How to use Beam SQL to perform Lookups - google-cloud-dataflow

Is it possible to perform column based lookups using Beam SQL? I have come across a class BeamJoinTransforms.JoinAsLookup but couldn't find any working snippet.
Currently, in order to perform Lookups in Apache Beam codes, I'm following a practice to perform 'Left join' using CoGroupByKey/SideInput and produce the filtered TableRows by maintaining a column mapping within my code.
I believe this can be made possible using Beam SQL as well but following a more efficient way to deal with Lookups. Does anyone have any working snippet for this? Looking for examples wherein I could convert PCollection<TableRow> to Pcollection<Row> and perform the field lookups using Beam SQL library.

I don't believe there's any concrete version of this logic bundled with Beam SDK. It is supposed to be triggered when one of the tables in join is an instance of BeamSeekableTable, see this part of the source code. For more context you can read the original pull request that introduced this feature: PR-4196
Currently though, BeamSeekableTable doesn't have any working implementations in Beam SDK yet. Potentially you can implement your own TableProvider that returns a BeamSqlTable implementing that BeamSeekableTable as well. For example see here how TableProvider is implemented for text tables (CSV, lines).

Related

Resume Parsing using Solr and TIKA

I was going through this slide. I'm getting little difficulty in understanding the approach.
My two queries are:
How does Solr maintain schema of semi-structured document like
resumes (such as Name, Skills, Education etc)
Can Apache TIKA extract the section wise information from PDFs? Since every resume would have dissimilar sections, how do I define a
common schema of entities?
You define the schema, so that you get the fields you expect and can search in the different fields based on what kind of queries you want to do. You can lump any unknown (i.e. where you're not sure about where it belongs) values into a common search field and rank that field lower.
You'll have to parse the response from Tika (or a different PDF / docx parser) yourself. Just using Tika by itself will not give you an automagically structured response tuned to the problem you're trying to solve. There will be a lot of manual parsing and trying to make sense of what is what from the uploaded document, and then inserting the relevant data into the relevant field.
We did many implementations using solr and elastic search.
And got two challenges
defining schema and more specific getting document to given schema
Then expanding search terms to more accurate and useful match. Solr, Elastic can match which they get from content, but not beyond that content.
You need to use Resume Parser like www.rchilli.com, Sovrn, daxtra, hireability or any others and use their output and map to your schema. Best part is you get access to taxonomies to enhance your content is solr.
You can use any one based on your budget and needs. But for us RChilli worked best.
Let me know if you need any further help.

Custom Collector for Neo4j Legacy Index

I have a Neo4j application that uses the legacy Lucene indexes on certain relationship properties. Whenever I query these I am looking for exact matches, and all of them. While doing some profiling I discovered that the application is spending a highly disproportionate amount of time retrieving these results as it is pulling them in chunks from a prioritized queue. Given that I do not care about the ordering and want all of the results, what can I do to change the underlying behavior?
From my own searching, I came across Lucene's Collector implementations and it seems like a custom one that collects everything and never bothers scoring could be the answer, but I do not know how I can inject one into Neo4j. I am not opposed to using reflection or other means if it is not actually supported by Neo4j.
The application accesses Neo4j via the embedded Java methods.
We're working on some of that as part of our upgrade to Lucene5, there custom collectors for some of these use-case will be implemented. Hopefully we can make something available in the next weeks.

How to compute custom metrics using Elasticsearch + Kibana?

It seems like a pretty easy question, but for some reason I still can't understand how to solve the same. I have an elastic search cluster which is using twitter river to download tweets. I would like to implement a sentiment analysis module which takes each tweet and computes a score (+ve/-ve) etc. I would like the score to be computed for each of the existing tweets as well as for new tweets and then visualize using Kibana.
However, I am not sure where should I place the call to this sentiment analysis module in the elastic search pipeline.
I have considered the option of modifying twitter river plugin but that will not work retrospectively.
Essentially, I need to answer two questions :-
1) how to call python/java code while indexing a document so that I can modify the json accordingly.
2) how to use the same code to modify all the existing documents in ES.
If you don't want an external application to do the analysis before indexing the documents in Elasticsearch, the best way I guess is to write a plugin that does it. You can write a plugin that implements a custom analyzer that does the sentiment analysis. Then in the mapping define the fields you want to run your analyzer on.
See examples of analysis plugins -
https://github.com/barminator/elasticsearch-analysis-annotation
https://github.com/yakaz/elasticsearch-analysis-combo/
To run the analysis on all existing documents you will need to reindex them after defining the correct mapping.

Is there a better solution than ActiveRecord for batch data imports?

I've developed a web interface for a legacy (vendor) database using Ruby on Rails. The database schema is a complete mess, > 450 tables, and customer data spread over more than 20, involving complex joins, etc.
I've got a good solution for this for the web app, it works very well. But we also do nightly imports from external data sources (currently a view to a SQL Server DB and a SOAP feed) and they run SLOW. About 1.5-2.5 hours for the XML data import and about 4 hours for the DB import.
This is after doing some basic optimizations, which include manually starting the MRI garbage collector. And that right there suggests to me I'm Doing It Wrong. I've considered moving the nightly update/insert tasks out of the main Rails app and trying to use either JRuby or Rubinius to take advantage of the better concurrency and garbage collection.
My question is this: I know ActiveRecord isn't really designed for this type of task. But out of the O/RM options for Ruby (my preferred language), it seems to have the best Oracle support.
What would you do? Stick with AR and use a different interpreter? Would that really help? What about DataMapper or Sequel? Is there a better way of doing this?
I'm open to using Scala or Clojure if there's a better alternative (not limited to, but these are the other languages I'm playing with right now)... but what I don't want is something like DBI where I'm writing straight SQL, if for no other reason than that vendor updates occasionally change the DB schema, and I'd rather change a couple of classes than hundreds of UPDATE or INSERT statements.
Hopefully this question isn't 'too vague,' but I could really use some advice about this issue.
FWIW, Ruby is 1.9.2, Rails is 3.0.7, platform is OS X Server Snow Leopard (or optionally Debian 6.0).
Edit ok just realized that this solution will not work for oracle, sorry ---
You should really check out ActiveRecord-Import, it is easy to use and handles bulk imports with minimal amounts of sql statements. I saw a speed up from 5 hours to 2 minutes. And it will still run validations on the data.
from the github page:
books = []
10.times do |i|
books << Book.new(:name => "book #{i}")
end
Book.import books
https://github.com/zdennis/activerecord-import
From my experience, ORMs are a great tool to use on the front end, where you're mostly just reading the data or updating a single row at a time. On the back end where you're ingesting lost of data at a time, they can cause problems because of the way they tend to interact with the database.
As an example, assume you have a Person object that has a list of Friends that is long (lets say 100 for now). You create the Person object and assign 100 Friends to it, and then save it to the database. It's common for the naive use of an ORM to do 101 writes to the database (one for each Friend, one for the Person). If you were to do this in pure SQL at a lower level, you'd do 2 writes, one for Person and then one for all the Friends at once (an insert with 100 actual rows). The difference between the two actions is significant.
There are a couple ways I've seen to work around the problem.
Use a lower level database API that lets you write your "insert 100 friends in a single call" type command
Use an ORM that lets you write lower level SQL in order to do the Friends insert as a single SQL command (not all of them allow this and I don't know if Rails does)
Use an ORM that lets you batch writes into a single database call. It's still 101 writes to the database, but it allows the ORM to batch them into a single network call to the database and say "do these 101 things". I'm not sure what ORMs allow for this.
There's probably other ways
The main point being that using the ORM to ingest any real sized amount of data can run into efficiency problems. Understanding what the ORM is doing underneath the hood (asking it to log all db calls is a good way to understand what it's doing) is the best first step. Once you know what it's doing, you can look for ways to tell it "what I'm doing doesn't fit well into the normal pattern, lets change how you're using it"... and, should it not have a way that works, you can look at using a lower level API to allow for it.
I'll point out one other thing you can look at with a STRONG caveat that it should be one of the last things you consider. When inserting rows into the database in bulk, you can create a raw text file with all the data (format depends on the db, but the concept is similar to a CSV file) and give the file to the database to import in bulk. It's a bad way to go in almost every case, but I wanted to include it because it does exist as an option.
Edit: As a side note, the comment about more efficiently parsing the XML is a good thing to look at too. Using SAX vs DOM, or a different XML library, can be a huge win in time to completion. In some cases, it can be an even bigger win than more efficient database interaction. For example, you may be parsing a LOT of XML with lots of small pieces of data, and then only use small parts of it. In a case like that, the parsing could take a long time via DOM while SAX could ignore the parts you don't need... or it could be using a lot of memory creating DOM objects and slow down the whole thing due to garbage collection, etc. At the very least, it's worth looking at.
Since your question is indeed "a bit vague", I can only recommend you optimizing the XML import by using XML Pull parsing.
Take a look at this:
https://gist.github.com/827475
I needed to import MySQL XML, and to be fair, using the XML Pull method improved the parse part in factor of around 7 (yes, almost 7 times faster than reading the entire thing in the memory).
Another thing: you are saying "the DB import takes 4 hours". What file formats are these DB exports you are importing?

Can I use Splunk to analyse events from a Rails application?

Looking at Splunk, http://www.splunk.com, it looks like a very nice platform for analysing how a system is performing in relation to the actions users are taking.
A Ruby on Rails implementation is provided, but it would seem to only offer traditional analytics.
Is there either:
A way to use Slunk to monitor events defined in the code of a rails app?
or
A better tool for the job?
Thanks!
There's no ruby-specific query handler for ruby generated logs. It's certainly possible to build one by
Defining how to acquire the fields from ruby-style generated logs (as linked above)
Defining how to translate your desired syntax to splunk's search language, which would be probably, for that query "sign_up referer=bla"
Splunk is extensible in various ways. For example, it would be pretty possible to author a search filter which can narrow the set of events in ruby, parsing a ruby expression. The splunk search language has its own ideas about quotation marks, backslashes, and pipes, but the rest of the text would be up to the filter. However, the core performance optimizations of limiting the search to events containing substrings is currently only possible in the splunk search language syntax.
That said, if your data set is very small, and the analysis you want to do limited in scope, then maybe some custom ruby solution is closer to what you want.
As far as splunk, check out answers.splunk.com and here is one answer related to rails:
http://splunk-base.splunk.com/answers/8830/how-do-i-extract-key-value-pairs-from-ruby-on-rails-logs

Resources