It seems like a pretty easy question, but for some reason I still can't understand how to solve the same. I have an elastic search cluster which is using twitter river to download tweets. I would like to implement a sentiment analysis module which takes each tweet and computes a score (+ve/-ve) etc. I would like the score to be computed for each of the existing tweets as well as for new tweets and then visualize using Kibana.
However, I am not sure where should I place the call to this sentiment analysis module in the elastic search pipeline.
I have considered the option of modifying twitter river plugin but that will not work retrospectively.
Essentially, I need to answer two questions :-
1) how to call python/java code while indexing a document so that I can modify the json accordingly.
2) how to use the same code to modify all the existing documents in ES.
If you don't want an external application to do the analysis before indexing the documents in Elasticsearch, the best way I guess is to write a plugin that does it. You can write a plugin that implements a custom analyzer that does the sentiment analysis. Then in the mapping define the fields you want to run your analyzer on.
See examples of analysis plugins -
https://github.com/barminator/elasticsearch-analysis-annotation
https://github.com/yakaz/elasticsearch-analysis-combo/
To run the analysis on all existing documents you will need to reindex them after defining the correct mapping.
Related
Is it possible to perform column based lookups using Beam SQL? I have come across a class BeamJoinTransforms.JoinAsLookup but couldn't find any working snippet.
Currently, in order to perform Lookups in Apache Beam codes, I'm following a practice to perform 'Left join' using CoGroupByKey/SideInput and produce the filtered TableRows by maintaining a column mapping within my code.
I believe this can be made possible using Beam SQL as well but following a more efficient way to deal with Lookups. Does anyone have any working snippet for this? Looking for examples wherein I could convert PCollection<TableRow> to Pcollection<Row> and perform the field lookups using Beam SQL library.
I don't believe there's any concrete version of this logic bundled with Beam SDK. It is supposed to be triggered when one of the tables in join is an instance of BeamSeekableTable, see this part of the source code. For more context you can read the original pull request that introduced this feature: PR-4196
Currently though, BeamSeekableTable doesn't have any working implementations in Beam SDK yet. Potentially you can implement your own TableProvider that returns a BeamSqlTable implementing that BeamSeekableTable as well. For example see here how TableProvider is implemented for text tables (CSV, lines).
I was going through this slide. I'm getting little difficulty in understanding the approach.
My two queries are:
How does Solr maintain schema of semi-structured document like
resumes (such as Name, Skills, Education etc)
Can Apache TIKA extract the section wise information from PDFs? Since every resume would have dissimilar sections, how do I define a
common schema of entities?
You define the schema, so that you get the fields you expect and can search in the different fields based on what kind of queries you want to do. You can lump any unknown (i.e. where you're not sure about where it belongs) values into a common search field and rank that field lower.
You'll have to parse the response from Tika (or a different PDF / docx parser) yourself. Just using Tika by itself will not give you an automagically structured response tuned to the problem you're trying to solve. There will be a lot of manual parsing and trying to make sense of what is what from the uploaded document, and then inserting the relevant data into the relevant field.
We did many implementations using solr and elastic search.
And got two challenges
defining schema and more specific getting document to given schema
Then expanding search terms to more accurate and useful match. Solr, Elastic can match which they get from content, but not beyond that content.
You need to use Resume Parser like www.rchilli.com, Sovrn, daxtra, hireability or any others and use their output and map to your schema. Best part is you get access to taxonomies to enhance your content is solr.
You can use any one based on your budget and needs. But for us RChilli worked best.
Let me know if you need any further help.
Lets say, my data set is a shopping mall.
I have to build a graph for it. Whenever asked, I have to generate a path (shortest path) from one shop to another.
Now my question is,
Is it efficient to build a graph of the whole building and generate
the path?
Or build a graph (something like a subgraph) between
only the 2 nodes and all its connectors (edges) when a user needs to
find the path?
I have to implement this for a mobile application where all the data is loaded from a server.
My current code builds the whole graph. But I want to use this as a library for future use.
If it is only for the current building, then it works fine.
But assuming that in the future another type of data set is used which is way too big that the current one, then which one of these methods is more efficient?
These are the only 2 ways I can think of implementing it. If there is any other solution then that would be highly appreciated!
Secondly, I am using Dijkstra's Algorithm for path finding, is that suitable for this kind of a case?
Any help would be highly appreciated,
Thanks.
Is it efficient to build a graph of the whole building and generate the path?
Or build a graph (something like a subgraph) between only the 2 nodes and all its connectors (edges) when a user needs to find the
path?
If the graph is known a priori, the most efficient solution, in regards to query times, will be to generate the whole graph and preprocess it. Then, you will query the contracted graph and have a very fast query time. Look for example at Contraction hierarchies, since it is one of the most widely used techniques. Otherwise, when the graph has to be built in runtime, I think it is what you mean with your second point, you could use A* or bidirectional Dijkstra. In the first one I guess the best heuristic you can come up is the straight line distance, so probably not very helpful.
Secondly, I am using Dijkstra's Algorithm for path finding, is that
suitable for this kind of a case?
Yes it is, but I would always use bidirectional Dijkstra, it's not difficult to implement and, generally, a great improvement in time requirements over unidirectional Djikstra. Some related questions in SO: 1, 2
How to get the same results as http://developer.yahoo.com/search/content/V1/termExtraction.html
This question has been asked quite a few times before.
best approach to analyze text in PHP?
What is a good keyword extraction web service?
What is a simple way to generate keywords from a text?
Trying to approach this problem with existing solutions I stumbled upon "Text Analysis" Solr performs on the document before indexing as described in http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters - which includes stemming as well.
So the final index will consist mostly of terms used to describe the document.
Is there a solution that provides analyzers, tokenizers, and token filters for direct use? If solr is the way out, what is the best way get this data from solr's index?
Solr is a way to create a custom search engine. It does not seem to be the right tool for the job. The Wikipedia article about term extraction lists in its "external links" section several web applications for term extraction. OpenNLP has a list of tools which may be useful. Its Chunker may be helpful.
Just ask for the parsed terms e.g.
http://localhost:8983/solr/terms?terms.fl=text&terms.sort=count&terms.limit=-1
See TermsComponent
for more info.
Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.