I was going through this slide. I'm getting little difficulty in understanding the approach.
My two queries are:
How does Solr maintain schema of semi-structured document like
resumes (such as Name, Skills, Education etc)
Can Apache TIKA extract the section wise information from PDFs? Since every resume would have dissimilar sections, how do I define a
common schema of entities?
You define the schema, so that you get the fields you expect and can search in the different fields based on what kind of queries you want to do. You can lump any unknown (i.e. where you're not sure about where it belongs) values into a common search field and rank that field lower.
You'll have to parse the response from Tika (or a different PDF / docx parser) yourself. Just using Tika by itself will not give you an automagically structured response tuned to the problem you're trying to solve. There will be a lot of manual parsing and trying to make sense of what is what from the uploaded document, and then inserting the relevant data into the relevant field.
We did many implementations using solr and elastic search.
And got two challenges
defining schema and more specific getting document to given schema
Then expanding search terms to more accurate and useful match. Solr, Elastic can match which they get from content, but not beyond that content.
You need to use Resume Parser like www.rchilli.com, Sovrn, daxtra, hireability or any others and use their output and map to your schema. Best part is you get access to taxonomies to enhance your content is solr.
You can use any one based on your budget and needs. But for us RChilli worked best.
Let me know if you need any further help.
we are looking to transfer a part of our database to from PostgreSQL to Elastic: basically we want to combine three tables - Properties, Listings and Addresses into single document.
I couldn't find any standard tools and probably, as we have a Ruby on Rails application, the pimpliest and most reliable way will be to write a migration which will just iterate through the models, compose them into single document and save to Elastic.
The task doesn't seem to be complicated but as that it's my first experience with Elastic I want to check with the community.
Thanks.
The closest thing I'm away of is the JDBC importer. However, I think writing your own script is probably equally as fast.
There is a postgres function, row_to_json, that will convert a resulting row to JSON, which you can then publish into elasticsearch. There's nothing I'm aware of that will automatically do this for you. Assuming it's not billions of rows, I'd stick with your plan of writing a short script to run your query, and HTTP post the results into elasticsearch.
You'll need to decide on two things: Index name, and document type(s).
Some notes:
The consistency model between a relational database like postgres and an eventually consistent document store like elasticsearch are quite different. You should be aware of these differences and the drawbacks of them.
You will likely want the data in elasticsearch to be de-normalized, as there are no awesome ways of doing joins.
I am using thinking sphinx version 2.0.10 in rails for full text search and I am dealing with millions of record in Database. it take huge time to return result. so is there any way to keep the indexes on swap device. so it will work faster.
Thank You for Help
Thinking Sphinx configures Sphinx to store attributes in memory - but as far as I know there's no such setting that applies to field data. Sphinx index files can be stored on any disk you like though, instead of just RAILS_ROOT/db/sphinx/RAILS_ENV - this is configured using the searchd_file_path setting in config/sphinx.yml.
Perhaps you could elaborate on how you're using Sphinx and Thinking Sphinx - what kinds of queries you're running that are slow, and what the relevant index structures look like. There may be other ways of improving the speed of this.
I am using thinking_sphinx in Rails. As far as I know, Sphinx is used for full text search. Let's say I have these queries:
keyword
country
sort order
I use Sphinx for all the search above. However, when I am querying without keyword, but just country and sort order only, is it a better to use just normal query in MySQL instead of using Sphinx?
In other words, should Sphinx be used only when keyword is searched?
Looking at overall performance and speed.
Not to sound snarky, but does performance really matter?
If you're building an application which will only be used by a handful of users within an organization, then you can probably dismiss the performance benefits of using one method over the other and focus instead on simplicity in your code.
On the other hand, if your application is accessed by a large number of users on the interwebz and you really need to focus on being performant, then you should follow #barryhunter's advice above and benchmark to determine the best approach in a given circumstance.
P.S. Don't optimize before you need to. Fight with all your heart to keep code out of your code.
Benchmark! Benchmark! Benchmark!
Ie test it yourself. The exact performance will vary depending on the exact data, and perhaps even the relative perofrmance of your sphinx and mysql servers.
Sphinx will offer killer-speeds over MySQL when searching by a text string and MySQL will probably be faster when searching by a numerical key.
So, assuming that both "country" and "sort order" can be indexed using a numerical index in MySQL, it will be better to use Sphinx only with "keyword" and for the other two - MySQL normal query.
However, benchmarks won't hurt, as barryhunter suggested ;)
I've been looking into searching plugins/gems for Rails. Most of the articles compare Ferret (Lucene) to Ultrasphinx or possibly Thinking Sphinx, but none that talk about SearchLogic. Does anyone have any clues as to how that one compares? What do you use, and how does it perform?
thinking_sphinx and sphinx work beautifully, no indexing, query, install problems ever (5 or 6 install, including production slicehost )
why doesn't everybody use sphinx, like, say craigslist? read here about its limitations (year and a half old articles. The sphinx developer, Aksyonoff, is working on these and he's putting in features and reliability and stamping out bugs at an amazing pace)
http://codemonkey.ravelry.com/2008/01/09/sphinx-for-search/
http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/
Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
ferret: easy install, doesn't stem properly, very slow indexing (one mysql db: sphinx: 3 seconds, ferret: 50 minutes). Well documented problems (index corruption) in drb servers in production under load. Having said that, i have use it in develometn since acts-as_ferret came out 3 years ago, and it has served me well. Not adhering to porter stemming is an advantage in some contexts.
Lucene and Solr is the gorilla/mack truck / heavyweight champ of open source search. The teams have been doing an impressive number of new features in solr 14 release:
acts-as-solr: works well, once the tomcat or jetty is in place, but those sometimes are a pain. The A-A-S fork by mattmatt is the main fork, but the project is relatively unmaintained.
re the tomcat install: SOLR/lucene has unquestionably the best knowledge base/ support search engine of any software package i've seen ( i guess i'm not that surprised), the search box here:
http://www.lucidimagination.com/
Sunspot the new ruby wrapper, build on solr-ruby. Looks promising, but I couldn't get it to install on OSX. Indexes all ruby objects, not just databases through AR
one thing that's really instructive is to install 2 search plugins, e.g. sphinx and SOLR, sphinx and ferret, and see what different results they return. It's as easy as #sphinx_results - #ferret_results
just saw this post and responses
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
http://www.jroller.com/otis/entry/open_source_search_engine_benchmark
http://www.flax.co.uk/blog/2009/07/07/xapian-compared/
First off, my obvious bias: I created and maintain Thinking Sphinx.
As it so happens, I actually saw Ben Johnson (creator of SearchLogic) present at the NYC ruby meet about it last night. SearchLogic is SQL-only - so if you're not dealing with massive tables, and relevance rankings aren't needed, then it could be exactly what you're looking for. The syntax is pretty clean, too.
However, if you want all the query intelligence handled by code that is not your own, then Sphinx or Solr (which is Lucene under the hood, I think) is probably going to work out better.
SearchLogic is a good plugin, but is really meant to make your search code more readable, it doesn't provide the automatic indexing that Sphinx does. I haven't used Ferret, but Sphinx is incredibly powerful.
http://railscasts.com/episodes/120-thinking-sphinx
Great introduction to see how flexible it is.
I have not used SearchLogic but I can tell you that Lucene is a very mature project, that has implementation in many languages. It is fast and flexible and the API is fun to work with. It's a good bet.
Given this question is still highly ranked at google for full text search, I'd really like to say that Sunspot is even stronger today if you're interested in adding full text search capabilities to your Rails application (and would like to have Solr behind you for that). You can check a full tutorial on this here.
And while we're at it, another contender that has arrived in the field is ElasticSearch, that aims to be a real time full text search engine built on top of Lucene (but doing things differently when compared to Solr). ElasticSearch includes out-of-the-box sharding and replication to multiple nodes, faster real time search, "percolators" to allow you to receive notifications when something that matches your criteria becomes available and it's moving really fast with many more other features. It's easy to build something on top of it, since the API is dead simple and completely based on REST using JSON as a format. One could say you don't even need a plugin to use it.
Personally, I don't bother with database agnostics for web applications and am quite happy using the full text search in pg83. The benefit is, if and when you change your framework/language, that you will still have full text search.
Full Text Indexing and MATCH() AGAINST().
If you're just looking to do a fast search against a few text columns in your table, you can simply use a full text index of those columns and use MATCH() AGAINST() in your queries.
Create the full text index in a migration file:
add_index :table, :column, type: :fulltext
Query using that index:
where( "MATCH( column ) AGAINST( ? )", term )
ElasticSearch and Searchkick
If you're looking for a full blown search indexing solution that allows you to search for any column in any of your records while still being lightning quick, take a look at ElasticSearch and Searchkick.
ElasticSearch is the indexing and search engine.
Searchkick is the integration library with Rails that makes it very easy to index your records and search them.
Searchkick's README does a fantastic job at explaining how to get up and running and to fine tune your setup, but here is a little snippet:
Install and start ElasticSearch.
brew install elasticsearch
brew services start elasticsearch
Add searchkick gem to your bundle:
bundle add searchkick --strict
The --strict option just tells Bundler to use an exact version in your Gemfile, which I highly recommend.
Add searchkick to a model you want to index:
class MyModel < ApplicationRecord
searchkick
end
Index your records.
MyModel.reindex
Search your index.
matching_records = MyModel.search( "term" )
For anyone looking for a simple search gem without any dependencies, check out acts_as_indexed