How to best use Solr parser syntax in a specific business requirement - parsing

Just starting to learn Solr for a project at work and was wondering on how to go about this issue. Our application allows a user to search based on a business name. The business name is comprised of 3 different categories ( English, French and Combined Name ). Based on a single query entered by the user, how would one go about using Solr to provide the most relevant search results? I have looked into fuzzy and proximity searches which seem reasonable enough. Although fuzzy search only applies to a single term, which makes me believe that I would need to split the query into single terms and apply fuzzy search to each and merge the results if I were to use it ? My question is how to best approach the problem ? Thanks!

To provide relevancy to your documents , you need to have a combination of proper boosting queries and your priorities as what relevance means to your use case . If Regex based search is included in use case you may go for NGrams , if exact search is what you seeking for , boosting is important . You can use parameters like phrase slope , mm, and other edismax parameters to your advantage . You may use a combination of title and text content search, with a good combination of boosts . Also , Solr allows you to pass your query in parenthesis, that functions like an SQL IN query , that further boosts relevancy in your documents by sticking to keywords only mentioned in the query . And , at last , if all these doesn't suffice, you may use custom function queries to meet your needs . While doing all this, just keep in mind the Analyzers in schema.xml file are just right and serve the purpose to execute above mentioned queries .

You can go as far down this rabbit-hole as you have time for wrt Business Name search. (Fuzzy, sound-alike, language-specific analysis, weird compounded-terms used as a domain name (eg: getting "EZBake" to match "easy bake", or "1-to-1" to match "one to one" is non-trivial)
Since this sounds like a pre-existing application, I typically look to query logs (when available) to sample the frequency of different types of mismatches (dig out the zero-result search terms and start manually categorizing the high-level issues behind the more common mismatches).
That will provide you with a backlog of "matching use cases to research how to implement" (in the order of maximal benefit, as determined by your sample).
Then you're ready to start burning them down, and asking much more specific questions about how to get Solr to jump through your domain-specific hoops.

Related

How to implement fuzzy search

I'm using Neo4j 3 REST API and i have node named customer it has properties like name etc i need to get search results of name of customer eg i should get results for name "john" for my input "joan".how to implement fuzzy search to get my desired results.
Thanks in advance
First off, I want to make that you know that if you're using Neo4j 3.x that 3.x is currently in beta and isn't considered stable yet.
You have two options to implement a fuzzy search in Neo4j. You can use the legacy indexes to implement Lecene-based indexing. That should provide anything that Lucene can do, though you'd probably need to do a bit more work. You can also implement your own unmanaged extension which will allow you to use Lucene a bit more directly.
Perhaps the easier alternative is to use elasticsearch with Neo4j and have elasticsearch do your full-text indexing. You might take a look at the Neo4j and ElasticSearch page on neo4j.com. There they provide a link to a GitHub repository which is a plugin for Neo4j which automagically updates ElasticSearch with data from Neo4j and which provides and endpoint for querying your graph fuzzily. There is also a video tutorial on how to do this.
You will have to try using https://neo4j.com/developer/kb/how-to-perform-a-soundex-search/ which in this case will work. If your input is Joan you will not get John as the response, unless you just give jo as input in which you will get both. To get what you are expecting you will have to use the soundex search.
Stepping back a little, what is the problem you are trying to solve with fuzzy matching?
My experience has been that misspellings and typos are far less common than you might think, and humans prefer exact matches whenever possible. If there is no exact match (often just missing a space between words), that's a good time to use a spellchecker, and that's where the fuzzy matching should kick in.
In addition, your example would match "joan" to "john", but some synonyms like "joanie" would be more useful. If you have a big corpus of content to work with, you may be able to extract some relationships, using fuzzy & machine learning to identify "joanne" and "joni" as possible synonyms and then submit that to a human curator. "Jon" looks like a related name but it's not, while "jo" and even "nonie" may or may not be nicknames in these groupings.

How to get a search ranking based on multiple factors in sphinx?

Hello stackoverflow folks,
We got a Rails project which is growing and growing and we now get first performance problems on the search, because we don't know how to utilize sphinx properly for our needs.
We have search queries like "Java PHP Software developer". Our problem is now the ranking should work with multiple things.
As search fields we have tag list, description and title.
If one of the terms is inside of one of the fields it should get for example 2 points. More Points if its in more fields, but not multiple points if it is in the same field more than once.
Next Problem is I have a big file with synonyms for which should also be checked. It looks like this:
Java > Java
Java-EE > Java
...
So if Java-EE is found it should get some points too but with a penalty for being a synonym.
Maximum amount of points would be 5 as in 5 stars which get displayed.
Any speedy solution would be nice because at the moment it's done in plain ruby and it gets slow, because we cant rank properly in sphinx.
If there is a solution with another search engine that would also be very nice, as it could be changed.
Thanks in advance for all efforts. All spelling corrections and questions to clear the question are welcome.
Most of the performance issues can be solved by changing the way you use sphinx. First you need to address how you index the data in sphinx. Doing some processing during while indexing will make the search quicker and the results more relevant. Second, tackle the search terms and last but not least, decide on the ranking algorithm to use.
I am going to use the "title" field as an example, but the logic can be replicated for all fields.
Indexing
Add two fields to sphinx ("title" and "title_synonyms"). For each record in the database do the following :-
Perform a DISTINCT on the words to remove duplicates ("Ruby Developer / Java Developer" will become "Ruby Developer / Java". This will stop records from getting two scores for duplicates when searching. This goes in to "title"
Take the DISTINCT title from above and REPLACE all the words with their expanded synonym equivalents. I would suggest putting the synonyms in the DB to make the expansion easier. The text would then become "Ruby Developer / Java-EE". Each word must be replaced with all the synonyms. If Java has two synonyms, they both must be in the field. This goes into "title_synonyms"
Searching
Because there are now two fields in sphinx we can give them each a different weight; "title" can get a weight of "10" and "title_synonyms" a weight of "3". That means a record has to match 4 synonyms before it ranks higher than one with the original title. You can play around with the weights to suit your needs.
Lets assume a user was searching for "Java Developer". For the search phrase do the following :-
Remove duplicate words
Get synonyms for each word in the search phrase
Set Matching Mode in Sphinx to SPH_MATCH_EXTENDED
The above rules will mean the search in sphinx looks like this :-
#title "Java Developer" | #title_synonyms "Java-EE"
If you want to rank exact matches higher than lexemes, the search query would look like this :-
#title ("Java Developer" | "=Java =Developer") | #title_synonyms ("Java-EE" | "=Java-EE")
You will need to use SPH_RANK_PROXIMITY_BM25 or SPH_RANK_SPH04 to make this work properly though.
Ranking
You can try any of the built in ranking algorithms to see what the results look like. I recommend SPH_RANK_MATCHANY or SPH_RANK_WORDCOUNT as a start.
For Proximity and exact match ranking use SPH_RANK_PROXIMITY_BM25, SPH_RANK_SPH04 or SPH_RANK_EXPR where you can use your own algorithm.
Conclusion
You should now have a search that is both fast and accurate. Very little work has to be done by your Ruby application and most of the work is done inside sphinx (where it should be).
Hope this helps...
This performance problem is an algorithm problem.
If you cannot express the problem in a way to utilize a backend tool, like sphinx or the database engine, then you are doing the processing in ruby, and that's easy to have a performance problem.
First, do as much as you can with sphinx (or whatever other search engine) and the database as you can. The more pre-digested the data coming into ruby, the less you have to do in ruby code, and that will likely be faster, since databases have been highly optimized over the last half century.
So, for example, run sphinx on the key words. Also run sphinx on the synonyms. Limit all the answers to the top results, and merge the results. That way your ruby code will be limited to the likely high results instead of having to consider the whole database of entries.
Once in ruby, the most important thing is to avoid high order algorithms, that is, make sure you are using a low order algorithm.
As you process your raw data, if you hold your top results in an array and try to sort or scan the array, you are going to have an N-squared order. That is, your order will be the product of the number of raw entries and the number of elements you keep in your array.
The best algorithms for your problem are a priority queue implemented by a heap like container, or a b-tree. Both have N-log-N order (N times the log of N), or the number of raw data records time the log of the number of items you will keep in your container.
A heap is a binary tree, where each node in the tree (not just the leaves but each node) has a rated record. The nodes below each record all have lower ranks. This is called the heap condition.
There are algorithms for adding elements, taking the top ranked element out, and replacing the lowest ranked element which maintain the heap condition. Look up binary heap in the wikipedia.
Let's say your site will display the top 100 ranked results. Maintain a help where the root is the lowest ranked. Populate the heap by adding the first 100 raw records you are processing.
Now for record 101 and after, compare its rank with the root. If the new record is ranked higher, use the delete algorithm to reduce your heap to 99 nodes (which will remove the lowest ranked record in the heap) and add your new record to the heap.
Once you have gone through all your records, you will have the top 100 ranked results. The heap delete algorithm will pull them out in reverse order.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

What is the best approach for a interpreting an text input for geocoding purposes?

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Using Ferret to build unique tag clouds

I've been using Ferret as my full-text search engine in a small project I'm working on.
Through the documentation and a few examples online, i've been able to pull together a tag cloud generator using the full-text index to help with tag cloud generation using the IndexReader.terms method.
It's worked quite well up to now, when I want to get term data based on a search result.
For example, if the user searches for "cake", I want to show them a tag cloud of terms used in association with the term "cake".
I've been looking for examples of where the terms method can be used in association with a search result set or similar?
Currently I'm using the following method to generate my list of tags:
reader = Ferret::Index::IndexReader.new(Scrape.find_last_index_version)
terms = []
reader.terms(:all_quotes).each do |term, doc_freq|
terms << [term, doc_freq]
end
Cheers.
It's more like a term frequency chart (like a wordle) than a tag cloud? Or are these in a tag field? Anyway, the index doesn't keep track of term frequency within each possible document subset (such as the results of a search), so that method wouldn't be fast, even if it existed. For a single document, you can get the TermFreqVector and provide suggested documents that are good matches for other frequent terms in that document. So, you could take some of the top results, grab the term vectors from each one, and just add them up, but those aggregate functions don't exist natively (they generally try not to put slow operations in there.)

Resources