elasticsearch nGram/edgengram partial match? - ruby-on-rails

I'm trying to make partial search working, a search for
"sw"
"swe"
"swed"
should match "Sweden"
I looked around and just can't get it to work
Rails Code
I'm using
this code from the Tire repo as templatecode.
whole words still match!
I have reindex and also tried using the edgengram filter.

I'm not a Ruby developper but found this article useful:
http://dev.af83.com/2012/01/19/autocomplete-with-tire.html
Like it sais:
Most databases handle that feature with a filter (with LIKE keyword in SQL, regular expression search with mongoDB). The strategy is simple: iterate on all results and keep only words which match the filter. This is brutal and hurts the hard drive. Elastic Search can do it too, with the prefix query.
With a small index, it plays well. For large indexes it will be more slow and painful.
You said:
whole words still match!
And what do you expect? Is "Sweden" not supposed to match "Sweden" but only "Swe", "Swed" or Swede" ?
Because your query on the field is also analyzed

Use the edgengram token filter. That will get you what you're looking for.

Related

LIKE condition in SphinxQL

Dear programmers and IT experts, I need your help. I've just started to research what Sphinx is. I even made my own "google suggest", that fix frequent and common human search input mistakes. The problem is, that it tries to fix errors all the time and interrupt the real input.
Whell, I want the search engine try to find consilience in searched field by substring first, than, if consiliences are not found, than use my logic for fixing errors. If to say shortly, I want sphinx, first of all, execute this SQL equivalent command
SELECT * FROM suggest WHERE keyword LIKE('%$keyword%')
than, if nothing found, continue mistakes fixing.
The main questioin is....is it possible to tell spinx to search by substring?
Sphinx can mostly do that, but need to understand how it works. Sphinx indexes individual words, and matches by keywords. It uses a large inverted index to make queries fast (rather than running a substring match)
So can do MATCH('one two') as a query, and it will match a document that contains '... one two ...', but the order doesn't matter, and other words can be present, so will ALSO match '... two three one ...' which wouldn't happen with mysql LIKE (its a pure substring match)
Can use phrase operator to do that MATCH('"one two"')
Furthermore Sphinx matches whole words by default. So MATCH('one two') will only match those two works. It wont match against a document say "... one twotwentyone ...' whereas LIKE doesn't restrict to whole words.
So can use wildcards to allow part matches. MATCH('"*one two*"') --- also need to enable it on the index with min_infix_len config!
And even more, sphinx doesn't index punctuation etc (with default charset_table), so a document say '... One! (two?) ... ' WOULD still match MATCH('"one two"'). SQL like would NOT ignore that.
You could change sphinx to index more punctuation (via charset_table) to get a closer to substring match.
So SELECT * FROM index WHERE MATCH('"*$keyword*"') is possibly the closest sphinx query to your original (ie a substring match). Just as long as you aware of the differences. Also there is MySQL Collations to consider, they not exactly same as charset_table.
(frankly while this is correct. I wonder, if a bit OTT. If you just have a textual corpus you want to search, you could index it as normal. Then run queries though CALL KEYWORDS(), to get idea if the query is a valid word in the index (ie just tells you how many times given words appear in index). Can then run your algorithm to fix the mistakes)
As a side note sphinx does have a built in suggest system
http://sphinxsearch.com/blog/2016/10/03/2-3-2-feature-built-in-suggests/

How to implement fuzzy search

I'm using Neo4j 3 REST API and i have node named customer it has properties like name etc i need to get search results of name of customer eg i should get results for name "john" for my input "joan".how to implement fuzzy search to get my desired results.
Thanks in advance
First off, I want to make that you know that if you're using Neo4j 3.x that 3.x is currently in beta and isn't considered stable yet.
You have two options to implement a fuzzy search in Neo4j. You can use the legacy indexes to implement Lecene-based indexing. That should provide anything that Lucene can do, though you'd probably need to do a bit more work. You can also implement your own unmanaged extension which will allow you to use Lucene a bit more directly.
Perhaps the easier alternative is to use elasticsearch with Neo4j and have elasticsearch do your full-text indexing. You might take a look at the Neo4j and ElasticSearch page on neo4j.com. There they provide a link to a GitHub repository which is a plugin for Neo4j which automagically updates ElasticSearch with data from Neo4j and which provides and endpoint for querying your graph fuzzily. There is also a video tutorial on how to do this.
You will have to try using https://neo4j.com/developer/kb/how-to-perform-a-soundex-search/ which in this case will work. If your input is Joan you will not get John as the response, unless you just give jo as input in which you will get both. To get what you are expecting you will have to use the soundex search.
Stepping back a little, what is the problem you are trying to solve with fuzzy matching?
My experience has been that misspellings and typos are far less common than you might think, and humans prefer exact matches whenever possible. If there is no exact match (often just missing a space between words), that's a good time to use a spellchecker, and that's where the fuzzy matching should kick in.
In addition, your example would match "joan" to "john", but some synonyms like "joanie" would be more useful. If you have a big corpus of content to work with, you may be able to extract some relationships, using fuzzy & machine learning to identify "joanne" and "joni" as possible synonyms and then submit that to a human curator. "Jon" looks like a related name but it's not, while "jo" and even "nonie" may or may not be nicknames in these groupings.

Search a document in Elasticsearch by a list of Wildcarded statements on a single field

If I have documents in ElasticSearch that have a field called url and the contents of the url field are strings like "http://www.foo.com" or "http://www.bar.com/some/url/segment/the-page.html", is it possible to search for documents matching a list of wildcarded url fragments e.g., ["http://www.foo.", "http://www.bar.com//segment/.html", "://*bar.com/**"]?
If it is possible, what is the best approach to do this? I have explored wildcard query which only seems to support 1 fragment not multiple. Filters don't seem to support wildcarding as I have tried using * in a term filter without any luck.
To make it a little more complex, I'm also interested in being able to search by a lot of these fragments. I have come across terms filter lookup which seems like it is a good solution for dealing with many search terms, but I'm not sure wildcarding works with filters.
Any thoughts?

Good search term in Sphinx

Using Rails 3.2, Sphinx Search and Thinking Sphinx gem. I have tried wrapping % and *, and wildcard, in the keywords in my controller, but can't achieve what I want:
Keywords to search:
world trade
worldtrade
worl trade
trade world
trad worl
Expected matched search results:
World Trade Center
How should I format the keywords in my controller so that I get the expected search result with the different keywords as shown above?
For Sphinx, the wildcard character is * - so there's no point using %.
If you want every query to have wildcards added to it, use :star => true in your Thinking Sphinx search call.
If you want to match on parts of words, you need to enable either min_prefix_len or min_infix_len. The default is 0 (disabled), but given your examples, perhaps 4 is a good setting? The lower it is, the larger your index files get (and so searches may get slower as well), so perhaps don't set it to 1 unless you need to.
All of the above points won't make 'worldtrade' match, because you're providing one long word instead of two separate ones. You could look into adding a wordforms file to your Sphinx configuration, but that may be overkill in this case, as you'd want to cover each specific case.

Rails 4 + SQL Lite: Find records with substring of big string?

I want to do the opposite of this:
image_name = "blah"
Pipe.where("LOWER(name) like ?", "%#{image_name.downcase}%") (this would find a Pipe named 'blahzzz`)
What I want is the opposite, where I have a pipe named ah and given image_name = "blah", I want to be able to find that Pipe. How would I accomplish this?
i think what you are looking for is a functionality of a full-text search index like lucene. search for sunspot or tire, they provide bindings for solr and elasticsearch. those are the most common full-text servers out there.
if you want to find partials of text, there is a feature called n-gram that allows you to find matching parts or substrings. i think that would be the way to go on a larger scale.
if you have just one place, where you are going to implement this functionality and your database is not too large, you can mimic the behavior in a relational database by combining a lot of OR LIKE queries and providing substrings of the input.

Resources