LIKE condition in SphinxQL - search-engine

Dear programmers and IT experts, I need your help. I've just started to research what Sphinx is. I even made my own "google suggest", that fix frequent and common human search input mistakes. The problem is, that it tries to fix errors all the time and interrupt the real input.
Whell, I want the search engine try to find consilience in searched field by substring first, than, if consiliences are not found, than use my logic for fixing errors. If to say shortly, I want sphinx, first of all, execute this SQL equivalent command
SELECT * FROM suggest WHERE keyword LIKE('%$keyword%')
than, if nothing found, continue mistakes fixing.
The main questioin is....is it possible to tell spinx to search by substring?

Sphinx can mostly do that, but need to understand how it works. Sphinx indexes individual words, and matches by keywords. It uses a large inverted index to make queries fast (rather than running a substring match)
So can do MATCH('one two') as a query, and it will match a document that contains '... one two ...', but the order doesn't matter, and other words can be present, so will ALSO match '... two three one ...' which wouldn't happen with mysql LIKE (its a pure substring match)
Can use phrase operator to do that MATCH('"one two"')
Furthermore Sphinx matches whole words by default. So MATCH('one two') will only match those two works. It wont match against a document say "... one twotwentyone ...' whereas LIKE doesn't restrict to whole words.
So can use wildcards to allow part matches. MATCH('"*one two*"') --- also need to enable it on the index with min_infix_len config!
And even more, sphinx doesn't index punctuation etc (with default charset_table), so a document say '... One! (two?) ... ' WOULD still match MATCH('"one two"'). SQL like would NOT ignore that.
You could change sphinx to index more punctuation (via charset_table) to get a closer to substring match.
So SELECT * FROM index WHERE MATCH('"*$keyword*"') is possibly the closest sphinx query to your original (ie a substring match). Just as long as you aware of the differences. Also there is MySQL Collations to consider, they not exactly same as charset_table.
(frankly while this is correct. I wonder, if a bit OTT. If you just have a textual corpus you want to search, you could index it as normal. Then run queries though CALL KEYWORDS(), to get idea if the query is a valid word in the index (ie just tells you how many times given words appear in index). Can then run your algorithm to fix the mistakes)
As a side note sphinx does have a built in suggest system
http://sphinxsearch.com/blog/2016/10/03/2-3-2-feature-built-in-suggests/

Related

Good search term in Sphinx

Using Rails 3.2, Sphinx Search and Thinking Sphinx gem. I have tried wrapping % and *, and wildcard, in the keywords in my controller, but can't achieve what I want:
Keywords to search:
world trade
worldtrade
worl trade
trade world
trad worl
Expected matched search results:
World Trade Center
How should I format the keywords in my controller so that I get the expected search result with the different keywords as shown above?
For Sphinx, the wildcard character is * - so there's no point using %.
If you want every query to have wildcards added to it, use :star => true in your Thinking Sphinx search call.
If you want to match on parts of words, you need to enable either min_prefix_len or min_infix_len. The default is 0 (disabled), but given your examples, perhaps 4 is a good setting? The lower it is, the larger your index files get (and so searches may get slower as well), so perhaps don't set it to 1 unless you need to.
All of the above points won't make 'worldtrade' match, because you're providing one long word instead of two separate ones. You could look into adding a wordforms file to your Sphinx configuration, but that may be overkill in this case, as you'd want to cover each specific case.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

elasticsearch nGram/edgengram partial match?

I'm trying to make partial search working, a search for
"sw"
"swe"
"swed"
should match "Sweden"
I looked around and just can't get it to work
Rails Code
I'm using
this code from the Tire repo as templatecode.
whole words still match!
I have reindex and also tried using the edgengram filter.
I'm not a Ruby developper but found this article useful:
http://dev.af83.com/2012/01/19/autocomplete-with-tire.html
Like it sais:
Most databases handle that feature with a filter (with LIKE keyword in SQL, regular expression search with mongoDB). The strategy is simple: iterate on all results and keep only words which match the filter. This is brutal and hurts the hard drive. Elastic Search can do it too, with the prefix query.
With a small index, it plays well. For large indexes it will be more slow and painful.
You said:
whole words still match!
And what do you expect? Is "Sweden" not supposed to match "Sweden" but only "Swe", "Swed" or Swede" ?
Because your query on the field is also analyzed
Use the edgengram token filter. That will get you what you're looking for.

How to get a search ranking based on multiple factors in sphinx?

Hello stackoverflow folks,
We got a Rails project which is growing and growing and we now get first performance problems on the search, because we don't know how to utilize sphinx properly for our needs.
We have search queries like "Java PHP Software developer". Our problem is now the ranking should work with multiple things.
As search fields we have tag list, description and title.
If one of the terms is inside of one of the fields it should get for example 2 points. More Points if its in more fields, but not multiple points if it is in the same field more than once.
Next Problem is I have a big file with synonyms for which should also be checked. It looks like this:
Java > Java
Java-EE > Java
...
So if Java-EE is found it should get some points too but with a penalty for being a synonym.
Maximum amount of points would be 5 as in 5 stars which get displayed.
Any speedy solution would be nice because at the moment it's done in plain ruby and it gets slow, because we cant rank properly in sphinx.
If there is a solution with another search engine that would also be very nice, as it could be changed.
Thanks in advance for all efforts. All spelling corrections and questions to clear the question are welcome.
Most of the performance issues can be solved by changing the way you use sphinx. First you need to address how you index the data in sphinx. Doing some processing during while indexing will make the search quicker and the results more relevant. Second, tackle the search terms and last but not least, decide on the ranking algorithm to use.
I am going to use the "title" field as an example, but the logic can be replicated for all fields.
Indexing
Add two fields to sphinx ("title" and "title_synonyms"). For each record in the database do the following :-
Perform a DISTINCT on the words to remove duplicates ("Ruby Developer / Java Developer" will become "Ruby Developer / Java". This will stop records from getting two scores for duplicates when searching. This goes in to "title"
Take the DISTINCT title from above and REPLACE all the words with their expanded synonym equivalents. I would suggest putting the synonyms in the DB to make the expansion easier. The text would then become "Ruby Developer / Java-EE". Each word must be replaced with all the synonyms. If Java has two synonyms, they both must be in the field. This goes into "title_synonyms"
Searching
Because there are now two fields in sphinx we can give them each a different weight; "title" can get a weight of "10" and "title_synonyms" a weight of "3". That means a record has to match 4 synonyms before it ranks higher than one with the original title. You can play around with the weights to suit your needs.
Lets assume a user was searching for "Java Developer". For the search phrase do the following :-
Remove duplicate words
Get synonyms for each word in the search phrase
Set Matching Mode in Sphinx to SPH_MATCH_EXTENDED
The above rules will mean the search in sphinx looks like this :-
#title "Java Developer" | #title_synonyms "Java-EE"
If you want to rank exact matches higher than lexemes, the search query would look like this :-
#title ("Java Developer" | "=Java =Developer") | #title_synonyms ("Java-EE" | "=Java-EE")
You will need to use SPH_RANK_PROXIMITY_BM25 or SPH_RANK_SPH04 to make this work properly though.
Ranking
You can try any of the built in ranking algorithms to see what the results look like. I recommend SPH_RANK_MATCHANY or SPH_RANK_WORDCOUNT as a start.
For Proximity and exact match ranking use SPH_RANK_PROXIMITY_BM25, SPH_RANK_SPH04 or SPH_RANK_EXPR where you can use your own algorithm.
Conclusion
You should now have a search that is both fast and accurate. Very little work has to be done by your Ruby application and most of the work is done inside sphinx (where it should be).
Hope this helps...
This performance problem is an algorithm problem.
If you cannot express the problem in a way to utilize a backend tool, like sphinx or the database engine, then you are doing the processing in ruby, and that's easy to have a performance problem.
First, do as much as you can with sphinx (or whatever other search engine) and the database as you can. The more pre-digested the data coming into ruby, the less you have to do in ruby code, and that will likely be faster, since databases have been highly optimized over the last half century.
So, for example, run sphinx on the key words. Also run sphinx on the synonyms. Limit all the answers to the top results, and merge the results. That way your ruby code will be limited to the likely high results instead of having to consider the whole database of entries.
Once in ruby, the most important thing is to avoid high order algorithms, that is, make sure you are using a low order algorithm.
As you process your raw data, if you hold your top results in an array and try to sort or scan the array, you are going to have an N-squared order. That is, your order will be the product of the number of raw entries and the number of elements you keep in your array.
The best algorithms for your problem are a priority queue implemented by a heap like container, or a b-tree. Both have N-log-N order (N times the log of N), or the number of raw data records time the log of the number of items you will keep in your container.
A heap is a binary tree, where each node in the tree (not just the leaves but each node) has a rated record. The nodes below each record all have lower ranks. This is called the heap condition.
There are algorithms for adding elements, taking the top ranked element out, and replacing the lowest ranked element which maintain the heap condition. Look up binary heap in the wikipedia.
Let's say your site will display the top 100 ranked results. Maintain a help where the root is the lowest ranked. Populate the heap by adding the first 100 raw records you are processing.
Now for record 101 and after, compare its rank with the root. If the new record is ranked higher, use the delete algorithm to reduce your heap to 99 nodes (which will remove the lowest ranked record in the heap) and add your new record to the heap.
Once you have gone through all your records, you will have the top 100 ranked results. The heap delete algorithm will pull them out in reverse order.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

Resources