Rails: A good search algorithm - ruby-on-rails

I'm trying to return results more like the search
My curren algorithm is this
def search_conditions(column, q)
vars = []
vars2 = []
vars << q
if q.size > 3
(q.size-2).times do |i|
vars2 << q[i..(i+2)]
next if i == 0
vars << q[i..-1]
vars << q[0..(q.size-1-i)]
vars << q[i % 2 == 0 ? (i/2)..(q.size-(i/2)) : (i/2)..(q.size-1-(i/2))] if i > 1
end
end
query = "#{column} ILIKE ?"
vars = (vars+vars2).uniq
return [vars.map { query }.join(' OR ')] + vars.map { |x| "%#{x}%" }
end
If I search for "Ruby on Rails" it will make 4 search ways.
1) Removing the left letters "uby on Rails".."ils"
2) Removing the right letters "Ruby on Rail".."Rub"
3) Removing left and right letters "uby on Rails", "uby on Rail" ... "on "
4) Using only 3 letters "Rub", "uby", "by ", "y o", " on" ... "ils"
Is good to use these 4 ways? There any more?

Why are you removing these letters? Are you trying to make sure that if someone searches for 'widgets', you will also match 'widget'?
If so, what you are trying to do is called 'stemming', and it is really much more complicated than removing leading and trailing letters. You may also be interested in removing 'stop words' from your query. These are those extremely common words that are necessary to form grammatically-correct sentences, but are not very useful for search, such as 'a', 'the', etc.
Getting search right is an immensely complex and difficult problem. I would suggest that you don't try to solve it yourself, and instead focus on the core purpose of your site. Perhaps you can leverage the search functionality from the Lucene project in your code. This link may also be helpful for using Lucene in Ruby on Rails.
I hope that helps; I realize that I sort of side-stepped your original question, but I really would not recommend trying to tackle this yourself.

As pkaeding says, stemming is far too complicated to try to implement yourself. However, if you want to search for similar (not exact) strings in MySQL, and your user search terms are very close to the full value of a database field (ie, you're not searching a large body of text for a word or phrase), you might want to try using the Levenshtein distance. Here is a MySQL implementation.
The Levenshtein algorithm will allow you to do "fuzzy" matching, give you a similarity score, and help you avoid installation and configuration of a search daemon, which is complicated. However, this is really only for a very specific case, not a general site search.

While, were all suggesting other possible solutions, check out:
Sphinx - How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
Thinking Sphinx - A Ruby connector between Sphinx and ActiveRecord.

Related

How to do synonym matching on regular expressions in solr 5.3.1?

I'm writing a sunspot application for a large gene database. Ligands and receptors for genes are named with the normal gene name, followed by an 'l' or an 'r', respectively, so for example a ligand for the gene 'MIP2' would be called 'MIP2l'. However, I want to account for instances in which the scientists will search for them using the syntax "MIP2 ligand". How can I combine the two tokens "MIP2" and "ligand" into one, and then concat them?
I tried using the Synonym Graph Filter Factory, but my solr is in 5.3.1, so it won't load. A quick update is not feasible. I also tried the technique illustrated in this article (https://lucidworks.com/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/), but the database is too large for a simple synonyms.txt doc. I want to use regular expressions for this, but I can't without combining the two tokens into one first.
This is my current search function, the sql lookup and weird hashing is because it's replacing an old search function, and the sql lookup is how I get the properly formatted data for the view.
search = GeneName.search do
fulltext params[:search][:search_str]
order_by(:use_name, :asc)
order_by(:score, :desc)
end
gene_ids = []
for gene_name in search.results
gene_ids << gene_name.gene_id unless gene_name.nil? or gene_ids.include? gene_name.gene_id
end
gene_ids_to_s = gene_ids.to_s.gsub("[","(").gsub("]",")")
#raise gene_ids_to_s.inspect
#genes = Gene.find_by_sql("select distinct g.id gene_id from genes g, gene_names gn where g.id = gn.gene_id and g.id in #{gene_ids_to_s} order by use_name desc") unless gene_ids_to_s == "()"
I believe I fixed it, but it's a lame workaround where I just added
#str.downcase!
#str.gsub!(" ligand", "l")
#str.gsub!(" receptor","r")
params[:search][:search_str] = #str
before the previously mentioned code section. #str is a parsed version of params[:search][:search_str]
I realize this isn't really your question. But, it seems like here:
gene_ids = []
for gene_name in search.results
gene_ids << gene_name.gene_id unless gene_name.nil? or gene_ids.include? gene_name.gene_id
end
You could be using map, compact, and uniq, like:
gene_ids = search.results.map do |result|
result.gene_id unless result.gene_name.nil?
end.compact.uniq
Also, I never use find_by_sql and I don't really understand what you're doing there. But, I wonder if you could use a standard ActiveRecord query there?

Thinking Sphinx sql like query in a condition for a field

I had these queries, but now I'm trying to use sphinx, and I need to replace them, but I can't find a way to do this:
p1 = Product.where "category LIKE ?", "#{WORD}"
p2 = Product.where "category LIKE ?", "#{WORD}.%"
product_list = p1 + p2
I'm doing the search over a model named "Product" in "category" field; I need a way to replace "#" and "%" in sphinx. I have a basic idea of how to do that, but this isn't working:
Product.search conditions: {category: "('WORD' | 'WORD.*')"}
There's a few things to note.
If you want to match on prefixes, make sure you have min_prefix_len set to 1 or greater (the smaller, the more accurate, but also the slower your searches will be, and the larger your index files will get). Also, you need enable_star set to true. Both of these settings belong in config/thinking_sphinx.yml (there's examples in the docs).
Single quotes have no purpose in Sphinx searches, and will be ignored - but I don't think that's a problem with what you're trying to search with.
Full stops, however, are treated as word separators by default. You can change this with charset_table - but that means all full stops in all fields will be treated as part of words (say, at the end of sentences), so I wouldn't recommend it.
However, if full stops are ignored, then each word in the category field is indexed separately, and so without any extra settings, this should work:
Product.search conditions: {category: WORD}

Building an ILIKE clause from an array

I'm experimenting with a few concepts (actually playing and learning by building a RoR version of the 1978 database WHATSIT?).
It basically is a has_many :through structure with Subject -> Tags <- Value. I've tried to replicate a little of the command line structure by using a query text field to enter the commands. Basically things like: What's steve's phone.
Anyhow, with that interface most of the searches use ILIKE. I though about enhancing it by allowing OR conditions using some form of an array. Something like What's steve's [son,daugher]. I got it working by creating the ILIKE clause directly, but not with string replacement.
def bracket_to_ilike(arrel,name,bracket)
bracket_array = bracket.match(/\[([^\]]+)\]/)[1].split(',')
like_clause = bracket_array.map {|i| "#{name} ILiKE '#{i}' "}.join(" OR ")
arrel.where(like_clause)
end
bracket_to_ilike(tags,'tags.name','[son,daughter]') produces the like clause tags.name ILiKE 'son' OR tags.name ILiKE 'daughter'
And it get the relations, but with all the talk about using the form ("tags.name ILiKE ? OR tags.name ? ",v1,v2,vN..)., I though I'd ask if anyone has any ideas on how to do that.
Creating variables on the fly is doable from what I've searched, but not in favor. I just wondered if anyone has tried creating a method that can add a where clause that has a variable number parameters.I tried sending the where clause to the relation, but it didn't like that.
Steve
Couple of things to watch out for in your code...
What will happen when one of the elements of bracket_array contains a single quote?
What will happen if I take it step farther and set an element to say "'; drop tables..."?
My first stab at refactoring your code would be to see if Arel can do it. Or Sequeel, or whatever they call the "metawhere" gem these days. My second stab would be something like this:
arrel.where( [ bracket_array.size.times.map{"#{name} ILIKE ?"}.join(' OR '), *bracket_array ])
I didn't test it, but the idea is to use the size of bracket_array to generate a string of OR'd conditions, then use the splat operator to pass in all the values.
Thanks to Phillip for pointing me in the right direction.
I didn't know you could pass an array to a where clause - that opened up some options
I had used the splat operator a few times, but it didn't hit me that it actually creates an object(variable)
The [son,daughter] stuff was just a console exercise to see what I could do, but not sure what I was going to do with it. I ended up taking the model association and creating the array out of the picture and implemented OR searches.
def array_to_ilike(col_name,keys)
ilike = [keys.map {|i| "#{col_name} ILiKE ? "}.join(" OR "), *keys ]
#ilike = [keys.size.times.map{"#{col_name} ILIKE ?"}.join(' OR '), *keys ]
#both work, guess its just what you are use to.
end
I then allowed a pipe(|) character in my subject,tag,values searches, so a WHATSIT style question
What's Steve's Phone Home|Work => displays home and work phone
steve phone home|work The 's stuff is just for show
steve son|daughter => displays children
phone james%|lori% => displays phone number for anyone who's name starts with james or lori
james%|lori% => dumps all information on anyone who's name starts with james or lori
The query then parses the command and if it encounters a | in any of the words, it will do things like:
t_ilike = array_to_ilike('tags.name',name.split("|"))
# or I actually stored it off on the inital parse
t_ilike = #tuple[:tag][:ilike] ||= ['tags.name ilike ?',tag]
Again this is just a learning exercise in creating a non-CRUD class to deal with the parsing and searching.
Steve

Concatenating two fields in a collect

Rails 2.3.5
I'm not having any luck searching for an answer on this. I know I could just write out a manual sql statement with a concat in it, but I thought I'd ask:
To load a select, I'm running a query of shift records. I'm trying to make the value in the select be shift date followed by a space and then the shift name. I can't figure out the syntax for doing a concat of two fields in a collect. The Ruby docs make it looks like plus signs and double quotes should work in a collect but everything I try gets a "expected numeric" error from Rails.
#shift_list = [a find query].collect{|s| [s.shift_date + " " + s.shift_name, s.id]}
Thanks for any help - much appreciated.
Hard to say without knowing what s is going to be or what type s.shift_date and s.shift_name are but maybe you're looking for this:
collect{|s| ["#{s.shift_date} #{s.shift_name}", s.id]}
That is pretty much the same as:
collect{|s| [s.shift_date.to_s + ' ' + s.shift_name.to_s, s.id]}
but less noisy.

Sphinx, Rails, ThinkSphinx and making some words matter more than others in your query

I have a list of keywords that I need to search against, using ThinkingSphinx
Some of them being more important than others, i need to find a way to weight those words.
So far, the only solution i came up with is to repeat x number of times the same word in my query to increase its relevance.
Eg:
3 keywords, each of them having a level of importance: Blue(1) Recent(2) Fun(3)
I run this query
MyModel.search "Blue Recent Recent Fun Fun Fun", :match_mode => :any
Not very elegant, and quite limiting.
Does anyone have a better idea?
If you can get those keywords into a separate field, then you could weight those fields to be more important. That's about the only good approach I can think of, though.
MyModel.search "Blue Recent Fun", :field_weights => {"keywords" => 100}
Recently I've been using Sphinx extensively, and since the death of UltraSphinx, I started using Pat's great plugin (Thanks Pat, I'll buy you a coffee in Melbourne soon!)
I see a possible solution based on your original idea, but you need to make changes to the data at "index time" not "run time".
Try this:
Modify your Sphinx SQL query to replace "Blue" with "Blue Blue Blue Blue", "Recent" with "Recent Recent Recent" and "Fun" with "Fun Fun". This will magnify any occurrences of your special keywords.
e.g. SELECT REPLACE(my_text_col,"blue","blue blue blue") as my_text_col ...
You probably want to do them all at once, so just nest the replace calls.
e.g. SELECT REPLACE(REPLACE(my_text_col,"fun","fun fun"),"blue","blue blue blue") as my_text_col ...
Next, change your ranking mode to SPH_RANK_WORDCOUNT. This way maximum relevancy is given to the frequency of the keywords.
(Optional) Imagine you have a list of keywords related to your special keywords. For example "pale blue" relates to "blue" and "pleasant" relates to "fun". At run time, rewrite the query text to look for the target word instead. You can store these words easily in a hash, and then loop through it to make the replacements.
# Add trigger words as the key,
# and the related special keyword as the value
trigger_words = {}
trigger_words['pale blue'] = 'blue'
trigger_words['pleasant'] = 'fun'
# Now loop through each query term and see if it should be replaced
new_query = ""
query.split.each do |word|
word = trigger_words[word] if trigger_words.has_key?(word)
new_query = new_query + ' ' word
end
Now you have quasi-keyword-clustering too. Sphinx is really a fantastic technology, enjoy!

Resources