Sphinx, Rails, ThinkSphinx and making some words matter more than others in your query - ruby-on-rails

I have a list of keywords that I need to search against, using ThinkingSphinx
Some of them being more important than others, i need to find a way to weight those words.
So far, the only solution i came up with is to repeat x number of times the same word in my query to increase its relevance.
Eg:
3 keywords, each of them having a level of importance: Blue(1) Recent(2) Fun(3)
I run this query
MyModel.search "Blue Recent Recent Fun Fun Fun", :match_mode => :any
Not very elegant, and quite limiting.
Does anyone have a better idea?

If you can get those keywords into a separate field, then you could weight those fields to be more important. That's about the only good approach I can think of, though.
MyModel.search "Blue Recent Fun", :field_weights => {"keywords" => 100}

Recently I've been using Sphinx extensively, and since the death of UltraSphinx, I started using Pat's great plugin (Thanks Pat, I'll buy you a coffee in Melbourne soon!)
I see a possible solution based on your original idea, but you need to make changes to the data at "index time" not "run time".
Try this:
Modify your Sphinx SQL query to replace "Blue" with "Blue Blue Blue Blue", "Recent" with "Recent Recent Recent" and "Fun" with "Fun Fun". This will magnify any occurrences of your special keywords.
e.g. SELECT REPLACE(my_text_col,"blue","blue blue blue") as my_text_col ...
You probably want to do them all at once, so just nest the replace calls.
e.g. SELECT REPLACE(REPLACE(my_text_col,"fun","fun fun"),"blue","blue blue blue") as my_text_col ...
Next, change your ranking mode to SPH_RANK_WORDCOUNT. This way maximum relevancy is given to the frequency of the keywords.
(Optional) Imagine you have a list of keywords related to your special keywords. For example "pale blue" relates to "blue" and "pleasant" relates to "fun". At run time, rewrite the query text to look for the target word instead. You can store these words easily in a hash, and then loop through it to make the replacements.
# Add trigger words as the key,
# and the related special keyword as the value
trigger_words = {}
trigger_words['pale blue'] = 'blue'
trigger_words['pleasant'] = 'fun'
# Now loop through each query term and see if it should be replaced
new_query = ""
query.split.each do |word|
word = trigger_words[word] if trigger_words.has_key?(word)
new_query = new_query + ' ' word
end
Now you have quasi-keyword-clustering too. Sphinx is really a fantastic technology, enjoy!

Related

Rails - Detecting keywords in a string with exact match

This one is tricky, at least for me as I am new to rails.
soccer = ["football pitch", "soccer", "free kick", "penalty"]
string = "Did anyone see that free kick last night, let me get my pen!!!"
What I want to do is search for instances of keywords but with 2 main rules:
1 - Don't do partial matches i.e it should not match pen with penalty, has to be a full match.
2 - Match multiple sets of words like "nice day" "sweet tooth" "three's a crowd" (max of 3)
This code works perfect for scenario 1:
def self.check_for_keyword_match?(string,keyword_array)
string.split.any? { |word| keyword_array.include?(word) }
end
if check_for_keyword_match?(string,soccer)
soccer.to_set.freeze
keywords_found.push('soccer')
# send a response saying Hey, I see you are interested in soccer.
end
In that example it would not match pen but it would match penalty which is perfect.
But I also want it to match 2-3 sets of keywords i.e "free kick" should match but only "free" and "kick" would match if they were written as singular keywords. Free is too broad, same with kick but "free kick" is not broad so it works much better at deciphering their interests.
I can change the format of the soccer array but the string been submitted would be from a slack post so I can't control how that is formatted. In the actual program I have 20 or so of those arrays with keywords but once I figure out how to do one, the rest I can handle.
For manipulating strings, Regular Expressions are useful.
The following code should fix your issue:
def self.check_for_keyword_match?(string, keyword_array)
keyword_array.any? { |word| Regexp.new('\b' + word + '\b').match(string) }
end
Instead of splitting string, go through keyword_array and search the entire string for each keyword.
The regex adds a 'word boundary' modifier \b so that it will only match entire words (Rule 1, if you use include? here, then a keyword of "pen" will match "penalty").

Remove excess junk words from string or array of strings

I have millions of arrays that each contain about five strings. I am trying to remove all of the "junk words" (for lack of a better description) from the arrays, such as all articles of speech, words like "to", "and", "or", "the", "a" and so on.
For example, one of my arrays has these six strings:
"14000"
"Things"
"to"
"Be"
"Happy"
"About"
I want to remove the "to" from the array.
One solution is to do:
excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}
But I am hoping to avoid manually typing every excess word. Does anyone know of a Rails function or helper that would help in this process? Or perhaps an array of "junk words" already written?
Dealing with stopwords is easy, but I'd suggest you do it BEFORE you split the string into the component words.
Building a fairly simple regular expression can make short work of the words:
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into sandbar forest thesis algebra"
clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]
How do you handle them if you get them already split? I'd join(' ') the array to turn it back into a string, then run the above code, which returns the array again.
incoming_array = [
"14000",
"Things",
"to",
"Be",
"Happy",
"About",
]
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]
You could try to use Array's set operations, but you'll run afoul of the case sensitivity of the words, forcing you to iterate over the stopwords and the arrays which will run a LOT slower.
Take a look at these two answers for some added tips on how you can build very powerful patterns making it easy to match thousands of strings:
"How do I ignore file types in a web crawler?"
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"
All you need is a list of English stopwords. You can find it here, or google for 'english stopwords list'

Thinking Sphinx sql like query in a condition for a field

I had these queries, but now I'm trying to use sphinx, and I need to replace them, but I can't find a way to do this:
p1 = Product.where "category LIKE ?", "#{WORD}"
p2 = Product.where "category LIKE ?", "#{WORD}.%"
product_list = p1 + p2
I'm doing the search over a model named "Product" in "category" field; I need a way to replace "#" and "%" in sphinx. I have a basic idea of how to do that, but this isn't working:
Product.search conditions: {category: "('WORD' | 'WORD.*')"}
There's a few things to note.
If you want to match on prefixes, make sure you have min_prefix_len set to 1 or greater (the smaller, the more accurate, but also the slower your searches will be, and the larger your index files will get). Also, you need enable_star set to true. Both of these settings belong in config/thinking_sphinx.yml (there's examples in the docs).
Single quotes have no purpose in Sphinx searches, and will be ignored - but I don't think that's a problem with what you're trying to search with.
Full stops, however, are treated as word separators by default. You can change this with charset_table - but that means all full stops in all fields will be treated as part of words (say, at the end of sentences), so I wouldn't recommend it.
However, if full stops are ignored, then each word in the category field is indexed separately, and so without any extra settings, this should work:
Product.search conditions: {category: WORD}

Ruby On Rails searching database for word

I am new to rails. I am trying to search a database in MySQL where the term I am searching may be one word in the column string. For example if the cell was "this is a very lovely day" then I would like to be able to call that object by searching for the word 'lovely'
Thank you.
You need to do a LIKE query. (i.e. foo LIKE %bar%) The % represents a wildcard operator. bar% would be "starts with bar" and %bar% would be "contains bar." Note that contains searches cannot use column indexes and will be slow.
Suppose you had a Day class with the attribute description. In that case, you would do
Day.where("description LIKE '%lovely%')
by using Arel
days = Day.arel_table
Day.where(days[:description].matches("%lovely%"))

Rails: A good search algorithm

I'm trying to return results more like the search
My curren algorithm is this
def search_conditions(column, q)
vars = []
vars2 = []
vars << q
if q.size > 3
(q.size-2).times do |i|
vars2 << q[i..(i+2)]
next if i == 0
vars << q[i..-1]
vars << q[0..(q.size-1-i)]
vars << q[i % 2 == 0 ? (i/2)..(q.size-(i/2)) : (i/2)..(q.size-1-(i/2))] if i > 1
end
end
query = "#{column} ILIKE ?"
vars = (vars+vars2).uniq
return [vars.map { query }.join(' OR ')] + vars.map { |x| "%#{x}%" }
end
If I search for "Ruby on Rails" it will make 4 search ways.
1) Removing the left letters "uby on Rails".."ils"
2) Removing the right letters "Ruby on Rail".."Rub"
3) Removing left and right letters "uby on Rails", "uby on Rail" ... "on "
4) Using only 3 letters "Rub", "uby", "by ", "y o", " on" ... "ils"
Is good to use these 4 ways? There any more?
Why are you removing these letters? Are you trying to make sure that if someone searches for 'widgets', you will also match 'widget'?
If so, what you are trying to do is called 'stemming', and it is really much more complicated than removing leading and trailing letters. You may also be interested in removing 'stop words' from your query. These are those extremely common words that are necessary to form grammatically-correct sentences, but are not very useful for search, such as 'a', 'the', etc.
Getting search right is an immensely complex and difficult problem. I would suggest that you don't try to solve it yourself, and instead focus on the core purpose of your site. Perhaps you can leverage the search functionality from the Lucene project in your code. This link may also be helpful for using Lucene in Ruby on Rails.
I hope that helps; I realize that I sort of side-stepped your original question, but I really would not recommend trying to tackle this yourself.
As pkaeding says, stemming is far too complicated to try to implement yourself. However, if you want to search for similar (not exact) strings in MySQL, and your user search terms are very close to the full value of a database field (ie, you're not searching a large body of text for a word or phrase), you might want to try using the Levenshtein distance. Here is a MySQL implementation.
The Levenshtein algorithm will allow you to do "fuzzy" matching, give you a similarity score, and help you avoid installation and configuration of a search daemon, which is complicated. However, this is really only for a very specific case, not a general site search.
While, were all suggesting other possible solutions, check out:
Sphinx - How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
Thinking Sphinx - A Ruby connector between Sphinx and ActiveRecord.

Resources