Search term identification - ruby-on-rails

I am trying to do a small analytics plugin for my search. I want to isolate the useful search terms from all the searches done.
for example:
search: "where do i register for charms class"
search terms: "register", "charms class"
I know this is not possible without the program having the context of our whole data. But is there something which I could use to achieve partial results??

What you can do is break the string into array of strings
keywords = "where do i register for charms class".split(" ")
#=> ["where", "do", "i", "register", "for", "charms", "class"]
Then you can loop through the array of keywords. This is not a perfect solution but still it would help you.

You could put all keywords into an array:
keywords = ['some keyword', 'another keyword']
string = 'My string with some keyword'
keywords.none?{|keyword| string.include?(keyword)} #=> true/false

My take on this is to create rules to eliminate useless words.
Like removing articles,verbs,pronouns and other useless stuffs.
You can first tokenize the string then perform the pruning.
After thaT you can create rules to further extract the the important tokens.
For references:
Tokenizer
Tokenizing a String

Related

Rails - Detecting keywords in a string with exact match

This one is tricky, at least for me as I am new to rails.
soccer = ["football pitch", "soccer", "free kick", "penalty"]
string = "Did anyone see that free kick last night, let me get my pen!!!"
What I want to do is search for instances of keywords but with 2 main rules:
1 - Don't do partial matches i.e it should not match pen with penalty, has to be a full match.
2 - Match multiple sets of words like "nice day" "sweet tooth" "three's a crowd" (max of 3)
This code works perfect for scenario 1:
def self.check_for_keyword_match?(string,keyword_array)
string.split.any? { |word| keyword_array.include?(word) }
end
if check_for_keyword_match?(string,soccer)
soccer.to_set.freeze
keywords_found.push('soccer')
# send a response saying Hey, I see you are interested in soccer.
end
In that example it would not match pen but it would match penalty which is perfect.
But I also want it to match 2-3 sets of keywords i.e "free kick" should match but only "free" and "kick" would match if they were written as singular keywords. Free is too broad, same with kick but "free kick" is not broad so it works much better at deciphering their interests.
I can change the format of the soccer array but the string been submitted would be from a slack post so I can't control how that is formatted. In the actual program I have 20 or so of those arrays with keywords but once I figure out how to do one, the rest I can handle.
For manipulating strings, Regular Expressions are useful.
The following code should fix your issue:
def self.check_for_keyword_match?(string, keyword_array)
keyword_array.any? { |word| Regexp.new('\b' + word + '\b').match(string) }
end
Instead of splitting string, go through keyword_array and search the entire string for each keyword.
The regex adds a 'word boundary' modifier \b so that it will only match entire words (Rule 1, if you use include? here, then a keyword of "pen" will match "penalty").

Remove excess junk words from string or array of strings

I have millions of arrays that each contain about five strings. I am trying to remove all of the "junk words" (for lack of a better description) from the arrays, such as all articles of speech, words like "to", "and", "or", "the", "a" and so on.
For example, one of my arrays has these six strings:
"14000"
"Things"
"to"
"Be"
"Happy"
"About"
I want to remove the "to" from the array.
One solution is to do:
excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}
But I am hoping to avoid manually typing every excess word. Does anyone know of a Rails function or helper that would help in this process? Or perhaps an array of "junk words" already written?
Dealing with stopwords is easy, but I'd suggest you do it BEFORE you split the string into the component words.
Building a fairly simple regular expression can make short work of the words:
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into sandbar forest thesis algebra"
clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]
How do you handle them if you get them already split? I'd join(' ') the array to turn it back into a string, then run the above code, which returns the array again.
incoming_array = [
"14000",
"Things",
"to",
"Be",
"Happy",
"About",
]
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]
You could try to use Array's set operations, but you'll run afoul of the case sensitivity of the words, forcing you to iterate over the stopwords and the arrays which will run a LOT slower.
Take a look at these two answers for some added tips on how you can build very powerful patterns making it easy to match thousands of strings:
"How do I ignore file types in a web crawler?"
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"
All you need is a list of English stopwords. You can find it here, or google for 'english stopwords list'

Suppress delimiters in Ruby's String#split

I'm importing data from old spreadsheets into a database using rails.
I have one column that contains a list on each row, that are sometimes formatted as
first, second
and other times like this
third and fourth
So I wanted to split up this string into an array, delimiting either with a comma or with the word "and". I tried
my_string.split /\s?(\,|and)\s?/
Unfortunately, as the docs say:
If pattern contains groups, the respective matches will be returned in the array as well.
Which means that I get back an array that looks like
[
[0] "first"
[1] ", "
[2] "second"
]
Obviously only the zeroth and second elements are useful to me. What do you recommend as the neatest way of achieving what I'm trying to do?
You can instruct the regexp to not capture the group using ?:.
my_string.split(/\s?(?:\,|and)\s?/)
# => ["first", "second"]
As an aside note
into a database using rails.
Please note this has nothing to do with Rails, that's Ruby.

MongoDB Substring matching query

My application is trying to match an incoming string against documents in my Mongo Database where a field has a list of keywords. The goal is to see if the keywords are present in the string.
Here's an example:
Incoming string:
"John Doe is from Florida and is a fan of American Express"
the field for the documents in the MongoDB has a value such as:
in_words: "georgia,american express"
So, the database record has inwords or keywords separate by comman and some of them are two words or more.
Currently, my RoR application pulls the documents and pulls the inwords for each one issuing a split(',') command on the inwords, then loops through each one and sees if it is present in the string.
I really want to find a way to push this type of search into the actual database query in order to speed up the processing. I could change the in_words in the database to an array such as follows:
in_words: ["georgia", "american express"]
but I'm still not sure how to query this?
To Sum up, my goal is to find the person that matches an incoming string by comparing a list of inwords/keywords for that person against the incoming string. And do this query all in the database layer.
Thanks in advance for your suggestions
You should definitely split the in_words into an array as a first step.
Your query is still a tricky one.
Next consider using a $regex query against that array field.
Constructing the regex will be a bit hard since you want to match any single word from your input string, or, it appears any pair of works (how many words??). You may get some further ideas for how to construct a suitable regex from my blog entry here where I am matching a substring of the input string against the database (the inverse of a normal LIKE operation).
You can solve this by splitting the long string into seperate tokens and put them in to the separate array. And use $all query to effectively find the matching keywords.
Check out the sample
> db.splitter.insert({tags:'John Doe is from Florida and is a fan of American Express'.split(' ')})
> db.splitter.insert({tags:'John Doe is a super man'.split(' ')})
> db.splitter.insert({tags:'John cena is a dummy'.split(' ')})
> db.splitter.insert({tags:'the rock rocks'.split(' ')})
and when you query
> db.splitter.find({tags:{$all:['John','Doe']}})
it would return
> db.splitter.find({tags:{$all:['John','Doe']}})
{ "_id" : ObjectId("4f9435fa3dd9f18b05e6e330"), "tags" : [ "John", "Doe", "is", "from", "Florida", "and", "is", "a", "fan", "of", "American", "Express" ] }
{ "_id" : ObjectId("4f9436083dd9f18b05e6e331"), "tags" : [ "John", "Doe", "is", "a", "super", "man" ] }
And remember, this operation is case-sensitive.
If you are looking for a partial match, use $in instead $all
Also you probably need to remove the noise words('a','the','is'...) before insert for accurate results.
I hope it is clear

Sphinx, Rails, ThinkSphinx and making some words matter more than others in your query

I have a list of keywords that I need to search against, using ThinkingSphinx
Some of them being more important than others, i need to find a way to weight those words.
So far, the only solution i came up with is to repeat x number of times the same word in my query to increase its relevance.
Eg:
3 keywords, each of them having a level of importance: Blue(1) Recent(2) Fun(3)
I run this query
MyModel.search "Blue Recent Recent Fun Fun Fun", :match_mode => :any
Not very elegant, and quite limiting.
Does anyone have a better idea?
If you can get those keywords into a separate field, then you could weight those fields to be more important. That's about the only good approach I can think of, though.
MyModel.search "Blue Recent Fun", :field_weights => {"keywords" => 100}
Recently I've been using Sphinx extensively, and since the death of UltraSphinx, I started using Pat's great plugin (Thanks Pat, I'll buy you a coffee in Melbourne soon!)
I see a possible solution based on your original idea, but you need to make changes to the data at "index time" not "run time".
Try this:
Modify your Sphinx SQL query to replace "Blue" with "Blue Blue Blue Blue", "Recent" with "Recent Recent Recent" and "Fun" with "Fun Fun". This will magnify any occurrences of your special keywords.
e.g. SELECT REPLACE(my_text_col,"blue","blue blue blue") as my_text_col ...
You probably want to do them all at once, so just nest the replace calls.
e.g. SELECT REPLACE(REPLACE(my_text_col,"fun","fun fun"),"blue","blue blue blue") as my_text_col ...
Next, change your ranking mode to SPH_RANK_WORDCOUNT. This way maximum relevancy is given to the frequency of the keywords.
(Optional) Imagine you have a list of keywords related to your special keywords. For example "pale blue" relates to "blue" and "pleasant" relates to "fun". At run time, rewrite the query text to look for the target word instead. You can store these words easily in a hash, and then loop through it to make the replacements.
# Add trigger words as the key,
# and the related special keyword as the value
trigger_words = {}
trigger_words['pale blue'] = 'blue'
trigger_words['pleasant'] = 'fun'
# Now loop through each query term and see if it should be replaced
new_query = ""
query.split.each do |word|
word = trigger_words[word] if trigger_words.has_key?(word)
new_query = new_query + ' ' word
end
Now you have quasi-keyword-clustering too. Sphinx is really a fantastic technology, enjoy!

Resources