Remove excess junk words from string or array of strings - ruby-on-rails

I have millions of arrays that each contain about five strings. I am trying to remove all of the "junk words" (for lack of a better description) from the arrays, such as all articles of speech, words like "to", "and", "or", "the", "a" and so on.
For example, one of my arrays has these six strings:
"14000"
"Things"
"to"
"Be"
"Happy"
"About"
I want to remove the "to" from the array.
One solution is to do:
excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}
But I am hoping to avoid manually typing every excess word. Does anyone know of a Rails function or helper that would help in this process? Or perhaps an array of "junk words" already written?

Dealing with stopwords is easy, but I'd suggest you do it BEFORE you split the string into the component words.
Building a fairly simple regular expression can make short work of the words:
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into sandbar forest thesis algebra"
clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]
How do you handle them if you get them already split? I'd join(' ') the array to turn it back into a string, then run the above code, which returns the array again.
incoming_array = [
"14000",
"Things",
"to",
"Be",
"Happy",
"About",
]
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]
You could try to use Array's set operations, but you'll run afoul of the case sensitivity of the words, forcing you to iterate over the stopwords and the arrays which will run a LOT slower.
Take a look at these two answers for some added tips on how you can build very powerful patterns making it easy to match thousands of strings:
"How do I ignore file types in a web crawler?"
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"

All you need is a list of English stopwords. You can find it here, or google for 'english stopwords list'

Related

Suppress delimiters in Ruby's String#split

I'm importing data from old spreadsheets into a database using rails.
I have one column that contains a list on each row, that are sometimes formatted as
first, second
and other times like this
third and fourth
So I wanted to split up this string into an array, delimiting either with a comma or with the word "and". I tried
my_string.split /\s?(\,|and)\s?/
Unfortunately, as the docs say:
If pattern contains groups, the respective matches will be returned in the array as well.
Which means that I get back an array that looks like
[
[0] "first"
[1] ", "
[2] "second"
]
Obviously only the zeroth and second elements are useful to me. What do you recommend as the neatest way of achieving what I'm trying to do?
You can instruct the regexp to not capture the group using ?:.
my_string.split(/\s?(?:\,|and)\s?/)
# => ["first", "second"]
As an aside note
into a database using rails.
Please note this has nothing to do with Rails, that's Ruby.

Search term identification

I am trying to do a small analytics plugin for my search. I want to isolate the useful search terms from all the searches done.
for example:
search: "where do i register for charms class"
search terms: "register", "charms class"
I know this is not possible without the program having the context of our whole data. But is there something which I could use to achieve partial results??
What you can do is break the string into array of strings
keywords = "where do i register for charms class".split(" ")
#=> ["where", "do", "i", "register", "for", "charms", "class"]
Then you can loop through the array of keywords. This is not a perfect solution but still it would help you.
You could put all keywords into an array:
keywords = ['some keyword', 'another keyword']
string = 'My string with some keyword'
keywords.none?{|keyword| string.include?(keyword)} #=> true/false
My take on this is to create rules to eliminate useless words.
Like removing articles,verbs,pronouns and other useless stuffs.
You can first tokenize the string then perform the pruning.
After thaT you can create rules to further extract the the important tokens.
For references:
Tokenizer
Tokenizing a String

Rails order greek characters first

I have a list of names in my DB and I need to sort them alphabetically. However, I need to show the greek letters first, and then the latin ones. For example, I have:
[Jale, Βήτα, Άλφα, Ben]
and I need to order it like this:
[Άλφα, Βήτα, Ben, Jale]
Any suggestions would be much appreciated :)
I like to solve these problems by playing around in irb. Here's one way you could go about finding this solution. First, we'll define our test array:
>> names = %w{Jale Βήτα Άλφα Ben}
=> ["Jale", "Βήτα", "Άλφα", "Ben"]
To solve this, let's first transform the array into 2-tuples which contain a flag indicating whether the name is greek or not, and then the name itself. We want the flag to be sortable, so we'll first find a regex match for latin-only characters, and coerce it to be a string.
>> names.map{|name| [(name =~ /^\w+$/).to_s, name]}
=> [["0", "Jale"], ["", "Βήτα"], ["", "Άλφα"], ["0", "Ben"]]
Then we'll sort the 2-tuples:
>> names.map{|name| [(name =~ /^\w+$/).to_s, name]}.sort
=> [["", "Άλφα"], ["", "Βήτα"], ["0", "Ben"], ["0", "Jale"]]
We now have a sort order where we have first the greek names, then the latin names. We can shorten this into our solution:
>> names.sort_by{|name| [(name =~ /^\w+$/).to_s, name]}
=> ["Άλφα", "Βήτα", "Ben", "Jale"]
I gave one solution above. Another approach is to partition the names into greek and latin, then sort within those groups, then flatten the two arrays into one:
>> names.partition{|name| name !~ /^\w+$/}.map(&:sort).flatten
=> ["Άλφα", "Βήτα", "Ben", "Jale"]
This might be a little more elegant and understandable than my other solution, but it is less flexible. Note that name !~ /\w+$ will return something like true if the name has non-latin characters, ie, is greek.

MongoDB Substring matching query

My application is trying to match an incoming string against documents in my Mongo Database where a field has a list of keywords. The goal is to see if the keywords are present in the string.
Here's an example:
Incoming string:
"John Doe is from Florida and is a fan of American Express"
the field for the documents in the MongoDB has a value such as:
in_words: "georgia,american express"
So, the database record has inwords or keywords separate by comman and some of them are two words or more.
Currently, my RoR application pulls the documents and pulls the inwords for each one issuing a split(',') command on the inwords, then loops through each one and sees if it is present in the string.
I really want to find a way to push this type of search into the actual database query in order to speed up the processing. I could change the in_words in the database to an array such as follows:
in_words: ["georgia", "american express"]
but I'm still not sure how to query this?
To Sum up, my goal is to find the person that matches an incoming string by comparing a list of inwords/keywords for that person against the incoming string. And do this query all in the database layer.
Thanks in advance for your suggestions
You should definitely split the in_words into an array as a first step.
Your query is still a tricky one.
Next consider using a $regex query against that array field.
Constructing the regex will be a bit hard since you want to match any single word from your input string, or, it appears any pair of works (how many words??). You may get some further ideas for how to construct a suitable regex from my blog entry here where I am matching a substring of the input string against the database (the inverse of a normal LIKE operation).
You can solve this by splitting the long string into seperate tokens and put them in to the separate array. And use $all query to effectively find the matching keywords.
Check out the sample
> db.splitter.insert({tags:'John Doe is from Florida and is a fan of American Express'.split(' ')})
> db.splitter.insert({tags:'John Doe is a super man'.split(' ')})
> db.splitter.insert({tags:'John cena is a dummy'.split(' ')})
> db.splitter.insert({tags:'the rock rocks'.split(' ')})
and when you query
> db.splitter.find({tags:{$all:['John','Doe']}})
it would return
> db.splitter.find({tags:{$all:['John','Doe']}})
{ "_id" : ObjectId("4f9435fa3dd9f18b05e6e330"), "tags" : [ "John", "Doe", "is", "from", "Florida", "and", "is", "a", "fan", "of", "American", "Express" ] }
{ "_id" : ObjectId("4f9436083dd9f18b05e6e331"), "tags" : [ "John", "Doe", "is", "a", "super", "man" ] }
And remember, this operation is case-sensitive.
If you are looking for a partial match, use $in instead $all
Also you probably need to remove the noise words('a','the','is'...) before insert for accurate results.
I hope it is clear

Sphinx, Rails, ThinkSphinx and making some words matter more than others in your query

I have a list of keywords that I need to search against, using ThinkingSphinx
Some of them being more important than others, i need to find a way to weight those words.
So far, the only solution i came up with is to repeat x number of times the same word in my query to increase its relevance.
Eg:
3 keywords, each of them having a level of importance: Blue(1) Recent(2) Fun(3)
I run this query
MyModel.search "Blue Recent Recent Fun Fun Fun", :match_mode => :any
Not very elegant, and quite limiting.
Does anyone have a better idea?
If you can get those keywords into a separate field, then you could weight those fields to be more important. That's about the only good approach I can think of, though.
MyModel.search "Blue Recent Fun", :field_weights => {"keywords" => 100}
Recently I've been using Sphinx extensively, and since the death of UltraSphinx, I started using Pat's great plugin (Thanks Pat, I'll buy you a coffee in Melbourne soon!)
I see a possible solution based on your original idea, but you need to make changes to the data at "index time" not "run time".
Try this:
Modify your Sphinx SQL query to replace "Blue" with "Blue Blue Blue Blue", "Recent" with "Recent Recent Recent" and "Fun" with "Fun Fun". This will magnify any occurrences of your special keywords.
e.g. SELECT REPLACE(my_text_col,"blue","blue blue blue") as my_text_col ...
You probably want to do them all at once, so just nest the replace calls.
e.g. SELECT REPLACE(REPLACE(my_text_col,"fun","fun fun"),"blue","blue blue blue") as my_text_col ...
Next, change your ranking mode to SPH_RANK_WORDCOUNT. This way maximum relevancy is given to the frequency of the keywords.
(Optional) Imagine you have a list of keywords related to your special keywords. For example "pale blue" relates to "blue" and "pleasant" relates to "fun". At run time, rewrite the query text to look for the target word instead. You can store these words easily in a hash, and then loop through it to make the replacements.
# Add trigger words as the key,
# and the related special keyword as the value
trigger_words = {}
trigger_words['pale blue'] = 'blue'
trigger_words['pleasant'] = 'fun'
# Now loop through each query term and see if it should be replaced
new_query = ""
query.split.each do |word|
word = trigger_words[word] if trigger_words.has_key?(word)
new_query = new_query + ' ' word
end
Now you have quasi-keyword-clustering too. Sphinx is really a fantastic technology, enjoy!

Resources