MongoDB Substring matching query - ruby-on-rails

My application is trying to match an incoming string against documents in my Mongo Database where a field has a list of keywords. The goal is to see if the keywords are present in the string.
Here's an example:
Incoming string:
"John Doe is from Florida and is a fan of American Express"
the field for the documents in the MongoDB has a value such as:
in_words: "georgia,american express"
So, the database record has inwords or keywords separate by comman and some of them are two words or more.
Currently, my RoR application pulls the documents and pulls the inwords for each one issuing a split(',') command on the inwords, then loops through each one and sees if it is present in the string.
I really want to find a way to push this type of search into the actual database query in order to speed up the processing. I could change the in_words in the database to an array such as follows:
in_words: ["georgia", "american express"]
but I'm still not sure how to query this?
To Sum up, my goal is to find the person that matches an incoming string by comparing a list of inwords/keywords for that person against the incoming string. And do this query all in the database layer.
Thanks in advance for your suggestions

You should definitely split the in_words into an array as a first step.
Your query is still a tricky one.
Next consider using a $regex query against that array field.
Constructing the regex will be a bit hard since you want to match any single word from your input string, or, it appears any pair of works (how many words??). You may get some further ideas for how to construct a suitable regex from my blog entry here where I am matching a substring of the input string against the database (the inverse of a normal LIKE operation).

You can solve this by splitting the long string into seperate tokens and put them in to the separate array. And use $all query to effectively find the matching keywords.
Check out the sample
> db.splitter.insert({tags:'John Doe is from Florida and is a fan of American Express'.split(' ')})
> db.splitter.insert({tags:'John Doe is a super man'.split(' ')})
> db.splitter.insert({tags:'John cena is a dummy'.split(' ')})
> db.splitter.insert({tags:'the rock rocks'.split(' ')})
and when you query
> db.splitter.find({tags:{$all:['John','Doe']}})
it would return
> db.splitter.find({tags:{$all:['John','Doe']}})
{ "_id" : ObjectId("4f9435fa3dd9f18b05e6e330"), "tags" : [ "John", "Doe", "is", "from", "Florida", "and", "is", "a", "fan", "of", "American", "Express" ] }
{ "_id" : ObjectId("4f9436083dd9f18b05e6e331"), "tags" : [ "John", "Doe", "is", "a", "super", "man" ] }
And remember, this operation is case-sensitive.
If you are looking for a partial match, use $in instead $all
Also you probably need to remove the noise words('a','the','is'...) before insert for accurate results.
I hope it is clear

Related

How can I write a Rails query to find phone number which are all saved with various formats

I'm using Rails 5.2.0
I am trying to write a query to find a phone number based off user input. However, the format of phone numbers saved in the database varies (e.g. (123)-456-7890, +1(123)-456-7890, +1 123456789, and so on). Is there any way I can format the records saved in my database in this query? I've thought of adding a second column to the table that would simply be formatted_telephone, but I have tens of thousands of records. Can I add a method in the User controller to update these records when they are fetched?
Here is what I have so far:
User.where("REGEXP_REPLACE(telephone, '[^[:digit:]]', '') ~* ?", "%#{input}%")
Right now this is still only returning phone numbers with this format: 1234567890.
Am I on the right track with this? Or is it not possible to format columns when querying?
Normally with a where clause and a regexp we are asking something like "find me everything that matches this regexp" but you are asking the DB for a phone number that matches "12035551212" and want the where clause to apply a regexp to every single phone number in the table while searching to match it. I guess you try something like (this can be streamlined but I'm breaking it down to make it easier to follow):
my_phone = '12035551212'
phone_arr = my_phone.split('')
#=> ["1", "2", "0", "3", "5", "5", "5", "1", "2", "1", "2"]
regx = '^\D?' + phone_arr[0] + '?' + '\D*' + phone_arr[1..-1].join('\D*') + '$'
#=> '^\D?1?\D*2\D*0\D*3\D*5\D*5\D*5\D*1\D*2\D*1\D*2$'
now you have a regexp that matches only your phone number regardless of format. So now you can try:
User.where('phone_number ~* ?', regx)
this should ask Postgres to match your very specific regexp based on the phone number you are searching for. This should get you what you need. But I would look at refactoring it.
In the long run I would standardize all numbers in the DB. You could add a phone_number_e164 column to Users and convert every one to the E164 format using a regexp. Then remove the old phone number column and rename the new one to phone_number. You would also need to add code to standardize any new numbers coming into the DB.
As a stop-gap measure you could also create a Postgres view that grabs the User records and applies regexp to the phone number to transform it to E164 format, and access that view instead of the User table.

How can I query the db for all items in a table where a value meets one of the values in an array in Rails?

This is probably better explained with an example. I have a documents table that has a country attribute, for example Document.first.country could return 'DE'. I have an array of country codes, for called eu_countries with the value ["AT", "BE", "BG", "CY", "CZ", "DE"...] and I would like to query the db and return only documents that have a country code that is in the array.
Something with the same functionality as: Documents.where(country == "AT" or "BE" or "BG" or "CY" or "CZ" or "DE"...)
It's quite simple
Document.where(country: eu_countries)
It is transferred to SQL similar to this:
select documents.* from documents where documents.country IN (values)

Remove excess junk words from string or array of strings

I have millions of arrays that each contain about five strings. I am trying to remove all of the "junk words" (for lack of a better description) from the arrays, such as all articles of speech, words like "to", "and", "or", "the", "a" and so on.
For example, one of my arrays has these six strings:
"14000"
"Things"
"to"
"Be"
"Happy"
"About"
I want to remove the "to" from the array.
One solution is to do:
excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}
But I am hoping to avoid manually typing every excess word. Does anyone know of a Rails function or helper that would help in this process? Or perhaps an array of "junk words" already written?
Dealing with stopwords is easy, but I'd suggest you do it BEFORE you split the string into the component words.
Building a fairly simple regular expression can make short work of the words:
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into sandbar forest thesis algebra"
clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]
How do you handle them if you get them already split? I'd join(' ') the array to turn it back into a string, then run the above code, which returns the array again.
incoming_array = [
"14000",
"Things",
"to",
"Be",
"Happy",
"About",
]
STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i
incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]
You could try to use Array's set operations, but you'll run afoul of the case sensitivity of the words, forcing you to iterate over the stopwords and the arrays which will run a LOT slower.
Take a look at these two answers for some added tips on how you can build very powerful patterns making it easy to match thousands of strings:
"How do I ignore file types in a web crawler?"
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"
All you need is a list of English stopwords. You can find it here, or google for 'english stopwords list'

Suppress delimiters in Ruby's String#split

I'm importing data from old spreadsheets into a database using rails.
I have one column that contains a list on each row, that are sometimes formatted as
first, second
and other times like this
third and fourth
So I wanted to split up this string into an array, delimiting either with a comma or with the word "and". I tried
my_string.split /\s?(\,|and)\s?/
Unfortunately, as the docs say:
If pattern contains groups, the respective matches will be returned in the array as well.
Which means that I get back an array that looks like
[
[0] "first"
[1] ", "
[2] "second"
]
Obviously only the zeroth and second elements are useful to me. What do you recommend as the neatest way of achieving what I'm trying to do?
You can instruct the regexp to not capture the group using ?:.
my_string.split(/\s?(?:\,|and)\s?/)
# => ["first", "second"]
As an aside note
into a database using rails.
Please note this has nothing to do with Rails, that's Ruby.

Search term identification

I am trying to do a small analytics plugin for my search. I want to isolate the useful search terms from all the searches done.
for example:
search: "where do i register for charms class"
search terms: "register", "charms class"
I know this is not possible without the program having the context of our whole data. But is there something which I could use to achieve partial results??
What you can do is break the string into array of strings
keywords = "where do i register for charms class".split(" ")
#=> ["where", "do", "i", "register", "for", "charms", "class"]
Then you can loop through the array of keywords. This is not a perfect solution but still it would help you.
You could put all keywords into an array:
keywords = ['some keyword', 'another keyword']
string = 'My string with some keyword'
keywords.none?{|keyword| string.include?(keyword)} #=> true/false
My take on this is to create rules to eliminate useless words.
Like removing articles,verbs,pronouns and other useless stuffs.
You can first tokenize the string then perform the pruning.
After thaT you can create rules to further extract the the important tokens.
For references:
Tokenizer
Tokenizing a String

Resources