Rails - Detecting keywords in a string with exact match - ruby-on-rails

This one is tricky, at least for me as I am new to rails.
soccer = ["football pitch", "soccer", "free kick", "penalty"]
string = "Did anyone see that free kick last night, let me get my pen!!!"
What I want to do is search for instances of keywords but with 2 main rules:
1 - Don't do partial matches i.e it should not match pen with penalty, has to be a full match.
2 - Match multiple sets of words like "nice day" "sweet tooth" "three's a crowd" (max of 3)
This code works perfect for scenario 1:
def self.check_for_keyword_match?(string,keyword_array)
string.split.any? { |word| keyword_array.include?(word) }
end
if check_for_keyword_match?(string,soccer)
soccer.to_set.freeze
keywords_found.push('soccer')
# send a response saying Hey, I see you are interested in soccer.
end
In that example it would not match pen but it would match penalty which is perfect.
But I also want it to match 2-3 sets of keywords i.e "free kick" should match but only "free" and "kick" would match if they were written as singular keywords. Free is too broad, same with kick but "free kick" is not broad so it works much better at deciphering their interests.
I can change the format of the soccer array but the string been submitted would be from a slack post so I can't control how that is formatted. In the actual program I have 20 or so of those arrays with keywords but once I figure out how to do one, the rest I can handle.

For manipulating strings, Regular Expressions are useful.
The following code should fix your issue:
def self.check_for_keyword_match?(string, keyword_array)
keyword_array.any? { |word| Regexp.new('\b' + word + '\b').match(string) }
end
Instead of splitting string, go through keyword_array and search the entire string for each keyword.
The regex adds a 'word boundary' modifier \b so that it will only match entire words (Rule 1, if you use include? here, then a keyword of "pen" will match "penalty").

Related

Can I pull a list of info out of an email?

I get a daily email that lists upcoming appointments, and their length. The number of appointments vary from day to day.
The emails go like this:
================
Today's Schedule
9:30 AM
3h
Brazilian Blowout
[Client #1 name]
12:30 PM
1h
Women's Cut
[Client 2 name]
6:00 PM
45m
Men's Cut
[Client #3 name]
Projected Revenue
===================
I want to create an event in a Google Calendar for each appointment, and it seems like zapier MIGHT be able to do this, but all the help resources I can find are very general in nature.
Is this do-able on Zapier? If so, any nudges in the right direction would be awesome.
Any thoughts greatly appreciated.
I had some time to kill and enjoy the odd challenge. So I have put together a solution that should do what you are looking for. I will break it down by steps.
TEMPLATE
Zapier Trigger - Step 1
Type: Trigger
Module: Gmail
Criteria: User Dependent
Comments: For the trigger zap you will want to use a Gmail specific trigger, something to the effect of "execute trigger on emails titled 'xyz'", or "emails labeled 'xyz'" if you setup a filter in your inbox.
Input screenshot:
Output Screenshot:
Zapier Action - Step 2
Type: Action
Module: Code (Python 3)
Comments: The Code offered by Zapier executes whatever (properly written) code you place in its container. It is especially handy as it allows you to incorporate data from previous steps in it through the use of a dictionary variable titled 'input_data'. Zapier offers the Code module in two languages: Javascript and Python. As I am most familiar with Python my solution for this step was written in Python. I will append the code to the end of this answer. Using the data held in the body of the email (retrieved in step 1) we can execute some string manipulations and datetime conversions to break apart the email into its component parts and pass those on to the following Action Step: Create Calendar Event.
Input Screenshot:
Output Screenshot:
Zapier Action - Step 3
Type: Action
Module: Google Calendar - Create Event
Comments: Using the data outputted from the previous code step we can fill out the required fields for creating a new appointment.
Input Screenshot:
Output Screenshot:
PYTHON CODE
from datetime import timedelta, date, datetime
'''
Goal: Extract individual appointment details from variable length email
Steps:
Remove all extraneous and new line characters.
Isolate each individual appointment and group its relevant details.
Derive appointment start and end times using appointment time and duration.
Return all appointments in a list.
'''
def format_appt_times(appt_dict):
appt_start_str = appt_dict.get("appt_start")
appt_dur_str = appt_dict.get("appt_length")
# isolate hour and minutes from appointment time
appt_s_hour = int(appt_start_str[:appt_start_str.find(":")])
if ("pm" in appt_start_str.lower()):
appt_s_hour = 12 if appt_s_hour + 12 >= 24 else appt_s_hour + 12
appt_s_min = int(appt_start_str[appt_start_str.find(":") + 1 :
appt_start_str.find(":") + 3])
# isolate hour and minutes from duration time
appt_d_hour = 0
appt_d_min = 0
if ("h" in appt_dur_str):
appt_d_hour = int(appt_dur_str[:appt_dur_str.find("h")])
if ("m" in appt_dur_str):
appt_d_min = int(appt_dur_str[appt_dur_str.find("m") - 2 : appt_dur_str.find("m")])
# NOTE: adjust timedelta hours depending on your relation to UTC
# create datetime objects for appointment start and end times
time_zone = timedelta(hours=0)
tdy = date.today() - time_zone
duration = timedelta(hours=appt_d_hour, minutes=appt_d_min)
appt_start_dto = datetime(year=tdy.year,
month=tdy.month,
day=tdy.day,
hour=appt_s_hour,
minute=appt_s_min)
appt_end_dto = appt_start_dto + duration
# return properly formatted datetime as string for use in next step.
return (appt_start_dto.strftime("%Y-%m-%dT%H:%M"),
appt_end_dto.strftime("%Y-%m-%dT%H:%M"))
def partition_list(target, part_size):
for data in range(0, len(target), part_size):
yield target[data : data + part_size]
def main():
# Remove all extraneous and new line characters.
email_body = input_data.get("email_body")
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
appointment_list = []
# Isolate each individual appointment and group its relevant details.
for text in partition_list(email_body, 4):
template = {
"appt_start" : text[0],
"appt_end" : None,
"appt_length" : text[1],
"appt_title" : text[2],
"appt_client" : text[3]
}
appointment_list.append(template)
for appt in appointment_list:
appt["appt_start"], appt["appt_end"] = format_appt_times(appt)
return appointment_list
return main()
I am not sure of your familiarity with Python, or programming more generally, but the comments in the code explain what each section is doing. If you have any specific questions regarding aspects of the code let me know. Assuming your email template does not change this setup should work exactly as needed. Let me know if anything is unclear.
UPDATE
I thought it best to address your question in the original answer should anyone else have similar questions.
explaining how this code is removing the extra characters:
There is actually a fair bit going on in the first line, so I will do my best to break it down, and provide resources where necessary.
The code in question:
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
First step here was to break the text into manageable chunks. I did so with the line email_body.splitlines() which, by default, breaks strings into a list at each newline character found (you can specify your own delimiter).
If we were to inspect the list at this moment its contents would be something of the following:
["================", "", "Today's Schedule", "", "9:30 AM", "", "3h", ..., "[Client #3 name]", "", "Projected Revenue", "", "==================="]
You will notice there is a fair amount of information in there that we really don't want.
First lets look at the "" elements. These are left over as a result of the blank lines between each line of text, which even though they are blank do still have newline characters at the end of them. There a number of ways you could address this within python. We could simply write a for-loop to go through and copy all elements that are not "" to a new list.
To me this felt like additional work, and besides, Python offers list comprehension for just such a scenario. I won't go too deep into list comprehension as there is a lot that can be said about it, and in more insightful ways than I could muster, but it essentially allows you to provide logic against a set of 'data' to form a list. In this case, I specifically wanted to filter out the "" elements returned from the call to splitlines().
And so you will see I address this with the following line
[text for text in email_body.splitlines() if text != ""]
With that we have a list as above less the "" elements. Now we must turn our attention towards the more 'dynamic' garbage strings. Again there are a number of ways to do this. A, not particularly flexible, option could be to simply store the strings we want to remove in variables something to the effect of:
garb_1 = "==================="
garb_2 = "Projected Revenue"
garb_3 = ...
and once again filter the list with yet another for-loop. I instead chose to leverage Python's list unpacking idiom. Which allows us to 'unpack' list objects (and I believe tuples) into variables. As an example:
one, two, three = ["a", "b", "c"]
I'm sure you can guess what is happening above, as long as we provide the same number of variables as are in the list we can 'unpack' it in this fashion. But wait! In our case we don't know how long the list is going to be as it is entirely dependent on the number of appointments you have for any given day. Well this is where star unpacking enters to elevate the functionality. Using my code as the example:
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
The *, in plain-English, is saying "I don't know how many elements to expect just give me all of them in a list". As we know that there will always be two lines of garbage at the beginning and end of the email we can assign them to throw away variables and capture everything in between using our variable length *email_body container.
With all of this complete we now have a list with only the data we are looking to capture. If, as you say, there are additional lines of garbage before or after the email_body, you can simply add additional throw away variables to account for them.
Once again feel free to ask any follow up questions.
Michael
Resources
List Comprehension
Star Unpacking

Thinking Sphinx sql like query in a condition for a field

I had these queries, but now I'm trying to use sphinx, and I need to replace them, but I can't find a way to do this:
p1 = Product.where "category LIKE ?", "#{WORD}"
p2 = Product.where "category LIKE ?", "#{WORD}.%"
product_list = p1 + p2
I'm doing the search over a model named "Product" in "category" field; I need a way to replace "#" and "%" in sphinx. I have a basic idea of how to do that, but this isn't working:
Product.search conditions: {category: "('WORD' | 'WORD.*')"}
There's a few things to note.
If you want to match on prefixes, make sure you have min_prefix_len set to 1 or greater (the smaller, the more accurate, but also the slower your searches will be, and the larger your index files will get). Also, you need enable_star set to true. Both of these settings belong in config/thinking_sphinx.yml (there's examples in the docs).
Single quotes have no purpose in Sphinx searches, and will be ignored - but I don't think that's a problem with what you're trying to search with.
Full stops, however, are treated as word separators by default. You can change this with charset_table - but that means all full stops in all fields will be treated as part of words (say, at the end of sentences), so I wouldn't recommend it.
However, if full stops are ignored, then each word in the category field is indexed separately, and so without any extra settings, this should work:
Product.search conditions: {category: WORD}

detect if a combination of string objects from an array matches against any commands

Please be patient and read my current scenario. My question is below.
My application takes in speech input and is successfully able to group words that match together to form either one word or a group of words - called phrases; be it a name, an action, a pet, or a time frame.
I have a master list of the phrases that are allowed and are stored in their respective arrays. So I have the following arrays validNamesArray, validActionsArray, validPetsArray, and a validTimeFramesArray.
A new array of phrases is returned each and every time the user stops speaking.
NSArray *phrasesBeingFedIn = #[#"CHARLIE", #"EAT", #"AT TEN O CLOCK",
#"CAT",
#"DOG", "URINATE",
#"CHILDREN", #"ITS TIME TO", #"PLAY"];
Knowing that its ok to have the following combination to create a command:
COMMAND 1: NAME + ACTION + TIME FRAME
COMMAND 2: PET + ACTION
COMMAND n: n + n, .. + n
//In the example above, only the groups of phrases 'Charlie eat at ten o clock' and 'dog urinate'
//would be valid commands, the phrase 'cat' would not qualify any of the commands
//and will therefor be ignored
Question
What is the best way for me to parse through the phrases being fed in and determine which combination phrases will satisfy my list of commands?
POSSIBLE solution I've come up with
One way is to step through the array and have if and else statements that check the phrases ahead and see if they satisfy any valid command patterns from the list, however my solution is not dynamic, I would have to add a new set of if and else statements for every single new command permutation I create.
My solution is not efficient. Any ideas on how I could go about creating something like this that will work and is dynamic no matter if I add a new command sequence of phrase combination?
I think what I would do is make an array for each category of speech (pet, command, etc). Those arrays would obviously have strings as elements. You could then test each word against each simple array using
[simpleWordListOfPets containsObject:word]
Which would return a BOOL result. You could do that in a case statement. The logic after that is up to you, but I would keep scanning the sentence using NSScanner until you have finished evaluating each section.
I've used some similar concepts to analyze a paragraph... it starts off like this:
while ([scanner scanUpToString:#"," intoString:&word]) {
processedWordCount++;
NSLog(#"%i total words processed", processedWordCount);
// Does word exist in the simple list?
if ([simpleWordList containsObject:word]) {
//NSLog(#"Word already exists: %#", word);
You would continue it with whatever logic you wanted (and you would search for a space rather than a ",".

Loop through text and extract pre-defined words and word pairs in Rails

I have a large string of text description, up to 500 words long. I would like to do the following:
Loop through description and look for a large number of pre-defined words from array keywords, which contains single words, word pairs and word triplets.
Every time a match is found, add this match to a new array matches (unless already added earlier in the process) and remove the matched word(s) from description.
I've had a look around for solutions, but most of them seem to either dive in at the deep end of natural language processing, which would be too complex for my current needs, or simply split the text string on spaces which means that it's then impossible to look for word pairs.
Would greatly appreciate any ideas as to how to do this efficiently.
description = "The quick brown fox jumped over the lazy dog, and another brown dog"
keywords = ["brown", "lazy", "apple"]
matches = []
keywords.each do |keyword|
matches << description.match(keyword).to_s if description.match(keyword)
end
puts matches
#=> ["brown", "lazy"]
matches.each do |keyword|
description.gsub!(Regexp.new(keyword), '')
end
description.gsub!(' ', ' ')
puts description
#=> "The quick fox jumped over the dog, and another dog"
You can set the Threshold frequency for each word in array
Loop through the text in Description
If word matches exactly with description text then increase the threshold frequency by 1 point
At the end, words who's frequency grater than 0 put it in to new array matches and delete it from description
For Example,
If any word repeated for 2 times,
It's frequency will be 0 + 2 and
Initially it should be 0.
This is the crude hack that occurred to me :)
keywords.select do |keyword|
description =~ /\b#{Regexp.escape(keyword)}\b/
# -or-
description.gsub(/\b#{Regexp.escape(keyword)}\b/) do |match|
# whatever
end
end

Sphinx, Rails, ThinkSphinx and making some words matter more than others in your query

I have a list of keywords that I need to search against, using ThinkingSphinx
Some of them being more important than others, i need to find a way to weight those words.
So far, the only solution i came up with is to repeat x number of times the same word in my query to increase its relevance.
Eg:
3 keywords, each of them having a level of importance: Blue(1) Recent(2) Fun(3)
I run this query
MyModel.search "Blue Recent Recent Fun Fun Fun", :match_mode => :any
Not very elegant, and quite limiting.
Does anyone have a better idea?
If you can get those keywords into a separate field, then you could weight those fields to be more important. That's about the only good approach I can think of, though.
MyModel.search "Blue Recent Fun", :field_weights => {"keywords" => 100}
Recently I've been using Sphinx extensively, and since the death of UltraSphinx, I started using Pat's great plugin (Thanks Pat, I'll buy you a coffee in Melbourne soon!)
I see a possible solution based on your original idea, but you need to make changes to the data at "index time" not "run time".
Try this:
Modify your Sphinx SQL query to replace "Blue" with "Blue Blue Blue Blue", "Recent" with "Recent Recent Recent" and "Fun" with "Fun Fun". This will magnify any occurrences of your special keywords.
e.g. SELECT REPLACE(my_text_col,"blue","blue blue blue") as my_text_col ...
You probably want to do them all at once, so just nest the replace calls.
e.g. SELECT REPLACE(REPLACE(my_text_col,"fun","fun fun"),"blue","blue blue blue") as my_text_col ...
Next, change your ranking mode to SPH_RANK_WORDCOUNT. This way maximum relevancy is given to the frequency of the keywords.
(Optional) Imagine you have a list of keywords related to your special keywords. For example "pale blue" relates to "blue" and "pleasant" relates to "fun". At run time, rewrite the query text to look for the target word instead. You can store these words easily in a hash, and then loop through it to make the replacements.
# Add trigger words as the key,
# and the related special keyword as the value
trigger_words = {}
trigger_words['pale blue'] = 'blue'
trigger_words['pleasant'] = 'fun'
# Now loop through each query term and see if it should be replaced
new_query = ""
query.split.each do |word|
word = trigger_words[word] if trigger_words.has_key?(word)
new_query = new_query + ' ' word
end
Now you have quasi-keyword-clustering too. Sphinx is really a fantastic technology, enjoy!

Resources