Select the missing word in the sentence from the possible options - machine-learning

I have a sentence with a missing word.
"I want to buy a ___ to drive it."
And I have answer options that may be here.
["car", "cat", "can"]
I need to choose the most suitable word. I wanted to do it with BERT, but there is no way to specify which words can be.

You can restrict BERT to a word list. In fact, I made a library which does this: FitBert.
from fitbert import FitBert
fb = FitBert()
masked_string = "I want to buy a ***mask*** to drive it."
options = ["car", "cat", "can"]
ranked_options = fb.rank(masked_string, options=options)
assert ranked_options[0] == "car"

Related

Splitting a string based on a certain set of words?

I'm trying to figure out how to take a phrase and split it up into a list of separate strings based on the occurrence of certain words.
Examples are probably be the easiest way to explain what I'm hoping to achieve:
List splitters = ['ABOVE', 'AT', 'NEAR', 'IN'];
INPUT: "ALFALFA DITCH IN ECKERT CO";
OUTPUT: ["ALFALFA DITCH", "IN ECKERT CO"];
INPUT: 'ANIMAS RIVER AT DURANGO, CO';
OUTPUT: ['ANIMAS RIVER', 'AT DURANGO, CO'];
INPUT: 'ALAMOSA RIVER ABOVE WILSON CREEK IN JASPER, CO';
OUTPUT ['ALAMOSA RIVER', 'ABOVE WILSON CREEK IN JASPER, CO'];
Notice in the third example, when there are multiple occurrences of splitters in the input phrase, I only want to use the first one.
To my knowledge, the split() method doesn't support multiple strings I can't find a single example of this in dart. I would think there is a simple solution?
I'd use a RegExp then
var splitters = ['ABOVE', 'AT', 'NEAR', 'IN'];
var s = "ALFALFA DITCH IN ECKERT CO";
var splitterRE = RegExp(splitters.join('|'));
var match = splitterRE.firstMatch(s);
if (match ! null) {
var partOne = s.substring(0, match.start).trimRight();
var partTwo = s.substring(match.start);
}
That does what you ask for, but it's slightly unsafe.
It will find "IN" in "BEHIND" if given "BEHIND THE FARM IN ALABAMA".
You likely want to match only complete words. In that case, RegExps are even more helpful, since they can do that too. Change the line to:
var splitterRE = RegExp(r'\b(?:' + splitters.join('|') + r')\b');
then it will only match entire words.

Generate valid words from string

Due to some technical problem all the spaces in all sentences are removed. (except fullstops)
mystring='thisisonlyatest. andhereisanothersentense'
Is there any way in python to get the readable output like this...
"this is only a test. and here is another sentense."
If you have a list of valid common words (can be found on the internet for different languages), you can get all the prefixes, check whether they are a valid word, and recursively repeat with the rest of the sentence. Use memoization to prevent redundant computations on same suffixes.
Here is an example in Python. The lru_cache annotation adds memoization to the function so that the sentence for each suffix is calculated only once, independently of how the first part has been split. Note that words is a set for O(1) lookup. A Prefix-Tree would work very well, too.
words = {"this", "his", "is", "only", "a", "at", "ate", "test",
"and", "here", "her", "is", "an", "other", "another",
"sent", "sentense", "tense", "and", "thousands", "more"}
max_len = max(map(len, words))
import functools
functools.lru_cache(None)
def find_sentences(text):
if len(text) == 0:
yield []
else:
for i in range(min(max_len, len(text)) + 1):
prefix, suffix = text[:i], text[i:]
if prefix in words:
for rest in find_sentences(suffix):
yield [prefix] + rest
mystring = 'thisisonlyatest. andhereisanothersentense'
for text in mystring.split(". "):
print(repr(text))
for sentence in find_sentences(text):
print(sentence)
This will give you a list of valid (but possibly non-sensical) ways to split the sentence into words. Those may be few enough so you an pick the right one by hand; otherwise you might have to add another post-processing step, e.g. using Part of Speech analysis with a proper NLP framework.

Titanic: Machine Learning from Disaster

Define function to extract titles from passenger names
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
title_search = re.search(' ([A-Za-z]+)\.', name) What does this mean?
The titanic dataset has names of the passenger like: Graham, Miss. Margaret Edith and Behr, Mr. Karl Howell
The titles here are Mr. and Miss.
title_search = re.search(' ([A-Za-z]+).', name)
The above line of code searches for names having titles. Mr. and Miss. are not the only ones, it could also be for example, Dr., Prof. and so on. Since we do not know before hand what are the titles but we know the pattern, which is 'alphabets followed by a period', we look for those words alone.
([A-Za-z]+). means, look for any word that starts with A-Z or a-z and end with a fullstop.
I suggest you read about regular expressions.

grep or detect a pattern in a character that begins with a phrase?

Using Rstudio.
Have a descriptive character feature that begins with values like "I love you", "I love him", "I love my dad", "I rather love...", "I hate..", "I don't care..", "I surely love....". Many "I * love" patterns, among others.
Now I like to create a new feature that =1 if the raw feature begins with "I love*". Otherwise the new feature =0.
In SAS, i can just write such:
if compress(old_feature) in: ("Ilove") then new_feature=1; else new_feature=0;
How to do that in Rstudio? I have searched here and the closest example is below
grep("^FA_.*Sc$",names(nc_df), value=TRUE). But this captures a lot I don't want. For example, "I definitely love".
Thanks.

Canonicalize NFL team names

This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)
The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.
I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:
http://www.footballlocks.com/nfl_odds.shtml
http://www.repole.com/sun4cast/freepick.shtml
Ideas for better, more parsable, sources of data are very welcome!
Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?
Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.
Scan[(fullname[First##] = #[[2]])&, {
{"ari", "Arizona Cardinals"}, {"atl", "Atlanta Falcons"},
{"bal", "Baltimore Ravens"}, {"buf", "Buffalo Bills"},
{"car", "Carolina Panthers"}, {"chi", "Chicago Bears"},
{"cin", "Cincinnati Bengals"}, {"clv", "Cleveland Browns"},
{"dal", "Dallas Cowboys"}, {"den", "Denver Broncos"},
{"det", "Detroit Lions"}, {"gbp", "Green Bay Packers"},
{"hou", "Houston Texans"}, {"ind", "Indianapolis Colts"},
{"jac", "Jacksonville Jaguars"}, {"kan", "Kansas City Chiefs"},
{"mia", "Miami Dolphins"}, {"min", "Minnesota Vikings"},
{"nep", "New England Patriots"}, {"nos", "New Orleans Saints"},
{"nyg", "New York Giants NYG"}, {"nyj", "New York Jets NYJ"},
{"oak", "Oakland Raiders"}, {"phl", "Philadelphia Eagles"},
{"pit", "Pittsburgh Steelers"}, {"sdc", "San Diego Chargers"},
{"sff", "San Francisco 49ers forty-niners"}, {"sea", "Seattle Seahawks"},
{"stl", "St Louis Rams"}, {"tam", "Tampa Bay Buccaneers"},
{"ten", "Tennessee Titans"}, {"wsh", "Washington Redskins"}}]
Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:
teams = keys#fullnames;
(* argMax[f, domain] returns the element of domain for which f of that element is
maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First#dom, Rest#dom]
canonicalize[s_] := argMax[StringLength#LongestCommonSubsequence[" "<>s<>" ",
" "<>fullname##<>" ", IgnoreCase->True]&, teams]
Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:
Denver
Minnesota
Arizona
Jacksonville
and the other looks like
Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars
Seems like, in this case, some pretty simple substring matching would do it.
If you know both the source and destination names, then you just need to map them.
In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:
$map = array('49ers' => 'San Francisco 49ers',
'packers' => 'Green Bay Packers');
foreach($incoming_name as $name) {
echo $map[$name];
}

Resources