Canonicalize NFL team names - parsing

This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)
The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.
I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:
http://www.footballlocks.com/nfl_odds.shtml
http://www.repole.com/sun4cast/freepick.shtml
Ideas for better, more parsable, sources of data are very welcome!
Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?

Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.
Scan[(fullname[First##] = #[[2]])&, {
{"ari", "Arizona Cardinals"}, {"atl", "Atlanta Falcons"},
{"bal", "Baltimore Ravens"}, {"buf", "Buffalo Bills"},
{"car", "Carolina Panthers"}, {"chi", "Chicago Bears"},
{"cin", "Cincinnati Bengals"}, {"clv", "Cleveland Browns"},
{"dal", "Dallas Cowboys"}, {"den", "Denver Broncos"},
{"det", "Detroit Lions"}, {"gbp", "Green Bay Packers"},
{"hou", "Houston Texans"}, {"ind", "Indianapolis Colts"},
{"jac", "Jacksonville Jaguars"}, {"kan", "Kansas City Chiefs"},
{"mia", "Miami Dolphins"}, {"min", "Minnesota Vikings"},
{"nep", "New England Patriots"}, {"nos", "New Orleans Saints"},
{"nyg", "New York Giants NYG"}, {"nyj", "New York Jets NYJ"},
{"oak", "Oakland Raiders"}, {"phl", "Philadelphia Eagles"},
{"pit", "Pittsburgh Steelers"}, {"sdc", "San Diego Chargers"},
{"sff", "San Francisco 49ers forty-niners"}, {"sea", "Seattle Seahawks"},
{"stl", "St Louis Rams"}, {"tam", "Tampa Bay Buccaneers"},
{"ten", "Tennessee Titans"}, {"wsh", "Washington Redskins"}}]
Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:
teams = keys#fullnames;
(* argMax[f, domain] returns the element of domain for which f of that element is
maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First#dom, Rest#dom]
canonicalize[s_] := argMax[StringLength#LongestCommonSubsequence[" "<>s<>" ",
" "<>fullname##<>" ", IgnoreCase->True]&, teams]

Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:
Denver
Minnesota
Arizona
Jacksonville
and the other looks like
Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars
Seems like, in this case, some pretty simple substring matching would do it.

If you know both the source and destination names, then you just need to map them.
In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:
$map = array('49ers' => 'San Francisco 49ers',
'packers' => 'Green Bay Packers');
foreach($incoming_name as $name) {
echo $map[$name];
}

Related

Part of the data is not displayed fully in spss

I want to create a new string variable (Education) which will contain data from other string variables (Listofuniversities, Listofschools).
The problem is that the data in the variable Education is not displayed fully. It is displayed like this:
Education
TU
Gymna
TL
My original dataset look like this:
Listofuniversities Listofschools
TU
Gymnasium van der Ort
TEU
Gymnasium van der Ort
TU
Gymnasium van der Art
TL
Gymnasium van der Art
This is the syntax that I have written.
STRING Education (A8).
RECODE Listofuniversities ('TU'='TU') ('TEU'='TEU') ('TL'='TL') INTO Education.
EXECUTE.
RECODE Listofschools ("Gymnasium van der Ort" = "Gymnasium van der Ort") into Education.
VARIABLE WIDTH Education(20).
EXECUTE.
Your data looks as if there are two fields "Listofuniversities" and "Listofschools"; both string fields. It seems like the two fields are independent of one another. When there is a non-blank value in one, there is a blank in the other. Was this intended? If not, I'd look at how you read the data into the program.
Your first command creates a string field 8 characters wide (Education). The values you try to put into "Education" (from "Listofschools" at least) are clearly more than 8 characters wide. So it is appropriate to define "Education" as a wider field. e.g. STRING Education (A50).
If your intent is to consolidate these values across records:
STRING Education (A50).
DO IF (Listofschools=" ").
COMPUTE Education = RTRIM(LTRIM(ListofUniversities)).
ELSE.
COMPUTE Education = RTRIM(LTRIM(Listofschools)).
END IF.
EXECUTE.

Can we find sentences around an entity tagged via NER?

We have a model ready which identifies a custom named entity. The problem is if the whole doc is given then the model does not work as per expecation if only a few sentences are given, it is giving amazing results.
I want to select two sentences before and after a tagged entity.
eg. If a part of the doc has world Colombo(which is tagged as GPE), I need to select two sentences before the tag and 2 sentences after the tag. I tried a couple of approaches but the complexity is too high.
Is there a built-in way in spacy with which we can address this problem?
I am using python and spacy.
I have tried parsing the doc by identifying the index of the tag. But that approach is really slow.
It might be worth it to see if you can improve the custom named entity recognizer, because it should be unusual for extra context to hurt performance and potentially if you fix that issue it will work better overall.
However, regarding your concrete question about surrounding sentences:
A Token or a Span (an entity is a Span) has a .sent attribute that gives you the covering sentence as a Span. If you look at the tokens right before/after a given sentence's start/end tokens, you can get the previous/next sentences for any token in a document.
import spacy
def get_previous_sentence(doc, token_index):
if doc[token_index].sent.start - 1 < 0:
return None
return doc[doc[token_index].sent.start - 1].sent
def get_next_sentence(doc, token_index):
if doc[token_index].sent.end + 1 >= len(doc):
return None
return doc[doc[token_index].sent.end + 1].sent
nlp = spacy.load('en_core_web_lg')
text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."
doc = nlp(text)
for ent in doc.ents:
print(ent, ent.label_, ent.sent)
print("Prev:", get_previous_sentence(doc, ent.start))
print("Next:", get_next_sentence(doc, ent.start))
print("----")
Output:
Jane PERSON Jane is a name.
Prev: None
Next: Here is a sentence.
----
Jane PERSON Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
2010 DATE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Someone else is the mayor of Colombo right now.
Prev: And here is yet another padding sentence without entities.
Next: None
----

Generate valid words from string

Due to some technical problem all the spaces in all sentences are removed. (except fullstops)
mystring='thisisonlyatest. andhereisanothersentense'
Is there any way in python to get the readable output like this...
"this is only a test. and here is another sentense."
If you have a list of valid common words (can be found on the internet for different languages), you can get all the prefixes, check whether they are a valid word, and recursively repeat with the rest of the sentence. Use memoization to prevent redundant computations on same suffixes.
Here is an example in Python. The lru_cache annotation adds memoization to the function so that the sentence for each suffix is calculated only once, independently of how the first part has been split. Note that words is a set for O(1) lookup. A Prefix-Tree would work very well, too.
words = {"this", "his", "is", "only", "a", "at", "ate", "test",
"and", "here", "her", "is", "an", "other", "another",
"sent", "sentense", "tense", "and", "thousands", "more"}
max_len = max(map(len, words))
import functools
functools.lru_cache(None)
def find_sentences(text):
if len(text) == 0:
yield []
else:
for i in range(min(max_len, len(text)) + 1):
prefix, suffix = text[:i], text[i:]
if prefix in words:
for rest in find_sentences(suffix):
yield [prefix] + rest
mystring = 'thisisonlyatest. andhereisanothersentense'
for text in mystring.split(". "):
print(repr(text))
for sentence in find_sentences(text):
print(sentence)
This will give you a list of valid (but possibly non-sensical) ways to split the sentence into words. Those may be few enough so you an pick the right one by hand; otherwise you might have to add another post-processing step, e.g. using Part of Speech analysis with a proper NLP framework.

Extracting text from APA citation

I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."

use WordNet to generalize specific word to higher-order concept

Does WordNet have "higher order" concepts? How to generate them for a given word?
I have a corpus of data in the form of prolog 'facts'. I want to generalize the conceptual components, i.e. 'contains'('oranges', 'vitamin c'). and 'contains'('spinach','iron'). would be generalized to 'contains'(<food>, <nutrient>).
I don't know WordNet very well, so one thing I was thinking about was just to generate all possible hypernyms and then combinatorially elaborate every possible new rule, but this is sort of a 'brute force' approach.
Does WordNet store higher order concepts such as <food> for instance? That might make it easier, because then I can just create one new rule with the higher order concept of that particular variable, assuming that there is one in WordNet, as opposed to perhaps fifty or one hundred if I do it the brute force way.
So what I actually want to know is: is there a command to generate the higher order concepts for each of the three components within a given 'fact'? Or maybe just for the two that are inside the parentheses. If such a command exits, what is it?
Below is some of the data I'm working with for reference.
'be'('mr jiang', 'representing china').
'be'('hrh', 'britain').
'be more than'('# distinguished guests', 'the principal representatives').
'end with'('the playing of the british national anthem', 'hong kong').
'follow at'('the stroke of midnight', 'this').
'take part in'('the ceremony', 'both countries').
'start at about'('# pm', 'the ceremony').
'end about'('# am', 'the ceremony').
'lower'('the british hong kong flag', '# royal hong kong police officers').
'raise'('the sar flag', 'another #').
'leave for'('the royal yacht britannia', 'the #').
'hold by'('the chinese and british governments', 'the handover of hong kong').
'rise over'('this land', 'the regional flag of the hong kong special administrative region of the people \'s republic of china').
'cast eye on'('hong kong', 'the world').
'hold on'('schedule', 'the # governments').
'be festival for'('the chinese nation', 'this').
'go in'('the annals of history', 'july # , #').
'become master of'('this chinese land', 'the hong kong compatriots').
'enter era of'('development', 'hong kong').
'remember'('mr deng xiaoping', 'history').
'be along'('the course', 'it').
'resolve'('the hong kong question', 'we').
'wish to express thanks to'('all the personages', 'i').
'contribute to'('the settlement of the hong kong', 'both china and britain').
'support'('hong kong \'s return', 'the world').
Wordnet refers to higher-order concepts as "hypernyms". A hypernym for the color "green," for instance, is "chromatic color", because the color green belongs to the higher-order class of chromatic colors.
One should note that Wordnet differentiates between "words" (strings of characters) and "sysnets" (the meaning we associate with a given string of characters). Just as one word can have multiple meanings, one string can have multiple synsets. If you want to retrieve all of the higher-order meanings for a given word, you can run these lines in Python:
from nltk import wordnet as wn
# If you are using nltk version 3.0.1, the following will tell you all the synsets for "green" and will thenn find all of their hypernyms. If you're running nltk 3.0.0, you can change the first line to `for synset in wn.synsets('bank'):
for synset in wn.wordnet.synsets('green'):
for hypernym in synset.hypernyms():
print synset, hypernym

Resources