Custom names detection - machine-learning

This is a project in really early phase and I'm trying to find ideas on where to start.
Any help or pointers would be greatly appreciated!
My problem:
I have text on one side, and a list of named GraphDB elements on the other (usually the name is either an acronym or a multi-word expression). My texts are not annotated.
I want to detect whenever a name is explicitly used in the text. The trick is that it will not necessarily be a perfect string match (for example an acronym can be used to shorten a multi-word expression, or a small part can be left out). So a simple string search will not have a 100% recall (even though it can be used as a starter).
If I just had an input and I wanted it to match it to one of the names, I would do a simple edit distance computation and that's it. What bugs me is that I have to do this for a whole text, and I don't know how to approach/break down the problem.
I cannot break down everything in N-grams because my named entities can be a single word or up to seven words long... Or can I?
I have thousands of Graph elements so I don't think NER can be applied here... Or can it?
An example could be:
My list of names is ['Graph Database', 'Manager', 'Employee Number 1']
The text is:
Every morning, the Manager browse through the Graph Database to look for updates. Every evening, Employee 1 updates the GraphDB.
I want in this block of text to map the 4 highlighted portions to their corresponding item in the list.
I have a small background in Machine Learning but I haven't really ever done NLP. To be clear, I do not care about the meaning of these words, I just want to be able to detect them.
Thanks

Related

How to infer surprisingly "missing" data in a large set of strings. Or nerds do unique (but sane) baby names

I was thinking about this problem the other day when trying to find an applicable email address for my very common name.
Let's say I had all the names of the roughly 150 million men in the United States in a file, and I wanted to figure out "men who don't exist but sound like they should". That is, I wanted to figure out a combination of names (First, Middle, Last) that exist without a person being named that combination in my record of all names. Let's say I appreciate the advantages of unique names but don't want any of the disadvantages of unfamiliarity and mispronunciation.
Of course I could make up a name like "Nickleback Sunshine Cheeseburger" and reasonably suspect that nobody would be named this combination but that may confuse people so I want names that exist in the set. So names like "Chao-Lin" which have different languages of origin although they may appear with the last name "Jones" would not be as likely to appear with Jones and seem more consistent with a last name of similar language origin like "Chao-Lin Kuo". José is more likely to appear with Gonzalez than Patel and so on.
Of course any of these notions would have to be re-enforced by the structure of the data.
So an example would be if "John Marcus Black" doesn't exist, that would be interesting because all names in the name are common and appear together frequently, just not in that order.
The first thing that came into my mind was some sort of trie or directed graph that is weighted by frequency but that only really works for an "autocomplete" like feature where what we are looking for is not actually present in the set. I was thinking about suffix trees as well but not sure if this is a good use case.
I'm sure there is a machine learning algorithm that would be sufficient in finding these names but I don't know very many.
Bonus, the most normal unique name given a necessary last name. Given a starting name like "Smith", come up with most surprising missing names.
tl;dr 1. Given all the names of men in the US in a file, find n names that probably should exist but don't. Also: some men have middle names, some don't.
The obvious choice would be character level Markov chains.
That won't prevent the generation of existing names, and of profanity, though. I.e., it might combine FUnk and niCK.
You could then rank the results by some surprisingness measure. E.g., fbased on character bigram/trigram frequencies.

vtd xml diff implementation

I have many VTD+XML indexes for different versions of the same file that i am hoping to implement a diff-like method to return the x-paths of nodes that have been modified between versions, as well as the difference between text within those nodes.
I figure using an existing algorithm such as O(nd) difference would be best to compare the text within two nodes. Thus the approach i envisioned would be to traverse the two documents simultaneously and store the xpath that corresponds with any nodes that contain text variations.
The issue is that once i encounter new or removed nodes, how do i determine that the node is infact an inserted/removed node or a variation of an existing node?
Or maybe there is another approach i should be taking?
Maybe my interpretation of your question is not exactly on the mark. But I feel that what you are trying to do may not have easy answers... consider the following XML snippet
<a>
<b>text1</b>
<b>text1</b>
</a>
and
<a>
<b>text2</b>
<b>text1</b>
</a>
You could say the second XML is simply the first one with text2 replaced with text1.
But you could also say the second XML is simply the first one removing the first b node, changing text1 of the the second b node to text2, and then insert text1 after the second b node.
In summary, it seems you don't just want to know what are the difference, but also the changes that lead to those differences. This is difficult as there are different things you can do that leads to the same output.

Delphi - What Structure allows for SAVING inverted index type of information?

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

RegExp as table entries

I'm building an application that takes inputs from SMS text thru Twilio. I'd like to build a table the matches the incoming SMS body with the appropriate response.
For example, imagine I'm building an NFL text message thing.
Someone texts in 'Redskins' and we text back, "The Redskins play at FedEx field"
Someone texts in 'Colts' and we text back, "The Colts are the pride of Indiana."
Here's the tricky part:
Of course, our Rails app is going to need to interpret the incoming team names through Regular Expressions, as many people will text in: Redskins or REDSKINS or REDSKIN or Redskin or REDskin.....
With one or two teams, one could just hardcode the RegExp and response into the controller...but with 30 teams, that seems wrong. (And with 120 entries -- say all pro sports-- even worse).
Does any one have any tips on getting the team names from the input stage, thru the DB table stage with a 'RegExp' conversion in the middle?
Thanks in advance.
for a modest number of keywords, I recommend a two table approach with Keywords and Aliases, always stores in lower case. Convert input to lower case. For each Keyword (say, redskins) you manually add 5-10 variations (including the correct one) in Aliases all of which have Alias.keyword_id = the id of the keyword. So you simply search Alias for the user input, and if you find a match you have the keyword_id of the keyword.
It has two advantages: fast and easy to extend... i fyou log the "no matches" you'll get a list of new aliases to add once to the dbase. MUCH easier and more reliable than trying to do via regex.
I don't think you want regexps here. What about spelling errors? For helpfulness (esp coming from a txt msg) I think you want to allow shortenings too.
Maybe a Soundex-based library or spelling correction thing would be best. You want a nearest match algorithm not a patterned match one.
If the text message is not too long, you should first chop that into words, and then take an intersection with the list of team names.
array_of_team_names = %w(Redskins Colts ... ) # keep it all capitalized
'cOLts blah blah'.scan(/\w+/).map{|word| word.capitalize} & array_of_team_names
# => ['Colts']
If you want to handle mistypes as suggested by drysdam, or if you want to handle larger text with more accuracy, you should use some library specific to that.
I think what you are asking is "how do I avoid hardcoding a regexp into my code, since I might have a lot of them, and they are really a data element"?
If you want to do the matching with regexp, you should note that you can create a regexp from a string, so you could easily have a table that contains column of regexp in string form. You can then dynamically create the array of regexp objects that you'd be using to search the incoming string with. The trick is what to do when you have a match. You'll need to develop a set of rules (yet another table) that basically says which response to pick based on incoming text. For example, if your rule is simply "match based on the team name and say where they play", that's pretty easy. Each regexp that you are searching for maps to exactly one action ("The Bears play in Chicago"). If your rules are more complicated (look for the Bears, and then look to see if the word "schedule" is in there too as well as "first game(s)", then you'd need another table that maps a collection of matches to a response.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

Resources