How to infer surprisingly "missing" data in a large set of strings. Or nerds do unique (but sane) baby names - machine-learning

I was thinking about this problem the other day when trying to find an applicable email address for my very common name.
Let's say I had all the names of the roughly 150 million men in the United States in a file, and I wanted to figure out "men who don't exist but sound like they should". That is, I wanted to figure out a combination of names (First, Middle, Last) that exist without a person being named that combination in my record of all names. Let's say I appreciate the advantages of unique names but don't want any of the disadvantages of unfamiliarity and mispronunciation.
Of course I could make up a name like "Nickleback Sunshine Cheeseburger" and reasonably suspect that nobody would be named this combination but that may confuse people so I want names that exist in the set. So names like "Chao-Lin" which have different languages of origin although they may appear with the last name "Jones" would not be as likely to appear with Jones and seem more consistent with a last name of similar language origin like "Chao-Lin Kuo". José is more likely to appear with Gonzalez than Patel and so on.
Of course any of these notions would have to be re-enforced by the structure of the data.
So an example would be if "John Marcus Black" doesn't exist, that would be interesting because all names in the name are common and appear together frequently, just not in that order.
The first thing that came into my mind was some sort of trie or directed graph that is weighted by frequency but that only really works for an "autocomplete" like feature where what we are looking for is not actually present in the set. I was thinking about suffix trees as well but not sure if this is a good use case.
I'm sure there is a machine learning algorithm that would be sufficient in finding these names but I don't know very many.
Bonus, the most normal unique name given a necessary last name. Given a starting name like "Smith", come up with most surprising missing names.
tl;dr 1. Given all the names of men in the US in a file, find n names that probably should exist but don't. Also: some men have middle names, some don't.

The obvious choice would be character level Markov chains.
That won't prevent the generation of existing names, and of profanity, though. I.e., it might combine FUnk and niCK.
You could then rank the results by some surprisingness measure. E.g., fbased on character bigram/trigram frequencies.

Related

Custom names detection

This is a project in really early phase and I'm trying to find ideas on where to start.
Any help or pointers would be greatly appreciated!
My problem:
I have text on one side, and a list of named GraphDB elements on the other (usually the name is either an acronym or a multi-word expression). My texts are not annotated.
I want to detect whenever a name is explicitly used in the text. The trick is that it will not necessarily be a perfect string match (for example an acronym can be used to shorten a multi-word expression, or a small part can be left out). So a simple string search will not have a 100% recall (even though it can be used as a starter).
If I just had an input and I wanted it to match it to one of the names, I would do a simple edit distance computation and that's it. What bugs me is that I have to do this for a whole text, and I don't know how to approach/break down the problem.
I cannot break down everything in N-grams because my named entities can be a single word or up to seven words long... Or can I?
I have thousands of Graph elements so I don't think NER can be applied here... Or can it?
An example could be:
My list of names is ['Graph Database', 'Manager', 'Employee Number 1']
The text is:
Every morning, the Manager browse through the Graph Database to look for updates. Every evening, Employee 1 updates the GraphDB.
I want in this block of text to map the 4 highlighted portions to their corresponding item in the list.
I have a small background in Machine Learning but I haven't really ever done NLP. To be clear, I do not care about the meaning of these words, I just want to be able to detect them.
Thanks

database design for dictionary of words

(my reason for asking this question is based on having read this answer, which made me rethink my current setup)
I currently am developing a ruby on rails application in which there are many languages, each of which has a dictionary of base words attached to it, as well as a list of the words that map to each base word. The way I currently have it set up, there is a base_words table that contains the base_word as a string, along with the language_id as a foreign key. There is also a words table, each row of which contains a word string, along with the base_word_id as a foreign key. There is also a language_id indexed on each column, although I'm almost positive that this is superfluous due to the language_id on base_word, so I'm planning to take it off (although this could be a bad assumption on my part).
In sum, on the contrary to the answer I mentioned in the beginning, the tables are not separated by language, because I've reasoned that I can simply pull out the language words programmatically when the time comes. However, my application will also have translation(s) associated with each base word (as did the answer I referenced), and so I'm doubting my structure due to the realization that each translation will actually be a base_word in the same table as itself, which would mean that the translation would actually be just an id of another base word in said table. This may be completely fine, or it may not be - I have no clue (this is my first ever programming project).
Is this ok? Do I need to separate my base_words into separate tables for each language, or can I leave it all in one table?
Another example: I also need to store many phrases for each language, along with their translations. Should I have one table where each row has the appropriate translation of the phrase, or one table where each row contains simply one phrase and a language_id, or multiple tables (one for each language)?
Un saludo,
Michael
As in the other scenario, you'll have a translations table. There is no technical reason it couldn't have multiple foreign keys to base_words (a source_word_id and target_word_id, perhaps). So yes, you can absolutely store all your words in one table. There are some minor side effects involved with translations being directional relationships: it becomes possible to have translations which only work one way, and there will be many pairs of entries with opposite source and target. Neither of these is much of a worry: the first is even potentially desirable in order to represent words with double meanings in one language but not the other, and as for the second, space is cheap and indexing is easy.
You are correct that you do not need words.language_id, so long as you always join base_words when you're querying words and the language matters. This obviously changes if you have a use case where it makes sense to leave base_words out, but that scenario sounds unlikely based on what you describe.
As for phrases: why should they be handled any differently than base_words?

how to create a replicable, unique code for a pre-ISBN book

I am putting my collection of some 13000 books in a mySQL database. Most of the copies I possess
can be identified uniquely by ISBN. I need to use this distinguishing code as a foreign key into
another database table.
However, quite a few of my books date from pre-ISBN ages. So for these, I am trying to devise a
scheme to uniquely assign a code, sort of like an SKU.
The code would be strictly for private use. It should have the important property that, when I
obtain a pre-ISBN publication, I could build the code from inspecting the work, and based on the
result search the database to see if I already have other copies in my possession.
Many years ago I think I saw a search scheme for some university(?) catalogue, where you could
perform a search of a title based on a concatenated string' (or code) that was made up of let's
say 8 letters from the title, and 4 from the author, and maybe some other data. For example,
to search 'The Nature of Space and Time' by Stephen Hawking and Roger Penrose you might perform
a search on the string 'Nature SHawk', being comprised of 8 characters from the title (omitting
non-filing words and stopwords) and 4 from the author(s).
I haven't been able to find any information on such scheme's, or whether or not such an approach
was standardized in any way.
Something along these lines could be made up of course, but I was wondering if people here have
heard of such schemes, of have ideas on how to come to a solution to this.
So keep in mind the important property of 'replicability': using the scheme, inspection of a pre-
ISBN dated work should --omitting very special or exclusive cases-- in general lead to a code
that can singly be used to subsequently determine if such a copy is already in the database.
Thank you for your time.
Just use the Title (add Author and Publisher as options) and a series id to produce a fake isbn. Take a look at fake_isbn.
NOTE: use the first digit as a series id but don't use 9!

Fuzzy neo4j relationships

I want to do something in neo4j that I hope will work ok: I want to make "fuzzy" path matches; the links will sometimes count as a relationship, and sometimes not, depending on the query.
Here's an example: let's say I have a (p:Person)-[:HAS]->(n:Name). A search has found a Person (say, by phone number). I want to go from this Person to other Persons with similar names, to get their phone numbers. Also, I want the similarity to be adjustable, so the user might ask to match very similar names, or not very similar names.
I could get the first person's name, and then do a search against other names with some lucene patterns - this is easy enough, but it means doing a full lucene search on the Name values, which in my use case is not ideal as I think it might be a bit slow (there are very many names - let's say a billion, remembering this is just an example). I hope there is a better way.
One approach I can imagine is having a "similarity" relationship between Names. Whenever a new Name node is added, we check for similar names and link them (creating these relationships would be slow, but we could push it onto a batch process, and it's ok if it takes some minutes). We would only link names that were fairly similar (so the number of links would hopefully not get too large). I suppose we could then craft a query on this, matching similarities greater than my threshold. Something like this:
MATCH (p1:Person {phone:"555-234234"})-->(n1:Name)-[s:SIMILAR]->(n2:Name)-->(p2:Person)
WHERE s.matchLevel >=2
RETURN p2.phone;
Is this approach better or worse than just doing the lucene search? Has anyone else wanted to do something like this?
Also, based on the suggestion at http://graphaware.com/neo4j/2013/10/24/neo4j-qualifying-relationships.html, I believe I'll be better off having many relationships (SIMILAR_1, SIMILAR_2 ..) instead of using a "match level" attribute on my relationship.
BTW, I know there are many similar questions to this (eg. Neo4j 2 Cypher fuzzy search), but afaik this exact question isn't on stackoverflow (and I have looked).

Delphi - What Structure allows for SAVING inverted index type of information?

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

Resources