Trouble with fuzzy matching columns - google-sheets

We have a long list of organizations that we are trying to match to another long list of organizations. Normally, I would just use an index/match formula to match the 2 columns together. However, they both utilize organization names and the spelling can vary mildly.
For example, one column might refer to Organization A as "Tanner Health" and another might refer to it as "Tanner Healthcare". Another example, one column might refer to Organization B as "The Tanner Health Institute" and another might refer to it as "Tanner Health Institute". There are countless examples, but the idea is that we want to try and get the closest possible match between the 2 columns.
It is totally possible that a Organization in column 1 might not exist in column 2, in which case, we don't want it to return anything.
If it helps at all, we also have information about each organizations state and zip code. So, I considered an array formula that matches Zip code, State, and then does the best fuzzy match to the available options.
I am struggling to find a formula that can do a fuzzy match of the 2 columns, and am looking for suggestions on the best way to approach this problem.

Related

How to infer surprisingly "missing" data in a large set of strings. Or nerds do unique (but sane) baby names

I was thinking about this problem the other day when trying to find an applicable email address for my very common name.
Let's say I had all the names of the roughly 150 million men in the United States in a file, and I wanted to figure out "men who don't exist but sound like they should". That is, I wanted to figure out a combination of names (First, Middle, Last) that exist without a person being named that combination in my record of all names. Let's say I appreciate the advantages of unique names but don't want any of the disadvantages of unfamiliarity and mispronunciation.
Of course I could make up a name like "Nickleback Sunshine Cheeseburger" and reasonably suspect that nobody would be named this combination but that may confuse people so I want names that exist in the set. So names like "Chao-Lin" which have different languages of origin although they may appear with the last name "Jones" would not be as likely to appear with Jones and seem more consistent with a last name of similar language origin like "Chao-Lin Kuo". José is more likely to appear with Gonzalez than Patel and so on.
Of course any of these notions would have to be re-enforced by the structure of the data.
So an example would be if "John Marcus Black" doesn't exist, that would be interesting because all names in the name are common and appear together frequently, just not in that order.
The first thing that came into my mind was some sort of trie or directed graph that is weighted by frequency but that only really works for an "autocomplete" like feature where what we are looking for is not actually present in the set. I was thinking about suffix trees as well but not sure if this is a good use case.
I'm sure there is a machine learning algorithm that would be sufficient in finding these names but I don't know very many.
Bonus, the most normal unique name given a necessary last name. Given a starting name like "Smith", come up with most surprising missing names.
tl;dr 1. Given all the names of men in the US in a file, find n names that probably should exist but don't. Also: some men have middle names, some don't.
The obvious choice would be character level Markov chains.
That won't prevent the generation of existing names, and of profanity, though. I.e., it might combine FUnk and niCK.
You could then rank the results by some surprisingness measure. E.g., fbased on character bigram/trigram frequencies.

Fuzzy neo4j relationships

I want to do something in neo4j that I hope will work ok: I want to make "fuzzy" path matches; the links will sometimes count as a relationship, and sometimes not, depending on the query.
Here's an example: let's say I have a (p:Person)-[:HAS]->(n:Name). A search has found a Person (say, by phone number). I want to go from this Person to other Persons with similar names, to get their phone numbers. Also, I want the similarity to be adjustable, so the user might ask to match very similar names, or not very similar names.
I could get the first person's name, and then do a search against other names with some lucene patterns - this is easy enough, but it means doing a full lucene search on the Name values, which in my use case is not ideal as I think it might be a bit slow (there are very many names - let's say a billion, remembering this is just an example). I hope there is a better way.
One approach I can imagine is having a "similarity" relationship between Names. Whenever a new Name node is added, we check for similar names and link them (creating these relationships would be slow, but we could push it onto a batch process, and it's ok if it takes some minutes). We would only link names that were fairly similar (so the number of links would hopefully not get too large). I suppose we could then craft a query on this, matching similarities greater than my threshold. Something like this:
MATCH (p1:Person {phone:"555-234234"})-->(n1:Name)-[s:SIMILAR]->(n2:Name)-->(p2:Person)
WHERE s.matchLevel >=2
RETURN p2.phone;
Is this approach better or worse than just doing the lucene search? Has anyone else wanted to do something like this?
Also, based on the suggestion at http://graphaware.com/neo4j/2013/10/24/neo4j-qualifying-relationships.html, I believe I'll be better off having many relationships (SIMILAR_1, SIMILAR_2 ..) instead of using a "match level" attribute on my relationship.
BTW, I know there are many similar questions to this (eg. Neo4j 2 Cypher fuzzy search), but afaik this exact question isn't on stackoverflow (and I have looked).

Delphi - What Structure allows for SAVING inverted index type of information?

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

Solr join and non-Join queries give different results

I am trying to link two types of documents in my Solr index. The parent is named "house" and the child is named "available". So, I want to return a list of houses that have available documents with some filtering. However, the following query gives me around 18 documents, which is wrong. It should return 0 documents.
q=*:*
&fq={!join from=house_id_fk to=house_id}doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq={!join from=house_id_fk to=house_id}doctype:available AND sd_year:2014 AND sd_month:11
To debug it, I tried first to check whether there is any available documents with the given filter queries. So, I tried the following query:
q=*:*
&fq=doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq=doctype:available AND sd_year:2014 AND sd_month:11
The query gives 0 results, which is correct. So as you can see both queries are the same, the different is using the join query parser. I am a bit confused, why the first query gives results. My understanding is that this should not happen because the second query shows that there is no any available documents that satisfy the given filter queries.
I have figured it out.
The reason is simply the type of join in Solr. It is an outer join. Since both filter queries are executed separately, a house that has available documents with discount > 1 or (sd_year:2014 AND sd_month:11) will be returned even though my intention was applying bother conditions at the same time.
However, in the second case, both conditions are applied at same time to find available documents, then houses based on the matching available documents are returned. Since there is no any available document that satisfies both conditions, then there is no any matching house which gives zero results.
It really took sometime to figure this out, I hope this will help someone else.

is there an algorithm to find out which words in a search-string belong together?

I was thinking about text driven search by user input.
often you are searching in a database of addresses, where you can find customers and so on.
has anybody any idea how to find out which of the typed words is the name, which is the street name, which is the company name?
and secondly if the name is a double name like "Lee Harvey", how can I find out that the two words Lee and Harvey belong together?
Same problem with company names like "frank the baker inc."...
Is there any algorithm or best practice strategy?
thanks for links, tutorials, scripts and all other help ;-)
What you basically want is a search engine :) Here are the basic steps you need to follow -
You need to create an 'Inverted Index' of the content you want to be searched on.
The index is 'name'=>'value' pair. You can have this pair in whichever way you want (tuned according to your data & needs.
Eg. for your problem of double names, you could split all your names into single words & index it like so -
'lee'=>'lee harvey'
'harvey'=>'lee harvey'
...
this way when anyone searches for 'lee' they get 'lee harvey'. There are other better approaches to this called "n-gram" indexing. Check it out...
You could possibly build indexes of names, addresses, emails etc & when the user types a query check it against all your indexes with the approach suggested above. After you get the results then merge them. Maybe you could introduce the notion of rank so that you can sort your results & show the most latest or most relevant ones at the top. For this you need to figure out a way to score your terms...
Don't care, just perform full-text search. Then you should check the result items for which field contains the search terms. Also, you may display items in separate lists (terms found int name, term found in address). The only difficulty is if John Smith is living in the John Smiht street, you must decide, which list/lists the result item belongs to.

Resources