I have a large table with a list of company names and need a way of unifying the company names, e.g.
McDonalds Restaurant = McDonalds
McDonalds Fast Food = McDonalds
McDonalds Food 1234 = McDonalds
McDonald = McDonalds
McDnld = McDonalds
McDonalds Farm doesn't equal McDonalds
Microsoft -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc. -> Polycom
Is there away to do this with out writing out each rule individually? Or at least generate some sort of percentage on the likely chance that One company name belongs to a certain company?
Try:
SELECT FROM `company` WHERE `name` LIKE
"%McDonalds%Food%" or "%McDonalds%Restaurant%"
You'll need to handle each case individually since you're explicitly excluding %Farm from the resultset.
If your doesn't equal would be much shorter you could do a NOT LIKE rule for each one of those. Otherwise there isn't really a way that SQL could tell one from the other. What I would do is to make a global company table that would hold the base name and tie to the child table with a base store ID.
The short answer is...no, at least not in SQL.
This sort of heuristic matching of names has been the subject of a lot of research.
How can I measure the similarity between 2 strings?
A Comparison of String Distance Metrics for Name-Matching Tasks
A Fast Heuristic for Approximate String Matching
A Guided Tour to Approximate String Matching
Many SQL implementations have a Soundex function, but that works well (for some definition of "well") only for conventional Anglo-Saxon names (that were widely used a century ago). See http://www.immagic.com/eLibrary/ARCHIVES/GENERAL/LAS_US/L030206B.pdf for some issues with Soundex.
Related
I've just started using neo4j and, having done a few experiments, am ready to start organizing the database in itself. Therefore, I've started by designing a basic diagram (on paper) and came across the following doubt:
Most examples in the material I'm using (cypher and neo4j tutorials) present only a few properties per relationship/node. But I have to wonder what the cost of having a heavy string of properties is.
Q: Is it more efficient to favor a wide variety of relationship types (GOODFRIENDS_WITH, FRIENDS_WITH, ACQUAINTANCE, RIVAL, ENEMIES, etc) or fewer types with varying properties (SEES_AS type:good friend, friend, acquaintance, rival, enemy, etc)?
The same holds for nodes. The first draft of my diagram has a staggering amount of properties (title, first name, second name, first surname, second surname, suffix, nickname, and then there's physical characteristics, personality, age, jobs...) and I'm thinking it may lower the performance of the db. Of course some nodes won't need all of the properties, but the basic properties will still be quite a few.
Q: What is the actual, and the advisable, limit for the number of properties, in both nodes and relationships?
FYI, I am going to remake my draft in such a way as to diminish the properties by using nodes instead (create a node :family names, another for :job and so on), but I've only just started thinking it over as I'll need to carefully analyse which 'would-be properties' make sense to remain, even because the change will amplify the number of relationship types I'll be dealing with.
Background information:
1) I'm using neo4j to map out all relationships between the people living in a fictional small town. The queries I'll perform will mostly be as follow:
a. find all possible paths between 2 (or more) characters
b. find all locations which 2 (or more) characters frequent
c. find all characters which have certain types of relationship (friends, cousins, neighbors, etc) to character X
d. find all characters with the same age (or similar age) who studied in the same school
e. find all characters with the same age / first name / surname / hair color / height / hobby / job / temper (easy to anger) / ...
and variations of the above.
2) I'm not a programmer, but having self-learnt HTML and advanced excel, I feel confident I'll learn the intuitive Cypher quickly enough.
First off, for small data "sandbox" use, this is a moot point. Even with the most inefficient data layout, as long as you avoid Cartesian Products and its like, the only thing you will notice is how intuitive your data is to yourself. So if this is a "toy" scale project, just focus on what makes the most organizational sense to you. If you change your mind later, reformatting via cypher won't be too hard.
Now assuming this is a business project that needs to scale to some degree, remember that non-indexed properties are basically invisible to the Cypher planner. The more meaningful and diverse your relationships, the better the Cypher planner is going to be at finding your data quickly. Favor relationships for connections you want to be able to explore, and favor properties for data you just want to see. Index any properties or use labels that will be key for finding a particular (or set of) node(s) in your queries.
I was thinking about this problem the other day when trying to find an applicable email address for my very common name.
Let's say I had all the names of the roughly 150 million men in the United States in a file, and I wanted to figure out "men who don't exist but sound like they should". That is, I wanted to figure out a combination of names (First, Middle, Last) that exist without a person being named that combination in my record of all names. Let's say I appreciate the advantages of unique names but don't want any of the disadvantages of unfamiliarity and mispronunciation.
Of course I could make up a name like "Nickleback Sunshine Cheeseburger" and reasonably suspect that nobody would be named this combination but that may confuse people so I want names that exist in the set. So names like "Chao-Lin" which have different languages of origin although they may appear with the last name "Jones" would not be as likely to appear with Jones and seem more consistent with a last name of similar language origin like "Chao-Lin Kuo". José is more likely to appear with Gonzalez than Patel and so on.
Of course any of these notions would have to be re-enforced by the structure of the data.
So an example would be if "John Marcus Black" doesn't exist, that would be interesting because all names in the name are common and appear together frequently, just not in that order.
The first thing that came into my mind was some sort of trie or directed graph that is weighted by frequency but that only really works for an "autocomplete" like feature where what we are looking for is not actually present in the set. I was thinking about suffix trees as well but not sure if this is a good use case.
I'm sure there is a machine learning algorithm that would be sufficient in finding these names but I don't know very many.
Bonus, the most normal unique name given a necessary last name. Given a starting name like "Smith", come up with most surprising missing names.
tl;dr 1. Given all the names of men in the US in a file, find n names that probably should exist but don't. Also: some men have middle names, some don't.
The obvious choice would be character level Markov chains.
That won't prevent the generation of existing names, and of profanity, though. I.e., it might combine FUnk and niCK.
You could then rank the results by some surprisingness measure. E.g., fbased on character bigram/trigram frequencies.
Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.
Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.
i have a dataset containg fields like below:
id amount date s_pName s_cName b_pName b_cName
1 100 2/3/2012 IBM IBM_USA Pepsi Pepsi_USA
2 200 21/3/2012 IBM IBM_USA Coke Coke_UK
3 300 12/3/2012 IBM IBM_USA Pepsi Pepsi_USA
4 1100 22/3/2012 Pepsi IBM_Aus IBM IBM_USA
here all 4 fields like s_pName s_cName b_pName b_cName can be saler or buyer.
how to models this dataset in neo4j so that when I query using gremlin like,
select b_CName,id,amount,date from tableName where s_cName = IBM_USA,IBM_AUS;
I noted your question on the gremlin-users mailing list as well (where you provided a bit more information about things you'd tried): https://groups.google.com/forum/#!topic/gremlin-users/AxsF2eJvpOA
I'm sure there are a few ways to approach this modelling issue, so I'll just provide some things to consider and hopefully that will inspire you to solution. First, instead of thinking of buyers and sellers, just think about the fact that you have "companies" that sells things to other companies and that companies have hierarchy (meaning that a company can have a parent). Your model then comes down to:
company --sellsTo--> company
company --parent--> company
Place your transaction amount and date on the "sellsTo" edge creating one such edge per row in your dataset. Create a key index on the "companyName" field of the company vertex so that you can look up the company. Your Gremlin would then be something like:
['IBM_USA','IBM_AUS'].collect{g.V('companyName',it).next()}._().outE('sellsTo').as('tx').inV.as('buyer').select{[it.id, it.amount, it.date]}{it.companyName}
so breaking that down you do a lookup of your two companies you care about by key index on companyName and get them into a pipeline with _(). Then you traverse out to the companies those two companies sold to. You use select to grab the tx (transaction edge) and buyer vertex executing a closure on each of them to transform them into the fields you want which will yield you something like (for one result, your Gremlin would likely return several of these with your full dataset obviously):
[[1,100,2/3/2012],Pepsi_USA]
You could use some Groovy JDK (http://groovy.codehaus.org/groovy-jdk/) operations to transform it further from there if that's not the final format you need.