Graph Databases and MDM - neo4j

One of the problems that I'm trying to solve with Master Data Management (MDM) is merge duplicate entities that look different because of things like misspellings. For instance John Doe and Jon Doe might in reality be the same people.
I've read that graph databases like Neo4J can be used for MDM, and I have the vague sense that graph theory might be able to help me resolve the problem of duplicate entities. Basically if I look at the relationships between John Doe/Jon Doe might graph similarity of that node with other pieces of data offer a way to decide whether they are in fact the same object?
If so, how can I go about doing this with Neo4J?

Related

Why it is not recommended to index relationships in a graph database

In the book Neo4j in Action by Aleksa Vukotic and Nicki Watt, the authors say:
In our experience, it is less common for relationship indexes to be good solutions. We are not saying that relationship indexing is poor practice, but if you find yourself adding lots of relationship indexes, it is worth asking why.
It sounds that the authors do not recommend to index relationship in a graph database but no explanation is given thereafter. Does anyone know why?
I've voted for this question to be migrated to SO, and answering it while hoping it to be really migrated. I used Neo4j a couple of years. Although it has changed a lot since then, the principles of being a graph database won't alter much I believe. In my opinion, if you need a lot of indices to promptly query the relationships between the nodes, you could have designed your data model in some other way such that it focuses more on the graph nodes (just for example, relationships being your nodes, and nodes being your relationships as in line graph); because the querying mechanism (e.g. Cypher query) is generally optimised for the nodes.
First, it's important to understand the role of indexes in Neo4j, in that indexes are used to find starting points in the graph, after which relationship traversal and filtering are used to perform the remainder of the pattern matching and to complete the query.
The advice therefore is about the same as: "we do not recommend using relationships as starting points in the graph", and we find that true more often than not.
Usually when you need to do index lookups, you have certain "things" in mind as your starting places, and important things in graphs are typically represented by nodes. If we're asking "what employees are connected to this particular company" we're interested in starting quickly by finding that particular company and expanding out, not in finding all :EMPLOYED_BY relationships in the graph and filtering by the connected company, which would take far more time.
Often we find that those who encounter this restriction, and need this kind of fast lookup of relationships anyway, may need to rethink their model. Often when there is a need to lookup relationships as starting places in the graph, it is an indication that the thing represented by a relationship is important enough that it really should be a node in the graph (with its own relationships to the previously connected nodes), so this becomes a "modeling smell" that drives refactoring changes to the model. Often this kind of change feels more natural after, and affords more capability for the thing as a node that wasn't available when it was being modeled as a relationship (for example, the ability to apply multiple labels to it, or to connect it via relationships to more nodes than just the original two).
All that said, there will be cases where a relationship really does just need to be a relationship (either for business reasons, or because it truly is most practical modeling-wise for it to be kept as a relationship), and using those relationships as starting points in the graph make sense.
With the fulltext schema indexes introduced in Neo4j 3.5, we added the capability to add relationship indexes by relationship type(s) and property(or properties). So the capability is there, if needed, after you've ruled out refactoring of your model.

In neo4j, how to club nodes together into groups and have relationships between these groups?

I want to club nodes and their relations together (like triplets) and have relations going between such triplets in neo4j.
How can this be done neo4j? Sorry for not showing any of my previous work because I was not able to find anything that was of much use.
I have asked the same thing in neo4j forums, here.
Thanks in advance.
EDIT1:
I have figured out one possible way to do this, need your help to decide weather it will cause some problem with queries or storage.
Sorry, stackoverflow said I'm not allowed to embed pictures yet
Possible solution sketch.
Triplets take the form of subject --> relation --> predicate
So for every triplet I need, I'll make another node representing that triplet.
The triplet will have links to the subject and predicate and can also contain their id as key-value pair.
In this way we could have relations between 2 triplets.

Simple social network design flaw with graph database

I was looking at graph databases and Neo4j. As suggested, I tried to draw a simple social networking graph on white paper and after a few sketches I stucked at some similar points.
At first I designed a social network where "user"s can "like" "post"s.
(u1:User)-[:LIKED]->(p:Post)<-[:POSTED]-(u2:User)
Now I want to notify user2 about the like action and draw this on the white paper.
(u1:User)-[:LIKED]->(p:Post)<-[:POSTED]-(u2:User)
| ^
|__________[:NOTIFY]_________|
I am not sure if it is clear but I just drew a relationship between a node and another relationship which is not possible for graph databases, at least for Neo4j. So I decided, a Like should be a node instead of a relationship. Then my graph turned into this.
(u1:User)-[:CREATAD]->(l:Like)-[:BELONGS_TO]->(p:Post)<-[:POSTED]-(u2:User)
| ^
|__________________[:NOTIFY]________________|
Now everything is OK. Then I added Comments feature to the system as a relationship but when notifications involved, again it turned into a node. And same happened when I added "Liking comments" feature, "Likes to Comments" first seemed they are relationships but once again they turned into nodes when notifications involved.
In general, at some point I find myself drawing a relationship between a node and another relationship. My solution to that feels like I am turning entities, which naturally look like relationships, into nodes. And this makes me think of I have some problems with deciding what should be a node and what should be a relationship.
So my question is, does anyone else other than me fall into this "relationship between a node and another relationship" issue and if so how do you solve that?
It all depends on your use-cases, in many cases a simple relationship is good enough but if you want to do more with that entity or fact you turn it into a node, oftentimes it turns out that it is an actually quite important concept in the domain.
In our data modeling class there is a specific section on this and also in the "Graph Databases" book it is discussed in detail (you can get the free PDF here).
Sometimes it makes sense to keep the original relationship around for a fast shortcut crossing over that intermediate node if you don't need that detail.

Can this be accomplished by a Graph Database?

I have a request to develop an application that keep track of the movements of a certain item (or items). To better demonstrate what the application must do, I drew a diagram (simplified abstraction).
As I never worked with any databases other than the relational ones, I really don't know if I can solve this problem with a graph database.
These questions must be answered by the system:
What was the path that a certain pen drive walked?
I passed some pen drivers. Where are they now?
What are the pens I received, from where did they come from and to where did they go?
Where are the pens I burned and passed? And with whom?
Any help and suggestions are much appreciated.
Thanks
In Neo4j everything is either a node or a relationship. So it's useful to think: what would be my nodes and relationships?
Here it might be, for example, that every "pen drive, "person" and "location" is a node. Verbs like "walk" or "give" would be your relationships.
In this model, you'd be able use "Cypher" to query for things like "give me all location nodes connected to pen nodes by the relationship walk." Or, say "start at all person nodes and return nodes who have a give relationship to a pen drive node that doesn't have a give relationship that connects back to the starting person node."
This rich graph query language gives you nice algorithms like shortest distance for free, so you beyond a transactional record you could determine whether, for example, a pen drive made it from A to B using the optimal path. But as you can see above, "relational joins" do not beget simple queries or descriptions thereof.
When it comes to database design, when the model becomes cumbersome to map mentally, it's going to be a pain to develop too. Design your database based on how you plan to query it. If you're unable to easily explain those queries in terms of Neo4j, it's possible that Neo4j isn't going to be the best fit.

Using machine learning to de-duplicate data

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.
There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.
Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.

Resources