Using machine learning to de-duplicate data - machine-learning

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.

There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.

Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.

Related

Automate Solving of customer technical issue Production L3 tickets

I want to develop a app/software which understand text from various input and make Decision according to it. Further if any point the system got confused then user can manual supply the output for it and from next time onwards system must learn to give such output in these scenarios. Basically system must learn from its past experience. The job that i want handle with this system is mundane job of resolving customer technical problems.( Production L3 tickets). The input in this case would be customer problem like with the order( like the state in which order is stuck and the state in which he wants it to be pushed) and second input be the current state order( data retrieved for that order from multiple tables of db) . For these two inputs the output would be the desired action to be taken like to update certain columns and fire XML for that order. The tools which I think would required is a Natural Language processor( NLP) library for understanding text and machine learning so as learn from past confusing scenarios.
If you want to use Java libraries for your NLP Pipeline, have a look at Opennlp.
you've a lot of basic support here.
And then you've deeplearning4j where you've a lot of Neural Network implementations in java.
As you want a Dynamic model which can learn from past experiences rather than a static one, you've a number of neural netwrok implementations which you can play with in deeplearning4j.
Hope this helps!

Master Data Management using Graph Database

I am building a master database to store all relevant information about our customers. I am using Neo4j.
Below is a sample of our model. We have Person, that can be registered in 3 of our mobile applications. (App.01, App. 02, App. 03 - We use CPF key, it is like a SSN). In those apps the user can be registered with an email. So it is represented by Email entity. Those user can have multiple address represented by Address entity.
The question is:
As I am building a Master Data, IMO, if someone query the mdm database asking for all "best" information about a person, I would return for example:
Name: John
Best email: email2 (because it has two apps using it)
Best address: addr1 (because it has tow apps using it)
So I am going to build some heuristis to define what is the "best" email and address.
For this purpose, I have some options:
I could create an edge from John to email2 and to addr1. So it's going to be easy for an user of MDM to get the "best" address/email from John.
I could build a rest API endpoint and create this heuristic in query time.
Does anyone have experience using graph database or design MDM database?
Is it a good approach?
This question is a complement for the question: Using Neo4j to build a Master Data Management
The graph data model is good to store your master data, however, your master data most likely will co-exist with operational and reference data in the form of dimensions.
if you decide to go with a graph model for your DMD, make sure that you have a well defined semantic model for the core dimension is MDM, usually:
products
customer
employees
Assets
Location
These core dimensions become attributes of your nodes.
Also, decide what DMD architecture style you are going to adopt, some popular ones are:
The Registry - Graph fits very well with this style because your master data remains in the SOS(system of record) and the references can be represented in the graph very nicely.
Master data Hub - Extra transformations ar4e required to transpose your system of record from tabular to the graph.
Master-Master. - this style fits well with your MDM in the graph if you do not have too many legacy apps that depend on your MDM.
Approach 1 would add a lot of essentially redundant information (about 2N extra relationships, where N is the number of people), and also require more complex coding to handle changes to a person's apps. And, as always when information is stored redundantly, you would have to be especially careful that inconsistencies do not creep in. But, it should be faster when querying for the "best" contact info.
Approach 2 keeps the DB the same size, but requires a more complex and slower query to get the "best" contact info. However, changing a person's apps and contact info is straightforward.
To decide which approach to use, you should consider whether DB size is an issue, and also look at your use cases and how frequently they will be performed.
Here is a simple heuristic if DB size is not an issue. Suppose G is the frequency at which you need to get a person's "best" contact info, and M is the frequency at which you need to modify a person's apps or contact info. You would pick approach 1 if the value of G/M exceeds some threshold value, K, that you would have to decide on, taking into consideration the above considerations.

Geodata Querying Optimisations

I am planning to write a Node.js-powered RESTful web service that I will use for a mobile application which provides some sort of location based features. The most basic use case is going to look something like this:
the user can create a resource by sending a request to the web service containing the resource's name and the user's current location (latitude and longitude)
the web service will store the metadata about this resource internally in some sort of collection
the user can query the web service for a list of resources within 5km of his current location
One of the first problems that came up in my mind was scalability. Let's suppose that at some point in the future the server will hold metadata for 1 million resources. When a user will query for nearby results, looping through 1 million entries to compute the distance will take forever.
There are many services out there that have the same flow, so I thought implementing something like this is not going to take me a lot of time. I might have been wrong.
I am now two days into researching proven methods and algorithms. By now I have read everything I could put my hands on about QuadTrees, Geohases, databases with spatial indexing support, formulas and so on. However, I still can't get the whole picture of how everything is going to work.
I was hoping that maybe someone who has worked on something similar could share his insight on what approach might be the most suitable considering this use case and the technologies that I am planning to use. Also, a short description of how it can be implemented would help me a lot!
For those who are also looking for more information on this topic out of curiosity, my answer might not provide much clearance. However, some answers in here might help you understand how you could achieve proximity searches using Geohashes.
My approach, after doing a little research on Redis, will be not to overcomplicate things and just use the tools that are already out there. It has out of the box support for spatial indexing and will most probably meet all my persistance requirements for this project.
Apparently MongoDB also comes with built-in support for geodata. In fact, even RDBMS like MySQL or SQLite do come with such capabilities.

Mahout Recommender - questions to setup user preference

I'm looking for some advice / guidance --
I'm working on a recommendation engine / personnel assistance app, using Mahout as the framework -
What I want to do is for new users of the app to begin by answering 5 questions and use the answers from the questions to effect the recommendation -- pretty much feeding the answers as a user-preference
I'm just not sure how to incorporate this into my code, I'm not even sure where to begin looking - I've been Googling but none of the search results really address this...
Any suggestions / advice / guidance will be greatly appreciated
Thanks
I did just that with the new Spark Itemsimilarity implementation about a year ago. You'll need a search engine for the recommendations query because Mahout doesn't have a server. I'd suggest using the new "Universal Recommender" engine template with PredicitonIO. It uses Mahout to calculate the model and Elasticsearch to serve it.
https://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
PreditionIO is a framework of integrated components that provide an event server (for event storage) integration with Hadoop/HDFS, Spark, Hbase, and a REST or SDK API. All you do is install it and get the template as a plugin engine. This will provide pretty advanced recommendations queries with multiple event ingestion, a hybrid content-based method to tune results, and several methods of using popular items for backfill when no other recommendations can be made. It also uses realtime user actions for recommendations.
This last bit is super important if you want to have your users go through some training. This way they will see the benefit of training in realtime. Check this site, where I did exactly what you are talking about: https://guide.finderbots.com Notice the "Trainer". It presents you with movies and asks for thumbs up or down for as many as you care to do, then when you ask for recommendations they will be based on the realtime preferences of the user. You need to create an account first so we have a user-id.
The way I created the list for the trainer is by cluster popular items. By clustering I mean based on the users that preferred the items. Clustering produces items that are differentiated because they belong to different clusters, which means different user-sets tended to like them, and the popular ones are more likely to be known by users when they go through training. These are good things to have in a trainer.

Free data warehousing systems--specifically, for data storage

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)

Resources