How to distinguish between two Different Named Entities of same name? - machine-learning

I have few articles, in which I am taking out name using NER Model (Named Entity Recognition). Since NER is classifying into four categories ( PERSON, LOCATION, ORGANISATION, MISCELLANEOUS ). Now I having two people of same name. How will I go about distinguishing between them?
Kindly direct me towards some research available on this problem, if possible.

The task you need is called Entity Linking, it is a harder problem than Named Entity Recognition.
A good way to start research on this problem is the ACL anthology.

Related

Domain specific ontologies to download? Specifically for agricultural domain

I am looking for ontologies made for the domain of agriculture. I need these ontologies for testing a logic I implemented for merging domain specific ontologies. I specifically need ontologies that are created for the sub domains of agriculture(ontologies of other domains will also be useful as well).
For example: Crops ontology, Fertilizer ontology, Rice ontology, Weed ontology.
Any kind of ontology will be helpful. Does anyone know where to find such ontologies? I couldn't find any.
If anyone know ontologies like these related to a domain other than agriculture, mention them too. Thank you in advance.
Sorry if I posted a wrong kind of question.
You don't state exactly what you mean by ontology, so I'm assuming a definition as is commonly used in the life-sciences, encompassing a wide degree of axiomatization with hundreds to potentially hundreds of thousands of classes. If you mean something more like an RDF schema then my examples may not apply.
AgroPortal has over 100 ontologies/thesauri in the domain of or closely related to Agriculture. You can see in the top 5 accessed ontologies some of the most relevant ones, including AGROVOC. Note that GACS is itself a merger of other thesauri, so if your goal is to test a merge framework you may want to hold this one back. Many of these are more thesaurus-like, but some such as ENVO and AGRO employ more extensive OWL axiomatization.
Note also that considerable work has been done in the Planteome project to merge together different crop ontologies into a pan-species trait ontology, this may also be useful for your evaluation.
If you are interested in applying your techniques more broadly in the life sciences, common sub-domains are anatomy, phenotype and disease. These frequently are used as tests in initiatives like the Ontology Alignment Evaluation Initiative. Although it sounds like your technique goes beyond mapping and into merging, it may be useful to look at past competitions. I have also produced merged anatomy, disease and phenotype ontologies, these have all been curated and could be used as test sets for your approach.

Azure Machine Learning One to Many Data

I'm trying to learn Azure Machine Learning and it seems the data sources for all the algorithms are two dimensional. Is there any way I can use one to many relational tables as a data source? or is it even possible?
It's not possible as far as I'm aware :(
However, the general rule is that you should flatten a relational graph into a single array of values. Remember, though, that you should have one array of values per main entity, it looks to me like your main entity in your example is the one with the Visits in.
Effectively, you'd be saying that all diagnoses are a property of Visit, but because there's potentially more than one, you'd have to have properties such as Diagnosis1, Diagnosis2, Diagnosis3 ...etc.

Is there an absolute model answer for Entity-Relationship diagram?

Hi I am new to conceptual data modelling and is currently working on entity-relationship diagrams. Just some questions that I am not able to find an answer to:
I am designing an E-R diagram based on a given scenerio and as I match my answers against the 'model answer', it is quite different, especially the terms I use within the diamond shapes representing the relationship between 2 entities. Am i right to say that the choice of words representing the relationship can be anything so long as it is logical?
I noticed different tutorials uses different means of representing cardinality. Some uses crow foot, some uses M:N to represent many-to-many. As there are so many standards, which is the recommended standard to follow for a beginner?
Thanks in advance
A Conceptual Model is used to represent a domain of discourse to aid you and the users understand that domain. As such the wording and form of that representation should be that most aids you and the users in that understanding.
Therefore, should you choose to include an Entity Relationship Diagram as part of the Conceptual Model, the terms, notation and diagrammatic conventions used should be those that make most sense to you and the users. The terms used may be specific to your organization or business.
There is no particular best practice to recommend. As long as you and the users are happy that your Conceptual Model adequately explains the domain of discourse then the production of the model has fulfilled its objective.

Using machine learning to de-duplicate data

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.
There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.
Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.

collaborative filtering approach for tips/recommendations related to registered courses

I am looking at a specific problem where I need to build a recommender.
The generalized problem is as follows,
Each user has registered for (say) x courses (c1, c2, c3, .. cx)
Depending on each course, I need to provide (say) top 5 tips/recommendations to the user (e.g. study materials that could be useful etc)
I need collaborative elements to be applied to learn what recommendations are proving helpful to users.
I looked at the recommendation engines like Apache Mahout Taste, but I am unable to model my problem in a way that it looks like the examples shown. (The extra filtering criteria where a user is associated with one or more courses and each recommendation/tip could be associated with one or more courses is throwing me off.)
Please let me know if there is any good way of modeling such a problem? Any pointers to documentation/examples would be very appreciated.
I am just starting my research in this area so please bear with me if I have misunderstood any concepts.
Thanks,
Vivek
This may be too simple to need a recommender. If each course has a set of associated materials, then it seems clear that taking course c1 means they should have the associated materials for the course. Maybe rank from among all materials by popularity. That might be very easy and accomplish most of what you need.
If you want to model this as CF, you can; I don't know how much data you have. If you have just a handful of users and courses it will be too sparse to give useful answers.
Your users have relations to two things: courses and materials. You don't want to recommend courses, but rather materials. I would build two data models: one with user-course info, and one with user-material purchase info. Use the user-course data as the basis of a UserSimilarity implementation that defines user-user similarity. Then piece that together with a NearestNUserNeighborhood, a GenericUserBasedRecommender, but using the other user-material data model.
You will be using user-user similarity based on courses to make recommendations from among materials.

Resources