Hadoop join with String key - join

I'm implementing a reduce-side join to find matches between databases A and B. Both files from the datasets contains a json object per line. The join key is the name attribute of each record, so, the mapper extract the name of the json and pass it as key and the json itself as value. The reducer must merge the jsons objects for the same or similar person name.
The problem is that I need to group keys using a string similarity matching algorithm, e.g., John White must be considered equal to John White Lennon.
I've tried to do that using a grouping comparator but it is not working as expected.
How can this be implemented?
Thanks in advance!

What you request here could be described as a set similarity join, where the sets are, e.g. the sets of tokens, or n-grams of each line. Here is a research paper, that describes how you can achieve that in MapReduce. I hope you find it useful.

Related

Collection? Dictionary? List?

Neo4j's manual does a very good job at explaining the meaning of the terms Node, Relationship, Label and a few others.
However, the real vocabulary of Cypher seems to include quite a few elusive terms, as well.
For instance, clause 3.3.15.1 of the manual says "Lists and paths are key concepts in Cypher". Fine, but what is a List in Cypher? I have all but given up trying to find a definition of that "key concept".
Similarly, the Cypher Reference Card mentions that "Cypher also supports maps and collections". Elsewhere, one can find that Cypher also "works with dictionaries".
Needless to say, I am in the dark as to how to spot and/or use those in Cypher.
Would really appreciate some illustrations.
Thanks.
The docs has a section on Composite types:
3.2.1.3. Composite types
✓ Can be returned from Cypher queries
✓ Can be used as parameters
❏ Cannot be stored as properties
✓ Can be constructed with Cypher literals
Composite types comprise:
Lists are heterogeneous, ordered collections of values, each of which has any property, structural or composite type.
Maps are heterogeneous, unordered collections of (key, value) pairs, where:
the key is a String
the value has any property, structural or composite type
You might also be interested in the development of openCypher. One of the goals of the openCypher project is to define the concepts of the Cypher language. As stated on its homepage:
The openCypher project aims to deliver a full and open specification of the industry’s most widely adopted graph database query language: Cypher.
Currently, openCypher is a work-in-progress. It has a draft document on the Property Graph Model, which itself does not discuss lists/maps in detail, but refers the CIP2015-06-16 - Public Type System and Type Annotations document, which is an accepted Cypher Improvement Proposal. It has a section on "Container Types", which defines how lists and maps work in Cypher.
I have not seen the term "dictionary" in the core Neo4j docs. It could be mentioned around drivers though, as some languages, e.g. Python, use this term for maps.
(Disclaimer: I am a regular participant at the openCypher Implementers Group meetings.)
According to wikipedia :
https://en.wikipedia.org/wiki/List_(abstract_data_type) : a list (or sequence, collection) is a data type that represents a countable number of ordered values.
https://en.wikipedia.org/wiki/List_(abstract_data_type) : a map (or dictionnary) is a data type composed of a collection of (key, value) pairs.
And in Cypher :
this is a list : RETURN ['Benoit', 'Simard'] AS list
this is a map : RETURN { firstname:'Benoit', lastname:'Simard'} AS map
Cheers

Individuals from DBpedia

I have an exercise in Semantic Web. I must extract some individuals from the DBpedia. These individuals must be inserted into an ontology that I must create. My question is. Can I retrieve individuals from the DBedia?
Let me clarify !
When I send this sparql query
PREFIX dbo: <http://dbpedia.org/ontology>
SELECT distinct * WHERE
{
?album a dbo:Album .
} LIMIT 10
I get 10 URIs. Should I get whole instances ? I mean, label, object properties, data properties etc. in order to insert them to my ontology?
I want them as a complete instance. I don't want to add more variables e.g
?album dbo:artist ?artist .
Can I use a java api e.g. Jena ?
EDIT:
Let me give you an example. Suppose that you get an Album with URI
http://dbpedia.org/resource/...Baby_One_More_Time_(album)
This album has also some properties with their values e.g.
dbo:artist dbr:Britney_Spears
dbo:releaseDate 1999-01-12 (xsd:date)
...
How could I get all of them in order to create an indivual album for my ontology with properties artist and releaseDate and values Britney_Spears and 1999-01-12 respectively ?
Well, a good point always to start is your requirements! What do you exactly need? There is scientific plethora research on Ontology Module Extraction (see for example here).
My rule of thumb is that: the amount you extract must align with the required constraints of soundness and completeness of results, which in turn, aligns with your requirements. To make it clear, consider the following: A DBpedia Artist is a subClassOf Person. Now consider that you extract all the instances of Artist from DBPedia, without the piece of information that Artist is a subClassOf Person. Now if you query your dataset asking for Person, you will get nothing. Is this a sound result? yes, but is it complete? No! However, if you don't care about the fact that each Artist is a Person, then it's okay. A mentioning worthy thing is that it depends on the DBpedia endpoint itself and what kind of reasoning it performs as well.
Concluding: Specify what you really need. While you can suffice for a couple of classes with their instances, you can as well extract the whole DBpedia.
Regarding getting the data, there are multiple ways; again depending on your requirements. For simple purposes, you can use Jena TDB for triples storage and access them via Jena. You can even store your data simply in an RDF file. You can, for example, use a construct query on DBpedia endpoint and specify the results format as RDF and then insert them to your RDF engine. Another option, for example, this answer, states how to use an INSERT query to perform the insert task into a local graph.
You can retrieve instances from DBpedia with whatever metadata you want, but it depends on your ontology that you would like to create. Please take a look at this document, it will help you to understand some notions.
Should you get whole instances? I think you are asking if you should take all the proporties and objects depending on the subject. Not necessarily..It depends on your ontology as stated in first step and you decide what to take.
Should you use Jena? You can but you don't have to! If you pose a CONSTRUCT query to the endpoint you can get the data but as far as I understood you don't want to add variables. So you can pose a query as follows by asking all the metadata regarding to the instance.
CONSTRUCT { ?album ?p ?o } WHERE {
?album a dbo:Album .
?album ?p ?o
}
If you would like to get a limited number of instances then you can add limit again at the end of this query.

Finding similar users based on String preperties

Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

Bipartite graph maximum matching

I am new to graphs. I have two sets in a bipartite graph. I need to find unique matching of all the possible combinations. So I thought I use Hopcroft-Karp to find maximum matching. Being a newbie I thought I would get the resulting matching graph but all it tells me is 42. Ahhh that really helps. I don't need to know how many matchings there are I need to know the unique matchings themselfs.
Am I missing something? How do I get the resulting matching?
I diden't check the datastructures generated by the Hopcroft-Karp match function, only the retrun value. The return value is the number of matchings. However there was also a self.pair dictionary in the python code, the pair dictionary contains the matchings from "both" sides, which answers my question.

Resources