Entity Relations from freebase dump - entity-relationship

I want to dump all entity-name-pair with a relation.
Example :
subject predicate object
<freebase/ns/g.11bc7__xnw> <freebase/ns/people.place_lived.location> <freebase/ns/m.02_286> .
Freebase in above line refers to url of freebase website.
I extracted all triplets which have mid in subject and object, then I took the predicate as the relation.
For the above example my code will output something like this:
entity pair : g.11bc7__xnw , m.02_286
relation : people.place_lived.location
I have two issues:
when I ran my code on freebase dump I got 14887 relations but the actual number of relations in freebase are more than 25,000 .
for some mid's, there is no property name or alias. (/type/object/name,
/common/topic/alias)
Please tell me what I am doing wrong.

Well, some relations to not point to a mid, but to a basic value:
<http://rdf.freebase.com/ns/g.11vjz1ynm> <http://rdf.freebase.com/ns/measurement_unit.dated_percentage.date> "2001-02"
And that's basically the entire measurement_unit domain.
Then, the mids that don't have name and alias sound like cvt's (compound value types) which are artificial nodes that hold a complex relationship (eg. node to node + time).
So I think you should account better for measurements, booleans, dates, etc. and cvts.

Related

Adding relationship to existing nodes with Cypher doesn't work

I am working on Panama dataset using Neo4J graph database 1.1.5 web version. I identified Ion Sturza, former Prime Minister of Moldova on the database and want to make a map of his related network. I used following code to query using Cypher (creating a variable 'IonSturza'):
MATCH (IonSturza {name: "Ion Sturza"}) RETURN IonSturza
I identified that the entity 'CONSTANTIN LUTSENKO' linked differently to entities like 'Quade..' and 'Kinbo...' with a name in small letters as in this picture. I hence want to map a relationship 'SAME_COMPANY_AS' between the capslock and the uncapped version. I tried the following code based on this answer by #StefanArmbruster:
MATCH (a:Officer {name :"Constantin Lutsenko"}),(b:Officer{name :
"CONSTANTIN LUTSENKO"})
where (a:Officer{name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(b:Entity{id:'284429'})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Instead of indexing, I used the 'where' statement to specify the uncapped version which is linked only to the entity bearing id '284429'.
My code however shows the cartesian product error message:
This query builds a cartesian product between disconnected patterns.If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (b))<<
Also when I execute, there are no changes, no rows!! What am I missing here? Can someone please help me with inserting this relationship between the nodes. Thanks in advance!
The cartesian product warning will appear whenever you're matching on two or more disconnected patterns. In this case, however, it's fine, because you're looking up both of them by what is likely a unique name, s your result should be one node each.
If each separate part of that pattern returned multiple nodes, then you would have (rows of a) x (rows of b), a cartesian product between the two result sets.
So in this particular case, don't mind the warning.
As for why you're not seeing changes, note that you're reusing variables for different parts of the graph: you're using variable b for both the uppercase version of the officer, and for the :Entity in your WHERE. There is no node that matches to both.
Instead, use different variables for each, and include the :Entity in your match. Also, once you match to nodes and bind them to variables, you can reuse the variable names later in your query without having to repeat its labels or properties.
Try this:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{id:'284429'}),(b:Officer{name : "CONSTANTIN LUTSENKO"})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Though I'm not quite sure of what you're trying to do...is an :Officer a company? That relationship type doesn't quite seem right.
I tried the answer by #InverseFalcon and thanks to it, by modifying the property identifier from 'id' to 'name' and using the property for both 'a' and 'b', 4 relationships were created by the following code:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{name:'KINBOROUGH PORTFOLIO LTD.'}),(b:Officer{name : "CONSTANTIN
LUTSENKO"})-[:SHAREHOLDER_OF]->(:Entity{name:'Chandler Group Holdings Ltd'})
CREATE (a)-[:SAME_NAME_AS]->(b)
Thank you so much #InverseFalcon!

Is it possible to query the graph in neo4j with just a part of the value of a relation's property?

I am trying to move the info of data flows to a DB. The data flows are something like this:
E_App1 sends data to I_App1. I_App1 then sends this data to I_App3. I_App3 then sends this data to I_App5.
E_App2 sends data to I_App2. I_App2 then sends this data to I_App3. I_App3 then sends this data to I_App5.
E_App3 sends data to I_App2. I_App2 then sends this data to I_App4. I_App4 then sends this data to I_App5. I_App5 then sends this data to I_App6.
E_App4 sends data to I_App3. I_App3 then sends this data to I_App5. I_App5 then sends this data to I_App6.
E_App5 sends data to I_App2. I_App2 then sends this data to I_App4. I_App4 then sends this data to I_App5.
I am thinking of having a property named "OF" of the "sends data" relationship that will contain names of the data that is being sent so I can trace the flow of a particular application. Something on the lines of the below diagram. Is it possible to query the OF values, like "show all relations whose OF value contains E_App4 only"?
This is the first time I am trying Graph DB and I am thinking of using it as the relationships are complex. I am not looking for high performance here. Is there some other approach I should follow to be able to achieve the result of tracing the flow of a particular application?
Link to the diagram:http://s27.postimg.org/5qieemks3/Graph_Data_Modeling.jpg
You diagram was a little complicated, but all you are asking is to find those relationships which are of the type OF and has the node type E_App4 as the end node. There is no restricion on the start node.
So this query should work:
match (startNode) -[of:OF]->(endNode:E_App4) return startNode, of, endNode;
This ofcourse assumes the following:
the relationship will be directed from start node to end node. Hence any relationship from E_App4 as start node will not be counted. If you wish to count those also, remove the -> and replace it with -.
The start node can be anything.
ONLY the relationship of type OF is considered. Mind it, the name is case sensitive. The relationship must be labeled with OF.
End nodes must be labeled as E_App4.
Edit
Reading your question again show all relations whose OF value contains E_App4 only I guess I misunderstood you. You are asking can you check the value of the relationship. Yes you can. Here is the query:
match (startNode) -[of:OF]->(endNode) where has(of.property) and of.property = "E_App4" return of;
This assumes:
The properties defined in relationships have the key property
This will only check the relationships who has the key property. If your relationship does not have this key, those relationships will not be counted.
Thanks Rash, that helped me. I got stuck in defining the filter but some search helped me in finding out how to use *. This is for others who may get stuck as I got:
The graph is somehow like this with a lot of nodes and confusing relations with Data property only carrying the identifying information.
z-[send]->b
y-[send]->b
w-[send]->d
x-[send]->c
q-[send]->c
b-[send]->c-[send]->e
also c-[send]->d
The Data property of every relation will have the source(s) it is carrying. So relations that are way far will have many sources defined in the "send" relation's "Data" property in the manner Data:"ABC,XYZ,QWR,SDF,TYOP,Zxcvb".
//to know which all send relations have ABC part of Data property
MATCH ()-[r:send]->() WHERE r.Data =~ ".*ABC.*" RETURN r
//to know which all send relations have TYOP part of Data property
MATCH ()-[r:send]->() WHERE r.Data =~ ".*TYOP.*" RETURN r
I hope this will help someone who is still getting a hand on all this.

An Example Showing the Necessity of Relationship Type Index and Related Execution Plan Optimization

Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.

Hadoop join with String key

I'm implementing a reduce-side join to find matches between databases A and B. Both files from the datasets contains a json object per line. The join key is the name attribute of each record, so, the mapper extract the name of the json and pass it as key and the json itself as value. The reducer must merge the jsons objects for the same or similar person name.
The problem is that I need to group keys using a string similarity matching algorithm, e.g., John White must be considered equal to John White Lennon.
I've tried to do that using a grouping comparator but it is not working as expected.
How can this be implemented?
Thanks in advance!
What you request here could be described as a set similarity join, where the sets are, e.g. the sets of tokens, or n-grams of each line. Here is a research paper, that describes how you can achieve that in MapReduce. I hope you find it useful.

How to model time dependent relationships?

What's the right way to model and iterate time dependent relationships?
For example:
John married Elizabeth on 07/03/69 and divorced her 05/12/73; then he married Corrie on 03/18/82 and still married.
Mark worked for IBM (certain date intervals), then MSFT (other time interval), etc.
There are many other relationships which are time dependent:
lived in
worked for
reported to
belonged to etc.
What is the right way to model these? A typical query would be to find a traversal with an "as of" parameter, e.g "Who is the spouse of John as of 01/01/74?"
This is easily done by using indexes. Index/properties can only use primitive, so you can convert your Date object to a long value, then you can index that value. You'll have to store the index as a special Numeric type, but after that, you can search based on a range, so you can do a "Before" or an "After" or even a "Between" type query.
You could consider using extra nodes to represent a particular marriage or period of employment in conjunction with a calendar subgraph. For example, John's marriages could be modelled as follows:
(John)-[:MARRIAGE]->(John+Liz)
(Liz)-[:MARRIAGE]->(John+Liz)
(John+Liz)-[:START_DATE]->(07/03/69)
(John+Liz)-[:END_DATE]->(05/12/73)
(John)-[:MARRIAGE]->(John+Corrie)
(Corrie)-[:MARRIAGE]->(John+Corrie)
(John+Corrie)-[:START_DATE]->(03/18/82)
This gives you flexibility in both the number of marriages each person may undertake as well as whether an END_DATE exists or not.
Hope this helps
Nige
i would assume to build always two kinds of relationships for every type - lived_in_s, lived_in_e, belonged_to_s, belonged_to_e .... where the postfixes "s" and "e" represents the time of start and end of that relationships. than, querying an as of parameter could look like:
START n=node({John})
MATCH n-[r:spouse_of_s]-m, n-[?r2:spouse_of_e]-m
WHERE r2.time? < {timestamp} AND r.time > {timestamp}
RETURN m;
(i might have a typo in the query at r2?, i wrote it without testing)
and you might want to use the indexes #nicholas wrote about
I usually add a timestamp with on_date attribute.
I would also recommend encoding your date in the following format YYYYMMDD. You can then easily add comparisons in your code. For instance:
John married Elizabeth on 07/03/69 and divorced her 05/12/73; then he married Corrie on 03/18/82 and still married.
if you want to know if John is married and to whom, you can simply compare the date, and get the end node.

Resources