Error creating relationships over huge dataset - neo4j

My question is similar to the one pointed here :
Creating unique node and relationship NEO4J over huge dataset
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
The Entity table has 2 million records and the relationship tables has about 400K lines
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
Both of these are in txt format.
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
Could you please advice if I am doing something wrong?

This query allows you to take advantage of your :Entity(ENT_ID) index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
[EDITED]
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});

Your entity creation Cypher looks fine, as do your indexes.
I am rather confused about the last two Cypher fragments though.
Since your relationships have a specific label or id associated with them, it's probably best to add your relationships by loading from the relationship table data, though the node labels in your query (A and B) aren't used in your Entity creation and aren't in your graph, and neither are ENT_IDPARENT or ENT_IDCHILD fields. Looks like this isn't really the Cypher you used, but an example you built off of?
I'd change this relationship creation query to this, setting the type property of the relationship for post-processing later (this assumes that there can only be one :RELATION_OF relation between the same two nodes):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (parent:Entity {ENT_ID : Rships.ENT_IDPARENT})
MATCH (child:Entity {ENT_ID : Rships.ENT_IDCHILD})
MERGE (parent)-[r:RELATION_OF]->(child)
ON CREATE SET r.RID = Rships.RID;
Later on, if you like, you can match on your relationships with an RID, and add the corresponding type ("FATHER_OF", "MOTHER_OF", etc) property.
As for creating the :PARENT_OF relationship, you're doing some extra match on an Entity variable bound to every single node in your graph - get rid of that.
Instead, use this:
PROFILE
// first, match on all Entities with a PARENTID property
MATCH(child:Entity)
WHERE EXISTS(child.PARENTID)
// next, find the parent for each child by the child's PARENTID
WITH child
MATCH (parent:Entity {ENT_ID : child.PARENTID})
MERGE (parent)-[:PARENT_OF]->(child)
// lastly remove the parentid from the child, so it won't be reprocessed
// if we run the query again.
REMOVE child.PARENTID
EDITED the above query to use an existence check on child.PARENTID, and to remove child.PARENTID after the corresponding relationship has been created.
If you need a solution that uses batching, you could do this manually (adding LIMIT 100000 to your WITH child line, or you could install the APOC Procedures Library and use its periodic.commit() function to batch your processing.

Related

Cypher to lookup and order by multiple values

I have a JSON document with history based entity counts and relationship counts. I want to use this lookup data for entity and relationships in Neo4j. Lookup data has around 3000 rows. For the entity counts I want to display the counts for two entities based on UUID. For relationships, I want to order by two relationship counts (related entities and related mutual entities).
For entities, I have started with the following:
// get JSON doc
with value.aggregations.ent.terms.buckets as data
unwind data as lookup1
unwind data as lookup2
MATCH (e1:Entity)-[r1:RELATED_TO]-(e2)
WHERE e1.uuid = '$entityId'
AND e1.uuid = lookup1.key
AND e2.uuid = lookup2.key
RETURN e1.uuid, lookup1.doc_count, r1.uuid, e2.uuid, lookup2.doc_count
ORDER BY lookup2.doc_count DESC // just to demonstrate
LIMIT 50
I'm noticing that query is taking about 10 seconds. What am I doing wrong and how can I correct it?
Attaching explain plan:
Your query is very inefficient. You stated that data has 3,000 rows (let's call that number D).
So, your first UNWIND creates an intermediate result of D rows.
Your second UNWIND creates an intermediate result of D**2 (i.e., 9 million) rows.
If your MATCH (e1:Entity)-[r1:RELATED_TO]-(e2) clause finds N results, that generates an intermediate result of up to N*(D**2) rows.
Since your MATCH clause specifies a non-directional relationship pattern, it finds the same pair of nodes twice (in reverse order). So, N is actually twice as large as it needs to be.
Here is an improved version of your query, which should be much faster (with N/2 intermediate rows):
WITH apoc.map.groupBy(value.aggregations.ent.terms.buckets, 'key') as lookup
MATCH (e1:Entity)-[r1:RELATED_TO]->(e2)
WHERE e1.uuid = $entityId AND lookup[e1.uuid] IS NOT NULL AND lookup[e2.uuid] IS NOT NULL
RETURN e1.uuid, lookup[e1.uuid].doc_count AS count1, r1.uuid, e2.uuid, lookup[e2.uuid].doc_count AS count2
ORDER BY count2 DESC
LIMIT 50
The trick here is that the query uses apoc.map.groupBy to convert your buckets (a list of maps) into a single unified lookup map that uses the bucket key values as its property names. This allows the rest of the query to literally "look up" each uuid's data in the unified map.

How to batch Neo4j Cypher queries

So i have over 130M nodes of one type and 500K nodes of another type, i am trying to create relationships between them as follows:
MATCH (p:person)
MATCH (f:food) WHERE f.name=p.likes
CREATE (p)-[l:likes]->(f)
The problem is there are 130M relationships created and i would like to do it in a similar fashion to PERIODIC COMMIT when using LOAD CSV
Is there such a functionality for my type of query?
Yes, there is. You'll need the APOC Procedures library installed (download here). You'll be using the apoc.periodic.commit() function in the Job Management section. From the documentation:
CALL apoc.periodic.commit(statement, params) - repeats a batch update
statement until it returns 0, this procedure is blocking
You'll be using this in combination with the LIMIT clause, passing the limit value as the params.
However, for best results, you'll want to make sure your join data (f.name, I think) has an index or a unique constraint to massively cut down on the time.
Here's how you might use it (assuming from your example that a person only likes one food, and that we should only apply this to :persons that don't already have the relationship set):
CALL apoc.periodic.commit("
MATCH (p:person)
WHERE p.likes IS NOT NULL
AND NOT (p)-[:likes]->(:food)
WITH p LIMIT {limit}
MATCH (f:food) WHERE p.likes = f.name
CREATE (p)-[:likes]->(f)
RETURN count(*)
", {limit: 10000})

neo4j cypher taking time to set relationship

I am relatively new to neo4j.
I have imported dataset of 12 million records and I have created a relationship between two nodes. When I created the relationship, I forgot to attach a property to the relationship. Now I am trying to set the property for the relationship as follows.
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
MATCH (user:User{userID: USERID})
MATCH (order:Order{orderID: OrderId})
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=field1,
acc.field2=field2;
But this query is taking too much time to execute,
I even tried USING index on user and order node.
MATCH (user:User{userID: USERID}) USING INDEX user:User(userID)
Isn't it possible to create new attributes for the relationship at a later point?
Please let me know, how can I do this operation in a quick and efficient way.
You also forgot to prefix your query with USING PERIODIC COMMIT,
your query will build up transaction state for 24 million changes (property updates) and won't have enough memory to keep all that state.
You also forgot row. for the data that comes from your CSV and those names are inconsistently spelled.
If you run this from neo4j browser pay attention to any YELLOW warning signs.
Run
CREATE CONSTRAINT ON (u:User) ASSERT u.userID IS UNIQUE;
Run
CREATE CONSTRAINT ON (o:Order) ASSERT o.orderID IS UNIQUE;
Run
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
with row.USERID as userID, row.OrderId as orderID
MATCH (user:User{userID: userID})
USING INDEX user:User(userID)
MATCH (order:Order{orderID: orderID})
USING INDEX order:Order(orderID)
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=row.field1, acc.field2=row.field2;

How to generate relationships using property information [Node4j]

I have imported a CSV where each Node contains 3 columns. id, parent_id, and title. This is a simple tree structure i had in mysql. Now i need to create the relationships between those nodes considering the parent_id data. So each node to node will have 2 relationships as parent and child. Im really new to node4j and suggestions ?
i tried following, but no luck
MATCH (b:Branch {id}), (bb:Branch {parent_id})
CREATE (b)-[:PARENT]->(bb)
It seems as though your cypher is very close. The first thing you are going to want to do is create an index on the id and parent_id properties for the label Branch.
CREATE INDEX ON :Branch(id)
CREATE INDEX ON :Branch(parent_id)
Once you have indexes created you want to match all of the nodes with the label Branch (I would limit this with a specific value to start to make sure you create exactly what you want) and for each find the corresponding parent by matching on your indexed attributes.
MATCH (b:Branch), (bb:Branch)
WHERE b.id = ???
AND b.parent_id = bb.id
CREATE (b)-[:PARENT]->(bb)
Once you have proved this out on one branch and you get the results you expect I would run it for more branches at once. You could still choose to do it in batches depending on the number of branches in your graph.
After you have created all of the :PARENT relationships you could optionally remove all of the parent_id properties.
MATCH (b:Branch)-[:PARENT]->(:Branch)
WHERE exists(b.parent_id)
REMOVE b.parent_id

Cypher - multiple relationships with same label, i want to delete just one

I have the following schema:
(Node a and b are identified by id and are the same in both relationships)
(a)-[r:RelType {comment:'a comment'} ]-(b)
(a)-[r:RelType {comment:'another comment'} ]-(b)
So i have 2 nodes, and an arbitrary number of relationships between them. I want to delete just one of the relationships, and i don t care which one. How can i do this?
I have tried this, but it does not work:
match (a {id:'aaa'})-[r:RelType]-(b {id:'bbb'}) where count(r)=1 delete r;
Any ideas?
Here is the real-world query:
match (order:Order {id:'order1'}),(produs:Product {id:'supa'}),
(order)-[r:ordprod {status:'altered'}]->(produs) with r limit 1 set r.status='alteredAgain'
return (r);
The problem is Chypher says
Set 1 property, returned 1 row in 219 ms
, but when i inspect the database, it turns out all relationships have been updated.
Use the following:
match (a {id:'aaa'})-[r:RelType]-(b {id:'bbb'})
with r
limit 1
delete r
I tried to implement data structure like your's (Mihai's). and gone with both solutions; i.e., Stefan's and Sumit's. Stefan's solution is working at my side. Mihai, are you still facing any problems?
Hope this helps(As per my understanding you are trying to modify a relationship between two given nodes)
MATCH (order:Order {id:'order1'})-[r:ordprod {status:'altered'}]->(produs:Product {id:'supa'})
WITH order,r,produs
LIMIT 1
DELETE r
WITH order,produs
CREATE (order:Order {id:'order1'})-[r:ordprod {status:'alteredAgain'}]->(produs:Product {id:'supa'})
return (r);
And the reason for all your relationships getting updated is that after your WITH clause you are passing only r ie the relationship which may be same between all such nodes of label Order and Product. So when you do r.status = 'alteredagain' it changes all the relationships instead of changing between those two specific nodes that you matched in the beginning of your cypher query. Pass them too in the WITH and it will work fine!

Resources