Skiping relationship creation if already exist, not about MERGE - neo4j

I am new to neo4j, my data is in csv files trying load them in db and create relationships.
departments.csv(9 rows)
dept_name
dept_no
dept_emp.csv(331603 rows)
dept_no
emp_no
from_date
to_date
I have create nodes with labels departments and dept_emp with all columns as properties. now trying to create relationship between them.
CALL apoc.periodic.iterate("
load csv with headers from 'file:///dept_emp.csv' as row return row",
"match(de:dept_emp)
match(d:departments)
where de.dept_no=row.dept_no and d.dept_no= row.dept_no
merge (de)-[:BELONGS_TO]->(d)",{batchSize:10000, parallel:false})
I do have indexes on :dept_emp and :departments
When I try to run this it is taking ages to complete(many days). When I change the batch size to 10 it created 331603 relations, but it kept on running until it completes all the batches which is taking too long. When it encounters 9 different dept_no at initial rows in dept_emp.csv it is creating all the relations but it has to complete all the batches. In each batch it has to scan all the 331603 relations which were create in first two batches or so. Please help me with optimizing this.
Here I have used apoc.periodic.iterate to deal with the huge data in future, here how the data is related and how I am trying to establish the relation is making the problem . Each department will be having many dept_emp nodes connected.
Currently using Neo4j 4.2.1 version
Max heap size is 1G due to my laptop limitations.

There's no need to create nodes in this fashion, i.e. set properties and then load the same csv again but match all nodes in the graph and do a cartesian join.
Instead:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS row
CREATE (d:Department) SET d.deptNo=row.dept_no, d.name=row.dept_name
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///dept_emp.csv' AS row
MATCH (d:Department {deptNo:row.`dept_no`})
WITH d
MERGE (e:Employee {empNo: row.`emp_no`})
MERGE (e)-[:BELONGS_TO]->(d)

Related

Bulk Update neo4j relationship properties through csv

I have a csv file which have 3 column
Follower_id,Following_id,createTime
My Node in neo4j represent a USER and it has multiple properties one of them is profileId,.Two nodes in the graph can have FOLLOW_RELATIONSHIP and i have to update the createtime for FOLLOW_RELATIONSHIP properties.There are lots of relationships in the graph. I am new in neo4j i dont have much idea about how to do bulk update efficiently.
You can try something like this:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'FILEPATH' AS row
WITH row
MATCH (u1:User{profileId: row.Follower_id})
MATCH (u2:User{profileId: row.Following_id})
MERGE (u1)-[r:FOLLOW_RELATIONSHIP]->(u2)
SET r.createTime = row.createTime
FILEPATH is the path of the file on your system, usually within the database directory itself or some web link. You can learn how to set it from this article.

How to import large dataset into Neo4j with relationships defined in CSV

Problem: How to load ~8 GB of data, >10 million rows, of the following format into Neo4j efficiently. I am using the DocGraph data set which shows relationships between Medicare providers. The dataset is a csv with columns:
From_ID, To_ID, Count_Patients, Count_Transacts, Avg_Wait, Stdv_Wait
From_ID means ID of a doctor making a referral. To_ID is the doctor who receives the referral. The last four columns are relationship properties. Any ID in the first or 2nd column can reappear in either column, because providers can have many relationships in either direction.
Here is the basic query I've come up with (very new to Cypher but adept at SQL):
LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)
It seems to work with a small subset of data but last time I tried it on the full dataset it broke my neo4j and it kept timing out when I tried to restart it, so I had to terminate my EC2 instance and start from scratch.
Appreciate any advice I can get and help with the Cypher query. Also, I am planning to merge this data with additional Medicare data with more properties about the nodes e.g. doctor specialty, location, name, etc, so let me know how I should take that into consideration.
Instance details: Ubuntu 18.04, m5ad.large (2 vCPUS, 8GB RAM, 75GB SSD)
It seems very likely that your logic is flawed.
You should investigate whether multiple rows in your CSV file can have the same line[0] value. If so, your CREATE clause should be change to a MERGE, to avoid the creation of a potentially large number of duplicate provider nodes (and therefore also duplicate :REFERRED relationships).
Did you try using
USING PERIODIC COMMIT 1000 ......
USING PERIODIC COMMIT 1000 LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)

How to speed up the relationship creation with match...create

I'm creating nodes and relationships programmatically using neo4j java driver based on reading the relationships specified in a csv file.
The csv file contains about 16 million rows and there will be about 16*4 million relationships to create.
I'm using the pattern Match Match Create for this purpose :
Match (a:label), (b:label) where a.prop='1234' and b.prop='4567' create (a)-[:LINKS]->(b)
I just started the program and functionally it ran well. I saw nodes and relationships being created properly in the neo4j DB.
However, for the past four hours, only 100,000 rows from the csv have been processed and only 92037 relationships being created.
According to this speed, it will take about one months to finish processing the csv and creating all the relationships.
I noticed that I was sending the Match...Create one by one to the session.writeTransaction().
Is there any way to batch them up so as to speed up the creation time?

neo4j data insertion taking time

I have installed Neo4j community Edition 3.0.3 on Ubuntu 14.04 on a linux local server and have successfully installed it. Now I am accessing it through my windows browser through the port 7474 on that server.
Now I have a csv file having sales order data in the following format:
Customer_id, Item_id, Order_Date
It has 90000 rows, and both customer_id and item_id are the nodes. A total of (30000 customer_ids + 30000 item_ids) nodes and 90000 relationships(order_date as the distance attribute name). I ran the below query to insert the data from csv to my graph database:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
I left it to run, and after around 7 to 8 hours, it was still running. My question is, am I doing anything wrong? is my query not optimized? or is this thing usual? I am new to both Neo4j and Cypher. Please help me on this.
Create a uniqueness constraint
You should create a uniqueness constraint on MyNode.Name:
CREATE CONSTRAINT ON (m:MyNode) ASSERT m.Name IS UNIQUE;
In addition to enforcing the data integrity / uniqueness of MyNode, that will create an index on MyNode.Name which will speed the lookups on the MERGE statements. There's a bit more info in the indexes and performance section here.
Using periodic commit
Since Neo4j is a transactional database, the results of your query is built up in memory and the entire query is committed at once. Depending on the size of the data / resources available on your machine you may want to use the periodic commit functionality in LOAD CSV to avoid building up the entire statement in memory. Just start your query with USING PERIODIC COMMIT. This will commit the results periodically, freeing memory resources while iterating through your CSV file.
Avoiding the eager
One problem with your query is that is contains an eager operation. This will hinder the periodic commit functionality and the transaction will all be built up into memory regardless. To avoid the eager operation you can use two passes through the csv file:
Once to create the nodes:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
Then again to create the relationships:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MATCH (n:MyNode {Name:line.Customer})
MATCH (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
See these two posts for more info about the eager operation.
At a minimum you need to create the uniqueness constraint - that should be enough to increase the performance of your LOAD CSV statement.

Faster way to insert data in neo4j?

I am trying to insert unique nodes and relationship in neo4j.
What I am using :-
Neo4j Community Edition running on Amazon EC2.[Amazon Linux m3.large]
Neo4j Java Rest Binding [ https://github.com/neo4j-contrib/java-rest-binding ]
Data Size and Type :
TSV File [Multiple]. Each contains more than 8 Million Lines [each line represent a node or a relationship].There are more than 10 files for nodes.[= 2 Million Nodes] and another 2 million relations.
I am using UniqueNodeFactory for inserting nodes. And inserting sequentially, couldn't find any way to insert into batches preserving unique nodes.
The problem is it is taking huge time to insert data. For example it took almost a day for inserting 0.3 million unique nodes. Is there any way to fasten the insertion?
Don't do that.
Java-REST-Binding was never made for that.
Use either
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://some.url" as line
CREATE (u:User {name:line.name})
You can also use merge (with constraints), create relationships etc.
See my blog post for an example: http://jexp.de/blog/2014/06/using-load-csv-to-import-git-history-into-neo4j/
Or the Neo4j Manual: http://docs.neo4j.org/chunked/milestone/cypherdoc-importing-csv-files-with-cypher.html

Resources