I have installed Neo4j community Edition 3.0.3 on Ubuntu 14.04 on a linux local server and have successfully installed it. Now I am accessing it through my windows browser through the port 7474 on that server.
Now I have a csv file having sales order data in the following format:
Customer_id, Item_id, Order_Date
It has 90000 rows, and both customer_id and item_id are the nodes. A total of (30000 customer_ids + 30000 item_ids) nodes and 90000 relationships(order_date as the distance attribute name). I ran the below query to insert the data from csv to my graph database:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
I left it to run, and after around 7 to 8 hours, it was still running. My question is, am I doing anything wrong? is my query not optimized? or is this thing usual? I am new to both Neo4j and Cypher. Please help me on this.
Create a uniqueness constraint
You should create a uniqueness constraint on MyNode.Name:
CREATE CONSTRAINT ON (m:MyNode) ASSERT m.Name IS UNIQUE;
In addition to enforcing the data integrity / uniqueness of MyNode, that will create an index on MyNode.Name which will speed the lookups on the MERGE statements. There's a bit more info in the indexes and performance section here.
Using periodic commit
Since Neo4j is a transactional database, the results of your query is built up in memory and the entire query is committed at once. Depending on the size of the data / resources available on your machine you may want to use the periodic commit functionality in LOAD CSV to avoid building up the entire statement in memory. Just start your query with USING PERIODIC COMMIT. This will commit the results periodically, freeing memory resources while iterating through your CSV file.
Avoiding the eager
One problem with your query is that is contains an eager operation. This will hinder the periodic commit functionality and the transaction will all be built up into memory regardless. To avoid the eager operation you can use two passes through the csv file:
Once to create the nodes:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
Then again to create the relationships:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MATCH (n:MyNode {Name:line.Customer})
MATCH (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
See these two posts for more info about the eager operation.
At a minimum you need to create the uniqueness constraint - that should be enough to increase the performance of your LOAD CSV statement.
Related
I have a csv file which have 3 column
Follower_id,Following_id,createTime
My Node in neo4j represent a USER and it has multiple properties one of them is profileId,.Two nodes in the graph can have FOLLOW_RELATIONSHIP and i have to update the createtime for FOLLOW_RELATIONSHIP properties.There are lots of relationships in the graph. I am new in neo4j i dont have much idea about how to do bulk update efficiently.
You can try something like this:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'FILEPATH' AS row
WITH row
MATCH (u1:User{profileId: row.Follower_id})
MATCH (u2:User{profileId: row.Following_id})
MERGE (u1)-[r:FOLLOW_RELATIONSHIP]->(u2)
SET r.createTime = row.createTime
FILEPATH is the path of the file on your system, usually within the database directory itself or some web link. You can learn how to set it from this article.
I am new to neo4j, my data is in csv files trying load them in db and create relationships.
departments.csv(9 rows)
dept_name
dept_no
dept_emp.csv(331603 rows)
dept_no
emp_no
from_date
to_date
I have create nodes with labels departments and dept_emp with all columns as properties. now trying to create relationship between them.
CALL apoc.periodic.iterate("
load csv with headers from 'file:///dept_emp.csv' as row return row",
"match(de:dept_emp)
match(d:departments)
where de.dept_no=row.dept_no and d.dept_no= row.dept_no
merge (de)-[:BELONGS_TO]->(d)",{batchSize:10000, parallel:false})
I do have indexes on :dept_emp and :departments
When I try to run this it is taking ages to complete(many days). When I change the batch size to 10 it created 331603 relations, but it kept on running until it completes all the batches which is taking too long. When it encounters 9 different dept_no at initial rows in dept_emp.csv it is creating all the relations but it has to complete all the batches. In each batch it has to scan all the 331603 relations which were create in first two batches or so. Please help me with optimizing this.
Here I have used apoc.periodic.iterate to deal with the huge data in future, here how the data is related and how I am trying to establish the relation is making the problem . Each department will be having many dept_emp nodes connected.
Currently using Neo4j 4.2.1 version
Max heap size is 1G due to my laptop limitations.
There's no need to create nodes in this fashion, i.e. set properties and then load the same csv again but match all nodes in the graph and do a cartesian join.
Instead:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS row
CREATE (d:Department) SET d.deptNo=row.dept_no, d.name=row.dept_name
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///dept_emp.csv' AS row
MATCH (d:Department {deptNo:row.`dept_no`})
WITH d
MERGE (e:Employee {empNo: row.`emp_no`})
MERGE (e)-[:BELONGS_TO]->(d)
Im writing a custom doc manager for mongo-connector to replicate mongodb documents to neo4j. Here I would like to create bulk relationships. Im using py2neo2020.0.
It seems there are some options in previous versions but not in this version. Is there any way to create bulk nodes and relationships in py2neo
I am currently working on bulk load functionality. There will be some new functions available in the next release. Until then, Cypher UNWIND...CREATE queries are your best bet for performance.
I would strongly recommend switching to the neo4j Python driver, as it's supported by Neo4j directly.
In any case, you can also do bulk insert directly in Cypher, and/or call that Cypher from within Python using the neo4j driver.
I recommend importing the nodes first, and then the relationships. It helps if you have a guaranteed unique identifier for the nodes, because then you can set up an index on that property before loading. Then you can load nodes from a CSV (or better yet a TSV) file like so:
// Create constraint on the unique ID - greatly improves performance.
CREATE CONSTRAINT ON (a:my_label) ASSERT a.id IS UNIQUE
;
// Load the nodes, along with any properties you might want, from
// a file in the Neo4j import folder.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///my_nodes.tsv" AS tsvLine FIELDTERMINATOR '\t'
CREATE (:my_label{id: toInteger(tsvLine.id), my_field2: tsvLine.my_field2})
;
// Load relationships.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///my_relationships.tsv" AS tsvLine FIELDTERMINATOR '\t'
MATCH(parent_node:my_label)
WHERE parent_node.id = toInteger(tsvLine.parent)
MATCH(child_node:my_label)
WHERE child_node.id = toInteger(tsvLine.child)
CREATE(parent_node) --> (child_node)
;
Problem: How to load ~8 GB of data, >10 million rows, of the following format into Neo4j efficiently. I am using the DocGraph data set which shows relationships between Medicare providers. The dataset is a csv with columns:
From_ID, To_ID, Count_Patients, Count_Transacts, Avg_Wait, Stdv_Wait
From_ID means ID of a doctor making a referral. To_ID is the doctor who receives the referral. The last four columns are relationship properties. Any ID in the first or 2nd column can reappear in either column, because providers can have many relationships in either direction.
Here is the basic query I've come up with (very new to Cypher but adept at SQL):
LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)
It seems to work with a small subset of data but last time I tried it on the full dataset it broke my neo4j and it kept timing out when I tried to restart it, so I had to terminate my EC2 instance and start from scratch.
Appreciate any advice I can get and help with the Cypher query. Also, I am planning to merge this data with additional Medicare data with more properties about the nodes e.g. doctor specialty, location, name, etc, so let me know how I should take that into consideration.
Instance details: Ubuntu 18.04, m5ad.large (2 vCPUS, 8GB RAM, 75GB SSD)
It seems very likely that your logic is flawed.
You should investigate whether multiple rows in your CSV file can have the same line[0] value. If so, your CREATE clause should be change to a MERGE, to avoid the creation of a potentially large number of duplicate provider nodes (and therefore also duplicate :REFERRED relationships).
Did you try using
USING PERIODIC COMMIT 1000 ......
USING PERIODIC COMMIT 1000 LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)
I am loading the data into neo4j using loadcsv function. I have two types of nodes -Director and Company.
The below command is working fine and is executing within 50milisec.
LOAD CSV FROM "file:///Director.csv" AS line
CREATE(:Director {DirectorDIN:line[0]})
Load csv from "file:///Company.csv" AS line
Create(:Company{CompanyCIN:line[0]})
Now I am trying to build the relationship between the two nodes which is taking an infinite time to execute my query. Here is the simple query that I am trying.
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
match(c:Company{CompanyCIN:toString(line[0])}),(d:Director{DirectorDIN:toString(line[1])}) create (c)-[:Directed_by]->(d)
I have also tried:
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
match(c:Company{CompanyCIN:line[0]}),(d:Director{DirectorDIN:line[1]}) create (c)-[:Directed_by]->(d)
It is taking an infinite time. Please let me know what can be the issue over here?
Information:
The CSV file does not contain more than 20k records.
CompanyCIN is alphanumeric
DirectorDIN is numeric in nature
I think you forgot to create some schema constraint in your database :
CREATE CONSTRAINT on (n:Company) ASSERT n.CompanyCIN IS UNIQUE;
CREATE CONSTRAINT on (n:Director) ASSERT n.DirectorDIN IS UNIQUE;
Without thoses constraints the complexity of your query is N*M, where N is the number of Company nodes and M the number of Director.
To see what I mean, you can EXPLAIN your query before and after the creation of thoses constraints.
Moreover, you should also use the PERIODIC COMMIT on your LOAD CSV query, like that :
USING PERIODIC COMMIT 5000
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
MATCH (c:Company{CompanyCIN:line[0]})
MATCH (d:Director{DirectorDIN:line[1]})
CREATE (c)-[:Directed_by]->(d)
The main issue was that you did not have indexes on :Company(CompanyCIN) and :Director{DirectorDIN). Without the indexes, neo4j is forced to evaluate every possible pair of Company and Director nodes for every line in your CSV file. That takes a lot of time.
CREATE INDEX ON :Company(CompanyCIN);
CREATE INDEX ON :Director{DirectorDIN);
By the way, creating the corresponding uniqueness constraints (as suggested by #logisma) has the side-effect of creating these indexes, but the issue was not caused by missing uniqueness constraints.
In addition, you should avoid creating duplicate Directed_by relationships by using MERGE instead of CREATE.
This should work better (you can use the USING PERIODIC COMMIT option, as suggested by #logisima if you have ):
USING PERIODIC COMMIT 5000 LOAD CSV FROM "file:///CompanyDirector.csv" AS line
MATCH (c:Company {CompanyCIN:line[0]})
MATCH (d:Director {DirectorDIN:line[1]})
MERGE (c)-[:Directed_by]->(d)