I'm creating nodes and relationships programmatically using neo4j java driver based on reading the relationships specified in a csv file.
The csv file contains about 16 million rows and there will be about 16*4 million relationships to create.
I'm using the pattern Match Match Create for this purpose :
Match (a:label), (b:label) where a.prop='1234' and b.prop='4567' create (a)-[:LINKS]->(b)
I just started the program and functionally it ran well. I saw nodes and relationships being created properly in the neo4j DB.
However, for the past four hours, only 100,000 rows from the csv have been processed and only 92037 relationships being created.
According to this speed, it will take about one months to finish processing the csv and creating all the relationships.
I noticed that I was sending the Match...Create one by one to the session.writeTransaction().
Is there any way to batch them up so as to speed up the creation time?
Related
I am new to neo4j, my data is in csv files trying load them in db and create relationships.
departments.csv(9 rows)
dept_name
dept_no
dept_emp.csv(331603 rows)
dept_no
emp_no
from_date
to_date
I have create nodes with labels departments and dept_emp with all columns as properties. now trying to create relationship between them.
CALL apoc.periodic.iterate("
load csv with headers from 'file:///dept_emp.csv' as row return row",
"match(de:dept_emp)
match(d:departments)
where de.dept_no=row.dept_no and d.dept_no= row.dept_no
merge (de)-[:BELONGS_TO]->(d)",{batchSize:10000, parallel:false})
I do have indexes on :dept_emp and :departments
When I try to run this it is taking ages to complete(many days). When I change the batch size to 10 it created 331603 relations, but it kept on running until it completes all the batches which is taking too long. When it encounters 9 different dept_no at initial rows in dept_emp.csv it is creating all the relations but it has to complete all the batches. In each batch it has to scan all the 331603 relations which were create in first two batches or so. Please help me with optimizing this.
Here I have used apoc.periodic.iterate to deal with the huge data in future, here how the data is related and how I am trying to establish the relation is making the problem . Each department will be having many dept_emp nodes connected.
Currently using Neo4j 4.2.1 version
Max heap size is 1G due to my laptop limitations.
There's no need to create nodes in this fashion, i.e. set properties and then load the same csv again but match all nodes in the graph and do a cartesian join.
Instead:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS row
CREATE (d:Department) SET d.deptNo=row.dept_no, d.name=row.dept_name
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///dept_emp.csv' AS row
MATCH (d:Department {deptNo:row.`dept_no`})
WITH d
MERGE (e:Employee {empNo: row.`emp_no`})
MERGE (e)-[:BELONGS_TO]->(d)
I'm loading relationships into my graph db in Neo4j using the load csv operation. The nodes are already created. I have four different types of relationships to create from four different CSV files (file 1 - 59 relationships, file 2 - 905 relationships, file 3 - 173,000 relationships, file 4 - over 1 million relationships). The cypher queries execute just fine, However file 1 (59 relationships) takes 25 seconds to execute, file 2 took 6.98 minutes and file 3 is still going on since past 2 hours. I'm not sure if these execution times are normal given neo4j's capabilities to handle millions of relationships. A sample cypher query I'm using is given below.
load csv with headers from
"file:/sample.csv"
as rels3
match (a:Index1 {Filename: rels3.Filename})
match (b:Index2 {Field_name: rels3.Field_name})
create (a)-[:relation1 {type: rels3.`relation1`}]->(b)
return a, b
'a' and 'b' are two indices I created for two of the preloaded node categories hoping to speed up lookup operation.
Additional information - Number of nodes (a category) - 1791
Number of nodes (b category) - 3341
Is there a faster way to load this and does load csv operation take so much time? Am i going wrong somewhere?
Create an index on Index1.Filename and Index2.Field_name:
CREATE INDEX ON :Index1(Filename);
CREATE INDEX ON :Index2(Field_name);
Verify these indexes are online:
:schema
Verify your query is using the indexes by adding PROFILE to the start of your query and looking at the execution plan to see if the indexes are being used.
More info here
What i like to do before running a query is run explain first to see if there are any warnings. I have fixed many a query thanks to the warnings.
(simple pre-append explain to your query)
Also, perhaps you can drop the return statement. After your query finishes you can then run another to just see the nodes.
I create roughly 20M relationships in about 54 mins using a query very similar to yours.
Indices are important because that's how neo finds the nodes.
I was doing a POC on publicly-available Twitter dataset for our project. I was able to create the Neo4j database for it using Michael Hunger's Batch Inserter utility, and it was relatively fast (It just took a 2h and 53 mins to finish). All in all there were
15,203,731 Nodes, with 2 properties (name, url)
256,147,121 Relationships, with 1 property
Now I created a Cypher query to update the Twitter database. I added a new property (Age) on the Node and a new property on the Relationship (FollowedSince) in the CSVs. Now things start to look bad. The query to update the relationship (see below) takes forever to run.
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM {csvfile} AS row FIELDTERMINATOR '\t'
MATCH (u1:USER {name:row.`name:string:user`}), (u2:USER {name:row.`name:string:user2`})
MERGE (u1)-[r:Follows]->(u2)
ON CREATE SET r.Property=row.Property, r.FollowedSince=row.FollowedSince
ON MATCH SET r.Property=row.Property, r.FollowedSince=row.FollowedSince;
I already pre-created the index by running
CREATE INDEX ON :USER(name);
My neo4j property:
allow_store_upgrade=true
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=260M
neostore.propertystore.db.index.mapped_memory=260M
neostore.nodestore.db.mapped_memory=768M
neostore.relationshipstore.db.mapped_memory=12G
neostore.propertystore.db.mapped_memory=2048M
neostore.propertystore.db.strings.mapped_memory=2048M
neostore.propertystore.db.arrays.mapped_memory=260M
node_auto_indexing=true
I'd like to know what should I do to speed up my Cypher query? As of this writing, it's more than an hour and a half have passed and my Relationship (10,000,747) still hasn't finished. The Node (15,203,731) that finished earlier clocked at 34 minutes which I think is way too long. (The Batch Inserter utility processed the whole Node in just 5 minutes!)
I did test my queries on a small dataset just to try it out first before tackling bigger dataset, and it did work.
My Neo4j lives on a server-grade machine, so hardware is not an issue here.
Any advice please? Thanks.
I am merging large batches of ~500,000 relationships with the LOAD CSV command:
LOAD CSV WITH HEADERS FROM 'http://file.csv'
MATCH (a:Label {uid: csv.uid1}),(b:Otherlabel {uid: csv.uid2})
MERGE (a)-[:TYPE {key1: csv.key1}]->(b)
Both uid properties have a UNIQUE constraint.
The CSV file looks like:
uid1,uid2,key1
123,abc,some_value
456,def,some_value
This is usually very fast (< 1 min) when there are many different nodes on each side.
But performance drops dramatically when I load batches where a single a node is connected to many different b nodes. The uid1 is always the same but schema constraints are still there. ~30,000 relationships take ~8 min to load.
Am I missing something here? What could explain the huge performance difference in MERGEing 'many-to-many' relationships vs. 'one-to-many'?
As I mentioned in the comment on the question, I verified this behavior with a ~300,000 line CSV file that I created with unique random values for uid1 and uid2. #MartinPreusse then mentioned that if you change the query to use CREATE instead of MERGE, the query is fast. This observation made me realize what is going on.
The slowdown is caused by the need to scan the relationships list of the 'a' node each time a MERGE is performed. When a CREATE is performed, the relationship is added without testing first to see if the relationship already exists. When the relationship lists remain short (first case), scanning the relationship lists has little impact. When the relationship lists are growing long (second case), the repeated scanning of a growing list is dominating the process. In my test I linked all 300,000 nodes to a single node using a MERGE clause and it took hours.
If you don't have to worry about creating duplicate relationships, you can use CREATE without fear. Even if duplicates are an issue, it might be faster to use CREATE and then craft a query that will remove the duplicates.
I am trying to insert unique nodes and relationship in neo4j.
What I am using :-
Neo4j Community Edition running on Amazon EC2.[Amazon Linux m3.large]
Neo4j Java Rest Binding [ https://github.com/neo4j-contrib/java-rest-binding ]
Data Size and Type :
TSV File [Multiple]. Each contains more than 8 Million Lines [each line represent a node or a relationship].There are more than 10 files for nodes.[= 2 Million Nodes] and another 2 million relations.
I am using UniqueNodeFactory for inserting nodes. And inserting sequentially, couldn't find any way to insert into batches preserving unique nodes.
The problem is it is taking huge time to insert data. For example it took almost a day for inserting 0.3 million unique nodes. Is there any way to fasten the insertion?
Don't do that.
Java-REST-Binding was never made for that.
Use either
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://some.url" as line
CREATE (u:User {name:line.name})
You can also use merge (with constraints), create relationships etc.
See my blog post for an example: http://jexp.de/blog/2014/06/using-load-csv-to-import-git-history-into-neo4j/
Or the Neo4j Manual: http://docs.neo4j.org/chunked/milestone/cypherdoc-importing-csv-files-with-cypher.html