I am trying to insert unique nodes and relationship in neo4j.
What I am using :-
Neo4j Community Edition running on Amazon EC2.[Amazon Linux m3.large]
Neo4j Java Rest Binding [ https://github.com/neo4j-contrib/java-rest-binding ]
Data Size and Type :
TSV File [Multiple]. Each contains more than 8 Million Lines [each line represent a node or a relationship].There are more than 10 files for nodes.[= 2 Million Nodes] and another 2 million relations.
I am using UniqueNodeFactory for inserting nodes. And inserting sequentially, couldn't find any way to insert into batches preserving unique nodes.
The problem is it is taking huge time to insert data. For example it took almost a day for inserting 0.3 million unique nodes. Is there any way to fasten the insertion?
Don't do that.
Java-REST-Binding was never made for that.
Use either
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://some.url" as line
CREATE (u:User {name:line.name})
You can also use merge (with constraints), create relationships etc.
See my blog post for an example: http://jexp.de/blog/2014/06/using-load-csv-to-import-git-history-into-neo4j/
Or the Neo4j Manual: http://docs.neo4j.org/chunked/milestone/cypherdoc-importing-csv-files-with-cypher.html
Related
I am new to neo4j, my data is in csv files trying load them in db and create relationships.
departments.csv(9 rows)
dept_name
dept_no
dept_emp.csv(331603 rows)
dept_no
emp_no
from_date
to_date
I have create nodes with labels departments and dept_emp with all columns as properties. now trying to create relationship between them.
CALL apoc.periodic.iterate("
load csv with headers from 'file:///dept_emp.csv' as row return row",
"match(de:dept_emp)
match(d:departments)
where de.dept_no=row.dept_no and d.dept_no= row.dept_no
merge (de)-[:BELONGS_TO]->(d)",{batchSize:10000, parallel:false})
I do have indexes on :dept_emp and :departments
When I try to run this it is taking ages to complete(many days). When I change the batch size to 10 it created 331603 relations, but it kept on running until it completes all the batches which is taking too long. When it encounters 9 different dept_no at initial rows in dept_emp.csv it is creating all the relations but it has to complete all the batches. In each batch it has to scan all the 331603 relations which were create in first two batches or so. Please help me with optimizing this.
Here I have used apoc.periodic.iterate to deal with the huge data in future, here how the data is related and how I am trying to establish the relation is making the problem . Each department will be having many dept_emp nodes connected.
Currently using Neo4j 4.2.1 version
Max heap size is 1G due to my laptop limitations.
There's no need to create nodes in this fashion, i.e. set properties and then load the same csv again but match all nodes in the graph and do a cartesian join.
Instead:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS row
CREATE (d:Department) SET d.deptNo=row.dept_no, d.name=row.dept_name
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///dept_emp.csv' AS row
MATCH (d:Department {deptNo:row.`dept_no`})
WITH d
MERGE (e:Employee {empNo: row.`emp_no`})
MERGE (e)-[:BELONGS_TO]->(d)
Problem: How to load ~8 GB of data, >10 million rows, of the following format into Neo4j efficiently. I am using the DocGraph data set which shows relationships between Medicare providers. The dataset is a csv with columns:
From_ID, To_ID, Count_Patients, Count_Transacts, Avg_Wait, Stdv_Wait
From_ID means ID of a doctor making a referral. To_ID is the doctor who receives the referral. The last four columns are relationship properties. Any ID in the first or 2nd column can reappear in either column, because providers can have many relationships in either direction.
Here is the basic query I've come up with (very new to Cypher but adept at SQL):
LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)
It seems to work with a small subset of data but last time I tried it on the full dataset it broke my neo4j and it kept timing out when I tried to restart it, so I had to terminate my EC2 instance and start from scratch.
Appreciate any advice I can get and help with the Cypher query. Also, I am planning to merge this data with additional Medicare data with more properties about the nodes e.g. doctor specialty, location, name, etc, so let me know how I should take that into consideration.
Instance details: Ubuntu 18.04, m5ad.large (2 vCPUS, 8GB RAM, 75GB SSD)
It seems very likely that your logic is flawed.
You should investigate whether multiple rows in your CSV file can have the same line[0] value. If so, your CREATE clause should be change to a MERGE, to avoid the creation of a potentially large number of duplicate provider nodes (and therefore also duplicate :REFERRED relationships).
Did you try using
USING PERIODIC COMMIT 1000 ......
USING PERIODIC COMMIT 1000 LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)
I'm creating nodes and relationships programmatically using neo4j java driver based on reading the relationships specified in a csv file.
The csv file contains about 16 million rows and there will be about 16*4 million relationships to create.
I'm using the pattern Match Match Create for this purpose :
Match (a:label), (b:label) where a.prop='1234' and b.prop='4567' create (a)-[:LINKS]->(b)
I just started the program and functionally it ran well. I saw nodes and relationships being created properly in the neo4j DB.
However, for the past four hours, only 100,000 rows from the csv have been processed and only 92037 relationships being created.
According to this speed, it will take about one months to finish processing the csv and creating all the relationships.
I noticed that I was sending the Match...Create one by one to the session.writeTransaction().
Is there any way to batch them up so as to speed up the creation time?
I'm loading relationships into my graph db in Neo4j using the load csv operation. The nodes are already created. I have four different types of relationships to create from four different CSV files (file 1 - 59 relationships, file 2 - 905 relationships, file 3 - 173,000 relationships, file 4 - over 1 million relationships). The cypher queries execute just fine, However file 1 (59 relationships) takes 25 seconds to execute, file 2 took 6.98 minutes and file 3 is still going on since past 2 hours. I'm not sure if these execution times are normal given neo4j's capabilities to handle millions of relationships. A sample cypher query I'm using is given below.
load csv with headers from
"file:/sample.csv"
as rels3
match (a:Index1 {Filename: rels3.Filename})
match (b:Index2 {Field_name: rels3.Field_name})
create (a)-[:relation1 {type: rels3.`relation1`}]->(b)
return a, b
'a' and 'b' are two indices I created for two of the preloaded node categories hoping to speed up lookup operation.
Additional information - Number of nodes (a category) - 1791
Number of nodes (b category) - 3341
Is there a faster way to load this and does load csv operation take so much time? Am i going wrong somewhere?
Create an index on Index1.Filename and Index2.Field_name:
CREATE INDEX ON :Index1(Filename);
CREATE INDEX ON :Index2(Field_name);
Verify these indexes are online:
:schema
Verify your query is using the indexes by adding PROFILE to the start of your query and looking at the execution plan to see if the indexes are being used.
More info here
What i like to do before running a query is run explain first to see if there are any warnings. I have fixed many a query thanks to the warnings.
(simple pre-append explain to your query)
Also, perhaps you can drop the return statement. After your query finishes you can then run another to just see the nodes.
I create roughly 20M relationships in about 54 mins using a query very similar to yours.
Indices are important because that's how neo finds the nodes.
I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.