Big data import into neo4j - neo4j

I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;

How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.

Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.

Related

Is Merge operation for creating relationship from big csv file with thousands of rows gets slow in neo4j?

I have files those contain thousands of rows where the size of csv files are 500mb to 3.1 gb.i have firstly done bulk import it took few minutes to loads all data in graph DB. now for my project purpose, I need to upload data by regular basis. so I have written a python script using neo4j bolt driver where all regular node creates, node update, node delete performs. Creating a relationship from files also works for the small size of data(prototype). The problem occurs when I am going to create relations from large files. Though parallelism works, it gets very slow. my CPU 32 core is fully used. I have checked it through the HTOP. and for batch size 100-1000 the core is properly used. I have tried 10000-100000 batch size, in that case, parallelism does not work. here is my query code for creating load csv
"""CALL apoc.periodic.iterate('
load csv with headers from "file:///x.csv" AS row return row
','
MERGE (p1:A {ID: row.A})
MERGE (p2:B {ID: row.B})
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.RELATIONSHIP, {}, p2) YIELD rel return rel
',{batchSize:10000, iterateList:true, parallel:true})"""
it works totally fine for a small amount of data. but it gets very slow when it deals with big size of data. for creating 10 relations it took 39sec rough. Is merge operation is inefficient at my case or I am missing some tricks here. kindly help me to solve. I am working at EC2 instance where Ram size is 240G.I a have tried warmup.run it tuned at 192G but no significant change has been observed

Erratic behavior of Neo4j while loading datas

We're loading data in a Neo4j Server which represents mainly (almost) k-ary trees with k between 2 and 10 in most case. We have about 50 node types possible, and about same amount of type of relationships.
The server is online and data can be loaded from several instances (So, unhappily, we can't use neo4j-import)
We experience very slow loading for about 100 000 nodes and relationships, which take about 6mn to load in a good machine. Sometimes we experience loading of the same datas which takes 40mn ! Looking at the neo4j process, it sometime doing nothing....
In this case, we have messages like :
WARN [o.n.k.g.TimeoutGuard] Transaction timeout. (Overtime: 1481 ms).
Beside we don't experience problems with query which execute quickly despite very complex structures
We load data as follow :
A cypher file is loaded like this :
neo4j-shell -host localhost -v -port 1337 -file myGraph.cypher
The cypher file contains several sections :
Constraints creations :
CREATE CONSTRAINT ON (p:MyNodeType) ASSERT p.uid IS UNIQUE;
Index on very little set of Nodes (10 at more)
We carefully select these to avoid counter performance behaviours.
CREATE INDEX ON :MyNodeType1(uid);
Nodes creations
USING PERIODIC COMMIT 4000 LOAD CSV WITH HEADERS FROM "file:////tmp/my.csv" AS csvLine CREATE (p:MyNodeType1 {Prop1: csvLine.prop1, mySupUUID: toInt(csvLine.uidFonctionEnglobante), lineNum: toInt(csvLine.lineNum), uid: toInt(csvLine.uid), name: csvLine.name, projectID: csvLine.projectID, vValue: csvLine.vValue});
Relationships creations
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine Match (n1:MyNodeType1) Where n1.uid = toInt(csvLine.uidFather) With n1, csvLine Match (n2:MyNodeType2) Where n2.uid = toInt(csvLine.uidSon) MERGE (n1)-[:vOperandLink]-(n2);
Question 1
We experienced, sometimes, OOM in Neo4j server while loading datas, difficult to reproduce even with the same datas. But having recently added USING PERIODIC COMMIT 1000 to relationships loading commands, we never reproduced this problem. Could it is possibly the solution for OOM problem ?
Question 2
Is the Periodic Commit parameter good ?
Is there another way to speed up data loading ? Ie. another strategy to write the data loading script ?
Question 3
Is there ways to prevent timeout ? With another way to write the data loading script or maybe JVM tuning ?
Question 4
Some months ago we splited the cypher script in 2 or 3 parts to launch it concurrently, but we stoped that because the server messed up the data frequently and became unusable. Is there a way to split "cleanly" the script and launch them concurrently ?
Question 1: Yes, USING PERIODIC COMMIT is the first thing to try when LOAD CSV causes OOM errors.
Question 2&3: The "sweet spot" for periodic commit batch size depends on your Cypher query, your data characteristics, and how your neo4j server is configured (all of which can change over time). You do not want the batch size to be too high (to avoid occasional OOMs), nor too low (to avoid slowing down the import). And you should tune the server's memory configuration as well. But you will have to do your own experimentation to discover the best batch size and server configuration, and adjust them as needed.
Question 4: Concurrent write operations that touch the same nodes and/or relationships must be avoided, as they can cause errors (like deadlocks and constraint violations). If you can split up your operations so that they act on completely disjoint subgraphs, then they should be able to run concurrently without these kinds of errors.
Also, you should PROFILE your queries to see how the server will actual execute them. For example, even if both :MyNodeType1(uid) and :MyNodeType2(uid) are indexed (or have uniqueness constraints), that does not mean that the Cypher planner will automatically use those indexes when it executes your last query. If your profile of that query shows that it is not using the indexes, then you can add hints to the query to make the planner (more likely to) use them:
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine
MATCH (n1:MyNodeType1) USING INDEX n1:MyNodeType1(uid)
WHERE n1.uid = TOINT(csvLine.uidFather)
MATCH (n2:MyNodeType2) USING INDEX n2:MyNodeType2(uid)
WHERE n2.uid = TOINT(csvLine.uidSon)
MERGE (n1)-[:vOperandLink]-(n2);
In addition, if it is OK to store the uid values as strings, you can remove the uses of TOINT().This will speed up things to some extent.

Best Way to Bulk Load the Data into Neo4J

We are trying to load Millions of nodes and relationships into Neo4j. We are currently using below command
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:customers.csv" AS row
CREATE (:Customer ....
But it is taking us lot of time.
I do see a link which explains modifying the neo4j Files directly.
http://blog.xebia.com/combining-neo4j-and-hadoop-part-ii/
But above link seems to be very old. wanted to know if above process is still valid ?
There is a issue in "neo4j-spark-connector" Github link. which is not updated fully.
https://github.com/neo4j-contrib/neo4j-spark-connector/issues/15
What is the best way among those ?
The fastest way, especially for large data sets, should be through the import tool instead of via Cypher with LOAD CSV.
If you are using LOAD CSV, potentially with MERGE, I highly recommend adding unique constraints - for us it sped up a smallish import (100k nodes) by 100 times or so
You can make use of apoc methods which can perform better for large datasets. Below is a sample cypher query
CALL apoc.periodic.iterate(
'CALL apoc.load.csv(file_path) YIELD lineNo, map as row, list return row',
'MATCH (post:Post {id:row.`:END_ID(Post)`})
MATCH (owner:User {id:row.`:START_ID(User)`})
MERGE (owner)-[:ASKED]->(post);',
{batchSize:500, iterateList:true, parallel:true}
);
Below is the documentation link :
https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_examples_for_apoc_load_csv

Creating/Managing millions of Vertex Tree in Neo4j 3.0.4

I'm doing some stuff with my University and I've been asked to create a system that builds Complete Trees with millions of nodes (1 or 2 million at least).
I was trying to create the Tree with a Load CSV Using a periodic commit and it worked well with the creation of just Nodes (70000 ms on a general purpose Notebook :P ). When I tried the same with the Edges, it didn't scale as well.
Using periodic commit LOAD CSV WITH HEADERS FROM 'file:///Archi.csv' AS line
Merge (:Vertex {name:line.from})<-[:EDGE {attr1: toFloat(line.attr1), attr2:toFloat(line.attr2), attr3: toFloat(line.attr3), attr4: toFloat(line.attr4), attr5: toFloat(line.attr5)}]-(:Vertex {name:line.to})
I need to guarantee that a Tree is generated in no more than 5 minutes.
Is there a Faster method that can return such a performances?
P.S. : The task doesn't expect to use Neo4j, but just a Database (either SQL or NoSQL), but I found out this NoSQL Graph DB and I thought would be nice to implement with Neo4j as the graph data structure is given for free.
P.P.S : I'm using Cypher
I think you should read up on MERGE in the developer documentation again, to make sure you understand exactly what it's doing.
A few things in particular to be aware of...
If the pattern you are merging does not exist, all elements of the pattern will be merged, which could result in duplicate :Vertex nodes being created. If your :Vertexes are supposed to be in the database already, and if there are no relationships yet, and if you are sure that no relationship repeats itself in your CSV, I strongly urge you to MATCH on the start and end nodes, and then CREATE the relationship between them instead of the MERGE. Remember that doing a MERGE with a relationship with many attributes means it will try to match on that first, so as the number of relationships grow between nodes, there will be an increasing number of comparisons, which will slow your query down further. CREATE is a better choice if you know that no relationship will be duplicated, and if you are sure those relationships don't exist yet.
I also urge you to create an index on :Vertex(name), as that will significantly help matching on end nodes.

'Java heap space' error with neo4j -- when increasing memory allocation fails

I have used the import tool to read in ~1 million nodes. Now it is time to set relationships. (Unfortunately, it looks like you have to have relationships predetermined explicitly in a csv if you want to use the import tool, so that is out of the question.)
First thing I did was to put an index on the nodes.
Next, I wrote this, which I'm wondering is my problem -- even with an index, this statement might cause too many cartesian products?:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///home/monica/...relationship.csv' AS line
MATCH (p1:Player {player_id: line.player1_id}), (p2:Player {player_id: line.player2_id})
MERGE (p1)-[:VERSUS]-(p2)
Apparently the USING PERIODIC COMMIT 500 didn't help, as I got my error,
Java heap space
Googling around, I learned that it might help to change my memory settings in the neo4j-wrapper.conf file, so I changed the settings all the way up to 4GB (I have an eight GB system):
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096
Still got the same error.
Now, I'm stuck. I can't think of any other strategies, besides:
1) rewrite the statement
2) use a system with more RAM?
3) find some other way to run this in batches?
Any advice would be awesome. Thanks to the neo4j SO community in advance.
Do you have an index or an unique constraint on :Player(player_id) ? if the former, drop the index and add an unique constraint instead. Otherwise it is possible to have multiple Player nodes sharing the same player_id - which could cause cartesian products, assume you have 10 times the very same player, this would end up in 100 combinations for each line of your csv.
Once you're sure there is no such duplication the next thing to check is EagerPipe. If the query plan (without PERIODIC COMMIT)
EXPLAIN LOAD CSV WITH HEADERS FROM
'file:///home/monica/...relationship.csv' AS line
MATCH (b1:Player {player_id: line.player1_id}), (p2:Player {player_id: line.player2_id})
MERGE (p1)-[:VERSUS]-(p2)
shows something with eager then PERIODIC COMMIT is not applied, see http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/ for details.
The cases where this could happen gets less and less with a more recent Neo4j version.
update
I've just realized that you're using b1 in the match and in the merge a p1 - so the latter does not exist and gets created as new node during merge.
Can you please try:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///home/monica/...relationship.csv' AS line
MATCH (p1:Player {player_id: line.player1_id})
MATCH (p2:Player {player_id: line.player2_id})
MERGE (p1)-[:VERSUS]-(p2)

Resources