We're loading data in a Neo4j Server which represents mainly (almost) k-ary trees with k between 2 and 10 in most case. We have about 50 node types possible, and about same amount of type of relationships.
The server is online and data can be loaded from several instances (So, unhappily, we can't use neo4j-import)
We experience very slow loading for about 100 000 nodes and relationships, which take about 6mn to load in a good machine. Sometimes we experience loading of the same datas which takes 40mn ! Looking at the neo4j process, it sometime doing nothing....
In this case, we have messages like :
WARN [o.n.k.g.TimeoutGuard] Transaction timeout. (Overtime: 1481 ms).
Beside we don't experience problems with query which execute quickly despite very complex structures
We load data as follow :
A cypher file is loaded like this :
neo4j-shell -host localhost -v -port 1337 -file myGraph.cypher
The cypher file contains several sections :
Constraints creations :
CREATE CONSTRAINT ON (p:MyNodeType) ASSERT p.uid IS UNIQUE;
Index on very little set of Nodes (10 at more)
We carefully select these to avoid counter performance behaviours.
CREATE INDEX ON :MyNodeType1(uid);
Nodes creations
USING PERIODIC COMMIT 4000 LOAD CSV WITH HEADERS FROM "file:////tmp/my.csv" AS csvLine CREATE (p:MyNodeType1 {Prop1: csvLine.prop1, mySupUUID: toInt(csvLine.uidFonctionEnglobante), lineNum: toInt(csvLine.lineNum), uid: toInt(csvLine.uid), name: csvLine.name, projectID: csvLine.projectID, vValue: csvLine.vValue});
Relationships creations
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine Match (n1:MyNodeType1) Where n1.uid = toInt(csvLine.uidFather) With n1, csvLine Match (n2:MyNodeType2) Where n2.uid = toInt(csvLine.uidSon) MERGE (n1)-[:vOperandLink]-(n2);
Question 1
We experienced, sometimes, OOM in Neo4j server while loading datas, difficult to reproduce even with the same datas. But having recently added USING PERIODIC COMMIT 1000 to relationships loading commands, we never reproduced this problem. Could it is possibly the solution for OOM problem ?
Question 2
Is the Periodic Commit parameter good ?
Is there another way to speed up data loading ? Ie. another strategy to write the data loading script ?
Question 3
Is there ways to prevent timeout ? With another way to write the data loading script or maybe JVM tuning ?
Question 4
Some months ago we splited the cypher script in 2 or 3 parts to launch it concurrently, but we stoped that because the server messed up the data frequently and became unusable. Is there a way to split "cleanly" the script and launch them concurrently ?
Question 1: Yes, USING PERIODIC COMMIT is the first thing to try when LOAD CSV causes OOM errors.
Question 2&3: The "sweet spot" for periodic commit batch size depends on your Cypher query, your data characteristics, and how your neo4j server is configured (all of which can change over time). You do not want the batch size to be too high (to avoid occasional OOMs), nor too low (to avoid slowing down the import). And you should tune the server's memory configuration as well. But you will have to do your own experimentation to discover the best batch size and server configuration, and adjust them as needed.
Question 4: Concurrent write operations that touch the same nodes and/or relationships must be avoided, as they can cause errors (like deadlocks and constraint violations). If you can split up your operations so that they act on completely disjoint subgraphs, then they should be able to run concurrently without these kinds of errors.
Also, you should PROFILE your queries to see how the server will actual execute them. For example, even if both :MyNodeType1(uid) and :MyNodeType2(uid) are indexed (or have uniqueness constraints), that does not mean that the Cypher planner will automatically use those indexes when it executes your last query. If your profile of that query shows that it is not using the indexes, then you can add hints to the query to make the planner (more likely to) use them:
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine
MATCH (n1:MyNodeType1) USING INDEX n1:MyNodeType1(uid)
WHERE n1.uid = TOINT(csvLine.uidFather)
MATCH (n2:MyNodeType2) USING INDEX n2:MyNodeType2(uid)
WHERE n2.uid = TOINT(csvLine.uidSon)
MERGE (n1)-[:vOperandLink]-(n2);
In addition, if it is OK to store the uid values as strings, you can remove the uses of TOINT().This will speed up things to some extent.
Related
I have files those contain thousands of rows where the size of csv files are 500mb to 3.1 gb.i have firstly done bulk import it took few minutes to loads all data in graph DB. now for my project purpose, I need to upload data by regular basis. so I have written a python script using neo4j bolt driver where all regular node creates, node update, node delete performs. Creating a relationship from files also works for the small size of data(prototype). The problem occurs when I am going to create relations from large files. Though parallelism works, it gets very slow. my CPU 32 core is fully used. I have checked it through the HTOP. and for batch size 100-1000 the core is properly used. I have tried 10000-100000 batch size, in that case, parallelism does not work. here is my query code for creating load csv
"""CALL apoc.periodic.iterate('
load csv with headers from "file:///x.csv" AS row return row
','
MERGE (p1:A {ID: row.A})
MERGE (p2:B {ID: row.B})
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.RELATIONSHIP, {}, p2) YIELD rel return rel
',{batchSize:10000, iterateList:true, parallel:true})"""
it works totally fine for a small amount of data. but it gets very slow when it deals with big size of data. for creating 10 relations it took 39sec rough. Is merge operation is inefficient at my case or I am missing some tricks here. kindly help me to solve. I am working at EC2 instance where Ram size is 240G.I a have tried warmup.run it tuned at 192G but no significant change has been observed
I have used the import tool to read in ~1 million nodes. Now it is time to set relationships. (Unfortunately, it looks like you have to have relationships predetermined explicitly in a csv if you want to use the import tool, so that is out of the question.)
First thing I did was to put an index on the nodes.
Next, I wrote this, which I'm wondering is my problem -- even with an index, this statement might cause too many cartesian products?:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///home/monica/...relationship.csv' AS line
MATCH (p1:Player {player_id: line.player1_id}), (p2:Player {player_id: line.player2_id})
MERGE (p1)-[:VERSUS]-(p2)
Apparently the USING PERIODIC COMMIT 500 didn't help, as I got my error,
Java heap space
Googling around, I learned that it might help to change my memory settings in the neo4j-wrapper.conf file, so I changed the settings all the way up to 4GB (I have an eight GB system):
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096
Still got the same error.
Now, I'm stuck. I can't think of any other strategies, besides:
1) rewrite the statement
2) use a system with more RAM?
3) find some other way to run this in batches?
Any advice would be awesome. Thanks to the neo4j SO community in advance.
Do you have an index or an unique constraint on :Player(player_id) ? if the former, drop the index and add an unique constraint instead. Otherwise it is possible to have multiple Player nodes sharing the same player_id - which could cause cartesian products, assume you have 10 times the very same player, this would end up in 100 combinations for each line of your csv.
Once you're sure there is no such duplication the next thing to check is EagerPipe. If the query plan (without PERIODIC COMMIT)
EXPLAIN LOAD CSV WITH HEADERS FROM
'file:///home/monica/...relationship.csv' AS line
MATCH (b1:Player {player_id: line.player1_id}), (p2:Player {player_id: line.player2_id})
MERGE (p1)-[:VERSUS]-(p2)
shows something with eager then PERIODIC COMMIT is not applied, see http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/ for details.
The cases where this could happen gets less and less with a more recent Neo4j version.
update
I've just realized that you're using b1 in the match and in the merge a p1 - so the latter does not exist and gets created as new node during merge.
Can you please try:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///home/monica/...relationship.csv' AS line
MATCH (p1:Player {player_id: line.player1_id})
MATCH (p2:Player {player_id: line.player2_id})
MERGE (p1)-[:VERSUS]-(p2)
I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.
I have a very long Cypher request in my app (running on Node.Js and Neo4j 2.0.1), which creates at once about 16 nodes and 307 relationships between them. It is about 50K long.
The high number of relationships is determined by the data model, which I probably want to change later, but nevertheless, if I decide to keep everything as it is, two questions:
1) What would be the maximum size of each single Cypher request I send to Neo4J?
2) What would be the best strategy to deal with a request that is too long? Split it into the smaller ones and then batch them in a transaction? I wouldn't like to do that because in this case I lose the consistency that I had resulting from a combination of MERGE and CREATE commands (the request automatically recognized some nodes that did not exist yet, create them, and then I could make relations between them using their indices that I already got through the MERGE).
Thank you!
I usually recommend to
Use smaller statements, so that the query plan cache can kick in and execute your query immediately without compiling, for this you also need
parameters, e.g. {context} or {user}
I think a statement size of up to 10-15 elements is easy to handle.
You can still execute all of them in a single tx with the transactional cypher endpoint, which allows batching of statements and their parameters.
I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.