My database was affected by the bug in Neo4j 2.1.1 that tends to corrupt the database in the areas where many nodes have been deleted. It turns out most of the relationships that have been affected were marked for deletion in my database. I have dumped the rest of the data using neo4j-shell and with a single query. This gives a 1.5G Cypher file that I need to import into a mint database to have my data back in a healthy data structure.
I have noticed that the dump file contains definitions for (1) schema, (2) nodes and (3) relationships. I have already removed the schema definitions from the file because they can be applied later on. Now the issue is that since the dump file uses a single series of identifiers for nodes during node creation (in the following format: _nodeid) and relationship creation, it seems that all CREATE statements (33,160,527 in my case) need to be run in a single transaction.
My first attempt to do so kept the server busy for 36 hours without results. I had neo4j-shell load the data directly into a new database directory instead of connecting to a server. The data files in the new database directory never showed any sign of receiving data, and the message log showed many messages indicating thread blocks.
I wonder what is the best way of getting this data back into the database? Should I load a specific config file? Do I need to allocate a large Java heap? What is the trick to have such a large dump file loaded into a database?
The dump command is not meant for larger scale exports, there was originally a version that did, but it was not included in the product.
if you have the old database still around, you can try some things:
contact Neo4j support to help you recover your data
use my store-utils to copy it over to a new db (it will skip all broken records)
query the data with cypher and export the results as csv
you could use the shell-import-tools for that
and then import your data from the CSV using either the shell tools again, or the load csv command or the batch-importer
Here is what I finally did:
First I identified all unaffected nodes and marked them with one specific label (let's say Carriable). This was a pretty easy process in my case because all the affected nodes had the same label, so, I just excluded this specific label. In my case I did not have to identify the affected relationships separately because all the affected relationships were also connected to nodes from the affected label.
Then I exported the whole database except the affected nodes and relationships to GraphML using a single query (in neo4j-shell):
export-graphml -o /home/mah/full.gml -t -r match (n:Carriable) optional match (n)-[i]-(:Carriable) return n,i
This took about a half hour to yield a 4GB XML file.
Then I imported the entire GraphML back into a mint database:
JAVA_OPTS="-Xmx8G" neo4j-shell -c "import-graphml -c -t -b 10000 -i /home/mah/full.gml" -path /db/newneo
This took yet another half hour to accomplish.
Please note that I allocated more than sufficient Java heap memory (JAVA_OPTS="-Xmx8G"), imposed a particularly small batch size (-b 10000) and allowed the use of on-disk caching.
Finally, I removed the unnecessary "Carriable" label and recreated the constraints.
Related
I have files those contain thousands of rows where the size of csv files are 500mb to 3.1 gb.i have firstly done bulk import it took few minutes to loads all data in graph DB. now for my project purpose, I need to upload data by regular basis. so I have written a python script using neo4j bolt driver where all regular node creates, node update, node delete performs. Creating a relationship from files also works for the small size of data(prototype). The problem occurs when I am going to create relations from large files. Though parallelism works, it gets very slow. my CPU 32 core is fully used. I have checked it through the HTOP. and for batch size 100-1000 the core is properly used. I have tried 10000-100000 batch size, in that case, parallelism does not work. here is my query code for creating load csv
"""CALL apoc.periodic.iterate('
load csv with headers from "file:///x.csv" AS row return row
','
MERGE (p1:A {ID: row.A})
MERGE (p2:B {ID: row.B})
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.RELATIONSHIP, {}, p2) YIELD rel return rel
',{batchSize:10000, iterateList:true, parallel:true})"""
it works totally fine for a small amount of data. but it gets very slow when it deals with big size of data. for creating 10 relations it took 39sec rough. Is merge operation is inefficient at my case or I am missing some tricks here. kindly help me to solve. I am working at EC2 instance where Ram size is 240G.I a have tried warmup.run it tuned at 192G but no significant change has been observed
We're loading data in a Neo4j Server which represents mainly (almost) k-ary trees with k between 2 and 10 in most case. We have about 50 node types possible, and about same amount of type of relationships.
The server is online and data can be loaded from several instances (So, unhappily, we can't use neo4j-import)
We experience very slow loading for about 100 000 nodes and relationships, which take about 6mn to load in a good machine. Sometimes we experience loading of the same datas which takes 40mn ! Looking at the neo4j process, it sometime doing nothing....
In this case, we have messages like :
WARN [o.n.k.g.TimeoutGuard] Transaction timeout. (Overtime: 1481 ms).
Beside we don't experience problems with query which execute quickly despite very complex structures
We load data as follow :
A cypher file is loaded like this :
neo4j-shell -host localhost -v -port 1337 -file myGraph.cypher
The cypher file contains several sections :
Constraints creations :
CREATE CONSTRAINT ON (p:MyNodeType) ASSERT p.uid IS UNIQUE;
Index on very little set of Nodes (10 at more)
We carefully select these to avoid counter performance behaviours.
CREATE INDEX ON :MyNodeType1(uid);
Nodes creations
USING PERIODIC COMMIT 4000 LOAD CSV WITH HEADERS FROM "file:////tmp/my.csv" AS csvLine CREATE (p:MyNodeType1 {Prop1: csvLine.prop1, mySupUUID: toInt(csvLine.uidFonctionEnglobante), lineNum: toInt(csvLine.lineNum), uid: toInt(csvLine.uid), name: csvLine.name, projectID: csvLine.projectID, vValue: csvLine.vValue});
Relationships creations
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine Match (n1:MyNodeType1) Where n1.uid = toInt(csvLine.uidFather) With n1, csvLine Match (n2:MyNodeType2) Where n2.uid = toInt(csvLine.uidSon) MERGE (n1)-[:vOperandLink]-(n2);
Question 1
We experienced, sometimes, OOM in Neo4j server while loading datas, difficult to reproduce even with the same datas. But having recently added USING PERIODIC COMMIT 1000 to relationships loading commands, we never reproduced this problem. Could it is possibly the solution for OOM problem ?
Question 2
Is the Periodic Commit parameter good ?
Is there another way to speed up data loading ? Ie. another strategy to write the data loading script ?
Question 3
Is there ways to prevent timeout ? With another way to write the data loading script or maybe JVM tuning ?
Question 4
Some months ago we splited the cypher script in 2 or 3 parts to launch it concurrently, but we stoped that because the server messed up the data frequently and became unusable. Is there a way to split "cleanly" the script and launch them concurrently ?
Question 1: Yes, USING PERIODIC COMMIT is the first thing to try when LOAD CSV causes OOM errors.
Question 2&3: The "sweet spot" for periodic commit batch size depends on your Cypher query, your data characteristics, and how your neo4j server is configured (all of which can change over time). You do not want the batch size to be too high (to avoid occasional OOMs), nor too low (to avoid slowing down the import). And you should tune the server's memory configuration as well. But you will have to do your own experimentation to discover the best batch size and server configuration, and adjust them as needed.
Question 4: Concurrent write operations that touch the same nodes and/or relationships must be avoided, as they can cause errors (like deadlocks and constraint violations). If you can split up your operations so that they act on completely disjoint subgraphs, then they should be able to run concurrently without these kinds of errors.
Also, you should PROFILE your queries to see how the server will actual execute them. For example, even if both :MyNodeType1(uid) and :MyNodeType2(uid) are indexed (or have uniqueness constraints), that does not mean that the Cypher planner will automatically use those indexes when it executes your last query. If your profile of that query shows that it is not using the indexes, then you can add hints to the query to make the planner (more likely to) use them:
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine
MATCH (n1:MyNodeType1) USING INDEX n1:MyNodeType1(uid)
WHERE n1.uid = TOINT(csvLine.uidFather)
MATCH (n2:MyNodeType2) USING INDEX n2:MyNodeType2(uid)
WHERE n2.uid = TOINT(csvLine.uidSon)
MERGE (n1)-[:vOperandLink]-(n2);
In addition, if it is OK to store the uid values as strings, you can remove the uses of TOINT().This will speed up things to some extent.
I have more separate independent structures in the database. I need to do a backup for each of these structures separately and not to do a full backup of everything.
I am interested is there a way to do a backup of some specific graph part. I checked what backup strategies there are in the neo4j documentation. There are incremental backup and full backup, but I could not find the possibility to extract and backup only some part of the graph or maybe some independent graph structure in the database.
Ideal would be to define cypher query and to get the result like that. For example in most relational databases it is possible to extract/backup separate table or dataset (depending on database). So that is something I am looking to do in neo4j too. Define node label and then do a backup or by some other criteria.
You can use the experimental dump command along with the shell :
Example: dumping the user nodes to a users.cypher file that will contain all the cypher statements for recreating the users later :
./bin/neo4j-shell -c 'dump MATCH (n:User) RETURN n;' > users.cypher
Related info in the documentation : http://neo4j.com/docs/stable/shell-commands.html#_dumping_the_database_or_cypher_statement_results
I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.
I have 100M datasets in Oracle and try to import all these datasets into Neo4j with Talend. My question is, since the 100M datasets is updating everyday, how can I make sure Talend will only import datasets which are not already existed in the neo4j database? In other words, talend will only import the updated datasets.
For example, suppose Neo4j contains 38890, 38891, 38892 right now. In Oracle, the updated datasets are 38890,38891, 38892, 38893. The expected result is 38893 will be the imported only.
The datasets is very large, it seems not very efficienct to import these datasets to Neo4j everyday and delete the duplicate. Could anyone help me out please? Thanks in advance.
You should to do 2 loads, 1 for the initial FULL Load, just like you do it now and another one for the daily incremental loads.
Check your primary keys and find a way to make a SELECT query which will return your new/modified rows. You need another query which will show you which rows had been deleted / modified as you need to remove these rows before adding the new/modified rows into your db.
To run this automatically you need to right click on your job and select "export Job" It will build your job into a JAVA JAR file. With a .sh and .bat launcher. You can then use the windows scheduler to execute this daily, or use CRON to execute it daily if you happen to have a linux server.
You certainly have an updated timestamp on your tables in oracle, so I would use that to filter out the data that was only updated since the last import, which would be much less data, e.g. 1-5M rows.
For those entries you can have a unique constraint and then use cypher with the MERGE on the entries which is a get-or-create.
Make sure to use parameters for updating the data, against the embedded or server APIs
FOREACH (p in {people} |
MERGE (person:Person {name:{p.name}})
ON CREATE SET person.age = p.age, ...
}