I am importing several TB of CSV data into Neo4J for a project I have been working on. I have enough fast storage for the estimated 6.6TiB, however the machine has only 32GB of memory, and the import tool is suggesting 203GB to complete the import.
When I run the import, I see the following (I assume it exited because it ran out of memory). Is there any way I can import this large dataset with the limited amount of memory I have? Or if not with the limited amount of memory I have, with the maximum ~128GB that the motherboard this machine can support.
Available resources:
Total machine memory: 30.73GiB
Free machine memory: 14.92GiB
Max heap memory : 6.828GiB
Processors: 16
Configured max memory: 21.51GiB
High-IO: true
WARNING: estimated number of nodes 37583174424 may exceed capacity 34359738367 of selected record format
WARNING: 14.62GiB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 5.026GiB
minimum free and available memory excluding heap size: 202.6GiB
Import starting 2022-10-08 19:01:43.942+0000
Estimated number of nodes: 15.14 G
Estimated number of node properties: 97.72 G
Estimated number of relationships: 37.58 G
Estimated number of relationship properties: 0.00
Estimated disk space usage: 6.598TiB
Estimated required memory usage: 202.6GiB
(1/4) Node import 2022-10-08 19:01:43.953+0000
Estimated number of nodes: 15.14 G
Estimated disk space usage: 5.436TiB
Estimated required memory usage: 202.6GiB
.......... .......... .......... .......... .......... 5% ∆1h 38m 2s 867ms
neo4j#79d2b0538617:~/import$
TL:DR; Using Periodic Commit, or Transaction Batching
If you're trying to follow the Operations Manual: Neo4j Admin Import, and your csv matches the movies.csv in that example, I would suggest instead doing a more manual USING PERIODIC COMMIT LOAD CSV...:
Stop the db.
Put your csv at neo4j/import/myfile.csv.
If you're using Desktop: Project > DB > click the ... on the right >
Open Folder
Add the APOC plugin.
Start the DB.
Next, open a browser instance, run the following (adjust for your data), and leave it until tomorrow:
USING PERIODIC COMMIT LOAD CSV FROM 'file:///myfile.csv' AS line
WITH line[3] AS nodeLabels, {
id: line[0],
title: line[1],
year: toInteger(line[2])
} AS nodeProps
apoc.create.node(SPLIT(line[3],';',
Note: There are many ways to solve this problem, depending on your source data and the model you wish to create. This solution is only meant to give you a handful of tools to help you get around the memory limit. If it is a simple CSV, and you don't care about what labels the nodes get initially, and you have headers, you can skip the complex APOC, and probably just do something like the following:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///myfile.csv' AS line
CREATE (a :ImportedNode)
SET a = line
File for Each Label
Original Asker mentioned having a separate csv for each label. In such instances it may be helpful to have a great-big single-command that can handle all of it, rather than needing to manually step through each step of the operation.
Assuming two label-types, each with a unique 'id' property, and one with a 'parent_id' referencing the other label...
UNWIND [
{ file: 'country.csv', label: 'Country'},
{ file: 'city.csv', label: 'City'}
] AS importFile
USING PERIODIC COMMIT LOAD CSV FROM 'file:///' + importFile.file AS line
CALL apoc.merge.node([importFile.label], {id: line.id}) YIELD node
SET node = line
;
// then build the relationships
MATCH (city :City)
WHERE city.parent_id
MATCH (country :Country {id: city.parent_id)
MERGE (city)-[:IN]->(country)
Related
I am running a batch merge operation in Neo4j, but it keeps on failing. I increased the heap size and heap limit but is seems like my operations is perhaps not supported or there's a more appropriate method of doing this.
I have 10k of such MERGE statements
var transaction = `
MERGE (n0: Account {sfRecordId:$Id0})
ON CREATE SET ...
ON MATCH SET ...
MERGE (n1: Account {sfRecordId:$Id1})
ON CREATE SET ...
ON MATCH SET ...
MERGE ... //10k
// then send to Neo4j via the Javascript driver:
tx.run(transaction, bindParams)
My heap size is set to
server.memory.heap.initial_size=2G
server.memory.heap.max_size=3G
I have tried sending them all at once, I have tried batching them into batches of 100, 500, 1, and 10,000.
None of these seems to get my 10k to be inserted.
The first random 500-4k get in, but then the server crashes with an out of memory error.
Generating a very large statement to bulk insert data is not recommended. A good way of inserting such bulk data is to prepare a list of objects, and use UNWIND to iterate through it:
UNWIND $data as item
MERGE (n:Account {sfRecordId:item.Id})
ON CREATE SET ...
ON MATCH SET ...
I have a Neo4J database with the following properties:
Array Store 8.00 KiB
Logical Log 16 B
Node Store 174.54 MiB
Property Store 477.08 MiB
Relationship Store 3.99 GiB
String Store Size 174.34 MiB
MiB Total Store Size 5.41 GiB
There are 12M nodes and 125M relationships.
So you could say this is a pretty large database.
My OS is windows 10 64bit, running on an Intel i7-4500U CPU #1.80Ghz with 8GB of RAM.
This isn't a complete powerhouse, but it's a decent machine and in theory the total store could even fit in RAM.
However when I run a very simple query (using the Neo4j Browser)
MATCH (n {title:"A clockwork orange"}) RETURN n;
I get a result:
Returned 1 row in 17445 ms.
I also used a post request with the same query to http://localhost:7474/db/data/cypher, this took 19seconds.
something like this:
http://localhost:7474/db/data/node/15000
is however executed in 23ms...
And I can confirm there is an index on title:
Indexes
ON :Page(title) ONLINE
So anyone have ideas on why this might be running so slow?
Thanks!
This has to scan all nodes in the db - if you re-run your query using n:Page instead of just n, it'll use the index on those nodes and you'll get better results.
To expand this a bit more - INDEX ON :Page(title) is only for nodes with a :Page label, and in order to take advantage of that index your MATCH() needs to specify that label in its search.
If a MATCH() is specified without a label, the query engine has no "clue" what you're looking for so it has to do a full db scan in order to find all the nodes with a title property and check its value.
That's why
MATCH (n {title:"A clockwork orange"}) RETURN n;
is taking so long - it has to scan the entire db.
If you tell the MATCH() you're looking for a node with a :Page label and a title property -
MATCH (n:Page {title:"A clockwork orange"}) RETURN n;
the query engine knows you're looking for nodes with that label, it also knows that there's an index on that label it can use - which means it can perform your search with the performance you're looking for.
I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.
I have a performance issue with bulk insert into neo4j.
I have a csv file with 400k rows which produces about 3.5 million rows, and I use LOAD CSV command, with the latest version on neo4j.
I've noticed that when I user Create statement, the load takes about 4 minutes, and without indexes at all- about 3.5 minutes.
My first question, is whether this is the normal rate of nodes/ min.
Now, my real problem, is that I need to use merge, for data integrity reasons, and when I use it, it can take even 24 hours, together with indexes.
So 2 additional questions will be:
Is the LOAD CSV recommended for the best performance load,
and also:
What can I do do about this performance issue?
EDIT:
here is the query:
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS line FIELDTERMINATOR '|'
MERGE (session :Session { session:line.session })
MERGE (hit :Hit { key:line.key,date_time:line.date_time,session:line.session })
MERGE (user :User { id:line.user_id })
MERGE (session2 :Session2 { session2:line.session2 })
MERGE (country :Country{ name:line.country})
MERGE (tv :TV { name:tv.Model })
MERGE (transfer_protocol :Protocol { name:line.transfer_protocol })
MERGE (os :OS { name:line.os_name ,version:line.os_version, row_key:line.os_name+line.os_version})
Sample: session_guid|hit_key_guid|useridguid|session2_guid|PANASONIC|TCP|ANDROID|5.0
the session,user,session2,country,tv,transfer_protocol and os has unique constraint, and hit has an index
**session1 and session2 can have many hits (1 to 100, average 5)
hit_key_guid is different for each csv line
it's running really slow- pretty strong machine, and each 1000 rows can take up to 10 seconds.
also checked with the profiler, and no "Eager"
thanks
Lior
You should share your data model, your indexes, your LOAD CSV query and also the profile output. Are you using PERIODIC commit?
Make sure that you don't run into the Eager issue, see here:
http://neo4j.com/developer/guide-import-csv/#_load_csv_for_medium_sized_datasets
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
In general for a dataset your size LOAD CSV is ok, from 10M rows I'd probably switch to the import-tool.
It appears that the server side code, didn't create the indexes properly, and once they were created, the load done in good performance
When I run a script that tries to batch merge all nodes a certain types, I am getting some weird performance results.
When merging 2 collections of nodes (~42k) and (~26k), the performance is nice and fast.
But when I merge (~42) and (5), performance DRAMATICALLY degrades. I'm batching the ParentNodes (so (~42k) split up in batches of 500. Why does performance drop when I'm, essentially, merging less nodes (when the batch set is the same, but the source of the batch set is high and the target set is low)?
Relation Query:
MATCH (s:ContactPlayer)
WHERE has(s.ContactPrefixTypeId)
WITH collect(s) AS allP
WITH allP[7000..7500] as rangedP
FOREACH (parent in rangedP |
MERGE (child:ContactPrefixType
{ContactPrefixTypeId:parent.ContactPrefixTypeId}
)
MERGE (child)-[r:CONTACTPLAYER]->(parent)
SET r.ContactPlayerId = parent.ContactPlayerId ,
r.ContactPrefixTypeId = child.ContactPrefixTypeId )
Performance Results:
Process Starting
Starting to insert Contact items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time for 42149 Contact items: 19176.87ms
Average time per batch (500): 213.4ms
Longest batch time: 663ms
Starting to insert ContactPlayer items
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time for 27970 ContactPlayer items: 9419.2106ms
Average time per batch (500): 167.75ms
Longest batch time: 689ms
Starting to relate Contact to ContactPlayer
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time taken to relate Contact to ContactPlayer: 7907.4877ms
Average time per batch (500): 141.151517857143ms
Longest batch time: 883.0918ms for Batch number: 0
Starting to insert ContactPrefixType items
[+]
Total time for 5 ContactPrefixType items: 22.0737ms
Average time per batch (500): 22ms
Longest batch time: 22ms
Already inserted data for Contact.
Starting to relate ContactPrefixType to Contact
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time taken to relate ContactPrefixType to Contact: 376540.8309ms
Average time per batch (500): 4429.78643647059ms
Longest batch time: 14263.1843ms for Batch number: 63
So far, the best I could come up with is the following (and it's a hack, specific to my environment):
If / Else condition:
If childrenNodes.count() < 200 -> assume they are type identifiers for the parent... i.e. ContactPrefixType
Else assume it is a matrix for relating multiple item types together (i.e. ContactAddress)
If childNodes < 200
MATCH (parent:{parentLabel}),
(child:{childLabel} {{childLabelIdProperty}:parent.{parentRelationProperty}})
CREATE child-[r:{relationshipLabel}]->parent
This takes about 3-5 seconds to complete per relationship type
Else
MATCH (child:{childLabel}),
(parent:{parentLabel} {{parentPropertyField : child.{childLabelIdProperty}})
WITH collect(parent) as parentCollection, child
WITH parentCollection[{batchStart}..{batchEnd}] as coll, child
FOREACH (parent in coll |
CREATE child-[r:{relationshipLabel}]-parent )
I'm not sure this is the most efficient way of doing this, but after trying MANY different options, this seems to be the fastest.
Stats:
insert 225,018 nodes with 2,070,977 properties
create 464,606 relationships
Total: 331 seconds.
Because this is a straight import and I'm not dealing with updates yet, I assume that all the relationships are correct and don't need to worry about invalid data... however, I will try to set properties to the relationship type so as to be able to perform cleanup functions later (i.e. store the parent and child Id's in the relationship type as properties for later reference)
If anyone can improve on this, I would love it.
Can you pass the ids in as parameters rather than fetch them from the graph? The query could look like
MATCH (s:ContactPlayer {ContactPrefixTypeId:{cptid})
MERGE (c:ContactPrefixType {ContactPrefixTypeId:{cptid})
MERGE c-[:CONTACT_PLAYER]->s
If you use the REST API Cypher resource, I think the entity should look something like
{
"query":...,
"params": {
"cptid":id1
}
}
If you use the transactional endpoint, it should look something like this. You control transaction size by the number of statements in each call, and also by the number of calls before you commit. More here.
{
"statements":[
"statement":...,
"parameters": {
"cptid":id1
},
"statement":...,
"parameters": {
"cptid":id2
}
]
}