Neo4j 10k MERGE statements fail - neo4j

I am running a batch merge operation in Neo4j, but it keeps on failing. I increased the heap size and heap limit but is seems like my operations is perhaps not supported or there's a more appropriate method of doing this.
I have 10k of such MERGE statements
var transaction = `
MERGE (n0: Account {sfRecordId:$Id0})
ON CREATE SET ...
ON MATCH SET ...
MERGE (n1: Account {sfRecordId:$Id1})
ON CREATE SET ...
ON MATCH SET ...
MERGE ... //10k
// then send to Neo4j via the Javascript driver:
tx.run(transaction, bindParams)
My heap size is set to
server.memory.heap.initial_size=2G
server.memory.heap.max_size=3G
I have tried sending them all at once, I have tried batching them into batches of 100, 500, 1, and 10,000.
None of these seems to get my 10k to be inserted.
The first random 500-4k get in, but then the server crashes with an out of memory error.

Generating a very large statement to bulk insert data is not recommended. A good way of inserting such bulk data is to prepare a list of objects, and use UNWIND to iterate through it:
UNWIND $data as item
MERGE (n:Account {sfRecordId:item.Id})
ON CREATE SET ...
ON MATCH SET ...

Related

Can I load in nodes and relationships from a csv file using 1 cypher command?

I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.
I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).
I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?
// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)
Thanks in advance for any help!
You should review the documentation for indexes here:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/
Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.
For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.
Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.
You need indexes on: :race(raceId) and :Driver(driverId).
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)
You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.
I removed your RETURN because returning the types isn't useful information.
Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.
Also, you would probably want to batch these changes instead of trying to process them all at once.
If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.
CALL apoc.periodic.iterate("
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
RETURN lt, d, ir",
"MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Single CSV load
If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:
CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);
After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:
USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)

Execute parallel create/merge using only Cypher

I have a cypher script which loads the content from a CSV and then tries to load it into the graph. What it basically looks like is something similar to that:
LOAD CSV WITH HEADERS
FROM 'file:///data.csv'
AS row
MERGE (o:Order {Id:toInteger(row.tr_ID)})
ON CREATE SET
o.Created = toInteger(row.tr_Created)
// read credit card and create a relation to an order (if applicable)
FOREACH (n IN CASE WHEN row.cc_CrdId = 'NULL' THEN [] ELSE [1] END |
MERGE (cc:CreditCard {Id:row.cc_CrdId})
ON CREATE SET
cc.Created = toInteger(row.cc_CrdCreated)
MERGE (o)-[:WITH_CC]->(cc)
)
This script reads the file row by row. But I was wondering if it is somehow possible to execute the logic for multiple rows in parallel.
I saw this article for parallel cypher but don't really understand how I could use it. I don't know if some kind of lock needs to be implemented.

how to print logs and measure execution time

Currently, I'm trying to import a CSV file that contains around 2 million lines. Each line corresponds to a node. I'm using neo4j browser. note: I also tried neo4j import tool but it is also somehow working slower.
I tried to run the script with standard cypher query like
USING PERIODIC COMMIT 500 LOAD CSV FROM 'file:///data.csv' AS r
WITH toInteger(r[0]) AS ID, toInteger(r[1]) AS national_id, toInteger(r[2]) as passport_no, toInteger(r[3]) as status, toInteger(r[4]) as activation_date
MERGE (p:Customer {ID: ID}) SET p.national_id = national_id, p.passport_no = passport_no, p.status = status, p.activation_date = activation_date
This works very slow.
Later I tried.
CALL apoc.periodic.iterate('CALL apoc.load.csv(\'file:/data.csv\') yield list as r return r','WITH toInteger(r[0]) AS ID, toInteger(r[1]) AS national_id, toInteger(r[2]) as passport_no, toInteger(r[3]) as status, toInteger(r[4]) as activation_date MERGE (p:Customer {ID: ID}) SET p.national_id = national_id, p.passport_no = passport_no, p.status = status, p.activation_date = activation_date',
{batchSize:10000, iterateList:true, parallel:true});
This one seems like working faster since the parallel option is true. BUT I want to measure the execution time of one batch.
How could I print something on the neo4j browser?
How could I measure execution time for one batch?
Your first query uses a batch size of 500, and your second one uses a batch size that is 20 times larger. You need to use the same batch size to do a valid comparison.
Since your query requires a large number of batches (at least 200), dividing the total time by the number of batches should be a reasonable approximation of the average time per batch.
Have you created an index on :Customer(ID)? That should help to speed up your queries.
You should consider whether you should use the ON CREATE expression with your MERGE clause. Right now, the SET clause is always executed, even if the node already exists.
The key thing is adding "unique constraint" before adding any data. This makes the process a lot faster. I see that from https://neo4j.com/docs/getting-started/current/cypher-intro/load-csv/
Now a script like this
CREATE CONSTRAINT ON (n:Movie) ASSERT n.no IS UNIQUE;
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'file:///data/MovieData.csv' AS r
WITH r[0] AS no, toInteger(r[1]) AS status, toInteger(r[2]) as activation_date
MERGE (p:Movie {no: no})
ON CREATE SET p.status = status, p.activation_date = activation_date
adding 1 million nodes in 1 minute. Before it was more than 2-3 days.

"Unable to rollback transaction" error in Neo4j

I am trying to run the following query to create my nodes and relationships from a .csv file that I have:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file:///LoanStats3bEDITED.csv' AS line
//USING PERIODIC COMMIT 1000 makes sure we don't get a memory error
//creating the nodes with their properties
//member node
CREATE (member:Person{member_id:TOINT(line.member_id)})
//Personal information node
CREATE (personalInformation:PersonalInformation{addr_state:line.addr_state})
//recordHistory node
CREATE (recordHistory:RecordHistory{delinq_2yrs:TOFLOAT(line.delinq_2yrs),earliest_cr_line:line.earliest_cr_line,inq_last_6mths:TOFLOAT(line.inq_last_6mths),collections_12_mths_ex_med:TOFLOAT(line.collections_12_mths_ex_med),delinq_amnt:TOFLOAT(line.delinq_amnt),percent_bc_gt_75:TOFLOAT(line.percent_bc_gt_75), pub_rec_bankruptcies:TOFLOAT(line.pub_rec_bankruptcies), tax_liens:TOFLOAT(line.tax_liens)})
//Loan node
CREATE (loan:Loan{funded_amnt:TOFLOAT(line.funded_amnt),term:line.term, int_rate:line.int_rate, installment:TOFLOAT(line.installment),purpose:line.purpose})
//Customer Finances node
CREATE (customerFinances:CustomerFinances{emp_length:line.emp_length,verification_status_joint:line.verification_status_joint,home_ownership:line.home_ownership, annual_inc:TOFLOAT(line.annual_inc), verification_status:line.verification_status,dti:TOFLOAT(line.dti), annual_inc_joint:TOFLOAT(line.annual_inc_joint),dti_joint:TOFLOAT(line.dti_joint)})
//Accounts node
CREATE (accounts:Accounts{revol_util:line.revol_util,tot_cur_bal:TOFLOAT(line.tot_cur_bal)})
//creating the relationships
CREATE UNIQUE (member)-[:FINANCIAL{issue_d:line.issue_d,loan_status:line.loan_status, application_type:line.application_type}]->(loan)
CREATE UNIQUE (customerFinances)<-[:FINANCIAL]-(member)
CREATE UNIQUE (accounts)<-[:FINANCIAL{open_acc:TOINT(line.open_acc),total_acc:TOFLOAT(line.total_acc)}]-(member)
CREATE UNIQUE (personalInformation)<-[:PERSONAL]-(member)
CREATE UNIQUE (recordHistory)<-[:HISTORY]-(member)
However, I keep getting the following error:
Unable to rollback transaction
What does this mean and how can I fix my query so it can be run successfully?
I am now getting the following error:
GC overhead limit exceeded
I think you are out of memory.
Solutions:
Use neo4j-import.batch
Split your querys.
Make constraints to speed up the querys.
Why do you need the create unique? You could just use create if your csv is clean, or use merge.
I think it could also be faster if you execute the query not in the browser but in shell.
Download more ram :-)
If you really need the uniqueness of relationships, replace create unique with merge.
Also, your repeated MERGE operation on FINANCIAL causes Cypher to materialize the whole result before each of the 3 operations with an Eager operator so that it doesn't run into endless loops.
That's why the periodic commit is not going into effect, causing the whole intermediate result to use too much memory.
Something else you could do is to use the APOC library and apoc.periodic.iterate instead for batching.
call apoc.periodic.iterate("
LOAD CSV WITH HEADERS FROM 'file:///LoanStats3bEDITED.csv' AS line RETURN line
","
//member node
CREATE (member:Person{member_id:TOINT(line.member_id)})
//Personal information node
CREATE (personalInformation:PersonalInformation{addr_state:line.addr_state})
//recordHistory node
CREATE (recordHistory:RecordHistory{delinq_2yrs:TOFLOAT(line.delinq_2yrs),earliest_cr_line:line.earliest_cr_line,inq_last_6mths:TOFLOAT(line.inq_last_6mths),collections_12_mths_ex_med:TOFLOAT(line.collections_12_mths_ex_med),delinq_amnt:TOFLOAT(line.delinq_amnt),percent_bc_gt_75:TOFLOAT(line.percent_bc_gt_75), pub_rec_bankruptcies:TOFLOAT(line.pub_rec_bankruptcies), tax_liens:TOFLOAT(line.tax_liens)})
//Loan node
CREATE (loan:Loan{funded_amnt:TOFLOAT(line.funded_amnt),term:line.term, int_rate:line.int_rate, installment:TOFLOAT(line.installment),purpose:line.purpose})
//Customer Finances node
CREATE (customerFinances:CustomerFinances{emp_length:line.emp_length,verification_status_joint:line.verification_status_joint,home_ownership:line.home_ownership, annual_inc:TOFLOAT(line.annual_inc), verification_status:line.verification_status,dti:TOFLOAT(line.dti), annual_inc_joint:TOFLOAT(line.annual_inc_joint),dti_joint:TOFLOAT(line.dti_joint)})
//Accounts node
CREATE (accounts:Accounts{revol_util:line.revol_util,tot_cur_bal:TOFLOAT(line.tot_cur_bal)})
//creating the relationships
MERGE (member)-[:FINANCIAL{issue_d:line.issue_d,loan_status:line.loan_status, application_type:line.application_type}]->(loan)
MERGE (customerFinances)<-[:FINANCIAL]-(member)
MERGE (accounts)<-[:FINANCIAL{open_acc:TOINT(line.open_acc),total_acc:TOFLOAT(line.total_acc)}]-(member)
MERGE (personalInformation)<-[:PERSONAL]-(member)
MERGE (recordHistory)<-[:HISTORY]-(member)
", {batchSize:1000, iterateList:true})

Out of memory when creating large number of relationships

I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.

Resources