Neo4j: Best way to batch relate nodes using Cypher? - neo4j

When I run a script that tries to batch merge all nodes a certain types, I am getting some weird performance results.
When merging 2 collections of nodes (~42k) and (~26k), the performance is nice and fast.
But when I merge (~42) and (5), performance DRAMATICALLY degrades. I'm batching the ParentNodes (so (~42k) split up in batches of 500. Why does performance drop when I'm, essentially, merging less nodes (when the batch set is the same, but the source of the batch set is high and the target set is low)?
Relation Query:
MATCH (s:ContactPlayer)
WHERE has(s.ContactPrefixTypeId)
WITH collect(s) AS allP
WITH allP[7000..7500] as rangedP
FOREACH (parent in rangedP |
MERGE (child:ContactPrefixType
{ContactPrefixTypeId:parent.ContactPrefixTypeId}
)
MERGE (child)-[r:CONTACTPLAYER]->(parent)
SET r.ContactPlayerId = parent.ContactPlayerId ,
r.ContactPrefixTypeId = child.ContactPrefixTypeId )
Performance Results:
Process Starting
Starting to insert Contact items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time for 42149 Contact items: 19176.87ms
Average time per batch (500): 213.4ms
Longest batch time: 663ms
Starting to insert ContactPlayer items
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time for 27970 ContactPlayer items: 9419.2106ms
Average time per batch (500): 167.75ms
Longest batch time: 689ms
Starting to relate Contact to ContactPlayer
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time taken to relate Contact to ContactPlayer: 7907.4877ms
Average time per batch (500): 141.151517857143ms
Longest batch time: 883.0918ms for Batch number: 0
Starting to insert ContactPrefixType items
[+]
Total time for 5 ContactPrefixType items: 22.0737ms
Average time per batch (500): 22ms
Longest batch time: 22ms
Already inserted data for Contact.
Starting to relate ContactPrefixType to Contact
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time taken to relate ContactPrefixType to Contact: 376540.8309ms
Average time per batch (500): 4429.78643647059ms
Longest batch time: 14263.1843ms for Batch number: 63

So far, the best I could come up with is the following (and it's a hack, specific to my environment):
If / Else condition:
If childrenNodes.count() < 200 -> assume they are type identifiers for the parent... i.e. ContactPrefixType
Else assume it is a matrix for relating multiple item types together (i.e. ContactAddress)
If childNodes < 200
MATCH (parent:{parentLabel}),
(child:{childLabel} {{childLabelIdProperty}:parent.{parentRelationProperty}})
CREATE child-[r:{relationshipLabel}]->parent
This takes about 3-5 seconds to complete per relationship type
Else
MATCH (child:{childLabel}),
(parent:{parentLabel} {{parentPropertyField : child.{childLabelIdProperty}})
WITH collect(parent) as parentCollection, child
WITH parentCollection[{batchStart}..{batchEnd}] as coll, child
FOREACH (parent in coll |
CREATE child-[r:{relationshipLabel}]-parent )
I'm not sure this is the most efficient way of doing this, but after trying MANY different options, this seems to be the fastest.
Stats:
insert 225,018 nodes with 2,070,977 properties
create 464,606 relationships
Total: 331 seconds.
Because this is a straight import and I'm not dealing with updates yet, I assume that all the relationships are correct and don't need to worry about invalid data... however, I will try to set properties to the relationship type so as to be able to perform cleanup functions later (i.e. store the parent and child Id's in the relationship type as properties for later reference)
If anyone can improve on this, I would love it.

Can you pass the ids in as parameters rather than fetch them from the graph? The query could look like
MATCH (s:ContactPlayer {ContactPrefixTypeId:{cptid})
MERGE (c:ContactPrefixType {ContactPrefixTypeId:{cptid})
MERGE c-[:CONTACT_PLAYER]->s
If you use the REST API Cypher resource, I think the entity should look something like
{
"query":...,
"params": {
"cptid":id1
}
}
If you use the transactional endpoint, it should look something like this. You control transaction size by the number of statements in each call, and also by the number of calls before you commit. More here.
{
"statements":[
"statement":...,
"parameters": {
"cptid":id1
},
"statement":...,
"parameters": {
"cptid":id2
}
]
}

Related

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Can I load in nodes and relationships from a csv file using 1 cypher command?

I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.
I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).
I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?
// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)
Thanks in advance for any help!
You should review the documentation for indexes here:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/
Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.
For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.
Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.
You need indexes on: :race(raceId) and :Driver(driverId).
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)
You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.
I removed your RETURN because returning the types isn't useful information.
Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.
Also, you would probably want to batch these changes instead of trying to process them all at once.
If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.
CALL apoc.periodic.iterate("
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
RETURN lt, d, ir",
"MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Single CSV load
If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:
CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);
After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:
USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)

how to print logs and measure execution time

Currently, I'm trying to import a CSV file that contains around 2 million lines. Each line corresponds to a node. I'm using neo4j browser. note: I also tried neo4j import tool but it is also somehow working slower.
I tried to run the script with standard cypher query like
USING PERIODIC COMMIT 500 LOAD CSV FROM 'file:///data.csv' AS r
WITH toInteger(r[0]) AS ID, toInteger(r[1]) AS national_id, toInteger(r[2]) as passport_no, toInteger(r[3]) as status, toInteger(r[4]) as activation_date
MERGE (p:Customer {ID: ID}) SET p.national_id = national_id, p.passport_no = passport_no, p.status = status, p.activation_date = activation_date
This works very slow.
Later I tried.
CALL apoc.periodic.iterate('CALL apoc.load.csv(\'file:/data.csv\') yield list as r return r','WITH toInteger(r[0]) AS ID, toInteger(r[1]) AS national_id, toInteger(r[2]) as passport_no, toInteger(r[3]) as status, toInteger(r[4]) as activation_date MERGE (p:Customer {ID: ID}) SET p.national_id = national_id, p.passport_no = passport_no, p.status = status, p.activation_date = activation_date',
{batchSize:10000, iterateList:true, parallel:true});
This one seems like working faster since the parallel option is true. BUT I want to measure the execution time of one batch.
How could I print something on the neo4j browser?
How could I measure execution time for one batch?
Your first query uses a batch size of 500, and your second one uses a batch size that is 20 times larger. You need to use the same batch size to do a valid comparison.
Since your query requires a large number of batches (at least 200), dividing the total time by the number of batches should be a reasonable approximation of the average time per batch.
Have you created an index on :Customer(ID)? That should help to speed up your queries.
You should consider whether you should use the ON CREATE expression with your MERGE clause. Right now, the SET clause is always executed, even if the node already exists.
The key thing is adding "unique constraint" before adding any data. This makes the process a lot faster. I see that from https://neo4j.com/docs/getting-started/current/cypher-intro/load-csv/
Now a script like this
CREATE CONSTRAINT ON (n:Movie) ASSERT n.no IS UNIQUE;
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'file:///data/MovieData.csv' AS r
WITH r[0] AS no, toInteger(r[1]) AS status, toInteger(r[2]) as activation_date
MERGE (p:Movie {no: no})
ON CREATE SET p.status = status, p.activation_date = activation_date
adding 1 million nodes in 1 minute. Before it was more than 2-3 days.

storing/appending a list of values as a property of a relationship in neo4j

I am trying to store the metadata of my application in Neo4j.
So each time the application is successful, I am trying to store the count of rows processed in that batch
appended into the property of the relationship.
So, if in first batch, my application processes 30k rows, the graph database should look like this:
MERGE (N:Entity {name : "Cassandra"})-[:Success{rows:30000}]->(N:Entity{name:"MySQL"})
Now in the second batch if my application processes 20k rows, the database should APPEND 20000 in the rows property of Success edge. Something of this sort :
MERGE (N:Entity {name : "Cassandra"})-[:Success{rows.APPEND(20000)}]->(N:Entity{name:"MySQL"})
so the output would look like this : [30000, 20000].
Is it even possible doing that?
Thanks in advance!
Does this do what you're looking for?
MERGE (n1:Entity {name : "Cassandra"})-[s:Success]->(n2:Entity{name:"MySQL"})
SET s.rows = coalesce(s.rows, []) + 20000

SUM(LAST()) on GROUP BY

I have a series, disk, that contains a path (/mnt/disk1, /mnt/disk2, etc) and total space of a disk. It also includes free and used values. These values are updated at a specified interval. What I would like to do, is query to get the sum of the total of the last() of each path. I would also like to do the same for free and for used, to get a aggregate of the total size, free space, and used space of all of my disks on my server.
I have a query here that will get me the last(total) of all the disks, grouped by its path (for distinction):
select last(total) as total from disk where path =~ /(mnt\/disk).*/ group by path
Currently, this returns 5 series, each containing 1 row (the latest) and the value of its total. I then want to take the sum of those series, but I cannot just wrap the last(total) into a sum() function call. Is there a way to do this that I am missing?
Carrying on from my comment above about nested functions.
Building a toy example:
CREATE DATABASE FOO
USE FOO
Assuming your data is updated at intervals greater than[1] every minute:
CREATE CONTINUOUS QUERY disk_sum_total ON FOO
BEGIN
SELECT sum("total") AS "total_1m" INTO disk_1m_total FROM "disk"
GROUP BY time(1m)
END
Then push some values in:
INSERT disk,path="/mnt/disk1" total=30
INSERT disk,path="/mnt/disk2" total=32
INSERT disk,path="/mnt/disk3" total=33
And wait more than a minute. Then:
INSERT disk,path="/mnt/disk1" total=41
INSERT disk,path="/mnt/disk2" total=42
INSERT disk,path="/mnt/disk3" total=43
And wait a minute+ again. Then:
SELECT * FROM disk_1m_total
name: disk_1m_total
-------------------
time total_1m
1476015300000000000 95
1476015420000000000 126
The two values are 30+32+33=95 and 41+42+43=126.
From there, it's trivial to query:
SELECT last(total_1m) FROM disk_1m_total
name: disk_1m_total
-------------------
time last
1476015420000000000 126
Hope that helps.
[1] Picking intervals smaller than the update frequency prevents minor timing jitters from making all the data being accidentally summed twice for a given group. There might be some "zero update" intervals, but no "double counting" intervals. I typically run the query twice as fast as the updates. If the CQ sees no data for a window, there will be no CQ performed for that window, so last() will still give the correct answer. For example, I left the CQ running overnight and pushed no new data in: last(total_1m) gives the same answer, not zero for "no new data".

Resources