I'm using a python script to generate and execute queries loaded from data in a CSV file. I've got a substantial amount of data that needs to be imported so speed is very important.
The problem I'm having is that merging between two nodes takes a very long time, and including the cypher to create the relations between the nodes causes a query to take around 3 seconds (for a query which takes around 100ms without).
Here's a small bit of the query I'm trying to execute:
MERGE (s0:Chemical{`name`: "10074-g5"})
SET s0.`name`="10074-g5"
MERGE (y0:Gene{`gene-id`: "4149"})
SET y0.`name`="MAX"
SET y0.`gene-id`="4149"
MERGE (s0)-[:INTERACTS_WITH]->(y0)
MERGE (s1:Chemical{`name`: "10074-g5"})
SET s1.`name`="10074-g5"
MERGE (y1:Gene{`gene-id`: "4149"})
SET y1.`name`="MAX"
SET y1.`gene-id`="4149"
MERGE (s1)-[:INTERACTS_WITH]->(y1)
Any suggestions on why this is running so slowly? I've got index's set up on Chemical->name and Gene->gene-id so I honestly don't understand why this runs so slowly.
Most of your SET clauses are just setting properties to the same values they already have (as guaranteed by the preceding MERGE clauses).
The remaining SET clauses probably only need to be executed if the MERGE had created a new node. So, they should probably be preceded by ON CREATE.
You should never generate a long sequence of almost identical Cypher code. Instead, your Cypher code should use parameters, and you should pass your data as parameter(s).
You said you have a :Gene(id) index, whereas your code actually requires a :Gene(gene-id) index.
Below is sample Cypher code that uses the dataList parameter (a list of maps containing the desired property values), which fixes most of the above issues. The UNWIND clause just "unwinds" the list into individual maps.
UNWIND $dataList AS d
MERGE (s:Chemical{name: d.sName})
MERGE (y:Gene{`gene-id`: d.yId})
ON CREATE SET y.name=d.yName
MERGE (s)-[:INTERACTS_WITH]->(y)
Related
I am new to graph databases and especially cypher. I am importing data from my csv. Below is the sample I pulled for some country data and added the cities and states. Now I was pushing the data for areas
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
I've already pushed city and country data using the same query and built a relation between them too.
But when I try to create the areas inside the city it just keeps going, there is no stopping it. (15 mins have passed).
There are 7000 cities in the data I've got from the internet and 90k areas inside those cities.
Is it just taking time or have I messed up with the query.
After the Update
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
Okay, your query plan shows NodeByLabelScans and filters are being used to find your nodes, which means that every time you match or merge to a node, it has to scan all nodes with the given labels and perform property access on all of them to find the nodes you're looking for.
You need to add indexes (or unique constraints, depending on if the field is supposed to be unique) on the relevant label/property combinations so those lookups will be quick.
So you'll need one on :city(poc), and probably one on :area(eoc), assuming those properties are referring to unique properties.
EDIT
One other big thing I initially missed, you need to add USING PERIODIC COMMIT before the LOAD CSV so the load will batch the writes to the db, that should do the trick here.
My import.csv creates many nodes and merging creates a huge cartesian product and runs in a transaction timeout since the data has grown so much. I've currently set the transaction timeout to 1 second because every other query is very quick and is not supposed to take any longer than one second to finish.
Is there a way to split or execute this specific query in smaller chunks to prevent a timeout?
Upping or disabling the transaction timeout in the neo4j.conf is not an option because the neo4j service needs a restart for every change made in the config.
The query hitting the timeout from my import script:
MATCH (l:NameLabel)
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l);
Nodecounts: 1000 Movie, 2500 Namelabel
You can try installing APOC Procedures and using the procedure apoc.periodic.commit.
call apoc.periodic.commit("
MATCH (l:Namelabel)
WHERE NOT (l)-[:LABEL]->(:Movie)
WITH l LIMIT {limit}
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l)
RETURN count(*)
",{limit:1000})
The below query will be executed repeatedly in separate transactions until it returns 0.
You can change the value of {limit : 1000}.
Note: remember to install APOC Procedures according the version of Neo4j you are using. Take a look in the Version Compatibility Matrix.
The number of nodes and labels in your database suggest this is an indexing problem. Do you have constraints on both the Movie and Namelabel (which should be NameLabel since it is a node) nodes? The appropriate constraints should be in place and active.
Indexing and Performance
Make sure to have indexes and constraints declared and ONLINE for
entities you want to MATCH or MERGE on
Always MATCH and MERGE on a
single label and the indexed primary-key property
Prefix your load
statements with USING PERIODIC COMMIT 10000 If possible, separate node
creation from relationship creation into different statements
If your
import is slow or runs into memory issues, see Mark’s blog post on
Eager loading.
If your Movie nodes have unique names then use the CREATE UNIQUE statement. - docs
If one of the nodes is not unique but will be used in a relationship definition then the CREATE INDEX ON statement. With such a small dataset it may not be readily apparent how inefficient your queries are. Try the PROFILE command and see how many nodes are being searched. Your MERGE statement should only check a couple nodes at each step.
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
I have a list of MATCH statements which are totally unrelated to each other. But if I execute them like
MATCH (a:Person),(b:InProceedings) WHERE a.identifier = 'person/joseph-valeri' and b.identifier = 'conference/edm2008/paper/209' CREATE (a)-[r:creator]->(b)
MATCH (a:Person),(b:InProceedings) WHERE a.identifier = 'person/nell-duke' and b.identifier = 'conference/edm2008/paper/209' CREATE (a)-[r:creator]->(b)
But if I execute them at once I get the following error:
WITH is required between CREATE and MATCH (line 2, column 1)
What changes should I incorporate?
(I am new to Neo4j)
Does this need to happen in a single transaction? In which case you should be matching your nodes up front before performing the create:
MATCH (jo:Person{identifier:'person/joseph-valeri'}), (nell:Person{identifier:'person/nell-duke'}), (b:InProceedings{identifier:'conference/edm2008/paper/209'})
CREATE (jo)-[:creator]->(b), (nell)-[:creator]->(b)
If it's just the two creators you could change the create to:
CREATE (jo)-[:creator]->(b)<-[:creator]-(nell)
If this isn't what you want to achieve then effectively what you have posted is two distinct Cypher statements that you are trying to run as one, and the parser is getting confused.
Post comment edit
Given that you said millions I think that you are going to find the transaction time on performing the import prohibitive and therefore you should investigate the CSV import syntax (and specifically pay attention to PERIODIC COMMIT) if you can write to CSV instead of to the big Cypher dump?
If for some reason that is not an option and you are starting from empty then build slowly - creating nodes first. These are going to need names to keep the speed up (but these names aren't persisted, just constant in your Cypher query):
CREATE (a:Person{identifier:'person/joseph-valeri'}),
(b:Person{identifier:'person/nell-duke'}),
(zzz:Person{identifier:'person/do-you-really-want-person-in-all-these-identifiers'}),
(inProca:InProceedings{identifier:'conference/edm2008/paper/209'}),
(inProcb:InProceedings{identifier:'conference/edm2009/paper/209'})
You will have kept track of a, b .. zzz in your Python script allowing you to build the CREATE statment up with:
(a)-[:creator]->(inProcA), (zzz)-[:creator]-(inProcB)
Now if all of your nodes already exist and you just want to build the relationships in now, then you have the choice of:
Performing individual MATCH and CREATEs for each new relationship, exceuting them each individually. This looks like what your original code was doing. You should move the conditions into the MATCH rather than the WHERE clause.
MATCHing a large set of nodes and CREATEing new realtionships. This is more akin to what my initial code was doing and will require your script to be smart in generating the queries.
MERGEing existing nodes into new relationships.
Whatever you do, you'e going to need to batch the writes within the transaction or you're going to run out of memory - you can advise Neo4J to do this by using the USING PERIODIC COMMIT 50000 syntax, here is a great blog post on it.
I am merging large batches of ~500,000 relationships with the LOAD CSV command:
LOAD CSV WITH HEADERS FROM 'http://file.csv'
MATCH (a:Label {uid: csv.uid1}),(b:Otherlabel {uid: csv.uid2})
MERGE (a)-[:TYPE {key1: csv.key1}]->(b)
Both uid properties have a UNIQUE constraint.
The CSV file looks like:
uid1,uid2,key1
123,abc,some_value
456,def,some_value
This is usually very fast (< 1 min) when there are many different nodes on each side.
But performance drops dramatically when I load batches where a single a node is connected to many different b nodes. The uid1 is always the same but schema constraints are still there. ~30,000 relationships take ~8 min to load.
Am I missing something here? What could explain the huge performance difference in MERGEing 'many-to-many' relationships vs. 'one-to-many'?
As I mentioned in the comment on the question, I verified this behavior with a ~300,000 line CSV file that I created with unique random values for uid1 and uid2. #MartinPreusse then mentioned that if you change the query to use CREATE instead of MERGE, the query is fast. This observation made me realize what is going on.
The slowdown is caused by the need to scan the relationships list of the 'a' node each time a MERGE is performed. When a CREATE is performed, the relationship is added without testing first to see if the relationship already exists. When the relationship lists remain short (first case), scanning the relationship lists has little impact. When the relationship lists are growing long (second case), the repeated scanning of a growing list is dominating the process. In my test I linked all 300,000 nodes to a single node using a MERGE clause and it took hours.
If you don't have to worry about creating duplicate relationships, you can use CREATE without fear. Even if duplicates are an issue, it might be faster to use CREATE and then craft a query that will remove the duplicates.