Make copy of subtree containing 500k nodes and 1m relations in neo4j - neo4j

I am evaluating neo4j. I created some random data to compare with other dbs. The data represents a tree structure with 10k, 100k and 1m nodes. There are two relationship types, the hierarchical one, and a connection chain relation like a linked list.
One of the operations that I want to test is to make a copy of a subtree. This operation is done in three steps (copy nodes, copy relations, connect to target). The operation works fine for 10k and 100k tree. But for the biggest example with a copy tree of 500k neo4j never comes back.
The browser shows that it is getting reconnected and nothing happens. I think 500k nodes should not be that much. The test data in cvs files is around 300mb.
What am I doing wrong?
1: copy nodes
match (r {`domain key` : 'unit-B2'})-[:isPredecessorOf*0..]->(n:`T-Base`)
with n as map create (copy:`T-Base`)
set copy = map, copy.`domain key` = map.`domain key` + '-copy'
with map, copy
create (copy)-[:isCopyOf]->(map)
2: copy relations
match (s {`domain key` : 'unit-B2'})-[:isPredecessorOf*0..]->(n)
with collect(n) as st, s
match (s)-[:isPredecessorOf*0..]->(t)-[r:`isPredecessorOf`]->(x)
where x in st
with startNode(r) as s, endNode(r) as d
match (s)<-[:isCopyOf]-(source), (d)<-[:isCopyOf]-(dest)
with source, dest
create (source)-[:isPredecessorOf]->(dest)
match (s {`domain key` : 'unit-B2'})-[:isPredecessorOf*0..]->(n)
with collect(n) as st, s
match (s)-[:isPredecessorOf*0..]->(t)-[r:`isConnectedTo`]->(x)
where x in st
with startNode(r) as s, endNode(r) as d
match (s)<-[:isCopyOf]-(source), (d)<-[:isCopyOf]-(dest)
with source, dest
create (source)-[:isConnectedTo]->(dest)
3: connect root of copy tree to target node
match (source{`domain key`:'unit-B1'}), (dest{`domain key`:'unit-B2-copy'})
create (source)-[:isPredecessorOf]->(dest)

How do you run Neo4j? It's probably just a memory configuration issue for transactional memory. For 1M records you need about 4G heap config.
you should use a label for r and s
separate your 2nd statement into two statements.
If you have to do larger transactional updates, you can install the apoc procedures and use apoc.periodic.iterate to execute updates in batches.
e.g.
call apoc.periodic.iterate('
match (r:Label {`domain key` : 'unit-B2'})-[:isPredecessorOf*0..]->(n:`T-Base`)
return distinct n as map
','
create (copy:`T-Base`)
set copy = map, copy.`domain key` = map.`domain key` + '-copy'
with map, copy
create (copy)-[:isCopyOf]->(map)
',{batchSize:10000,iterateList:true})

Related

Longest Path Neo4j returning incorrect path

I have the following graph stored in csv format:
graphUnioned.csv:
a b
b c
The above graph denotes path from Node:a to Node:b. Note that the first column in the file denotes source and the second column denotes destination. With this logic the second path in the graph is from Node:b to Node:c. And the longest path in the graph is: Node:a to Node:b to Node:c.
I loaded the above csv in Neo4j desktop using the following command:
LOAD CSV WITH HEADERS FROM "file:\\graphUnioned.csv" AS csvLine
MERGE (s:s {s:csvLine.s})
MERGE (o:o {o:csvLine.o})
MERGE (s)-[]->(o)
RETURN *;
And then for finding longest path I run the following command:
match (n:s)
where (n:s)-[]->()
match p = (n:s)-[*1..]->(m:o)
return p, length(p) as L
order by L desc
limit 1;
However unfortunately this command only gives me path from Node: a to Node:b and does not return the longest path. Can someone please help me understand as to where am I going wrong?
There are two mistakes in your CSV import query.
First, you need to use a type when you MERGE a relationship between nodes, that query won't compile otherwise. You likely supplied one and forgot to add it when you pasted it here.
Second, the big one, is that your query is merging nodes with different labels and different properties, and this is majorly throwing it off. Your intent was to create 3 nodes, with a longest path connecting them, but your query creates 4 nodes, two isolated groups of two nodes each:
This creates 2 b nodes: (:s {s:b}) and (:o {o:b}). Each of them is connected to a different node, and this is due to treating the nodes to be created from each variable in the CSV differently.
What you should be doing is using the same label and property key for all of the nodes involved, and this will allow the match to the b node to only refer to a single node and not create two:
LOAD CSV WITH HEADERS FROM "file:\\graphUnioned.csv" AS csvLine
MERGE (s:Node {value:csvLine.s})
MERGE (o:Node {value:csvLine.o})
MERGE (s)-[:REL]->(o)
RETURN *;
You'll also want an index on :Node(value) (or whatever your equivalent is when you import real data) so that your MERGEs and subsequent MATCHes are fast when performing lookups of the nodes by property.
Now, to get to your longest path query.
If you are assuming that the start node has no relations to it, and that your end node has no relationships from it, then you can use a query like this:
match (start:Node)
where not ()-->(start)
match p = (start)-[*]->(end)
where not (end)-->()
return p, length(p) as L
order by L desc
limit 1;

Efficiently assigning UUIDs to connected components in Neo4j

I have partitioned my graph into ~400,000 connected components using the algo.unionFind function from the Neo4j Graph Algorithms library.
Each node n within the same connected component has the same n.partition value. However, now I want to assigned each connected component a UUID so that each node n in a connected component will have n.uuid populated with a component UUID. What is the most efficient way of doing this?
Currently I am getting a list of all n.partition values and then going through each partition and running a Cypher query to update all nodes of that partition to have a generated UUID. I'm using the Python wrapper py2neo and this process is quite slow.
Edit:
The Cypher queries I am currently using are:
MATCH (n)
RETURN DISTINCT n.partition AS partition
to get a list of partitions ids and then iteratively calling:
MATCH (n)
WHERE n.partition = <PARTITION_ID>
SET n.uuid = <GENERATED_UUID>
on each of the partition ids.
Edit 2:
I am able to get through ~180k/400k of the connected components using the following query:
CALL apoc.periodic.iterate(
"MATCH (n)
WITH n.partition as partition, COLLECT(n) as nodes
RETURN partition, nodes, apoc.create.uuid() as uuid",
"FOREACH (n in nodes | SET n.uuid = uuid)",
{batchSize:1000, parallel:true}
)
before getting a heap error: "neo4j.exceptions.ClientError: Failed to invoke procedure `apoc.periodic.iterate`: Caused by: java.lang.OutOfMemoryError: Java heap space"
The best way would be to install the APOC plug-in to Neo4j so that you can use the UUID function apoc.create.uuid() in Cypher. (so that it can be generated, and assigned, in the same transaction)
To create 1 uuid per partition, you will need to use WITH to store the uuid in a temporary variable. It will be run per row, so you need to do it once you have one partition
USING PERIODIC COMMIT 5000 // commit every 5k changes
MATCH (n)
WITH DISTINCT n.partition as p // will exclude null
WITH p, apoc.create.uuid() as uuid // create reusable uuid
// now just match and assign
MATCH (n)
WHERE n.partition = p
SET n.uuid = uuid
or as InverseFalcon suggested
MATCH (n)
WHERE exists(n.partition) // to filter out nulls
WITH n.partition as p, collect(n) as nodes // collect nodes so each row is 1 partition, and it's nodes
WITH p, nodes, apoc.create.uuid() as uuid // create reusable uuid
FOREACH (n in nodes | SET n.uuid = uuid) // assign uuid to each node in collection
The first query is more periodic commit friendly, since it doesn't need to load everything into memory to start doing assignments. Without the perodic commit statement though, it will eventually load everything into memory as it has to hold on to it for the transaction log. Once it hits a commit point, it can clear the transaction log to keep memory use down.
If your data set isn't too large though, the second query should be faster because by holding everything in memory after the first node scan, it doesn't need to run another node scan to find all the nodes. Periodic commit won't help here because if you blow the heap, it will almost certainly be during the initial scan/collect phase.
To do this you'll need to collect nodes by their partition value, which means you'll have a single row per distinct partition. Then you create the UUID (it will execute per row), then you can use FOREACH to apply to each node in the partition:
MATCH (n)
// WHERE exists(n.partition) // only if there are nodes in the graph without partitions
WITH n.partition as partition, collect(n) as nodes
WITH partition, nodes, randomUUID() as uuid
FOREACH (n in nodes | SET n.uuid = uuid)
Depending on the number of nodes in your graph, you may need to combine this with some batch processing, such as apoc.periodic.iterate(), to avoid heap issues.

why is neo4j so slow on this cypher query?

I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...
The initial node is a :txnEvent and all the rest are :mEvent
mEvent nodes have 4 properties, one of them called channel_name
Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B
This query:
match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);
Is telling me there are only 1,667,444 paths to consider.
However, the following query:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
EXTRACT (n in nodes(p) | n.channel_name),
EXTRACT (n in nodes(p) | n.step),
EXTRACT (n in nodes(p) | n.event_type),
EXTRACT (n in nodes(p) | n.event_device),
EXTRACT (r in relationships(p) | r.weight )
Takes almost 1 minute to execute (neo4j's UI on port 7474)
For completness, neo4j is telling me:
"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."
So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?
UDATE:
This query, that doesn't stream data back:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
COUNT(*)
Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.
UPDATE 2:
Ideally this data should go into a jupyter notebook with a python kernel.
Thanks for the PROFILE plan.
Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.
You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.
As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.

Inserting a Relation into Neo4j using MERGE or MATCH runs forever

I am experimenting with Neo4j using a simple dataset of Locations. A location can have a relation to another relation.
a:Location - [rel] - b:Location
I already have the locations in the database (roughly 700.000+ Location entries)
Now I wanted to add the relation data (170M Edges), but I wanted to experiment with the import logic with a smaller set first, so I basically picked 2 nodes that are in the set and tried to create a relationship as follows.
MERGE p =(a:Location {locationid: 3616})-[w:WikiLink]->(b:Location {locationid: 467501})
RETURN p;
and also tried the approach directly from the docu
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
I tried using a directional merge, undirectional merge, etc. etc. I basically tried multiple variants of the above queries and the result is: They run forever, seeming to no complete even after 15 minutes. Which is very odd.
Indexes
ON :Location(locationid) ONLINE (for uniqueness constraint)
Constraints
ON (location:Location) ASSERT location.locationid IS UNIQUE
This is what I am currently using:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///edgelist.csv' AS line WITH line
MATCH (a:Location {locationid: toInt(line.locationidone)}), (b:Location {locationid: toInt(line.locationidtwo)})
MERGE (a)-[w:WikiLink {weight: toFloat(line.edgeweight)}]-(b)
RETURN COUNT(w);
If you look at the terminal output below you can see Neo4j reports 258ms query execution time, the realtime is however somewhat above that. This query already takes a few seconds too much in my opinion (The machine this runs on has 48GB RAM, 16 Cores and is relatively new).
I am currently running this query with LIMIT 1000 (before it was LIMIT 1) but the script is already running for a few minutes. I wonder if I have to switch from MERGE to CREATE. The problem is, I cannot understand the callgraph that EXPLAIN gives me in order to determine the bottleneck.
time /usr/local/neo4j/bin/neo4j-shell -file import-relations.cql
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Node[758609]{title:"Tehran",locationid:3616,locationlabel:"NIL"},:WikiLink[9422418]{weight:1.2282325516616477E-7},Node[917147]{title:"Khorugh",locationid:467501,locationlabel:"city"}] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
Relationships created: 1
Properties set: 1
258 ms
real 0m1.417s
user 0m1.497s
sys 0m0.158s
If you haven't:
create constraint on loc:Location assert loc.locationid is unique;
Then find both nodes, and create the releationship.
MATCH (a:Location {locationid: 3616}),(b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
or if the locations don't exist yet:
MERGE (a:Location {locationid: 3616})
MERGE (b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
You should also use parameters if you do that from a program.
Have you indexed the Location nodes on locationid?
CREATE INDEX ON :Location(locationid)
I had a similar problem adding edges to a graph and indexing the nodes led to the linking running over 150x faster.
If the nodes aren't indexed neo4j will do a serial search for the two nodes to link together.
USING PERIODIC COMMIT <value>:
Specifies number of records(rows) to be commited in a transaction. Since you have high RAM, it is good to use value that is greater than 100000. This will reduce the number of transactions committed and might further reduce the overall time.

merging nodes into a new one with cypher and neo4j

using Neo4j - Graph Database Kernel 2.0.0-M02 and the new merge function,
I was trying to merge nodes into a new one (merge does not really merges but binds to the returning identifier according to the documentation) and delete old nodes. I only care at the moment about properties to be transferred to the new node and not relationships.
What I have at the moment is the cypher below
merge (n:User {form_id:123}) //I get the nodes with form_id=123 and label User
with n match p=n //subject to change to have the in a collection
create (x) //create a new node
foreach(n in nodes(p): set x=n) //properties of n copied over to x
return n,x
Problems
1. When foreach runs it creates a new x for every n
2. Moving properties from n to x is replacing all properties each time with the new n
so if the 1st n node from merge has 2 properties a,b and the second c,d in the and after the set x=n all new nodes end up with c,d properties. I know is stated in the documentation so my question is:
Is there a way to merge all properties of N number of nodes (and maybe relationships as well) in a new node with cypher only?
I don't think the Cypher language currently has a syntax that non-destructively copies any and all properties from one node into another.
However, I'll present the solution to a simple situation that may be similar to yours. Let's say that some User nodes have the properties a & b, and some others have c & d. For example:
CREATE (:User { id:1,a: 1,b: 2 }),(:User { id:1,c: 3,d: 4 }),
(:User { id:2,a:10,b:20 }),(:User { id:2,c:30,d:40 });
This is how we would "merge" all User nodes with the same id into a single node:
MATCH (x:User), (y:User)
WHERE x.id=y.id AND has(x.a) AND has(y.c)
SET x.c = y.c, x.d = y.d
DELETE y
RETURN x
You can try this out in the neo4j sandbox at: http://console.neo4j.org/
With Neo4j-3.x it is also possible to merge two nodes into one using a specific apoc procedure.
First you need to download the apoc procedures jar file in your into your $NEO4J_HOME/plugins folder and start the Neo4j server.
Then you can call apoc.refactor.mergeNodes this way:
MATCH (x:User), (y:User)
WHERE x.id=y.id
call apoc.refactor.mergeNodes([x,y]) YIELD node
RETURN node
As I can see it, the resulting node would have all the properties of both x and y, choosing the values of y if both are set.

Resources