Delete a connected graph with Cypher - neo4j

I want to delete a connected graph related to a particular node in a Neo4j database using Cypher. The use case is to delete a "start" node and all the nodes where a path to the start node exists. To limit the transaction the query has to be iterative and must not disconnect the connected graph.
Until now I am using this query:
OPTIONAL MATCH (start {indexed_prop: $PARAM})--(toDelete)
OPTIONAL MATCH (toDelete)--(toBind)
WHERE NOT(id(start ) = id(toBind)) AND NOT((start)--(toBind))
WITH start, collect(toBind) AS TO_BIND, toDelete limit 10000
DETACH DELETE toDelete
WITH start, TO_BIND
UNWIND TO_BIND AS b
CREATE (start)-[:HasToDelete]->(b)
And call it until deleted node is equal to 0.
Is there a better query for this ?

You could try a mark and delete approach, which is similar to how you would detach and delete the entire connnected graph with a variable match, but instead of DETACH DELETE you can apply a :TO_DELETE label.
Something like this (making up a label to use for the start node, as otherwise it has to comb the entire db looking for a node with the indexed param):
MATCH (start:StartNodeLabel {indexed_prop: $PARAM})-[*]-(toDelete)
SET toDelete:TO_DELETE
If that blows up your heap, you can run it multiple times, with the added predicate WHERE NOT toDelete:TO_DELETE before the SET, and using a combination of LIMIT and/or a limit on the depth of the variable-length relationship.
When you're sure you've labeled every connected node, then it's just a matter of deleting every node in the TO_DELETE label, and you can run that iteratively, or use APOC procedure apoc.periodic.commit() to handle that in batches.

Related

How to access node objects in a collection of paths? (paths of two or more nodes)

I have a graph where some nodes were created out of an error in the app.
I want to delete those nodes (they represent a log), but I can't figure out how to loop thru the nodes.
I don't know how to access nodes in a collection of paths, and I need to do that in order to compare one node to another.
match (o:Order{id:123})
match (o)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
with collect((l:Log)-[:STATUS]->(os:OrderStatus)) as logs
I want to access each one of the nodes in the paths to perform a comparation. There are 5 or 6 of (l)-[:STATUS]->(os) normally for each Order.
How can I access the (l) and (os) nodes of each path, to perform the comparations between their properties?
For example, if I had this collection of paths in one of the Orders:
(log1)-[:STATUS]->(os1)
(log2)-[:STATUS]->(os2)
(log3)-[:STATUS]->(os3)
(log4)-[:STATUS]->(os2) <-- This is the error
(log5)-[:STATUS]->(os4)
So, from the collection of paths above, I'd want to detach delete the (log4), because the (os2) node is lower than the previous one (os3), and should be greater.
And after that, I want to attach the (log3) to the (log5)
NOTE: Each one of the (os) nodes has an id that represents the "status", and go from 1 to 5. Also, the (log) nodes are ordered by the created datetime.
Any idea on how to do this? Thank you in advance guys!
EDIT
I didn't mention some other scenarios I had. This is one of them:
Based on #cybersam answer, I found out how to work it out.
I had to run 2 separated queries to make it work, but the principle is the same, and is as follows:
Create new relationships:
MATCH(o:Order)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE SIZE((o)-[:STATUS_CHANGE*]->()-[:STATUS]->(os)) >= 1
WITH o, os, COLLECT(l)[0] AS keep
WITH o, collect(keep) AS k
FOREACH(i IN range(0,size(k)-1) |
FOREACH(a IN [k[i]] |
FOREACH(b IN [k[i+1]] |
FOREACH(c IN CASE WHEN b IS NOT NULL THEN [1] END | MERGE (a)-[:STATUS_CHANGE]->(b) ))));
Delete exceeded nodes:
MATCH(o:Order)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE (os)<-[:STATUS]-()-[:STATUS_CHANGE*]->(l)-[:STATUS]->(os)
WITH o, os, COLLECT(l) AS exceed
UNWIND exceed AS del
detach delete del;
This queries worked on every scenario.
Assuming all your errors follow the same pattern (the unwanted Log nodes are always referencing an "older" OrderStatus), this may work for you:
MATCH (o:Order{id:123})-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE SIZE(()-[:STATUS]->(os)) > 1
WITH os, COLLECT(l) AS logs
UNWIND logs[1..] AS unwanted
OPTIONAL MATCH (x)-[:STATUS_CHANGE]->(unwanted)-[:STATUS_CHANGE]->(y)
DETACH DELETE unwanted
FOREACH(ignored IN CASE WHEN x IS NOT NULL THEN [1] END | CREATE (x)-[:STATUS_CHANGE]->(y))
This query:
Finds (in order) all relevant OrderStatus nodes having multiple STATUS relationships.
Uses the aggregating function COLLECT to collect (in order) the Log nodes related to each of those OrderStatus nodes.
Uses UNWIND logs[1..] to get the individual unwanted Log nodes.
Uses OPTIONAL MATCH to get the 2 nodes that may need to be connected together, after the unwanted node is deleted.
Uses DETACH DELETE to deleted each unwanted node and its relationships.
Uses FOREACH to connect together the pair of nodes that might have been foiund by the OPTIONAL MATCH.

Efficiently assigning UUIDs to connected components in Neo4j

I have partitioned my graph into ~400,000 connected components using the algo.unionFind function from the Neo4j Graph Algorithms library.
Each node n within the same connected component has the same n.partition value. However, now I want to assigned each connected component a UUID so that each node n in a connected component will have n.uuid populated with a component UUID. What is the most efficient way of doing this?
Currently I am getting a list of all n.partition values and then going through each partition and running a Cypher query to update all nodes of that partition to have a generated UUID. I'm using the Python wrapper py2neo and this process is quite slow.
Edit:
The Cypher queries I am currently using are:
MATCH (n)
RETURN DISTINCT n.partition AS partition
to get a list of partitions ids and then iteratively calling:
MATCH (n)
WHERE n.partition = <PARTITION_ID>
SET n.uuid = <GENERATED_UUID>
on each of the partition ids.
Edit 2:
I am able to get through ~180k/400k of the connected components using the following query:
CALL apoc.periodic.iterate(
"MATCH (n)
WITH n.partition as partition, COLLECT(n) as nodes
RETURN partition, nodes, apoc.create.uuid() as uuid",
"FOREACH (n in nodes | SET n.uuid = uuid)",
{batchSize:1000, parallel:true}
)
before getting a heap error: "neo4j.exceptions.ClientError: Failed to invoke procedure `apoc.periodic.iterate`: Caused by: java.lang.OutOfMemoryError: Java heap space"
The best way would be to install the APOC plug-in to Neo4j so that you can use the UUID function apoc.create.uuid() in Cypher. (so that it can be generated, and assigned, in the same transaction)
To create 1 uuid per partition, you will need to use WITH to store the uuid in a temporary variable. It will be run per row, so you need to do it once you have one partition
USING PERIODIC COMMIT 5000 // commit every 5k changes
MATCH (n)
WITH DISTINCT n.partition as p // will exclude null
WITH p, apoc.create.uuid() as uuid // create reusable uuid
// now just match and assign
MATCH (n)
WHERE n.partition = p
SET n.uuid = uuid
or as InverseFalcon suggested
MATCH (n)
WHERE exists(n.partition) // to filter out nulls
WITH n.partition as p, collect(n) as nodes // collect nodes so each row is 1 partition, and it's nodes
WITH p, nodes, apoc.create.uuid() as uuid // create reusable uuid
FOREACH (n in nodes | SET n.uuid = uuid) // assign uuid to each node in collection
The first query is more periodic commit friendly, since it doesn't need to load everything into memory to start doing assignments. Without the perodic commit statement though, it will eventually load everything into memory as it has to hold on to it for the transaction log. Once it hits a commit point, it can clear the transaction log to keep memory use down.
If your data set isn't too large though, the second query should be faster because by holding everything in memory after the first node scan, it doesn't need to run another node scan to find all the nodes. Periodic commit won't help here because if you blow the heap, it will almost certainly be during the initial scan/collect phase.
To do this you'll need to collect nodes by their partition value, which means you'll have a single row per distinct partition. Then you create the UUID (it will execute per row), then you can use FOREACH to apply to each node in the partition:
MATCH (n)
// WHERE exists(n.partition) // only if there are nodes in the graph without partitions
WITH n.partition as partition, collect(n) as nodes
WITH partition, nodes, randomUUID() as uuid
FOREACH (n in nodes | SET n.uuid = uuid)
Depending on the number of nodes in your graph, you may need to combine this with some batch processing, such as apoc.periodic.iterate(), to avoid heap issues.

Unable to load NODE with id

I recently upgraded my Neo4j database to v. 3 (3.0.6). Since then I am having trouble when trying to delete nodes with Cypher. This is a new problem since the upgrade. They Cypher query is:
MATCH (p) WHERE id(p) = 83624
OPTIONAL MATCH (p)-[r]-(n)
OPTIONAL MATCH (p)-[r2]-(n2)
WHERE NOT ('Entity' in labels(n2))
DELETE r, r2, p, n2
This now results in the error Unable to load NODE with id 83624
The exact query with RETURN instead of DELETE returns the node. I've also tried swapping out DELETE with DETACH DELETE of the nodes, but that gives me the same error.
It looks like this question has been asked before but without a solution. Any idea what is causing this error?
I'm a little confused by this query. Both of your optional matches are identical, in that they will match to any relationship to any node connected to p.
Let me make sure I understand what you're trying to do: Find a node by ID, then delete that node along with any connected node that is not an Entity (as well as the relationships connecting them). Is that right?
If so, this query might work better:
MATCH (p) WHERE id(p) = 83624
OPTIONAL MATCH (p)--(n)
WHERE NOT n:Entity
DETACH DELETE n
DETACH DELETE p
Detaching a node deletes all relationships connected to that node, and you can only delete a node that has zero relationships.
Also, just to note, it's not a good idea to use the internal ids for uniquely identifying nodes. It's recommended to use your own unique IDs instead, and create unique constraints on that property for that label.

How to do traversal in neo4j with cypher queries?

What I'm trying to do is simply start at a node and search for all connected nodes that are a certain label. However I don't want to return the start node. How would I do this?
Example:
...<-[:parent]<-anode<-[created]-user-[created]->anode-[:parent]->anode-....->nodes...
What I would like to do is start at the user node and return all relationships but excluding the user node.
This will return you a list of all nodes connected via created relationships of a distance of up to 10.
MATCH user-[:created*1..10]->(anode:CertainLabel)
RETURN DISTINCT anode
Depending on your graph, you may be able to get rid of the 10, but if it's large and complex removing the max value could cause your query to run very slowly
This is along the lines of what I was looking for.
START u = node(26)
MATCH (u)-[rels*1..10]->(node) unwind rels as r
RETURN DISTINCT id(startNode(r)),endNode(r)

How to get last node created in neo4j?

So I know when you created nodes neo4j has a UUID for each node. I know you can access a particular node by that UUID by accessing the ID. For example:
START n=node(144)
RETURN n;
How would I get the last node that was created? I know I could show all nodes and then run the same command in anotehr query with the corresponding ID, but is there a way to do this quickly? Can I order nodes by id and limit by 1? Is there a simpler way? Either way I have not figured out how to do so through a simple cypher query.
Every time not guaranteed that a new node always has a larger id than all previously created nodes,
So Better way is to set created_at property which stores current time-stamp while creating node.
You can use timestamp() function to store current time stamp
Then,
Match (n)
Return n
Order by n.created_at desc
Limit 1
Please be aware that Neo4j's internal node id is not a UUID. Nor is it guaranteed that a new node always has a larger id than all previously created nodes. The node id is (multiplied with some constant) the offset of the node's location within a store file. Due to space reclaiming a new node might get a lower id number.
BIG FAT WARNING: Do not take any assumption on node ids.
Depending on your requirements you could organize all nodes into a linked list. There is one "magic" node having a specific label, e.g. References that has always a relationship to the latest created node:
CREATE (entryPoint:Reference {to:'latest'}) // create reference node
When a node from your domain is created, you need to take multiple actions:
remove the latest relationships if existing
create your node
connect your new node to the previously latest node
create the reference link
.
MATCH (entryPoint:Reference {to:'latest'})-[r:latest]->(latestNode)
CREATE (domainNode:Person {name:'Foo'}), // create your domain node
(domainNode)-[:previous]->(latestNode), // build up a linked list based on creation timepoint
(entryPoint)-[:latest]->(domainNode) // connect to reference node
DELETE r //delete old reference link
I finally found the answer. The ID() function will return the neo4j ID for a node:
Match (n)
Return n
Order by ID(n) desc
Limit 1;

Resources