neo4j Nodes Deleted (but not Actually) - neo4j

I would like to delete all the nodes of a certain label by executing
match (P:ALabel) delete P;
This returns the comment "No data returned." It also states how many Nodes deleted, and how long it took (5767 ms). However, the shell seems to stop responding after this, and I am unable to execute any other commands.
I also used this command, encouraged from this answer:
match (n:ALabel)
optional match (n)-[r]-()
delete n, r;
Executing this command took slightly longer (16929 ms). It still does not return.

Depending on the amount of changes you need to choose an appropriate transaction size, otherwise you'll see excessive garbage collections and/or OOM exceptions. Use the LIMIT clause and return back the number of deleted nodes. Run this statement multiple times until 0 is returned:
match (n:ALabel)
with n limit 5000
optional match (n)-[r]-()
delete n,r
return count(distinct n)
Here the batch size is 5000 nodes.

Related

why is neo4j so slow on this cypher query?

I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...
The initial node is a :txnEvent and all the rest are :mEvent
mEvent nodes have 4 properties, one of them called channel_name
Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B
This query:
match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);
Is telling me there are only 1,667,444 paths to consider.
However, the following query:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
EXTRACT (n in nodes(p) | n.channel_name),
EXTRACT (n in nodes(p) | n.step),
EXTRACT (n in nodes(p) | n.event_type),
EXTRACT (n in nodes(p) | n.event_device),
EXTRACT (r in relationships(p) | r.weight )
Takes almost 1 minute to execute (neo4j's UI on port 7474)
For completness, neo4j is telling me:
"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."
So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?
UDATE:
This query, that doesn't stream data back:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
COUNT(*)
Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.
UPDATE 2:
Ideally this data should go into a jupyter notebook with a python kernel.
Thanks for the PROFILE plan.
Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.
You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.
As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.

How to ask Neo4j to take cycles into account in an optimised way

This is a follow-up question to:
How to ask Neo4j to take cycles into account
In my previous question, #stdob-- kindly helped me find the following query:
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)
OPTIONAL MATCH (n3t)-[:R]->(n4:R:L)
WHERE n3t = n3
RETURN labels(n1), labels(n3t), labels(n4);
The above query is a replacement for the following:
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)-[:R]->(n4:R:L)
RETURN labels(n1), labels(n3t), labels(n4);
And I have to use the first one because in my data there's the possibility that n2 and n4 are the same nodes and since Neo4j refuses to take the same node twice, it will return null.
While the first query is valid and working, it has got a really bad performance. It forces the database to restart the search for the whole data and at the end, it will match the selected nodes using n3t = n3. Just to give you a hint on how bad its performance is, on a dataset of 200k magnitude, it takes 5 seconds to return the result while if I omit the second OPTIONAL MATCH and its WHERE the result is generated in less than 10 milliseconds for the same query. If anyone's interested, here's the execution plan for the query:
The right branch is the part I mentioned earlier (which I tried to fool Neo4j to take a node for the second time). As you can see 2M db hits were incurred in order to make Neo4j take a node for the second time. The actual query for this execution plan is:
PROFILE MATCH (n5_1:Revision:`Account`)<-[:RevisionOf]-(n5_2:Entity:`Account`)
WITH n5_2, n5_1
ORDER BY n5_1.customer_number ASC
LIMIT 100
OPTIONAL MATCH (n5_1)-[:`Main Contact`]->(n4_1:Wrapper)<-[:Wrapper]-(:Revision:`Contact`)<-[:RevisionOf]-(n4_2:Entity:`Contact`)
OPTIONAL MATCH (n4_4)-[:RevisionOf]->(n4_3:Revision:Latest:`Contact`:Active)
WHERE (n4_2) = (n4_4)
RETURN n5_1, n5_2, n4_1, n4_2, n4_3
So my question is, how can I write a Cypher query in which a node is taken for the second time while the performance is not gonna suffer?
For some example data and testbed, please go to the other question.
I posted on your linked question an answer that should give the result table you described. If that fits what you're looking for, this query uses the same approach, and may be the solution for this question:
PROFILE MATCH (n5_1:Revision:`Account`)<-[:RevisionOf]-(n5_2:Entity:`Account`)
WITH n5_2, n5_1
ORDER BY n5_1.customer_number ASC
LIMIT 100
OPTIONAL MATCH (n5_1)-[:`Main Contact`]->(n4_1:Wrapper)<-[:Wrapper]-(:Revision:`Contact`)<-[:RevisionOf]-(n4_2:Entity:`Contact`)
WHERE (n4_2)-[:RevisionOf]->(:Revision:Latest:`Contact`:Active)
OPTIONAL MATCH (n4_2)-[:RevisionOf]->(n4_3:Revision:Latest:`Contact`:Active)
RETURN n5_1, n5_2, n4_1, n4_2, n4_3
This keeps the n4_2 in your last OPTIONAL MATCH, which should solve the performance issue you observed.
As you noted in your previous question, you want to avoid circumstances where the first OPTIONAL MATCH succeeds, but the second fails, leaving the variables as non-null from the first OPTIONAL MATCH when you don't want them to be.
We solve that issue by adding a WHERE after the first OPTIONAL MATCH, forcing the match to succeed only if the pattern you're looking for in the second OPTIONAL MATCH exists off of the last node (this will work even if such a pattern reuses relationships and nodes from the OPTIONAL MATCH).
You can try collect the tail in additionally:
PROFILE
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)
WITH n1, [null] + ( (n3)-[:R]->(:R:L) ) as tail
WITH n1, tail, size(tail) as tailSize
UNWIND tail as t
WITH n1, tailSize, t WHERE (tailSize = 2 AND NOT t is NULL) OR tailSize = 1
WITH n1, nodes(t) as nds
WITH n1, nds[0] as n3t, nds[1] as n4
RETURN labels(n1), labels(n3t), labels(n4)

Delete a connected graph with Cypher

I want to delete a connected graph related to a particular node in a Neo4j database using Cypher. The use case is to delete a "start" node and all the nodes where a path to the start node exists. To limit the transaction the query has to be iterative and must not disconnect the connected graph.
Until now I am using this query:
OPTIONAL MATCH (start {indexed_prop: $PARAM})--(toDelete)
OPTIONAL MATCH (toDelete)--(toBind)
WHERE NOT(id(start ) = id(toBind)) AND NOT((start)--(toBind))
WITH start, collect(toBind) AS TO_BIND, toDelete limit 10000
DETACH DELETE toDelete
WITH start, TO_BIND
UNWIND TO_BIND AS b
CREATE (start)-[:HasToDelete]->(b)
And call it until deleted node is equal to 0.
Is there a better query for this ?
You could try a mark and delete approach, which is similar to how you would detach and delete the entire connnected graph with a variable match, but instead of DETACH DELETE you can apply a :TO_DELETE label.
Something like this (making up a label to use for the start node, as otherwise it has to comb the entire db looking for a node with the indexed param):
MATCH (start:StartNodeLabel {indexed_prop: $PARAM})-[*]-(toDelete)
SET toDelete:TO_DELETE
If that blows up your heap, you can run it multiple times, with the added predicate WHERE NOT toDelete:TO_DELETE before the SET, and using a combination of LIMIT and/or a limit on the depth of the variable-length relationship.
When you're sure you've labeled every connected node, then it's just a matter of deleting every node in the TO_DELETE label, and you can run that iteratively, or use APOC procedure apoc.periodic.commit() to handle that in batches.

Neo4j 2.3.2 - unable to start database

I have a database populated with about 81MB of CSV data.
The data has some implicit relationships that I wanted to explicitly create, so I ran the following command:
with range(0,9) as numbers
unwind numbers as n
match (ks:KbWordSequence) where ks.kbid ends with tostring(n)
match (kt:KbTextWord {kbid: ks.kbid})
create (kt)-[:SEQUENCE]->(ks)
create (ks)-[:TEXT]->(kt)
On running the code I started to see lots of these messages in the .log file:
2016-03-19 19:27:30.740+0000 WARN [o.n.k.i.c.MonitorGc] GC Monitor: Application threads blocked for 9149ms.
After seeing these GC messages for a while, and seeing the process take up 6G of RAM, I killed the windows process and went to try creating the relationship again.
When I did that I got the following error and the database wouldn't start.
Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase#1dc6ce1' was successfully initialized, but failed to start. Please see attached cause exception.
There's no error in the .log file or any other corresponding message I can see.
Other examples of this kind of error corresponded to a Neo4j db version mismatch, which isn't the case in my situation.
How would I recover from this condition?
I guess the transaction grows too large since this statement seems to trigger a global operation. First understand the size of the intended operation:
with range(0,9) as numbers
unwind numbers as n
match (ks:KbWordSequence) where ks.kbid ends with tostring(n)
match (kt:KbTextWord {kbid: ks.kbid})
return count(*)
As a rule of thumb ~ 10k to 100k atomic operations is a good transaction size. With that in mind apply skip and limit to control the transaction size:
with range(0,9) as numbers
unwind numbers as n
match (ks:KbWordSequence) where ks.kbid ends with tostring(n)
match (kt:KbTextWord {kbid: ks.kbid})
with ks, kt skip 0 limit 50000
create (kt)-[:SEQUENCE]->(ks)
create (ks)-[:TEXT]->(kt)
return count(*)
and run this statement couple of times until you get back a value of 0.
Depending on the actual use case there might be even more efficient approaches in a way to prevent usage of skip and detect the not yet processed nodes directly in the match.

Best way to delete all nodes and relationships in Cypher

What is the best way to cleanup the graph from all nodes and relationships via Cypher?
At http://neo4j.com/docs/stable/query-delete.html#delete-delete-a-node-and-connected-relationships the example
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r
has the note:
This query isn’t for deleting large amounts of data
So, is the following better?
MATCH ()-[r]-() DELETE r
and
MATCH (n) DELETE n
Or is there another way that is better for large graphs?
As you've mentioned the most easy way is to stop Neo4j, drop the data/graph.db folder and restart it.
Deleting a large graph via Cypher will be always slower but still doable if you use a proper transaction size to prevent memory issues (remember transaction are built up in memory first before they get committed). Typically 50-100k atomic operations is a good idea. You can add a limit to your deletion statement to control tx sizes and report back how many nodes have been deleted. Rerun this statement until a value of 0 is returned back:
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH n,r LIMIT 50000
DELETE n,r
RETURN count(n) as deletedNodesCount
According to the official document here:
MATCH (n)
DETACH DELETE n
but it also said This query isn’t for deleting large amounts of data. so it's better use with limit.
match (n)
with n limit 10000
DETACH DELETE n;
Wrote this little script, added it in my NEO/bin folder.
Tested on v3.0.6 community
#!/bin/sh
echo Stopping neo4j
./neo4j stop
echo Erasing ALL data
rm -rf ../data/databases/graph.db
./neo4j start
echo Done
I use it when my LOAD CSV imports are crappy.
Hope it helps
What is the best way to clean up the graph from all nodes and relationships via Cypher?
I've outlined four options below that are current as of July 2022:
Option 1: MATCH (x) DETACH DELETE x
Option 2: CALL {} IN TRANSACTIONS
Option 3: delete data directories
Option 4: delete in code
Option 1: MATCH (x) DETACH DELETE x - works only with small data sets
As you posted in your question, the following works fine, but only if there aren't too many nodes and relationships:
MATCH (x) DETACH DELETE x
If the number of nodes and/or relationships is high enough, this won't work. Here's what "not working" looks like against http://localhost:7474/browser/:
There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
And here's what
shows up in neo4j console output (or in logs, if you have that enabled):
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "neo4j.Scheduler-1"
Option 2: CALL {} IN TRANSACTIONS - does not work as of July 2022
An alternative, available since 4.4 according to neo4j docs, is to use a new CALL {} IN TRANSACTIONS feature:
With 4.4 and newer versions you can utilize the CALL {} IN TRANSACTIONS syntax [...] to delete subsets of the matched records in batches until the full delete is complete
Unfortunately, this doesn't work in my tests. Here's an example attempting to delete relationships only:
MATCH ()-[r]-()
CALL { WITH r DELETE r }
IN TRANSACTIONS OF 1000 ROWS
Running that in browser results in this error:
A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.
In code, it produces the same result. Here's an attempt connecting via bolt in Java:
session.executeWrite(tx -> tx.run("MATCH (x) " +
"CALL { WITH x DETACH DELETE x } " +
"IN TRANSACTIONS OF 10000 ROWS"));
which results in this error, identical to what the browser showed:
org.neo4j.driver.exceptions.DatabaseException: A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.
at org.neo4j.driver.internal.util.Futures.blockingGet(Futures.java:111)
at org.neo4j.driver.internal.InternalTransaction.run(InternalTransaction.java:58)
at org.neo4j.driver.internal.AbstractQueryRunner.run(AbstractQueryRunner.java:34)
Looking at the documentation for Transactions, it states: "Transactions can be either explicit or implicit." What's the difference? From that same doc:
Explicit transactions:
Are opened by the user.
Can execute multiple Cypher queries in sequence.
Are committed, or rolled back, by the user.
Implicit transactions, sometimes called auto-commit transactions or :auto transactions:
Are opened automatically.
Can execute a single Cypher query.
Are committed automatically when the query finishes successfully.
I can't determine from docs or experimentation how to open an implicit transaction (and thus, to be able to use 'CALL { ... } IN TRANSACTIONS' structure), so this is apparently a dead end.
In a recent Neo4j AuraDB Office Hours posted May 31, 2022, they tried using this same feature in AuraDB. It didn't work for them either, though the behavior was different from what I've observed in Neo4j Community. I'm guessing they'll address this at some point, feels like a bug, but at least for now it's another confirmation that
'CALL { ... } IN TRANSACTIONS' is not the way forward.
Option 3: delete data directories - works with any size data set
This is the easiest, most straightforward mechanism that actually works:
stop the server
manually delete data directories
restart the server
Here's what that looks like:
% ./bin/neo4j stop
% rm -rf data/databases data/transactions
% ./bin/neo4j start
This is pretty simple. You could write a script to capture this as a single command.
Option 4: delete in code - works with any size data set
Below is a minimal Java program that handles deletion of all nodes and relationships, regardless of how many.
The manual-delete option works fine, but I needed a way to delete all nodes and relationships in code.
This works in Neo4j Community 4.4.3, and since I'm using only basic functionality (no extensions), I assume this would work across a range of other Neo4j versions, and probably AuraDB, too.
import org.neo4j.driver.AuthTokens;
import org.neo4j.driver.GraphDatabase;
import org.neo4j.driver.Session;
public static void main(String[] args) throws InterruptedException {
String boltUri = "...";
String user = "...";
String password = "...";
Session session = GraphDatabase.driver(boltUri, AuthTokens.basic(user, password)).session();
int count = 1;
while (count > 0) {
session.executeWrite(tx -> tx.run("MATCH (x) WITH x LIMIT 1000 DETACH DELETE x"));
count = session.executeWrite(tx -> tx.run("MATCH (x) RETURN COUNT(x)").single().values().get(0).asInt());
}
}
optional match (n)-[p:owner_real_estate_relation]->() with n,p LIMIT 1000 delete p
In test run, deleted 50000 relationships, completed after 589 ms.
I performed several tests and the best combination was
`call apoc.periodic.iterate("MATCH p=()-[r]->() RETURN r,p LIMIT 5000000;","DELETE r;", {batchSize:10000, parallel: true}`)
(this code deleted 300,000,000 relationships in 3251s)
It is worth noting that using the "parallel" parameter drastically reduces the time.
This for Neo4j 4.4.1
AWS EC2: m5.xlarge
neo4j:
resources:
memory: 29000Mi
configs:
dbms.memory.heap.initial_size: "20G"
dbms.memory.heap.max_size: "20G"
dbms.memory.pagecache.size: "5G"

Resources