Delete every node and edge from Neo4j database - neo4j

I want to delete every node and edge of any type from a Neo4j database. There are different ways of deleting nodes and edges suggested on SO. However, since my database is huge, and since all these methods rely on first querying for edges/nodes and then deleting them, which leads to loading (at least their indexes) into memory, these methods fail for my use case with the out-of-memory error. See the following example.
match ()-[r]->() delete r
match (n) delete n
Neo.TransientError.General.OutOfMemoryError: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
For different reasons, I cannot increase the amount of configured memory.
One radical solution is to delete the database's files which would lead to deleting the database effectively (even resetting the indexes). In my use case, this approach has its downsides, e.g., some of our applications rely on a set import path to bulk load data (e.g., Neo4jDesktop\relate-data\dbmss\dbms-...\import\) where deleting and re-creating the database requires updating all those dependent applications.
I was wondering if there is any efficient approach other than these to delete all nodes and edges from a huge Neo4j database---ideally without needing to loading/query the nodes/edges first.

If you are using Neo4j 4.3 and above, you can simply use:
DROP DATABASE database_name IF EXISTS <-- Best Way
OR you can use use CALL syntax, like this:
MATCH (n)
CALL { WITH n
DETACH DELETE n
} IN TRANSACTIONS OF 10000 ROWS;
The above query might not work on Neo4j Browser. To run it on neo4j Browser, try this:
:auto MATCH (n)
CALL { WITH n
DETACH DELETE n
} IN TRANSACTIONS OF 10000 ROWS;
Finally, if your version is less than 4.x, you can try the APOC as suggested in another answer, or simply run this query multiple times until the output is zero.
MATCH (n)
WITH n LIMIT 10000
DETACH DELETE n
RETURN count(*);

You can use apoc iterate function, this documentation will explain the details.
https://neo4j.com/labs/apoc/4.1/overview/apoc.periodic/apoc.periodic.iterate/#usage-apoc.periodic.iterate
CALL apoc.periodic.iterate(
"MATCH (n) RETURN n",
"DETACH DELETE n",
{batchSize:10000, parallel:true})
This will delete the nodes and edges per 10k batches.
You can change the batch size based on your intuition.

Related

Neo4j Count Distinct Nodes returning more Nodes than the Total

I'm trying to count all the nodes in my graph where a specific relation does not happen.
I have 1816 nodes in my graph.
When I run the following query:
MATCH (n1)-[r]->(n2)
WHERE NOT (n1)-[:wikipedia]->(n2)
RETURN count(distinct n1)
Or:
MATCH (n)-[r]->()
WHERE NOT type(r)='wikipedia'
RETURN count(distinct n)
I get: 2202
Above even the number of nodes!
What is wrong?
Neo4j version 3.5.1
The fact that you found inconsistencies with the consistency check means your Neo4j database is corrupted. The first thing you should do is take it offline and back it up before attempting any restore/repair.
Once you have your backup, you could try deleting the "neostore.counts.db.*" files to force Neo4j to rebuild them, but I would not recommend it since by definition of Neo4j being in an invalid state, it is impossible to know how much actual damage there is (corrupt nodes and what not). I would recommend either restoring from an older backup (if you have any) or using a restore tool like store-utils to rebuild the whole database, throwing out any invalid nodes/relationships.

Neo4j long lasting query to be split/executed in smaller chunks?

My import.csv creates many nodes and merging creates a huge cartesian product and runs in a transaction timeout since the data has grown so much. I've currently set the transaction timeout to 1 second because every other query is very quick and is not supposed to take any longer than one second to finish.
Is there a way to split or execute this specific query in smaller chunks to prevent a timeout?
Upping or disabling the transaction timeout in the neo4j.conf is not an option because the neo4j service needs a restart for every change made in the config.
The query hitting the timeout from my import script:
MATCH (l:NameLabel)
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l);
Nodecounts: 1000 Movie, 2500 Namelabel
You can try installing APOC Procedures and using the procedure apoc.periodic.commit.
call apoc.periodic.commit("
MATCH (l:Namelabel)
WHERE NOT (l)-[:LABEL]->(:Movie)
WITH l LIMIT {limit}
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l)
RETURN count(*)
",{limit:1000})
The below query will be executed repeatedly in separate transactions until it returns 0.
You can change the value of {limit : 1000}.
Note: remember to install APOC Procedures according the version of Neo4j you are using. Take a look in the Version Compatibility Matrix.
The number of nodes and labels in your database suggest this is an indexing problem. Do you have constraints on both the Movie and Namelabel (which should be NameLabel since it is a node) nodes? The appropriate constraints should be in place and active.
Indexing and Performance
Make sure to have indexes and constraints declared and ONLINE for
entities you want to MATCH or MERGE on
Always MATCH and MERGE on a
single label and the indexed primary-key property
Prefix your load
statements with USING PERIODIC COMMIT 10000 If possible, separate node
creation from relationship creation into different statements
If your
import is slow or runs into memory issues, see Mark’s blog post on
Eager loading.
If your Movie nodes have unique names then use the CREATE UNIQUE statement. - docs
If one of the nodes is not unique but will be used in a relationship definition then the CREATE INDEX ON statement. With such a small dataset it may not be readily apparent how inefficient your queries are. Try the PROFILE command and see how many nodes are being searched. Your MERGE statement should only check a couple nodes at each step.

Neo4j GC overhead limit

I build a neo4j graph. The size is about 5 GB. When I want to add a relation to each node by using a cypher query like match (a)-[:know]-(b),(b)-[:know]-(c) merge (a)-[:maybe_know]-(c) , I get a GC overhead limit error. I don't want to increase the memory for neo4j. Is there some way to update nodes step by step? Like firstly, 5000 nodes, then another 5000 nodes... Or do you have some other suggestions about this?
Like #twobit says, limit your batches to something manageable but also only match things that have not already been matched. i.e. if a and c already know one another or the maybe_know relationship has already been created between them then never match them again. Yould could also make sure the id of one is greater than the other which would ensure you don't make the same match twice (once in each direction).
match (a)-[:know]-(b),(b)-[:know]-(c)
where a <> c
and not (a)-[:know|maybe_know]-(c)
and id(a) > id(c)
merge (a)-[:maybe_know]-(c)
limit 1000

Better Way to remove cycles from a path in neo4j graph

I am using neo4j graph database version 2.1.7. Brief Details around data:
2 million nodes with 6 different type of nodes, 5 million relationships with only 5 different type of relationships and mostly connected graph but contains a few isolated subgraphs.
While resolving paths, i get cycles in path. And to restrict that, i used the solution shared in below:
Returning only simple paths in Neo4j Cypher query
Here is the Query, i am using:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(EXTRACT (p in NODES(path)| p.key)) > 1)
and ((exists ((c)-[:rel5]->(b)) and (not exists((b)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (b)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not exists ((c)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
The above query solves mine requirement but is not cost effective and keeps running if is run for huge subgraph. I have used 'Profile' command to improve query performance from what i started with. But, now stuck at this point. The performance has improved but, not what i expected from neo4j :(
I don't know that I have a solution, but I have a number of suggestions. Some might speed things up, some might just make the query easier to read.
Firstly, rather than putting exists ((c)-[:rel5]->(b)) in your WHERE, I believe you can put it in your MATCH like this:
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA), (c)-[:rel5]->(b)
I don't think you need the exists keyword. I think you can just say, for example, (NOT (b)-[:rel1|rel2|rel3|rel4]->(:nodeA))
I'd also suggest thinking about the WITH clause for potential performance improvements.
A couple of notes about your variable paths: In *0.. the 0 means that your potentially looking for a self-reference. That may or may not be what you want. Also, leaving the variable path open ended can often cause performance problems (as I think you're seeing). If you can possibly cap it that may help.
Also, if you upgrade to 2.2.1, there are a number of built-in performance improvements with the 2.2.x line, but you also get visual PROFILEing in the console and a new EXPLAIN command which both profiles and tells you the real performance of the query after running it.
One thing to consider too is that I don't think you're hitting performance boundaries of Neo4j but rather, perhaps, you're potentially hitting some boundaries of Cypher. If so, I might suggest you do your querying with the Java APIs that Neo4j provides for better performance and more control. This can either be via embedding your database if you're using a JVM-compatible language or by writing an unmanaged extension which lets you do your own querying in java but provide a custom REST API from the server
Did a couple of more tweaks to my query as suggested above by Brian. And found improvement in query response time. Now, It takes almost 20% of time in execution compared to my original query and the current query makes almost 60% less db hits, compared to the query i shared earlier, during query execution. PFB the updated query:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*1..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(path) > 0)
and ((exists ((c)-[:rel5]->(b)) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
And observed dramatic improvement when capped the path from *1.. to *1..15. Also, removed one filter from query which too was taking longer time.
But, the query response time increased when queried on nodes having relationships more than 18-20 depths.
I would advise to use profile command oftenly to find pain points in your query. That would help you resolve the issues faster.
Thanks Brian.

Neo4j why picking a single node and a single edge take so long?

I am trying to test the speed of Neo4j, thus I created an empty database and then populate it with 10,000 users.
Now I run the following query
MATCH (n) RETURN id(n) LIMIT 1;
Surprisingly, it takes 1069 ms!
Then I run the following query (note: I haven't created any edges)
MATCH ()-[r]-() RETURN id(r) LIMIT 1;
which takes 1153ms!
Then I run
MATCH (n) RETURN id(n) SKIP 9900 LIMIT 100
which takes 10427ms.
Is it normal? I think those operations, at least the last one, is quite frequent in an app. I am using a Macbook Air with 1.7GHz Core i5
What version of Neo4j are you using?
How do you measure? The Neo4j browser measures multiple roundtrips for additional data.
Also is that the first or a subsequent query?
None of those queries should be that slow. Perhaps you can share your Neo4j configuration?
that one should be really fast
this one goes over all the nodes (or even over the cross product) in your graph and tries to find a relationship at the end it doesn't find any
that one should also be really fat.
Regarding your comment, if you know the first node, your search will be anchored and you don't have to scan all rels in the database.
MATCH (:User {name:"Han"})-[:FRIEND]->(friend)
RETURN friend

Resources