I have a database populated with about 81MB of CSV data.
The data has some implicit relationships that I wanted to explicitly create, so I ran the following command:
with range(0,9) as numbers
unwind numbers as n
match (ks:KbWordSequence) where ks.kbid ends with tostring(n)
match (kt:KbTextWord {kbid: ks.kbid})
create (kt)-[:SEQUENCE]->(ks)
create (ks)-[:TEXT]->(kt)
On running the code I started to see lots of these messages in the .log file:
2016-03-19 19:27:30.740+0000 WARN [o.n.k.i.c.MonitorGc] GC Monitor: Application threads blocked for 9149ms.
After seeing these GC messages for a while, and seeing the process take up 6G of RAM, I killed the windows process and went to try creating the relationship again.
When I did that I got the following error and the database wouldn't start.
Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase#1dc6ce1' was successfully initialized, but failed to start. Please see attached cause exception.
There's no error in the .log file or any other corresponding message I can see.
Other examples of this kind of error corresponded to a Neo4j db version mismatch, which isn't the case in my situation.
How would I recover from this condition?
I guess the transaction grows too large since this statement seems to trigger a global operation. First understand the size of the intended operation:
with range(0,9) as numbers
unwind numbers as n
match (ks:KbWordSequence) where ks.kbid ends with tostring(n)
match (kt:KbTextWord {kbid: ks.kbid})
return count(*)
As a rule of thumb ~ 10k to 100k atomic operations is a good transaction size. With that in mind apply skip and limit to control the transaction size:
with range(0,9) as numbers
unwind numbers as n
match (ks:KbWordSequence) where ks.kbid ends with tostring(n)
match (kt:KbTextWord {kbid: ks.kbid})
with ks, kt skip 0 limit 50000
create (kt)-[:SEQUENCE]->(ks)
create (ks)-[:TEXT]->(kt)
return count(*)
and run this statement couple of times until you get back a value of 0.
Depending on the actual use case there might be even more efficient approaches in a way to prevent usage of skip and detect the not yet processed nodes directly in the match.
Related
I am trying to compare two fairly complex Neo4j databases. To that end, I created this script to look at relations:
//REL-COUNT-A-B
match (n)-[r]->(m)
WITH distinct type(r) as REL, count(r) as REL_COUNT, labels(n) as LA, labels(m) as LB
RETURN REL, REL_COUNT, apoc.coll.sort(LA) as LABELS_A, apoc.coll.sort(LB) as LABELS_B order by REL asc
Unfortunately, I get this back:
Neo.DatabaseError.Statement.ExecutionFailed
NOT PART OF CHAIN! RelationshipTraversalCursor[id=4155363, open state with: denseNode=true, next=4155363, , underlying record=Relationship[4155363,used=true,source=3888733,target=5731,type=217,sPrev=4155261,sNext=4155354,tCount=53,tNext=4149012,prop=45167148,!sFirst, tFirst]]
This 'smells like' a db inconsistency reaction was to run a consistency check via neo4j-admin.bat check-consistency --database=neo4j.
This runs for awhile, and the check simply stops at around 20%. I've tried a few times with the same result, with this partial log:
Index file: C:\Users\jac\AppData\Local\Neo4j\Relate\Data\dbmss\dbms-a8a1d3cd-3308-41a3-bd5e-0c8cad3d0e82\data\databases
\neo4j\schema\index\native-btree-1.0\41\index-41.
.....2022-12-05 22:20:24.924+0000 WARN [o.n.c.ConsistencyCheckService] Index was dirty on startup which means it was not shutdo
wn correctly and need to be cleaned up with a successful recovery.
Index file: C:\Users\jac\AppData\Local\Neo4j\Relate\Data\dbmss\dbms-a8a1d3cd-3308-41a3-bd5e-0c8cad3d0e82\data\databases
\neo4j\schema\index\native-btree-1.0\42\index-42.
...... 100%
Consistency check
.................... 10%
.................... 20%
..........Consistency checking failed.Full consistency check did not complete
Any insights from anyone different then 'rebuild the indexes and constraints'?
Thanks
Version info:
{
"edition" : "enterprise",
"version" : "3.2.2"
}
I have a Neo4j database with several million instances of label U and label D. Every U is connected to exactly one D by relationship WITH_D. Several Us may share the same D. My goal is to get a D and a list of all Us connected to it.
Why is it that this first query hangs for an indefinite amount of time...
match (d:D)<-[:WITH_D]-(u:U)
return d, collect(u) limit 1
Whereas this one returns immediately in a few ms?
match (d:D) with d limit 1
match (d)<-[:WITH_D]-(u:U)
return d, collect(u)
The query plan for the first involves node-by-label scan yielding millions of nodes, then "Expand all" yielding millions of nodes, whereas the second one is a node-by-label scan with a filter down to one node, and then "Expand all".
It seems like there are issues with the way limits are handled, i.e. in some cases it is simply not lazy enough.
This leads to a lot of unwieldy subqueries to avoid non-terminating queries. With a database nearing 1 billion nodes, I have encountered this issue many times. Any clues?
Thanks
I believe that the main point here is the place were you are using the LIMIT 1 in the query.
In the first query you are MATCHing ALL possible patterns between a :D and :U labels first. In the end of the query you are limiting the result to 1. That is: you are matching all patterns and using LIMIT "as a filter" over the entire result.
In the second query you are MATCHing :D nodes limiting to one. After, you are getting all :Us connected to this single node. That is: the first MATCH is being finalized when the fist occurrence of :D node is found. So the LIMIT is being used at read time and not only before the entire result is returned.
I have a neo4j database with ~260000 (EDIT: Incorrect by order of magnitude previously, missing 0) nodes of genes, something along the lines of:
example_nodes: sourceId, targetId
with an index on both sourceId and targetId
I am trying to build the relationships between all the nodes but am constantly running into OOM issues. I've increased my JVM heap size to -Xmx4096m and dbms.memory.pagecache.size=16g on a system with 16G of RAM.
I am assuming I need to optimize my query because it simply cannot complete in any of its current forms. However, I have tried the following three to no avail:
MATCH (start:example_nodes),(end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
(on a subset of the 5000 nodes, this query above completes in only a matter of seconds. It does of course warn: This query builds a cartesian product between disconnected patterns.)
MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
OPTIONAL MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
Any ideas how this query could be optimized to succeed would be much appreciated.
--
Edit
In a lot of ways I feel that while the apoc libary does indeed solve the memory issues, the function could be optimized if it were to run along the lines of this incredibly simple pseudocode:
for each start_gene
create relationship to end_gene where start_gene.targetId = end_gene.source_id
move on to next once relationship has been created
But I am unsure how to achieve this in cypher.
You can use apoc library for batching.
call apoc.periodic.commit("
MATCH (start:example_nodes),(end:example_nodes) WHERE not (start)-[:CONNECT]->(end) and id(start) > id(end) AND start.targetId =
end.sourceId
with start,end limit {limit}
CREATE (start)-[:CONNECT]->(end)
RETURN count(*)
",{limit:5000})
I would like to delete all the nodes of a certain label by executing
match (P:ALabel) delete P;
This returns the comment "No data returned." It also states how many Nodes deleted, and how long it took (5767 ms). However, the shell seems to stop responding after this, and I am unable to execute any other commands.
I also used this command, encouraged from this answer:
match (n:ALabel)
optional match (n)-[r]-()
delete n, r;
Executing this command took slightly longer (16929 ms). It still does not return.
Depending on the amount of changes you need to choose an appropriate transaction size, otherwise you'll see excessive garbage collections and/or OOM exceptions. Use the LIMIT clause and return back the number of deleted nodes. Run this statement multiple times until 0 is returned:
match (n:ALabel)
with n limit 5000
optional match (n)-[r]-()
delete n,r
return count(distinct n)
Here the batch size is 5000 nodes.
What is the best way to cleanup the graph from all nodes and relationships via Cypher?
At http://neo4j.com/docs/stable/query-delete.html#delete-delete-a-node-and-connected-relationships the example
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r
has the note:
This query isn’t for deleting large amounts of data
So, is the following better?
MATCH ()-[r]-() DELETE r
and
MATCH (n) DELETE n
Or is there another way that is better for large graphs?
As you've mentioned the most easy way is to stop Neo4j, drop the data/graph.db folder and restart it.
Deleting a large graph via Cypher will be always slower but still doable if you use a proper transaction size to prevent memory issues (remember transaction are built up in memory first before they get committed). Typically 50-100k atomic operations is a good idea. You can add a limit to your deletion statement to control tx sizes and report back how many nodes have been deleted. Rerun this statement until a value of 0 is returned back:
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH n,r LIMIT 50000
DELETE n,r
RETURN count(n) as deletedNodesCount
According to the official document here:
MATCH (n)
DETACH DELETE n
but it also said This query isn’t for deleting large amounts of data. so it's better use with limit.
match (n)
with n limit 10000
DETACH DELETE n;
Wrote this little script, added it in my NEO/bin folder.
Tested on v3.0.6 community
#!/bin/sh
echo Stopping neo4j
./neo4j stop
echo Erasing ALL data
rm -rf ../data/databases/graph.db
./neo4j start
echo Done
I use it when my LOAD CSV imports are crappy.
Hope it helps
What is the best way to clean up the graph from all nodes and relationships via Cypher?
I've outlined four options below that are current as of July 2022:
Option 1: MATCH (x) DETACH DELETE x
Option 2: CALL {} IN TRANSACTIONS
Option 3: delete data directories
Option 4: delete in code
Option 1: MATCH (x) DETACH DELETE x - works only with small data sets
As you posted in your question, the following works fine, but only if there aren't too many nodes and relationships:
MATCH (x) DETACH DELETE x
If the number of nodes and/or relationships is high enough, this won't work. Here's what "not working" looks like against http://localhost:7474/browser/:
There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
And here's what
shows up in neo4j console output (or in logs, if you have that enabled):
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "neo4j.Scheduler-1"
Option 2: CALL {} IN TRANSACTIONS - does not work as of July 2022
An alternative, available since 4.4 according to neo4j docs, is to use a new CALL {} IN TRANSACTIONS feature:
With 4.4 and newer versions you can utilize the CALL {} IN TRANSACTIONS syntax [...] to delete subsets of the matched records in batches until the full delete is complete
Unfortunately, this doesn't work in my tests. Here's an example attempting to delete relationships only:
MATCH ()-[r]-()
CALL { WITH r DELETE r }
IN TRANSACTIONS OF 1000 ROWS
Running that in browser results in this error:
A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.
In code, it produces the same result. Here's an attempt connecting via bolt in Java:
session.executeWrite(tx -> tx.run("MATCH (x) " +
"CALL { WITH x DETACH DELETE x } " +
"IN TRANSACTIONS OF 10000 ROWS"));
which results in this error, identical to what the browser showed:
org.neo4j.driver.exceptions.DatabaseException: A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.
at org.neo4j.driver.internal.util.Futures.blockingGet(Futures.java:111)
at org.neo4j.driver.internal.InternalTransaction.run(InternalTransaction.java:58)
at org.neo4j.driver.internal.AbstractQueryRunner.run(AbstractQueryRunner.java:34)
Looking at the documentation for Transactions, it states: "Transactions can be either explicit or implicit." What's the difference? From that same doc:
Explicit transactions:
Are opened by the user.
Can execute multiple Cypher queries in sequence.
Are committed, or rolled back, by the user.
Implicit transactions, sometimes called auto-commit transactions or :auto transactions:
Are opened automatically.
Can execute a single Cypher query.
Are committed automatically when the query finishes successfully.
I can't determine from docs or experimentation how to open an implicit transaction (and thus, to be able to use 'CALL { ... } IN TRANSACTIONS' structure), so this is apparently a dead end.
In a recent Neo4j AuraDB Office Hours posted May 31, 2022, they tried using this same feature in AuraDB. It didn't work for them either, though the behavior was different from what I've observed in Neo4j Community. I'm guessing they'll address this at some point, feels like a bug, but at least for now it's another confirmation that
'CALL { ... } IN TRANSACTIONS' is not the way forward.
Option 3: delete data directories - works with any size data set
This is the easiest, most straightforward mechanism that actually works:
stop the server
manually delete data directories
restart the server
Here's what that looks like:
% ./bin/neo4j stop
% rm -rf data/databases data/transactions
% ./bin/neo4j start
This is pretty simple. You could write a script to capture this as a single command.
Option 4: delete in code - works with any size data set
Below is a minimal Java program that handles deletion of all nodes and relationships, regardless of how many.
The manual-delete option works fine, but I needed a way to delete all nodes and relationships in code.
This works in Neo4j Community 4.4.3, and since I'm using only basic functionality (no extensions), I assume this would work across a range of other Neo4j versions, and probably AuraDB, too.
import org.neo4j.driver.AuthTokens;
import org.neo4j.driver.GraphDatabase;
import org.neo4j.driver.Session;
public static void main(String[] args) throws InterruptedException {
String boltUri = "...";
String user = "...";
String password = "...";
Session session = GraphDatabase.driver(boltUri, AuthTokens.basic(user, password)).session();
int count = 1;
while (count > 0) {
session.executeWrite(tx -> tx.run("MATCH (x) WITH x LIMIT 1000 DETACH DELETE x"));
count = session.executeWrite(tx -> tx.run("MATCH (x) RETURN COUNT(x)").single().values().get(0).asInt());
}
}
optional match (n)-[p:owner_real_estate_relation]->() with n,p LIMIT 1000 delete p
In test run, deleted 50000 relationships, completed after 589 ms.
I performed several tests and the best combination was
`call apoc.periodic.iterate("MATCH p=()-[r]->() RETURN r,p LIMIT 5000000;","DELETE r;", {batchSize:10000, parallel: true}`)
(this code deleted 300,000,000 relationships in 3251s)
It is worth noting that using the "parallel" parameter drastically reduces the time.
This for Neo4j 4.4.1
AWS EC2: m5.xlarge
neo4j:
resources:
memory: 29000Mi
configs:
dbms.memory.heap.initial_size: "20G"
dbms.memory.heap.max_size: "20G"
dbms.memory.pagecache.size: "5G"