Neo4j stuck when create relation with big data - neo4j

i'm using this cypher query to create relationship between two nodes in Neo4j
MATCH (first:FIRSTNODE)
with first
MATCH (second:SECONDNODE)
WHERE first.ID = second.ID
CREATE (first)-[:RELATION]->(second)
first has 100.000 of nodes and second has 1.100.000 nodes.
I have imported the csv file and then i've created index of the two tables; but when i try to run the query with the relation neo4j got stuck and stop working.
I noticed that the cpu usage goes at 100% when this happens.
I'm working with an cpu of 8x4.0Ghz and 10Gb of ram and an SSD.
Do you know something that can help me to resolve this problem?
EDIT 1:
Using apoc.periodic.commit it works. But if then i run a second query like this:
call apoc.periodic.commit("
MATCH (third:THIRDNODE)
WHERE NOT (third)-[:RELATION2]->()
WITH third LIMIT {limit}
MATCH (second:SECONDNODE)
WHERE third.ID = second.ID2
CREATE (third)-[:RELATION2]->(second)
RETURN count(*)
", {limit:10000})
it got stuck again

You can try using apoc.periodic.commit from APOC Procedures. The docs about this procedure says:
apoc.periodic.commit(statement,params) - runs the given statement in
separate transactions until it returns 0
Install APOC Procedures and try it:
call apoc.periodic.commit("
MATCH (first:FIRSTNODE),
WHERE NOT (first)-[:RELATION]->()
WITH first LIMIT {limit}
MATCH (second:SECONDNODE)
WHERE first.ID = second.ID
CREATE (first)-[:RELATION]->(second)
RETURN count(*)
", {limit:10000})
Remember to install APOC procedures according the version of Neo4j you are using. Take a look in the version compatibility matrix.

Related

Neo4j - Variable length greater 11 runs forever and query never returns

I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes

neo4j - Create relationships between all nodes in database (Out of memory)

I have a neo4j database with ~260000 (EDIT: Incorrect by order of magnitude previously, missing 0) nodes of genes, something along the lines of:
example_nodes: sourceId, targetId
with an index on both sourceId and targetId
I am trying to build the relationships between all the nodes but am constantly running into OOM issues. I've increased my JVM heap size to -Xmx4096m and dbms.memory.pagecache.size=16g on a system with 16G of RAM.
I am assuming I need to optimize my query because it simply cannot complete in any of its current forms. However, I have tried the following three to no avail:
MATCH (start:example_nodes),(end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
(on a subset of the 5000 nodes, this query above completes in only a matter of seconds. It does of course warn: This query builds a cartesian product between disconnected patterns.)
MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
OPTIONAL MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
Any ideas how this query could be optimized to succeed would be much appreciated.
--
Edit
In a lot of ways I feel that while the apoc libary does indeed solve the memory issues, the function could be optimized if it were to run along the lines of this incredibly simple pseudocode:
for each start_gene
create relationship to end_gene where start_gene.targetId = end_gene.source_id
move on to next once relationship has been created
But I am unsure how to achieve this in cypher.
You can use apoc library for batching.
call apoc.periodic.commit("
MATCH (start:example_nodes),(end:example_nodes) WHERE not (start)-[:CONNECT]->(end) and id(start) > id(end) AND start.targetId =
end.sourceId
with start,end limit {limit}
CREATE (start)-[:CONNECT]->(end)
RETURN count(*)
",{limit:5000})

How to batch Neo4j Cypher queries

So i have over 130M nodes of one type and 500K nodes of another type, i am trying to create relationships between them as follows:
MATCH (p:person)
MATCH (f:food) WHERE f.name=p.likes
CREATE (p)-[l:likes]->(f)
The problem is there are 130M relationships created and i would like to do it in a similar fashion to PERIODIC COMMIT when using LOAD CSV
Is there such a functionality for my type of query?
Yes, there is. You'll need the APOC Procedures library installed (download here). You'll be using the apoc.periodic.commit() function in the Job Management section. From the documentation:
CALL apoc.periodic.commit(statement, params) - repeats a batch update
statement until it returns 0, this procedure is blocking
You'll be using this in combination with the LIMIT clause, passing the limit value as the params.
However, for best results, you'll want to make sure your join data (f.name, I think) has an index or a unique constraint to massively cut down on the time.
Here's how you might use it (assuming from your example that a person only likes one food, and that we should only apply this to :persons that don't already have the relationship set):
CALL apoc.periodic.commit("
MATCH (p:person)
WHERE p.likes IS NOT NULL
AND NOT (p)-[:likes]->(:food)
WITH p LIMIT {limit}
MATCH (f:food) WHERE p.likes = f.name
CREATE (p)-[:likes]->(f)
RETURN count(*)
", {limit: 10000})

Best way to delete all nodes and relationships in Cypher

What is the best way to cleanup the graph from all nodes and relationships via Cypher?
At http://neo4j.com/docs/stable/query-delete.html#delete-delete-a-node-and-connected-relationships the example
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r
has the note:
This query isn’t for deleting large amounts of data
So, is the following better?
MATCH ()-[r]-() DELETE r
and
MATCH (n) DELETE n
Or is there another way that is better for large graphs?
As you've mentioned the most easy way is to stop Neo4j, drop the data/graph.db folder and restart it.
Deleting a large graph via Cypher will be always slower but still doable if you use a proper transaction size to prevent memory issues (remember transaction are built up in memory first before they get committed). Typically 50-100k atomic operations is a good idea. You can add a limit to your deletion statement to control tx sizes and report back how many nodes have been deleted. Rerun this statement until a value of 0 is returned back:
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH n,r LIMIT 50000
DELETE n,r
RETURN count(n) as deletedNodesCount
According to the official document here:
MATCH (n)
DETACH DELETE n
but it also said This query isn’t for deleting large amounts of data. so it's better use with limit.
match (n)
with n limit 10000
DETACH DELETE n;
Wrote this little script, added it in my NEO/bin folder.
Tested on v3.0.6 community
#!/bin/sh
echo Stopping neo4j
./neo4j stop
echo Erasing ALL data
rm -rf ../data/databases/graph.db
./neo4j start
echo Done
I use it when my LOAD CSV imports are crappy.
Hope it helps
What is the best way to clean up the graph from all nodes and relationships via Cypher?
I've outlined four options below that are current as of July 2022:
Option 1: MATCH (x) DETACH DELETE x
Option 2: CALL {} IN TRANSACTIONS
Option 3: delete data directories
Option 4: delete in code
Option 1: MATCH (x) DETACH DELETE x - works only with small data sets
As you posted in your question, the following works fine, but only if there aren't too many nodes and relationships:
MATCH (x) DETACH DELETE x
If the number of nodes and/or relationships is high enough, this won't work. Here's what "not working" looks like against http://localhost:7474/browser/:
There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
And here's what
shows up in neo4j console output (or in logs, if you have that enabled):
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "neo4j.Scheduler-1"
Option 2: CALL {} IN TRANSACTIONS - does not work as of July 2022
An alternative, available since 4.4 according to neo4j docs, is to use a new CALL {} IN TRANSACTIONS feature:
With 4.4 and newer versions you can utilize the CALL {} IN TRANSACTIONS syntax [...] to delete subsets of the matched records in batches until the full delete is complete
Unfortunately, this doesn't work in my tests. Here's an example attempting to delete relationships only:
MATCH ()-[r]-()
CALL { WITH r DELETE r }
IN TRANSACTIONS OF 1000 ROWS
Running that in browser results in this error:
A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.
In code, it produces the same result. Here's an attempt connecting via bolt in Java:
session.executeWrite(tx -> tx.run("MATCH (x) " +
"CALL { WITH x DETACH DELETE x } " +
"IN TRANSACTIONS OF 10000 ROWS"));
which results in this error, identical to what the browser showed:
org.neo4j.driver.exceptions.DatabaseException: A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.
at org.neo4j.driver.internal.util.Futures.blockingGet(Futures.java:111)
at org.neo4j.driver.internal.InternalTransaction.run(InternalTransaction.java:58)
at org.neo4j.driver.internal.AbstractQueryRunner.run(AbstractQueryRunner.java:34)
Looking at the documentation for Transactions, it states: "Transactions can be either explicit or implicit." What's the difference? From that same doc:
Explicit transactions:
Are opened by the user.
Can execute multiple Cypher queries in sequence.
Are committed, or rolled back, by the user.
Implicit transactions, sometimes called auto-commit transactions or :auto transactions:
Are opened automatically.
Can execute a single Cypher query.
Are committed automatically when the query finishes successfully.
I can't determine from docs or experimentation how to open an implicit transaction (and thus, to be able to use 'CALL { ... } IN TRANSACTIONS' structure), so this is apparently a dead end.
In a recent Neo4j AuraDB Office Hours posted May 31, 2022, they tried using this same feature in AuraDB. It didn't work for them either, though the behavior was different from what I've observed in Neo4j Community. I'm guessing they'll address this at some point, feels like a bug, but at least for now it's another confirmation that
'CALL { ... } IN TRANSACTIONS' is not the way forward.
Option 3: delete data directories - works with any size data set
This is the easiest, most straightforward mechanism that actually works:
stop the server
manually delete data directories
restart the server
Here's what that looks like:
% ./bin/neo4j stop
% rm -rf data/databases data/transactions
% ./bin/neo4j start
This is pretty simple. You could write a script to capture this as a single command.
Option 4: delete in code - works with any size data set
Below is a minimal Java program that handles deletion of all nodes and relationships, regardless of how many.
The manual-delete option works fine, but I needed a way to delete all nodes and relationships in code.
This works in Neo4j Community 4.4.3, and since I'm using only basic functionality (no extensions), I assume this would work across a range of other Neo4j versions, and probably AuraDB, too.
import org.neo4j.driver.AuthTokens;
import org.neo4j.driver.GraphDatabase;
import org.neo4j.driver.Session;
public static void main(String[] args) throws InterruptedException {
String boltUri = "...";
String user = "...";
String password = "...";
Session session = GraphDatabase.driver(boltUri, AuthTokens.basic(user, password)).session();
int count = 1;
while (count > 0) {
session.executeWrite(tx -> tx.run("MATCH (x) WITH x LIMIT 1000 DETACH DELETE x"));
count = session.executeWrite(tx -> tx.run("MATCH (x) RETURN COUNT(x)").single().values().get(0).asInt());
}
}
optional match (n)-[p:owner_real_estate_relation]->() with n,p LIMIT 1000 delete p
In test run, deleted 50000 relationships, completed after 589 ms.
I performed several tests and the best combination was
`call apoc.periodic.iterate("MATCH p=()-[r]->() RETURN r,p LIMIT 5000000;","DELETE r;", {batchSize:10000, parallel: true}`)
(this code deleted 300,000,000 relationships in 3251s)
It is worth noting that using the "parallel" parameter drastically reduces the time.
This for Neo4j 4.4.1
AWS EC2: m5.xlarge
neo4j:
resources:
memory: 29000Mi
configs:
dbms.memory.heap.initial_size: "20G"
dbms.memory.heap.max_size: "20G"
dbms.memory.pagecache.size: "5G"

How to delete large amount of nodes in cypher

I'm trying to delete 1 million nodes in cyphper at one query using web admin(i.e localhost:7474/browser).
These nodes is labeled as User. I ran following query, then returned Unknown error after waiting about 1minutes.
match (u:User) delete u
This query returned Unknown error every time. and I confirm my PC resources didn't lack.
I'm using Neo4j version 2.0.0 RC1 community edition. and Neo4j Hosted on local.
Is My trying way for deletion nodes wrong?
Thanks
You should do write operations with a reasonable transaction size of ~10-50k atomic operations. Therefore you can use limit and run the statement until all users are gone:
match (u:User) with u limit 1000 delete u
With Neo4j 3.x and forward you can run large delete transactions using APOC too:
call apoc.periodic.iterate("MATCH (u:User) return u", "DETACH DELETE u", {batchSize:1000})
yield batches, total return batches, total
I've found that just removing the neo4j/data folder is the fastest way to delete the db.

Resources