neo4j - Create relationships between all nodes in database (Out of memory) - neo4j

I have a neo4j database with ~260000 (EDIT: Incorrect by order of magnitude previously, missing 0) nodes of genes, something along the lines of:
example_nodes: sourceId, targetId
with an index on both sourceId and targetId
I am trying to build the relationships between all the nodes but am constantly running into OOM issues. I've increased my JVM heap size to -Xmx4096m and dbms.memory.pagecache.size=16g on a system with 16G of RAM.
I am assuming I need to optimize my query because it simply cannot complete in any of its current forms. However, I have tried the following three to no avail:
MATCH (start:example_nodes),(end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
(on a subset of the 5000 nodes, this query above completes in only a matter of seconds. It does of course warn: This query builds a cartesian product between disconnected patterns.)
MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
OPTIONAL MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
Any ideas how this query could be optimized to succeed would be much appreciated.
--
Edit
In a lot of ways I feel that while the apoc libary does indeed solve the memory issues, the function could be optimized if it were to run along the lines of this incredibly simple pseudocode:
for each start_gene
create relationship to end_gene where start_gene.targetId = end_gene.source_id
move on to next once relationship has been created
But I am unsure how to achieve this in cypher.

You can use apoc library for batching.
call apoc.periodic.commit("
MATCH (start:example_nodes),(end:example_nodes) WHERE not (start)-[:CONNECT]->(end) and id(start) > id(end) AND start.targetId =
end.sourceId
with start,end limit {limit}
CREATE (start)-[:CONNECT]->(end)
RETURN count(*)
",{limit:5000})

Related

Neo4j - Variable length greater 11 runs forever and query never returns

I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes

How optimised is this Cypher query?

[Edit] I'm using Neo4j 4.2.1
I have this need for a Cypher query that brings back a complete tree given its root node. All nodes and relationships must be fetched and present only once in the returned sets. Here's what I have come to:
MATCH p = (n)-[*..]->(m)
WHERE id(n) = 0
WITH relationships(p) AS r
WITH distinct last(r) as rel
WITH [node IN [startNode(rel), endNode(rel)] | node] AS tmp, rel
UNWIND tmp AS node
RETURN collect(DISTINCT node) AS nodes, collect(distinct rel) AS relationships;
Running the query on our database to get about 820 nodes makes the thing crash for lack of memory (5Gb allowed). Hard to believe. So I'm wondering : Is this query ill-born? Is there one technique I'm using that shouldn't be used for my purpose?
I strongly recommend that you come up with a node property that is guaranteed to be the same on all the nodes in a contiguous tree, if you don't have one already. I'll call that property same_prop. Here's what I do to run queries like the one you're running:
Index same_prop. If you have different node labels, then you need this index created for each different node label you expect to have in the tree.
CREATE INDEX samepropnode FOR (n:your_label) ON (n.same_prop)
is the kind of thing you need in Neo4j 4+. In Neo4j, indices are cheap, and can sometimes speed up queries quite a bit.
Collect all possible values of same_prop and store them in a text file (I use tab-separated values as safer than comma-separated values).
Use the Python driver, or your language of choice that has a Neo4j driver written (strongly recommend Neo4j-provided drivers, not third-party) to write wrapper code that executes a Cypher query something like this:
MATCH (p)-->(c)
USING INDEX p:your_label(same_prop)
WHERE p.same_prop IN [ same_prop_list ]
RETURN DISTINCT
p.datapiece1 AS `first_parent_datapiece`,
p.datapiecen AS `nth_parent_datapiece`,
c.datapiece1 AS `first_child_datapiece`,
c.datapiecen AS `nth_child_datapiece`
It's not a good idea, in general, to return nodes and relationships unless you're debugging.
Then in your Python (for example) code, you're simply going to read in all your same_prop values from the file you got in Step 2, chunk up the values in reasonable size chunks, maybe 1,000 or 10,000, and substitute them in for the [ same_prop_list ] in the Cypher query on-the-fly.

How to use With clause for Neo4j Cypher subquery formulation?

I trying to create a simple cypher query that should find all instances in the graph matching roughly this structure (BlogPost A) -> (Term) <- (BlogPost B). This means, I am trying all pairs of blog posts that are flagged with the same term and moreover count the number of terms. A term is a mechanism of categorization in this context.
Here is my query proposal:
MATCH (blogA:content {entitySubType:'blog'})
WITH blogA MATCH (blogA) -[]-> (t:term) <-[]- (blogB:content)
WHERE blogB.entitySubType='blog' AND NOT (ID(blogA) = ID(blogB))
RETURN ID(blogA), ID(blogB), count(t) ;
This query ends with null after ~1 day.
Is the uasge of blogA in the subquery not possible in the way I am using it? When using the same query with limits I do get reuslts:
MATCH (blogA:content {entitySubType:'blog'})
WITH blogA
LIMIT 10
MATCH (blogA) -[]-> (t:term) <-[]- (blogB:content)
WHERE blogB.entitySubType='blog' AND NOT (ID(blogA) = ID(blogB))
RETURN ID(blogA), ID(blogB), count(t)
LIMIT 20;
My Neo4j Instance has ~500GB RAM and the whole graph inclduing all properties is ~30 GB with ~15 million vertices in total, whereas there are 101k blog vertices and 108k terms.
I would be grateful for every hint about possible problems or suggestions for improvements.
Also make sure to consume that query with a client driver (e.g. Java) that can stream the billions of results. Here is a query that would use the compiled runtime which should be fastest and most memory efficient.
MATCH (blogA:Blog)-[:TAGGED]->(t:Term)<-[:TAGGED]-(blogB:Blog)
WHERE blogA <> blogB
RETURN ID(blogA), ID(blogB), count(t);

Neo4j stuck when create relation with big data

i'm using this cypher query to create relationship between two nodes in Neo4j
MATCH (first:FIRSTNODE)
with first
MATCH (second:SECONDNODE)
WHERE first.ID = second.ID
CREATE (first)-[:RELATION]->(second)
first has 100.000 of nodes and second has 1.100.000 nodes.
I have imported the csv file and then i've created index of the two tables; but when i try to run the query with the relation neo4j got stuck and stop working.
I noticed that the cpu usage goes at 100% when this happens.
I'm working with an cpu of 8x4.0Ghz and 10Gb of ram and an SSD.
Do you know something that can help me to resolve this problem?
EDIT 1:
Using apoc.periodic.commit it works. But if then i run a second query like this:
call apoc.periodic.commit("
MATCH (third:THIRDNODE)
WHERE NOT (third)-[:RELATION2]->()
WITH third LIMIT {limit}
MATCH (second:SECONDNODE)
WHERE third.ID = second.ID2
CREATE (third)-[:RELATION2]->(second)
RETURN count(*)
", {limit:10000})
it got stuck again
You can try using apoc.periodic.commit from APOC Procedures. The docs about this procedure says:
apoc.periodic.commit(statement,params) - runs the given statement in
separate transactions until it returns 0
Install APOC Procedures and try it:
call apoc.periodic.commit("
MATCH (first:FIRSTNODE),
WHERE NOT (first)-[:RELATION]->()
WITH first LIMIT {limit}
MATCH (second:SECONDNODE)
WHERE first.ID = second.ID
CREATE (first)-[:RELATION]->(second)
RETURN count(*)
", {limit:10000})
Remember to install APOC procedures according the version of Neo4j you are using. Take a look in the version compatibility matrix.

Inserting a Relation into Neo4j using MERGE or MATCH runs forever

I am experimenting with Neo4j using a simple dataset of Locations. A location can have a relation to another relation.
a:Location - [rel] - b:Location
I already have the locations in the database (roughly 700.000+ Location entries)
Now I wanted to add the relation data (170M Edges), but I wanted to experiment with the import logic with a smaller set first, so I basically picked 2 nodes that are in the set and tried to create a relationship as follows.
MERGE p =(a:Location {locationid: 3616})-[w:WikiLink]->(b:Location {locationid: 467501})
RETURN p;
and also tried the approach directly from the docu
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
I tried using a directional merge, undirectional merge, etc. etc. I basically tried multiple variants of the above queries and the result is: They run forever, seeming to no complete even after 15 minutes. Which is very odd.
Indexes
ON :Location(locationid) ONLINE (for uniqueness constraint)
Constraints
ON (location:Location) ASSERT location.locationid IS UNIQUE
This is what I am currently using:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///edgelist.csv' AS line WITH line
MATCH (a:Location {locationid: toInt(line.locationidone)}), (b:Location {locationid: toInt(line.locationidtwo)})
MERGE (a)-[w:WikiLink {weight: toFloat(line.edgeweight)}]-(b)
RETURN COUNT(w);
If you look at the terminal output below you can see Neo4j reports 258ms query execution time, the realtime is however somewhat above that. This query already takes a few seconds too much in my opinion (The machine this runs on has 48GB RAM, 16 Cores and is relatively new).
I am currently running this query with LIMIT 1000 (before it was LIMIT 1) but the script is already running for a few minutes. I wonder if I have to switch from MERGE to CREATE. The problem is, I cannot understand the callgraph that EXPLAIN gives me in order to determine the bottleneck.
time /usr/local/neo4j/bin/neo4j-shell -file import-relations.cql
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Node[758609]{title:"Tehran",locationid:3616,locationlabel:"NIL"},:WikiLink[9422418]{weight:1.2282325516616477E-7},Node[917147]{title:"Khorugh",locationid:467501,locationlabel:"city"}] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
Relationships created: 1
Properties set: 1
258 ms
real 0m1.417s
user 0m1.497s
sys 0m0.158s
If you haven't:
create constraint on loc:Location assert loc.locationid is unique;
Then find both nodes, and create the releationship.
MATCH (a:Location {locationid: 3616}),(b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
or if the locations don't exist yet:
MERGE (a:Location {locationid: 3616})
MERGE (b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
You should also use parameters if you do that from a program.
Have you indexed the Location nodes on locationid?
CREATE INDEX ON :Location(locationid)
I had a similar problem adding edges to a graph and indexing the nodes led to the linking running over 150x faster.
If the nodes aren't indexed neo4j will do a serial search for the two nodes to link together.
USING PERIODIC COMMIT <value>:
Specifies number of records(rows) to be commited in a transaction. Since you have high RAM, it is good to use value that is greater than 100000. This will reduce the number of transactions committed and might further reduce the overall time.

Resources