[Edit] I'm using Neo4j 4.2.1
I have this need for a Cypher query that brings back a complete tree given its root node. All nodes and relationships must be fetched and present only once in the returned sets. Here's what I have come to:
MATCH p = (n)-[*..]->(m)
WHERE id(n) = 0
WITH relationships(p) AS r
WITH distinct last(r) as rel
WITH [node IN [startNode(rel), endNode(rel)] | node] AS tmp, rel
UNWIND tmp AS node
RETURN collect(DISTINCT node) AS nodes, collect(distinct rel) AS relationships;
Running the query on our database to get about 820 nodes makes the thing crash for lack of memory (5Gb allowed). Hard to believe. So I'm wondering : Is this query ill-born? Is there one technique I'm using that shouldn't be used for my purpose?
I strongly recommend that you come up with a node property that is guaranteed to be the same on all the nodes in a contiguous tree, if you don't have one already. I'll call that property same_prop. Here's what I do to run queries like the one you're running:
Index same_prop. If you have different node labels, then you need this index created for each different node label you expect to have in the tree.
CREATE INDEX samepropnode FOR (n:your_label) ON (n.same_prop)
is the kind of thing you need in Neo4j 4+. In Neo4j, indices are cheap, and can sometimes speed up queries quite a bit.
Collect all possible values of same_prop and store them in a text file (I use tab-separated values as safer than comma-separated values).
Use the Python driver, or your language of choice that has a Neo4j driver written (strongly recommend Neo4j-provided drivers, not third-party) to write wrapper code that executes a Cypher query something like this:
MATCH (p)-->(c)
USING INDEX p:your_label(same_prop)
WHERE p.same_prop IN [ same_prop_list ]
RETURN DISTINCT
p.datapiece1 AS `first_parent_datapiece`,
p.datapiecen AS `nth_parent_datapiece`,
c.datapiece1 AS `first_child_datapiece`,
c.datapiecen AS `nth_child_datapiece`
It's not a good idea, in general, to return nodes and relationships unless you're debugging.
Then in your Python (for example) code, you're simply going to read in all your same_prop values from the file you got in Step 2, chunk up the values in reasonable size chunks, maybe 1,000 or 10,000, and substitute them in for the [ same_prop_list ] in the Cypher query on-the-fly.
Related
I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes
I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...
The initial node is a :txnEvent and all the rest are :mEvent
mEvent nodes have 4 properties, one of them called channel_name
Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B
This query:
match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);
Is telling me there are only 1,667,444 paths to consider.
However, the following query:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
EXTRACT (n in nodes(p) | n.channel_name),
EXTRACT (n in nodes(p) | n.step),
EXTRACT (n in nodes(p) | n.event_type),
EXTRACT (n in nodes(p) | n.event_device),
EXTRACT (r in relationships(p) | r.weight )
Takes almost 1 minute to execute (neo4j's UI on port 7474)
For completness, neo4j is telling me:
"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."
So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?
UDATE:
This query, that doesn't stream data back:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
COUNT(*)
Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.
UPDATE 2:
Ideally this data should go into a jupyter notebook with a python kernel.
Thanks for the PROFILE plan.
Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.
You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.
As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.
I have a highly interconnected graph where starting from a specific node
i want to find all nodes connected to it regardless of the relation type, direction or length. What i am trying to do is to filter out paths that include a node more than 1 times. But what i get is a
Neo.DatabaseError.General.UnknownError: key not found: UNNAMED27
I have managed to create a much simpler database
in neo4j sandbox and get the same message again using the following data:
CREATE (n1:Person { pid:1, name: 'User1'}),
(n2:Person { pid:2, name: 'User2'}),
(n3:Person { pid:3, name: 'User3'}),
(n4:Person { pid:4, name: 'User4'}),
(n5:Person { pid:5, name: 'User5'})
With the following relationships:
MATCH (n1{pid:1}),(n2{pid:2}),(n3{pid:3}),(n4{pid:4}),(n5{pid:5})
CREATE (n1)-[r1:RELATION]->(n2),
(n5)-[r2:RELATION]->(n2),
(n1)-[r3:RELATION]->(n3),
(n4)-[r4:RELATION]->(n3)
The Cypher Query that causes this issue in the above model is
MATCH p= (n:Person{pid:1})-[*0..]-(m)
WHERE ALL(c IN nodes(p) WHERE 1=size(filter(d in nodes(p) where c.pid = d.pid)) )
return m
Can anybody see what is wrong with this query?
The error seems like a bug to me. There is a closed neo4j issue that seems similar, but it was supposed to be fixed in version 3.2.1. You should probably create a new issue for it, since your comments state you are using 3.2.5.
Meanwhile, this query should get the results you seem to want:
MATCH p=(:Person{pid:1})-[*0..]-(m)
WITH m, NODES(p) AS ns
UNWIND ns AS n
WITH m, ns, COUNT(DISTINCT n) AS cns
WHERE SIZE(ns) = cns
return m
You should strongly consider putting a reasonable upper bound on your variable-length path search, though. If you do not do so, then with any reasonable DB size your query is likely to take a very long time and/or run out of memory.
When finding paths, Cypher will never visit the same node twice in a single path. So MATCH (a:Start)-[*]-(b) RETURN DISTINCT b will return all nodes connected to a. (DISTINCT here is redundant, but it can affect query performance. Use PROFILE on your version of Neo4j to see if it cares and which is better)
NOTE: This works starting with Neo4j 3.2 Cypher planner. For previous versions of
the Cypher planner, the only performant way to do this is with APOC, or add a -[:connected_to]-> relation from start node to all children so that path doesn't have to be explored.)
I have Neo4j version 3.0.4 with tree based data inside and I am trying to find solution for following problem:
I want start from root and travel down collecting all nodes. When I found node of type (property) "female" I want to include it in the result and stop going down.
Here is my try to describe the problem and expected result
Notes:
there is relationship between nodes called "relation". Every node has
only 1 parent so it is tree structure.
So far I have:
match p=(root:User {isRoot:true})-[:RELATION*..]-(child:User) return p
Which return the tree structure but without stopping condition
How can achieve the result?
Update 1:
Maybe a better way to describe the desired outcome is - I want every node from a tree going in depth and starting from root (or specific node) that has no direct or indirect parents of type female. Does that make sense?
You have two options here, really. The easy one is to do as InverseFalcon suggests and get all results, then prune using a predicate:
MATCH (root:User {isRoot: true})
WITH root
MATCH p = (root) - [:RELATION*] -> (:User {type: 'female'} )
WHERE ALL(x IN NODES(p)[1..-1] WHERE x.type = 'male')
RETURN NODES(p)
The harder, but better one, especially if your data set is very large or you plan to run a very large number of queries, is to refactor your data model so that instead of a generic -[:RELATION]->, you have particular relationship types that you plan to query against (:DAUGHTER|:SON, e.g.). Relationships in neo4j are much faster to query on than node labels or especially node properties, so design your relationships to accommodate the analysis you'll want to perform.
[EDITED]
Does this work for you?
MATCH p=(root:User {isRoot:true})-[:RELATION*0..]-(:User {type: 'male'})-[:RELATION]-(:User {type: 'female'})
RETURN p;
This query should return all paths that start at the root node and end at a female node, but without going through any other female nodes. I have assumed that the non-female nodes have "male" as the type value. The variable-length relationship pattern specifies 0.. so that a path consisting of a female root node can be returned as well.
I have a graph of a tree structure (well no, more of a DAG because i can have multiple parents) and need to be able to write queries that return all results in a flat list, starting at a particular node(s) and down.
I've reduced one of my use cases to this simple example. In the ascii representation here, n's are my nodes and I've appended their id. p is a permission in my auth system, but all that is pertinent to the question is that it marks the spot from which I need to recurse downwards to collect nodes which should be returned by the query.
There can be multiple root nodes related to p's
The roots, such as n3 below, should be contained in the results, as well as the children
The relationship depth is unbounded.
Graph:
n1
^ ^
/ \
n2 n3<--p
^ ^
/ \
n4 n5
^
/
n6
If it's helpful, here's the cypher I ran to throw together this quick example:
CREATE path=(n1:n{id:1})<-[:HAS_PARENT]-(n2:n{id:2}),
(n1)<-[:HAS_PARENT]-(n3:n{id:3})<-[:HAS_PARENT]-(n4:n{id:4}),
(n3)<-[:HAS_PARENT]-(n5:n{id:5}),
(n4)<-[:HAS_PARENT]-(n6:n{id:6})
MATCH (n{id:3})
CREATE (:p)-[:IN]->(n)
Here is the current best query I have:
MATCH (n:n)<--(:p)
WITH collect (n) as parents, (n) as n
OPTIONAL MATCH (c)-[:HAS_PARENT*]->(n)
WITH collect(c) as children, (parents) as parents
UNWIND (parents+children) as tree
RETURN tree
This returns the correct set of results, and unlike some previous attempts I made which did not use any collect/unwind, the results come back as a single column of data as desired.
Is this the most optimal way of making this type of query? It is surprisingly more complex than I thought the simple scenario called for. I tried some queries where I combined the roots ("parents" in my query) with the "children" using a UNION clause, but I could not find a way to do so without repeating the query for the relationship with p. In my real world queries, that's a much more expensive operation which i've reduced down here for the example, so I cannot run it more than once.
This might suit your needs:
MATCH (c)-[:HAS_PARENT*0..]->(root:n)<--(:p)
RETURN root, COLLECT(c) AS tree
Each result row will contain a distinct root node and a collection if its tree nodes (including the root node).