In my graph each node has a name and graph is actually a tree, so there exists a /path/to/each/node. Here's the query I currently use to get the path:
MATCH p=(n:Node{id:4})-[:CHILD_OF*0..200]->(r:Root{treeName:"vt"})
RETURN reduce(path = "", node IN nodes(p) | node.name + "/" + path) as path
An actual query is somewhat heavier, but the behavior is the same. So, having a ("")<-("a")<-("b")<-("c")<-("d") path I will get /a/b/c/d/. I don't mind trimming the last /, but I'm really worried about the order of the nodes iterator returned by nodes(p).
So, my question is mainly targeting neo4j team - are there any guarantees as to the order? Would it be better is I just returned Path and then manually extracted each property? I'm using Cypher with an embedded neo4j distribution, so that won't be a problem.
Related
I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes
[Edit] I'm using Neo4j 4.2.1
I have this need for a Cypher query that brings back a complete tree given its root node. All nodes and relationships must be fetched and present only once in the returned sets. Here's what I have come to:
MATCH p = (n)-[*..]->(m)
WHERE id(n) = 0
WITH relationships(p) AS r
WITH distinct last(r) as rel
WITH [node IN [startNode(rel), endNode(rel)] | node] AS tmp, rel
UNWIND tmp AS node
RETURN collect(DISTINCT node) AS nodes, collect(distinct rel) AS relationships;
Running the query on our database to get about 820 nodes makes the thing crash for lack of memory (5Gb allowed). Hard to believe. So I'm wondering : Is this query ill-born? Is there one technique I'm using that shouldn't be used for my purpose?
I strongly recommend that you come up with a node property that is guaranteed to be the same on all the nodes in a contiguous tree, if you don't have one already. I'll call that property same_prop. Here's what I do to run queries like the one you're running:
Index same_prop. If you have different node labels, then you need this index created for each different node label you expect to have in the tree.
CREATE INDEX samepropnode FOR (n:your_label) ON (n.same_prop)
is the kind of thing you need in Neo4j 4+. In Neo4j, indices are cheap, and can sometimes speed up queries quite a bit.
Collect all possible values of same_prop and store them in a text file (I use tab-separated values as safer than comma-separated values).
Use the Python driver, or your language of choice that has a Neo4j driver written (strongly recommend Neo4j-provided drivers, not third-party) to write wrapper code that executes a Cypher query something like this:
MATCH (p)-->(c)
USING INDEX p:your_label(same_prop)
WHERE p.same_prop IN [ same_prop_list ]
RETURN DISTINCT
p.datapiece1 AS `first_parent_datapiece`,
p.datapiecen AS `nth_parent_datapiece`,
c.datapiece1 AS `first_child_datapiece`,
c.datapiecen AS `nth_child_datapiece`
It's not a good idea, in general, to return nodes and relationships unless you're debugging.
Then in your Python (for example) code, you're simply going to read in all your same_prop values from the file you got in Step 2, chunk up the values in reasonable size chunks, maybe 1,000 or 10,000, and substitute them in for the [ same_prop_list ] in the Cypher query on-the-fly.
I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m
I have a highly interconnected graph where starting from a specific node
i want to find all nodes connected to it regardless of the relation type, direction or length. What i am trying to do is to filter out paths that include a node more than 1 times. But what i get is a
Neo.DatabaseError.General.UnknownError: key not found: UNNAMED27
I have managed to create a much simpler database
in neo4j sandbox and get the same message again using the following data:
CREATE (n1:Person { pid:1, name: 'User1'}),
(n2:Person { pid:2, name: 'User2'}),
(n3:Person { pid:3, name: 'User3'}),
(n4:Person { pid:4, name: 'User4'}),
(n5:Person { pid:5, name: 'User5'})
With the following relationships:
MATCH (n1{pid:1}),(n2{pid:2}),(n3{pid:3}),(n4{pid:4}),(n5{pid:5})
CREATE (n1)-[r1:RELATION]->(n2),
(n5)-[r2:RELATION]->(n2),
(n1)-[r3:RELATION]->(n3),
(n4)-[r4:RELATION]->(n3)
The Cypher Query that causes this issue in the above model is
MATCH p= (n:Person{pid:1})-[*0..]-(m)
WHERE ALL(c IN nodes(p) WHERE 1=size(filter(d in nodes(p) where c.pid = d.pid)) )
return m
Can anybody see what is wrong with this query?
The error seems like a bug to me. There is a closed neo4j issue that seems similar, but it was supposed to be fixed in version 3.2.1. You should probably create a new issue for it, since your comments state you are using 3.2.5.
Meanwhile, this query should get the results you seem to want:
MATCH p=(:Person{pid:1})-[*0..]-(m)
WITH m, NODES(p) AS ns
UNWIND ns AS n
WITH m, ns, COUNT(DISTINCT n) AS cns
WHERE SIZE(ns) = cns
return m
You should strongly consider putting a reasonable upper bound on your variable-length path search, though. If you do not do so, then with any reasonable DB size your query is likely to take a very long time and/or run out of memory.
When finding paths, Cypher will never visit the same node twice in a single path. So MATCH (a:Start)-[*]-(b) RETURN DISTINCT b will return all nodes connected to a. (DISTINCT here is redundant, but it can affect query performance. Use PROFILE on your version of Neo4j to see if it cares and which is better)
NOTE: This works starting with Neo4j 3.2 Cypher planner. For previous versions of
the Cypher planner, the only performant way to do this is with APOC, or add a -[:connected_to]-> relation from start node to all children so that path doesn't have to be explored.)
I have the following node structure Emp[e_id, e_name, e_bossid]. What is more I have a recursive query that exploit the database in recursive traversal on SELF relation e_bossid-[REPORTS_TO]->e_id
MATCH (e:Employee) WHERE NOT (e)-[:REPORTS_TO]->()
SET e:Root;
MATCH path = (b:Root)<-[:REPORTS_TO*]-(e:Employee)
RETURN path
limit 1000;
However the result is PATH. I would like to have result in form of NODES not the path. I tried to use the nodes(path), but it gives me an error:
org.codehaus.jackson.map.JsonMappingException: Reference node not available (through reference chain: java.util.ArrayList[0]->java.util.HashMap["rel"]->java.util.HashMap["nodes(path)"]->java.util.ArrayList[0]->org.neo4j.rest.graphdb.entity.RestNode["restApi"]->org.neo4j.rest.graphdb.RestAPIFacade["direct"]->org.neo4j.rest.graphdb.ExecutingRestAPI["referenceNode"])
When I query without nodes(path) it seems to return only paths.
How this should be done on the ground of cypher query?
I'm not sure why you would want to get all possible paths in your organizational hierarchy. Maybe what you want to get is a set of paths from the leaves of the tree to the root of the tree, and to return each unique set as a row of nodes.
MATCH (b:Employee)
WHERE NOT (b)-[:REPORTS_TO]->()
MATCH (l:Employee)
WHERE NOT (l)<-[:REPORTS_TO]-()
MATCH p = shortestPath((b)<-[:REPORTS_TO*]-(l))
RETURN nodes(p) as reports
As far as your error goes, that looks like a bug, although I don't know what version of Neo4j you are using. In all likelihood, your query won't complete because your Root employees are still a member of the Employee label. Which means that this pattern: MATCH path = (b:Root)<-[:REPORTS_TO*]-(e:Employee) matches the Root employees on each side of the variable length traversal.
Give my query a try and let me know what happens.