Why does this cypher query never finish - neo4j

This is the query:
MATCH (t:Table)-[*]-(a:Attribute) RETURN t,a
Here is the complete graph:
Here is the query and what happens when I try to execute it:

The reason is that you are performing a variable-length relationship without an upper bound. Cypher will attempt to find every possible path in existence that can be made no matter how long the path, provided that the path begins with a :Table node and ends with an :Attribute node. While a relationship will only be traversed once per path, there's no restriction to using a different relationship to return to a previously traversed node and then using another as-of-yet-untraversed-relationship-in-the-path to leave it and continue traversing.
Even on a small graph, the number of possible paths explodes. You can see for yourself how the number of paths grows, and how the db will get slower as the number of possible paths to explore explodes.
MATCH (:Table)-[*..6]-(:Attribute)
RETURN count(*) as pathsFound
Now if that finishes quick, increase the upper bound and run it, and keep on doing it, and see how high you can go, and how high the paths found gets, before the db starts running into trouble.
I'll save you some time, though. I recreated your graph, and you hit the max possible paths when you have an upper bound of 23 hops, returning a count of 1371112 total distinct paths in your graph matching that pattern. The browser alone won't be able to cope with this many rows of data.
Here are two queries you can run to verify it (provided that this is your entire graph):
MATCH (:Table)-[*..23]-(:Attribute)
RETURN count(*) as totalPathsFound
and
MATCH path = (:Table)-[*..23]-(:Attribute)
RETURN length(path) as pathLength, count(*) as pathsFound
ORDER BY pathLength DESC
Note that expanding out and counting the number of possible paths isn't too strenuous, we can get that in a few seconds. But doing property access or additional computations that may multiplicatively increase the number of paths can be a problem, and streaming back this many rows of data, especially to a browser app, can be a problem.
More to the point, I don't think you really want to process over a million results anyway. What the query is actually doing is likely completely different than what you really want. So you may want to clarify what exactly you want the query to do, because the current approach isn't feasible.

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

neo4j CYPHER - Relationship Query doesn't finish

in a 14 GB database I have a few CITES relationships:
MATCH p=()-[r:CITES]->() RETURN count(r)
91
However, when I run
MATCH ()-[r:CITES]-() RETURN count(r)
it loads forever and eventually crashes with a browser window reload (neo4j desktop)
You can see the differences in how each of those queries will execute if you prefix each query with EXPLAIN.
The pattern used for the first query is such that the planner will find that count in the counts store, a transactionally updated store of counts of various things. This is a fast constant time lookup.
The other pattern, when omitting the direction, will not use the count store lookup and will actually have to traverse the graph (starting from every node in the graph), and that will take a long time as your graph grows.
As for what this gives back, it should actually be twice the number of :CITIES relationships in your graph, since without the direction on the relationship, each individual relationship will be found twice, since the same path with the start and end nodes switched both fit the given pattern.
Neo4j always choose nodes as start points for query execution. In your query, probably the query engine is touching the whole graph, since you are not adding restrictions on node properties, labels, etc.
I think you should specify a label at least in your first node in the pattern.
MATCH (:Article)-[r:CITES]-() RETURN count(r)

cypher performance for multiple hops /

I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.

Neo4j more specific query slower than more generic one

I'm trying to count all values collected in one subtree of my graph. I thought that the more descriptive path from the root node I provide, the faster the query will run. Unfortunately this isn't true in my case and I can't figure out why.
Original, slow query:
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38397 total db hits in 2203 ms.
However without matching top-level node, labeled Sandbox, query is 10 times faster:
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38478 total db hits in 159 ms
To make this clear, in this case the result is the same as I have just one Sandbox.
What is wrong in my first query? How should I model/query the hierarchy like that? I can save sandbox name as property in Metric node, but it seems uglier for me, however executes faster.
Because the 2 queries are not identical.
(For reader visual difference)
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
So in the second query, Neo4j doesn't care about (root). You never use root, and root is already implied by [:has_metric], so Neo4j can just skip to finding ()-[:has_metric]->(n:Metric)-[:most_recent|prev]. In the first query, now we also have to find these Sandbox nodes! And on top of that, root has to be connected to that too! So Neo4j has to do extra work to prove that that is true. The extra column can also add more rows to the results being processed, which may add more validation checks on the rest of the query.
So long story short, the first query is slower because it is doing more validation work. So, the first query will be a subset of the latter.

Cypher query slow when intermediate node labels are specified

I have the following Cypher query
MATCH (p1:`Article` {article_id:'1234'})--(a1:`Author` {name:'Jones, P'})
MATCH (p2:`Article` {article_id:'5678'})--(a2:`Author` {name:'Jones, P'})
MATCH (p1)-[:WRITTEN_BY]->(c1:`Author`)-[h1:HAS_NAME]->(l1)
MATCH (p2)-[:WRITTEN_BY]->(c2:`Author`)-[h2:HAS_NAME]->(l2)
WHERE l1=l2 AND c1<>a1 AND c2<>a2
RETURN c1.FullName, c2.FullName, h1.distance + h2.distance
On my local Neo4j server, running this query takes ~4 seconds and PROFILE shows >3 million db hits. If I don't specify the Author label on c1 and c2 (it's redundant thanks to the relationship labels), the same query returns the same output in 33ms, and PROFILE shows <200 db hits.
When I run the same two queries on a larger version of the same database that's hosted on a remote server, this difference in performance vanishes.
Both dbs have the same constraints and indexes. Any ideas what else might be going wrong?
Your query has a lot of unnecessary stuff in it, so first off, here's a cleaner version of it that is less likely to get misinterpreted by the planner:
MATCH (name:Name) WHERE NOT name.name = 'Jones, P'
WITH name
MATCH (:`Article` {article_id:'1234'})-[:WRITTEN_BY]->()-[h1:HAS_NAME]->(name)<-[h2:HAS_NAME]-()<-[:WRITTEN_BY]-(:`Article` {article_id:'5678'})
RETURN name.name, h1.distance + h2.distance
There's really only one path you want to find, and you want to find it for any author whose name is not Jones, P. Take advantage of your shared :Name nodes to start your query with the smallest set of definite points and expand paths from there. You are generating a massive cartesian product by stacking all those MATCH statements and then filtering them out.
As for the difference in query performance, it appears that the query planner is trying to use the Author label to build your 3rd and 4th paths, whereas if you leave it out, the planner will only touch the much narrower set of :Articles (fixed by indexed property), then expand relationships through the (incidentally very small) set of nodes that have -[:WRITTEN_BY]-> relationships, and then the (also incidentally very small) set of those nodes that have a -[:HAS_NAME]-> relationship. That decision is based partly on the predictable size of the various sets, so if you have a different number of :Author nodes on the server, the planner will make a smarter choice and not use them.

Resources