Adding a property filter to cypher query explodes memory, why? - neo4j

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben

MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

Related

Why does this cypher query never finish

This is the query:
MATCH (t:Table)-[*]-(a:Attribute) RETURN t,a
Here is the complete graph:
Here is the query and what happens when I try to execute it:
The reason is that you are performing a variable-length relationship without an upper bound. Cypher will attempt to find every possible path in existence that can be made no matter how long the path, provided that the path begins with a :Table node and ends with an :Attribute node. While a relationship will only be traversed once per path, there's no restriction to using a different relationship to return to a previously traversed node and then using another as-of-yet-untraversed-relationship-in-the-path to leave it and continue traversing.
Even on a small graph, the number of possible paths explodes. You can see for yourself how the number of paths grows, and how the db will get slower as the number of possible paths to explore explodes.
MATCH (:Table)-[*..6]-(:Attribute)
RETURN count(*) as pathsFound
Now if that finishes quick, increase the upper bound and run it, and keep on doing it, and see how high you can go, and how high the paths found gets, before the db starts running into trouble.
I'll save you some time, though. I recreated your graph, and you hit the max possible paths when you have an upper bound of 23 hops, returning a count of 1371112 total distinct paths in your graph matching that pattern. The browser alone won't be able to cope with this many rows of data.
Here are two queries you can run to verify it (provided that this is your entire graph):
MATCH (:Table)-[*..23]-(:Attribute)
RETURN count(*) as totalPathsFound
and
MATCH path = (:Table)-[*..23]-(:Attribute)
RETURN length(path) as pathLength, count(*) as pathsFound
ORDER BY pathLength DESC
Note that expanding out and counting the number of possible paths isn't too strenuous, we can get that in a few seconds. But doing property access or additional computations that may multiplicatively increase the number of paths can be a problem, and streaming back this many rows of data, especially to a browser app, can be a problem.
More to the point, I don't think you really want to process over a million results anyway. What the query is actually doing is likely completely different than what you really want. So you may want to clarify what exactly you want the query to do, because the current approach isn't feasible.

neo4j CYPHER - Relationship Query doesn't finish

in a 14 GB database I have a few CITES relationships:
MATCH p=()-[r:CITES]->() RETURN count(r)
91
However, when I run
MATCH ()-[r:CITES]-() RETURN count(r)
it loads forever and eventually crashes with a browser window reload (neo4j desktop)
You can see the differences in how each of those queries will execute if you prefix each query with EXPLAIN.
The pattern used for the first query is such that the planner will find that count in the counts store, a transactionally updated store of counts of various things. This is a fast constant time lookup.
The other pattern, when omitting the direction, will not use the count store lookup and will actually have to traverse the graph (starting from every node in the graph), and that will take a long time as your graph grows.
As for what this gives back, it should actually be twice the number of :CITIES relationships in your graph, since without the direction on the relationship, each individual relationship will be found twice, since the same path with the start and end nodes switched both fit the given pattern.
Neo4j always choose nodes as start points for query execution. In your query, probably the query engine is touching the whole graph, since you are not adding restrictions on node properties, labels, etc.
I think you should specify a label at least in your first node in the pattern.
MATCH (:Article)-[r:CITES]-() RETURN count(r)

Neo4j more specific query slower than more generic one

I'm trying to count all values collected in one subtree of my graph. I thought that the more descriptive path from the root node I provide, the faster the query will run. Unfortunately this isn't true in my case and I can't figure out why.
Original, slow query:
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38397 total db hits in 2203 ms.
However without matching top-level node, labeled Sandbox, query is 10 times faster:
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38478 total db hits in 159 ms
To make this clear, in this case the result is the same as I have just one Sandbox.
What is wrong in my first query? How should I model/query the hierarchy like that? I can save sandbox name as property in Metric node, but it seems uglier for me, however executes faster.
Because the 2 queries are not identical.
(For reader visual difference)
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
So in the second query, Neo4j doesn't care about (root). You never use root, and root is already implied by [:has_metric], so Neo4j can just skip to finding ()-[:has_metric]->(n:Metric)-[:most_recent|prev]. In the first query, now we also have to find these Sandbox nodes! And on top of that, root has to be connected to that too! So Neo4j has to do extra work to prove that that is true. The extra column can also add more rows to the results being processed, which may add more validation checks on the rest of the query.
So long story short, the first query is slower because it is doing more validation work. So, the first query will be a subset of the latter.

Cypher directionless query not returning all expected paths

I have a cypher query that starts from a machine node, and tries to find nodes related to it using any of the relationship types I've specified:
match p1=(n:machine)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*]-(n2)
where n.machine="112943691278177215"
optional match p2=(n2)-[*]->()
return p1,p2
limit 300
The optional match clause is my attempt to traverse outwards in my model from each of the nodes found in p1. The below screenshot shows the part of the results I'm having issues with:
You can see from the starting machine node, it finds a personal_phone node via two app nodes related to the machine. For clarification, this part of the model is designed like so:
So it appeared to be working until I realized that certain paths were somehow being left out of the results. If I run a second query showing me all apps related to that particular personal_phone node, I get the following:
match p1=(n:personal_phone)<-[*]-(n2)
where n.personal_phone="(xxx) xxx-xxxx"
return p1
limit 100
The two apps I have segmented out, are the two apps shown in the earlier image.
So why doesn't my original query show the other 7 apps related to the personal_phone?
EDIT : Despite the overly broad optional match combined with the limit 300 statement, the returned results show only 52 nodes and 154 rels. This is because the paths following relationships with an outward direction are going to stop very quickly. I could have put a max 2 on it but was being lazy.
EDIT 2: The query I finally came up with to give me what I want is this:
match p1=(m:machine)<-[:MACHINE]-(a:app)
where m.machine="112943691278177215"
optional match p2=(a:app)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*0..3]-(n)
where a<>n and a<>m and m<>n
optional match p3=(n)-[r*]->(n2)
where n2<>n
return distinct n, r, n2
This returns 74 nodes and 220 rels which seems to be the correct result (387 rows). So it seems like my incredibly inefficient query was the reason the graph was being truncated. Not only were the nodes being traversed many times, but the paths being returned contained duplicate information which consumed the limited rows available for return. I guess my new questions are:
When following multiple hops, should I always explicitly make sure the same nodes aren't traversed via where clauses?
If I was to return p3 instead, it returns 1941 rows to display 74 nodes and 220 rels. There seems to be a lot of duplication present. Is it typically better to use return distinct (like I have above) or is there a way to easily dedupe the nodes and relationships within a path?
So part of your issue here (updated questions) is that you're returning paths, and not individual nodes/relationships.
For example, if you do MATCH p=(n)-[*]-() and your data is A->B->C->D then the results you'll get will be A->B, A->B->C, A->B->C->D and so on. If on the other hand you did MATCH (n)-[r:*]-(m) and then worked with r and m, you could get the same data, but deal with the distinct things on the path rather than have to sort that out later.
It seems you want the nodes and relationships, but you're asking for the paths - so you're getting them. ALL of them. :)
When following multiple hops, should I always explicitly make sure the
same nodes aren't traversed via where clauses?
Well, the way you did it, yes -- but honestly I haven't ever run into that problem before. Part of the issue again is the overly-broad query you're running. Lacking any constraint, it ends up roping in the items you've already matched, which buys you this problem. Perhaps better would be to match some set of possible labels, to narrow your query down. By narrowing it down, you wouldn't have the same issue, for example something like:
MATCH (n)-[r:*]-(m)
WHERE 'foo' in labels(m) or 'bar' in labels(m)
RETURN n, r, m;
Note we're not doing path matching, and we're specifying some range of labels that could be m, without leaving it completely wild-west. I tend to formulate queries this way, so your question #2 never really arises. Presumably you have a reasonable data model that would act as your grounding for that.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

Resources