Neo4j speed of failure - neo4j

I am using embedded Neo4j with Java and generating queries automatically that vary in the number of relations they try to match and whether these relations are optional or not. This is an example of a query with optional relations:
MATCH (target:C4)-[rvara:IN_LOCATION]->(nvara:LOCATION)
OPTIONAL MATCH (nvara:LOCATION)-[rvarb:CONNECTED]->(nvarb:LOCATION)
OPTIONAL MATCH (nvarc:LOCATION)-[rvarc:CONNECTED]->(nvara:LOCATION)
OPTIONAL MATCH (target:C4)-[rvard:HAS_VALUE]->(nvard:TRUE)
RETURN DISTINCT target, FILTER(x IN [rvara, rvarb, rvarc, rvard]
WHERE x IS NOT NULL ) AS collected
I've noticed that when there is no match found, the query engine can take a long time to determine this. When there is a match found, it finds this much more quickly though the search space should be the same; at least I assume they both have to check all possible matches to return results whether they are empty or not. Is there a way to get a query to fail more quickly if it will not match anything?

If you run this query with EXPLAIN in the browser, you get the plan.
Notice all of those "node by label scan" at the top of the plan. I think the reason it's taking a long time is that if the query will fail and generate nothing, it has to scan all nodes with a particular label over and over. If the query succeeds, since it's OPTIONAL MATCH I think it only needs to find one match. So finding one (and skipping the rest of the scan) is always going to be faster than scanning the same population of nodes over and over.

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

Neo4j more specific query slower than more generic one

I'm trying to count all values collected in one subtree of my graph. I thought that the more descriptive path from the root node I provide, the faster the query will run. Unfortunately this isn't true in my case and I can't figure out why.
Original, slow query:
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38397 total db hits in 2203 ms.
However without matching top-level node, labeled Sandbox, query is 10 times faster:
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38478 total db hits in 159 ms
To make this clear, in this case the result is the same as I have just one Sandbox.
What is wrong in my first query? How should I model/query the hierarchy like that? I can save sandbox name as property in Metric node, but it seems uglier for me, however executes faster.
Because the 2 queries are not identical.
(For reader visual difference)
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
So in the second query, Neo4j doesn't care about (root). You never use root, and root is already implied by [:has_metric], so Neo4j can just skip to finding ()-[:has_metric]->(n:Metric)-[:most_recent|prev]. In the first query, now we also have to find these Sandbox nodes! And on top of that, root has to be connected to that too! So Neo4j has to do extra work to prove that that is true. The extra column can also add more rows to the results being processed, which may add more validation checks on the rest of the query.
So long story short, the first query is slower because it is doing more validation work. So, the first query will be a subset of the latter.

Is it possible to reduce/optimize this query for node degrees?

Given the following Cypher query that returns afferent (inbound) and efferent (outbound) connections, and the sum as the node degree:
START n = node(*)
RETURN n.name, length((n)-->()) AS efferent,
length((n)<--()) AS afferent,
length((n)-->()) + length((n)<--()) AS degree
Is it possible to reduce the query so that the two length() functions are not repeated in the summation in the degree column?
You can resolve and bind the two length computations separately from and before returning by using WITH. Then you can sum those bound values while returning.
START n = node(*)
WITH n, length((n)-->()) AS efferent, length((n)<--()) AS afferent
RETURN n.name, efferent, afferent, efferent + afferent AS degree
You may want to use MATCH (n) instead of START n = node(*) if your Neo4j version is >2.0, but that's not what you're asking about so I'll assume you know what you are doing.
EDIT
In Neo4j 1.x START is how you began a query. From 2.x and on, while START is still around, MATCH is the preferred way. If you have Neo4j 2.x and don't know a particular reason why you should use START, then you should use MATCH. Here's a short explanation of why.
Your query is written to touch the entire graph. When that is the intention there is not a very big difference between START n = node(*) and MATCH (n). The execution plans do differ, but I'm not aware that the difference is very important.
If, however, you want to perform your computations only on part of the graph, and you add to your 'starting point pattern' to that effect, then there will be significant differences. If, for example, you want to perform your computation only on nodes with the :User label
START n = node(*)
WHERE n:User
will still pull up all nodes, and then apply a filter to discard those that don't have the label, whereas
MATCH (n)
WHERE n:User
will only pull up the nodes that have that label to begin with.
The general difference is this: WHERE is a dependent clause accompanying START, MATCH, OPTIONAL MATCH or WITH. When it accompanies START or WITH it does not work by modifying the operation but by filtering the results; when it accompanies MATCH and OPTIONAL MATCH it modifies (as often as it can) the operation and therefore doesn't have to filter the results. The difference is that between shouting "Everyone, if you are my child, don't go into the road" and "Kids, don't go into the road".
There are cases when WHERE is not pulled into the MATCH clause. One example is
MATCH n
WHERE n:Male OR n:Female
In this case all nodes are pulled up and then filtered, just as if we had used START instead of MATCH.
Sometimes it is easy to know which patterns in the WHERE clause are able to be pulled in to modify the MATCH. This is the case for patterns that you can move into the MATCH clause yourself, by simply rearranging the query. The first MATCH example above could also be expressed
MATCH (n:User)
There is no way, however, to do this for the WHERE clause in second MATCH example, WHERE n:Male OR n:Female.
That a WHERE pattern cannot be moved into the MATCH clause by reformulating the query is not a reliable indicator that the query planner is unable to make use of it in the match operation. Being a declarative language, you ultimately have to trust the query planner to wisely implement the instructions; trust, but verify.1,2
One other difference between START and MATCH pertains to indexing. If you use 'legacy indexing' then you need to use START to access these indices. The 'new' (about two years I believe) label indices have continuously been improved for features and efficiency and we are running out of reasons to use the old indices. I think the only reason left may be full-text indexing, for which a configured legacy lucene index is still necessary. In time this feature also will be added to the label indices. Possibly, at that point, the START clause will be removed from Cypher altogether–but that is just the author's speculation.

Cypher directionless query not returning all expected paths

I have a cypher query that starts from a machine node, and tries to find nodes related to it using any of the relationship types I've specified:
match p1=(n:machine)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*]-(n2)
where n.machine="112943691278177215"
optional match p2=(n2)-[*]->()
return p1,p2
limit 300
The optional match clause is my attempt to traverse outwards in my model from each of the nodes found in p1. The below screenshot shows the part of the results I'm having issues with:
You can see from the starting machine node, it finds a personal_phone node via two app nodes related to the machine. For clarification, this part of the model is designed like so:
So it appeared to be working until I realized that certain paths were somehow being left out of the results. If I run a second query showing me all apps related to that particular personal_phone node, I get the following:
match p1=(n:personal_phone)<-[*]-(n2)
where n.personal_phone="(xxx) xxx-xxxx"
return p1
limit 100
The two apps I have segmented out, are the two apps shown in the earlier image.
So why doesn't my original query show the other 7 apps related to the personal_phone?
EDIT : Despite the overly broad optional match combined with the limit 300 statement, the returned results show only 52 nodes and 154 rels. This is because the paths following relationships with an outward direction are going to stop very quickly. I could have put a max 2 on it but was being lazy.
EDIT 2: The query I finally came up with to give me what I want is this:
match p1=(m:machine)<-[:MACHINE]-(a:app)
where m.machine="112943691278177215"
optional match p2=(a:app)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*0..3]-(n)
where a<>n and a<>m and m<>n
optional match p3=(n)-[r*]->(n2)
where n2<>n
return distinct n, r, n2
This returns 74 nodes and 220 rels which seems to be the correct result (387 rows). So it seems like my incredibly inefficient query was the reason the graph was being truncated. Not only were the nodes being traversed many times, but the paths being returned contained duplicate information which consumed the limited rows available for return. I guess my new questions are:
When following multiple hops, should I always explicitly make sure the same nodes aren't traversed via where clauses?
If I was to return p3 instead, it returns 1941 rows to display 74 nodes and 220 rels. There seems to be a lot of duplication present. Is it typically better to use return distinct (like I have above) or is there a way to easily dedupe the nodes and relationships within a path?
So part of your issue here (updated questions) is that you're returning paths, and not individual nodes/relationships.
For example, if you do MATCH p=(n)-[*]-() and your data is A->B->C->D then the results you'll get will be A->B, A->B->C, A->B->C->D and so on. If on the other hand you did MATCH (n)-[r:*]-(m) and then worked with r and m, you could get the same data, but deal with the distinct things on the path rather than have to sort that out later.
It seems you want the nodes and relationships, but you're asking for the paths - so you're getting them. ALL of them. :)
When following multiple hops, should I always explicitly make sure the
same nodes aren't traversed via where clauses?
Well, the way you did it, yes -- but honestly I haven't ever run into that problem before. Part of the issue again is the overly-broad query you're running. Lacking any constraint, it ends up roping in the items you've already matched, which buys you this problem. Perhaps better would be to match some set of possible labels, to narrow your query down. By narrowing it down, you wouldn't have the same issue, for example something like:
MATCH (n)-[r:*]-(m)
WHERE 'foo' in labels(m) or 'bar' in labels(m)
RETURN n, r, m;
Note we're not doing path matching, and we're specifying some range of labels that could be m, without leaving it completely wild-west. I tend to formulate queries this way, so your question #2 never really arises. Presumably you have a reasonable data model that would act as your grounding for that.

No START clause VS. n = node(*)

I've read in Neo4J 2.0 docs that START clause is optional and
Cypher will try and infer start points from your query
I have experimentally found that
START user = node(*)
MATCH (user:User)-[r:KNOWS]-(user2:User)
RETURN user.username AS username, collect(user2.username) AS username2
gives the same results as
MATCH (user:User)-[r:KNOWS]-(user2:User)
RETURN user.username AS username, collect(user2.username) AS username2
for small data sets.
My question is: is it semantically the same? Will they always return same result set (I'm not talking about the order)? Even for large data sets? Does skipping START guarantee traversing all nodes? If they are semantically equal why would one ever use node(*)?
Your queries are not semantically the same, but they will always return the same result. The reason they will return the same result is that in your first query, having stated the 'universal node pattern' node(*) you immediately limit it with a further pattern in your MATCH clause. In your second query you state this more narrow pattern from the start, but since the two MATCH clauses are equivalent and the most narrow pattern declared in each query (and since the RETURN clauses are the same) the two queries return the same results.
The START clause used to be the way to state the initial pattern for a query and it was tied up with indexing. Using node(*) or relationship(*) was rarely recommended or useful, but the clause was used for index retrievals, as in START user=node:userIndex(name="Maciej Ziarko"). With 2.0 labels and label indexing was introduced and this is now the preferred way to bind nodes in a query.
Skipping START will not guarantee traversing all nodes (or perhaps more accurately: binding all nodes), but neither do you need a START clause to do so. Using MATCH user (without limiting what is bound to user with labels or relationships) you can still bind every node in your database. It is still rarely recommended or useful.

Resources