Cypher directionless query not returning all expected paths - neo4j

I have a cypher query that starts from a machine node, and tries to find nodes related to it using any of the relationship types I've specified:
match p1=(n:machine)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*]-(n2)
where n.machine="112943691278177215"
optional match p2=(n2)-[*]->()
return p1,p2
limit 300
The optional match clause is my attempt to traverse outwards in my model from each of the nodes found in p1. The below screenshot shows the part of the results I'm having issues with:
You can see from the starting machine node, it finds a personal_phone node via two app nodes related to the machine. For clarification, this part of the model is designed like so:
So it appeared to be working until I realized that certain paths were somehow being left out of the results. If I run a second query showing me all apps related to that particular personal_phone node, I get the following:
match p1=(n:personal_phone)<-[*]-(n2)
where n.personal_phone="(xxx) xxx-xxxx"
return p1
limit 100
The two apps I have segmented out, are the two apps shown in the earlier image.
So why doesn't my original query show the other 7 apps related to the personal_phone?
EDIT : Despite the overly broad optional match combined with the limit 300 statement, the returned results show only 52 nodes and 154 rels. This is because the paths following relationships with an outward direction are going to stop very quickly. I could have put a max 2 on it but was being lazy.
EDIT 2: The query I finally came up with to give me what I want is this:
match p1=(m:machine)<-[:MACHINE]-(a:app)
where m.machine="112943691278177215"
optional match p2=(a:app)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*0..3]-(n)
where a<>n and a<>m and m<>n
optional match p3=(n)-[r*]->(n2)
where n2<>n
return distinct n, r, n2
This returns 74 nodes and 220 rels which seems to be the correct result (387 rows). So it seems like my incredibly inefficient query was the reason the graph was being truncated. Not only were the nodes being traversed many times, but the paths being returned contained duplicate information which consumed the limited rows available for return. I guess my new questions are:
When following multiple hops, should I always explicitly make sure the same nodes aren't traversed via where clauses?
If I was to return p3 instead, it returns 1941 rows to display 74 nodes and 220 rels. There seems to be a lot of duplication present. Is it typically better to use return distinct (like I have above) or is there a way to easily dedupe the nodes and relationships within a path?

So part of your issue here (updated questions) is that you're returning paths, and not individual nodes/relationships.
For example, if you do MATCH p=(n)-[*]-() and your data is A->B->C->D then the results you'll get will be A->B, A->B->C, A->B->C->D and so on. If on the other hand you did MATCH (n)-[r:*]-(m) and then worked with r and m, you could get the same data, but deal with the distinct things on the path rather than have to sort that out later.
It seems you want the nodes and relationships, but you're asking for the paths - so you're getting them. ALL of them. :)
When following multiple hops, should I always explicitly make sure the
same nodes aren't traversed via where clauses?
Well, the way you did it, yes -- but honestly I haven't ever run into that problem before. Part of the issue again is the overly-broad query you're running. Lacking any constraint, it ends up roping in the items you've already matched, which buys you this problem. Perhaps better would be to match some set of possible labels, to narrow your query down. By narrowing it down, you wouldn't have the same issue, for example something like:
MATCH (n)-[r:*]-(m)
WHERE 'foo' in labels(m) or 'bar' in labels(m)
RETURN n, r, m;
Note we're not doing path matching, and we're specifying some range of labels that could be m, without leaving it completely wild-west. I tend to formulate queries this way, so your question #2 never really arises. Presumably you have a reasonable data model that would act as your grounding for that.

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

Cypher: Find any path between nodes

I have a neo4j graph that looks like this:
Nodes:
Blue Nodes: Account
Red Nodes: PhoneNumber
Green Nodes: Email
Graph design:
(:PhoneNumber) -[:PART_OF]->(:Account)
(:Email) -[:PART_OF]->(:Account)
The problem I am trying to solve is to
Find any path that exists between Account1 and Account2.
This is what I have tried so far with no success:
MATCH p=shortestPath((a1:Account {accId:'1234'})-[]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[:PART_OF]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[*]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=(a1:Account {accId:'1234'})<-[:PART_OF*1..100]-(n)-[:PART_OF]->(a2:Account {accId:'5678'}) RETURN p;
Same queries as above without the shortest path function call.
By looking at the graph I can see there is a path between these 2 nodes but none of my queries yield any result. I am sure this is a very simple query but being new to Cypher, I am having a hard time figuring out the right solution. Any help is appreciated.
Thanks.
All those queries are along the right lines, but need some tweaking to make work. In the longer term, though, to get a better system to easily search for connections between accounts, you'll probably want to refactor your graph.
Solution for Now: Making Your Query Work
The path between any two (n:Account) nodes in your graph is going to look something like this:
(a1:Account)<-[:PART_OF]-(:Email)-[:PART_OF]->(ai:Account)<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->(a2:Account)
Since you have only one type of relationship in your graph, the two nodes will thus be connected by an indeterminate number of patterns like the following:
<-[:PART_OF]-(:Email)-[:PART_OF]->
or
<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->
So, your two nodes will be connected through an indeterminate number of intermediate (:Account), (:Email), or (:PhoneNumber) nodes all connected by -[:PART_OF]- relationships of alternating direction. Unfortunately to my knowledge (and I'd love to be corrected here), using straight cypher you can't search for a repeated pattern like this in your current graph. So, you'll simply have to use an undirected search, to find nodes (a1:Account) and(a2:Account) connected through -[:PART_OF]- relationships. So, at first glance your query would look like this:
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*]-(a2:Account { accId: {a2_id} }))
RETURN *
(notice here I've used cypher parameters rather than the integers you put in the original post)
That's very similar to your query #3, but, like you said - it doesn't work. I'm guessing what happens is that it doesn't return a result, or returns an out of memory exception? The problem is that since your graph has circular paths in it, and that query will match a path of any length, the matching algorithm will literally go around in circles until it runs out of memory. So, you want to set a limit, like you have in query #4, but without the directions (which is why that query doesn't work).
So, let's set a limit. Your limit of 100 relationships is a little on the large side, especially in a cyclical graph (i.e., one with circles), and could potentially match in the region of 2^100 paths.
As a (very arbitrary) rule of thumb, any query with a potential undirected and unlabelled path length of more than 5 or 6 may begin to cause problems unless you're very careful with your graph design. In your example, it looks like these two nodes are connected via a path length of 8. We also know that for any two nodes, the given minimum path length will be two (i.e., two -[:PART_OF]- relationships, one into and one out of a node labelled either :Email or :PhoneNumber), and that any two accounts, if linked, will be linked via an even number of relationships.
So, ideally we'd set out our relationship length between 2 and 10. However, cypher's shortestPath() function only supports paths with a minimum length of either 0 or 1, so I've set it between 1 and 10 in the example below (even though we know that in reality, the shortest path have a length of at least two).
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*1..10]-(a2:Account { accId: {a2_id} }))
RETURN *
Hopefully, this will work with your use case, but remember, it may still be very memory intensive to run on a large graph.
Longer Term Solution: Refactor Graph and/or Use APOC
Depending on your use case, a better or longer term solution would be to refactor your graph to be more specific about relationships to speed up query times when you want to find accounts linked only by email or phone number - i.e. -[:ACCOUNT_HAS_EMAIL]- and -[:ACCOUNT_HAS_PHONE]-. You may then also want to use APOC's shortest path algorithms or path finder functions, which will most likely return a faster result than using cypher, and allow you to be more specific about relationship types as your graph expands to take in more data.

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

cypher delete is taking forever

I am trying to delete data from neo4j using the following query:
MATCH (c:Customer {customerID: '16af89a6-832b-4bef-b026-eafea3873d69'})
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)-[*]-(n2) WITH r, dept, n2 LIMIT 10
DETACH DELETE r, dept, n2;
This statement is taking forever and not deleting anything when I inspect the dept node for example. Is there anything I am missing here?
You have a variable length path without specifying an upper bound in this line:
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)-[*]-(n2) WITH r, dept, n2 LIMIT 10
This will result in a lot of traversals. Does your data model allow for specifying an upper bound on the number of hops to match n2. Also, you should specify a label or labels for n2.
Also, you don't need to include r in the DETACH DELETE statement. Any existing relationships of a node being deleted will also be deleted when using DETACH DELETE.
Edit
The pattern (dept:Dept)-[*]-(n2) indicates a bidirectional path of any length (with no upper bound). To specify an upper bound on the variable length path simply replace the (dept:Dept)-[*]-(n2) piece of the pattern with (dept:Dept)-[*1..3]-(n2). This will limit the length of the paths traversed to a maximum of three relationships between (dept:Dept) and (n2) (although this might not be appropriate for your data model). It would also be good to add labels and a relationship direction to the pattern (appropriate for your data model), something like:
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)<-[:BELONGS_TO*1..2]-(n2:Product) WITH r, dept, n2 LIMIT 10
There are many different issues in your query. Here are the one I've identified.
The number of paths discoverable by a variable length path query (let's assume the lower bound is 0 or 1) is roughly an exponential function of the maximum path length. That is, if every relevant node has M relationships, and the maximum depth being searched (or, if there is no upper bound, the maximum possible depth) is N, then in the worst case the number of possible paths is (M ^ N). For example, if we plug in 5 and 10 for M and N, we get 9,765,625 possible paths (and the same number of nodes and relationships to be deleted). This is probably the main reason why your query takes a long time.
A second major concern would be total failure of the query due to an out-of-memory situation in the neo4j engine, due to the potentially huge amount of data that needs to be in memory. You have apparently not encountered this yet, but you might. You could try to minimize the number of found paths by only matching complete paths (that is, paths in which the last node has no other node to connect to). I don’t know your data model, so I can’t show you a Cypher clause to do that for your data. But if you do this, your query would have to be modified to use all the nodes in the found paths rather than just the path end nodes.
The second MATCH clause will only match dept nodes that have at least one relationship other than r, because the default lower bound for a variable-length path is a length of 1. Therefore, this query will not delete dept nodes that have no other relationships. You could solve this by specifying a lower bound of 0, as in: [*0..].
You have a LIMIT 10 on your WITH clause, so your query is only going to attempt to delete a few dept and n2 nodes. Also, since you are not necessarily deleting complete paths, you may end up with “disconnected subgraphs” that are no longer connected to anything else. So, you should remove the LIMIT clause, even though that would make your query take even longer.
It is theoretically possible (but I don't know your data model) for an n2 to be the same as c. If your data allows this to be possible, but you never want your query to delete c, you need to add a WHERE clause right after the relevant MATCH clause to prevent that (see below).
Since a MATCH clause filters out any matches where the same relationship is used twice, your second MATCH clause is actually doing extra work to ensure that none of the relationships in each variable length path matches r. Since your use case does not need this checking (after you fix item 5), you could avoid that unneeded check by splitting the second MATCH clause so that r is matched in its own clause.
Here is a sample fix for items 3, 4, 5, 6:
MATCH (c:Customer {customerID: '16af89a6-832b-4bef-b026-eafea3873d69'})
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)
MATCH (dept)-[*0..]-(n2)
WHERE n2 <> c
DETACH DELETE dept, n2;
But, since the above does not solve items 1 or 2, your query could still take a very long time and/or fail. If you provide a more complete idea of your data model, we might be able to solve item 2. However, item 1 is the main issue, and may require rethinking your data model or possibly splitting the deletion into multiple queries.

what is the difference between one node connected to other after three nodes and 2nd degree?

i am running the queries to find if node a is connected to node b directly or indirectly. for directly i can use
MATCH (n)-[r]->(a) OR MATCH (n)-[r]->(b)
when i use the query
MATCH (b)-[r*1..2]->(a)
the results are different. i am confused to understand what is the difference between below mentioned two queries.
1- OPTIONAL MATCH L=a-->c-->e-->b with a,b,L,p,q,n
2- OPTIONAL MATCH M=(a)-[r*1..2]->(b)
Are these both queries are the same. if they are then the results for both in my case are different.
what i wanted to see, a is connected to b after two hop distance.
i will be very grateful for your contribution. Thanks in advance
This query:
MATCH (b)-[r*1..2]->(a)
Means match one or two hops away. So the result is different from your first queries, because your first queries match exactly one hop away. This one goes further, so the results are different. Here, "hops" mean relationships not nodes.
This query:
OPTIONAL MATCH L=a-->c-->e-->b with a,b,L,p,q,n
Is very different because you're navigating through 2 intermediate nodes (c and e) with 3 intermediate relationships (a->c, c->e, e->b).
By the way, you can of course use optional match here, but for you it's not needed. If a and b must be connected, then using optional match doesn't really change anything for you here.
So you need to decide whether you want something 2 hops/relationships away, or if you want something 2 hops/nodes away, that's different.
Another way of writing 2 hops/relationships away would be this:
MATCH p=(a)-[r1]-(m)-[r2]-(b)
RETURN p

Resources