The database has a graph with the following 3 nodes:
...->(1) ------>(3)-->...
\ ^
\ |
---->(2)---/
Now, I want to get all distinct nodes that are reachable from node 1 to node 3, including themselves where I know exactly unique properties of node 1 and node 3 (the nodes are actually commits from a github repository). So, I came up with the following query:
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (origin)-[:CHANGED_TO*0..]->(intermediate_commit:App)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit
However, the query never finishes or at least takes too long to complete. I know that I could have used MATCH p=(origin:App)-[:CHANGED_TO*0..]->(destination:App) and then UNWIND and return distinct nodes. The problem is, I believe, it queries different paths implying I am interested in relationships between them too. While in fact I am not interested in paths. What I need is only distinct nodes that match the pattern. My understanding is that querying paths is slower than it could be if I could query only the nodes.
Could you please help to understand what I am missing? Thanks!
The solution was quite simple. Instead of specifying a pattern in MATCH clause, we move the pattern to WHERE clause. Also, I split the pattern into 2 parts. I can't explain why exactly it is faster but my understanding is that when we move pattern to WHERE clause and MATCH only nodes, we let neo4j know that we are interested only in nodes and not in all possible paths that match the pattern.
The full query:
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (intermediate_commit:App)
WHERE (origin)-[:CHANGED_TO*0..]->(intermediate_commit)
AND (intermediate_commit)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit
Also, if you have a lot of nodes, I believe specifying LIMIT 1 to match origin and destination can also improve the query, like this:
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
WITH origin
LIMIT 1
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
WITH origin, destination
LIMIT 1
MATCH (intermediate_commit:App)
WHERE (origin)-[:CHANGED_TO*0..]->(intermediate_commit)
AND (intermediate_commit)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit
That might be an unbounded path search? Do you really want all paths of any length between the two nodes (e.g. paths spanning the entire graph?)
Does this do what you want?
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (origin:App)-[:CHANGED_TO *0..1]->(intermediate_commit:App)-[:CHANGED_TO *0..1]->(destination:App)
RETURN distinct intermediate_commit
I bounded the path length to one hop, changing from 0.. to 0..1
(which means minimum 0 hop, up to 1 relationship hop)
The pattern and conditions allow for the possibility of paths that extend past the start or end nodes yet reach them again further down, this is why it doesn't stop when it finds one matching path but keeps expanding beyond it. Remember Cypher is concerned with finding all possible paths that meet the pattern that exist in the graph. And because of your pattern, the check-beyond-the-start-and-end-nodes-without-limit doesn't just happen once, but per potential (intermediate_commit:App) found while expanding, this is why your query isn't returning.
One way you can get what you want, all possible paths but stopping when the node is reached, is to use the APOC path expanders, you can supply the node as a terminator node, which will halt further expansion past it.
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
CALL apoc.path.expandConfig(destination, {relationshipFilter:'<CHANGED_TO', terminatorNodes:[origin]}) YIELD path
UNWIND nodes[path] as node
WITH DISTINCT node
WHERE node:App
RETURN node as intermediate_commit
This is expanding backwards from destination to origin, seems like that could be more efficient. Once we have the paths, we can UNWIND the nodes from all paths, keep the distinct ones, and make sure we only take the :App nodes.
Related
I had another thread about this where someone suggested to do
MATCH (p:Person {person_id: '123'})
WHERE ANY(x IN $names WHERE
EXISTS((p)-[:BELONGS]-(:Face)-[:CORRESPONDS]-(:Image)-[:HAS_ACCESS_TO]-(:Dias {group_name: x})))
MATCH path=(p)-[:ASSOCIATED_WITH]-(:Person)
RETURN path
This does what I need it to, returns nodes that fit the criteria without returning the relationships, but now I need to include another param that is a list.
....(:Dias {group_name: x, second_name: y}))
I'm unsure of the syntax.. here's what I tried
WHERE ANY(x IN $names and y IN $names_2 WHERE..
this gives me a syntax error :/
Since the ANY() function can only iterate over a single list, it would be difficult to continue to use that for iteration over 2 lists (but still possible, if you create a single list with all possible x/y combinations) AND also be efficient (since each combination would be tested separately).
However, the new existenial subquery synatx introduced in neo4j 4.0 will be very helpful for this use case (I assume the 2 lists are passed as the parameters names1 and names2):
MATCH (p:Person {person_id: '123'})
WHERE EXISTS {
MATCH (p)-[:BELONGS]-(:Face)-[:CORRESPONDS]-(:Image)-[:HAS_ACCESS_TO]-(d:Dias)
WHERE d.group_name IN $names1 AND d.second_name IN $names2
}
MATCH path=(p)-[:ASSOCIATED_WITH]-(:Person)
RETURN path
By the way, here are some more tips:
If it is possible to specify the direction of each relationship in your query, that would help to speed up the query.
If it is possible to remove any node labels from a (sub)query and still get the same results, that would also be faster. There is an exception, though: if the (sub)query has no variables that are already bound to a value, then you would normally want to specify the node label for the one node that would be used to kick off that (sub)query (you can do a PROFILE to see which node that would be).
I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m
So my query is for a "superpath finding problem".
The relevant nodes here are;
route: The overall path object
tlroutesegment: The logical link between the route and the different segments (which compose the full path) (ps: I know this could be better represented using a relationship, however the database is just made this way :S)
oms: The PHYSICAL path segments itself
validochpath: More or less irrelevant for this question; Top level entity of routes
So on to the actual problem I am having; below is a WORKING solution to the above, HOWEVER, I wanted to optimize the query a bit by reducing the # of routes we have to search through in the 4th line here.
MATCH (vp:validochpath {"some ID HERE"})-->(ort:route)<--
(rs:tlroutesegment)-->(oms:oms)
WITH collect(oms) AS omsNodes
MATCH (ort:route)
WHERE ALL(x in omsNodes WHERE (ort)<--(:tlroutesegment)-->(x))
WITH ort
MATCH (ort)--(vp:validochpath)
RETURN *
This is what the new query looks like, as you can see I use the relation to filter out much of the route nodes.
MATCH (vp:validochpath {onepID:"some ID HERE"})-->(ort:route)<--
(rs:tlroutesegment)-->(oms:oms)<--(rs2:tlroutesegment)
WITH rs2, collect(oms) AS omsNodes
MATCH (rs2)-->(ort2:route)
WHERE ALL(x in omsNodes WHERE (x)<--(:tlroutesegment)-->(ort2))
MATCH (ort2)--(vp:validochpath)
RETURN *
The problem is, this query does not seem to filter out any nodes with the WHERE ALL and just returns everything.
In your second query, the WHERE clause accepts all matches.
From the first MATCH clause, we know that rs2 is a tlroutesegment and that all the nodes in omsNodes are related to rs2. From the second MATCH clause, we also know that ort2 is related to rs2. Your WHERE clause is checking that all the nodes in omsNodes are related to a tlroutesegment that is also related to ort2. Since rs2 is a tlroutesegment, this test always succeeds.
If you want to test for the existence of paths with tlroutesegment nodes that are different than rs2, try this WHERE clause:
WHERE ALL(x in omsNodes WHERE
SIZE([(x)<--(y:tlroutesegment)-->(ort2) WHERE y <> rs2 | y]) > 0)
Having this query working in Cypher (Neo4j):
MATCH p=(g:Node)-[:FOLLOWED_BY *2..2]->(g2:Node)
WHERE g.group=10 AND g2.group=10
RETURN p
which returns all possible paths belonging a specific group (group is just a property to classify nodes), I am struggling to get a query that returns the paths in common between both collection of paths. It would be something like this:
MATCH p=(g:Node)-[:FOLLOWED_BY *2..2]->(g2:Node)
WHERE g.group=10 AND g2.group=10
MATCH p=(g3:Node)-[:FOLLOWED_BY *2..2]->(g4:Node)
WHERE g3.group=15 AND g4.group=15
RETURN INTERSECTION(path1, path2)
Of course I made that up. The goal is to get all the paths in common between both queries.
The start/end nodes of your 2 MATCHes have different groups, so they can never find common paths.
Therefore, when you ask for "paths in common", I assume you actually want to find the shared middle nodes (between the 2 sets of 3-node paths). If so, this query should work:
MATCH p1=(g:Node)-[:FOLLOWED_BY *2]->(g2:Node)
WHERE g.group=10 AND g2.group=10
WITH COLLECT(DISTINCT NODES(p1)[1]) AS middle1
MATCH p2=(g3:Node)-[:FOLLOWED_BY *2]->(g4:Node)
WHERE g3.group=15 AND g4.group=15 AND NODES(p2)[1] IN middle1
RETURN DISTINCT NODES(p2)[1] AS common_middle_node;
My graph is a tree structure with root and end nodes, and a line of nodes between them with [:NEXT]-> relationships from one to the next. Some nodes along that path also have [:BRANCH]-> relationships to other root nodes, and through them to other lines of nodes.
What Cypher query will return an ordered list of the nodes on the path from beginning to end, with any BRANCH relationships being included with the records for the nodes that have them?
EDIT: It's not a technical diagram, but the basic structure looks like this:
with each node depicted as a black circle. In this case, I would would want every node depicted here.
How about
MATCH p=(root)-[:NEXT*0..]->(leaf)
OPTIONAL MATCH (leaf)-[:BRANCH]->(branched)
RETURN leaf, branched, length(p) as l
ORDER BY l ASC
see also this graph-gist: http://gist.neo4j.org/?9042990
This query - a bit slow - should work (I guess):
START n=node(startID), child=node(*)
MATCH (n)-[rels*]-(child)
WHERE all(r in rels WHERE type(r) IN ["NEXT", "BRANCH"])
RETURN *
That is based on Neo4j 2.0.x Cypher syntax.
Technically this query will stop at the end of the tree started from startID: that is because the end in the diagram above belongs to a single path, but not the end of all the branches.
I would also recommend to limit the cardinality of the relationships - [rels*1..n] - to prevent the query to go away...
You wont be able to control the order in which the nodes are returned as per the depth first or breadth first algo unless you have a variable to save previous element or kind of recursive call which I dont think is not possible using only Cypher.
What you can do
MATCH p =(n)-[:NEXT*]->(end)
WITH collect(p) as node_paths
MATCH (n1)-[:NEXT]->(m)-[:BRANCH]->(n2)
WITH collect(m) as branch_nodes , node_paths
RETURN branch_nodes,node_paths
Now node_paths consists of all the paths with pattern (node)-[:NEXT]->(node)-[:NEXT]->...(node) . Now you have the paths and branch Nodes(starting point of basically all the paths in the node_paths except the one which will be emerging from root node) , you can arrange the output order accordingly.