Combining depth- and breadth-first traversals in a single cypher query - neo4j

My graph is a tree structure with root and end nodes, and a line of nodes between them with [:NEXT]-> relationships from one to the next. Some nodes along that path also have [:BRANCH]-> relationships to other root nodes, and through them to other lines of nodes.
What Cypher query will return an ordered list of the nodes on the path from beginning to end, with any BRANCH relationships being included with the records for the nodes that have them?
EDIT: It's not a technical diagram, but the basic structure looks like this:
with each node depicted as a black circle. In this case, I would would want every node depicted here.

How about
MATCH p=(root)-[:NEXT*0..]->(leaf)
OPTIONAL MATCH (leaf)-[:BRANCH]->(branched)
RETURN leaf, branched, length(p) as l
ORDER BY l ASC
see also this graph-gist: http://gist.neo4j.org/?9042990

This query - a bit slow - should work (I guess):
START n=node(startID), child=node(*)
MATCH (n)-[rels*]-(child)
WHERE all(r in rels WHERE type(r) IN ["NEXT", "BRANCH"])
RETURN *
That is based on Neo4j 2.0.x Cypher syntax.
Technically this query will stop at the end of the tree started from startID: that is because the end in the diagram above belongs to a single path, but not the end of all the branches.
I would also recommend to limit the cardinality of the relationships - [rels*1..n] - to prevent the query to go away...

You wont be able to control the order in which the nodes are returned as per the depth first or breadth first algo unless you have a variable to save previous element or kind of recursive call which I dont think is not possible using only Cypher.
What you can do
MATCH p =(n)-[:NEXT*]->(end)
WITH collect(p) as node_paths
MATCH (n1)-[:NEXT]->(m)-[:BRANCH]->(n2)
WITH collect(m) as branch_nodes , node_paths
RETURN branch_nodes,node_paths
Now node_paths consists of all the paths with pattern (node)-[:NEXT]->(node)-[:NEXT]->...(node) . Now you have the paths and branch Nodes(starting point of basically all the paths in the node_paths except the one which will be emerging from root node) , you can arrange the output order accordingly.

Related

Why does the query to find intermediate nodes take so long?

The database has a graph with the following 3 nodes:
...->(1) ------>(3)-->...
\ ^
\ |
---->(2)---/
Now, I want to get all distinct nodes that are reachable from node 1 to node 3, including themselves where I know exactly unique properties of node 1 and node 3 (the nodes are actually commits from a github repository). So, I came up with the following query:
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (origin)-[:CHANGED_TO*0..]->(intermediate_commit:App)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit
However, the query never finishes or at least takes too long to complete. I know that I could have used MATCH p=(origin:App)-[:CHANGED_TO*0..]->(destination:App) and then UNWIND and return distinct nodes. The problem is, I believe, it queries different paths implying I am interested in relationships between them too. While in fact I am not interested in paths. What I need is only distinct nodes that match the pattern. My understanding is that querying paths is slower than it could be if I could query only the nodes.
Could you please help to understand what I am missing? Thanks!
The solution was quite simple. Instead of specifying a pattern in MATCH clause, we move the pattern to WHERE clause. Also, I split the pattern into 2 parts. I can't explain why exactly it is faster but my understanding is that when we move pattern to WHERE clause and MATCH only nodes, we let neo4j know that we are interested only in nodes and not in all possible paths that match the pattern.
The full query:
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (intermediate_commit:App)
WHERE (origin)-[:CHANGED_TO*0..]->(intermediate_commit)
AND (intermediate_commit)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit
Also, if you have a lot of nodes, I believe specifying LIMIT 1 to match origin and destination can also improve the query, like this:
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
WITH origin
LIMIT 1
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
WITH origin, destination
LIMIT 1
MATCH (intermediate_commit:App)
WHERE (origin)-[:CHANGED_TO*0..]->(intermediate_commit)
AND (intermediate_commit)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit
That might be an unbounded path search? Do you really want all paths of any length between the two nodes (e.g. paths spanning the entire graph?)
Does this do what you want?
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (origin:App)-[:CHANGED_TO *0..1]->(intermediate_commit:App)-[:CHANGED_TO *0..1]->(destination:App)
RETURN distinct intermediate_commit
I bounded the path length to one hop, changing from 0.. to 0..1
(which means minimum 0 hop, up to 1 relationship hop)
The pattern and conditions allow for the possibility of paths that extend past the start or end nodes yet reach them again further down, this is why it doesn't stop when it finds one matching path but keeps expanding beyond it. Remember Cypher is concerned with finding all possible paths that meet the pattern that exist in the graph. And because of your pattern, the check-beyond-the-start-and-end-nodes-without-limit doesn't just happen once, but per potential (intermediate_commit:App) found while expanding, this is why your query isn't returning.
One way you can get what you want, all possible paths but stopping when the node is reached, is to use the APOC path expanders, you can supply the node as a terminator node, which will halt further expansion past it.
MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
CALL apoc.path.expandConfig(destination, {relationshipFilter:'<CHANGED_TO', terminatorNodes:[origin]}) YIELD path
UNWIND nodes[path] as node
WITH DISTINCT node
WHERE node:App
RETURN node as intermediate_commit
This is expanding backwards from destination to origin, seems like that could be more efficient. Once we have the paths, we can UNWIND the nodes from all paths, keep the distinct ones, and make sure we only take the :App nodes.

Is this the optimal cypher for returning every node of a subtree?

I have a graph of a tree structure (well no, more of a DAG because i can have multiple parents) and need to be able to write queries that return all results in a flat list, starting at a particular node(s) and down.
I've reduced one of my use cases to this simple example. In the ascii representation here, n's are my nodes and I've appended their id. p is a permission in my auth system, but all that is pertinent to the question is that it marks the spot from which I need to recurse downwards to collect nodes which should be returned by the query.
There can be multiple root nodes related to p's
The roots, such as n3 below, should be contained in the results, as well as the children
The relationship depth is unbounded.
Graph:
n1
^ ^
/ \
n2 n3<--p
^ ^
/ \
n4 n5
^
/
n6
If it's helpful, here's the cypher I ran to throw together this quick example:
CREATE path=(n1:n{id:1})<-[:HAS_PARENT]-(n2:n{id:2}),
(n1)<-[:HAS_PARENT]-(n3:n{id:3})<-[:HAS_PARENT]-(n4:n{id:4}),
(n3)<-[:HAS_PARENT]-(n5:n{id:5}),
(n4)<-[:HAS_PARENT]-(n6:n{id:6})
MATCH (n{id:3})
CREATE (:p)-[:IN]->(n)
Here is the current best query I have:
MATCH (n:n)<--(:p)
WITH collect (n) as parents, (n) as n
OPTIONAL MATCH (c)-[:HAS_PARENT*]->(n)
WITH collect(c) as children, (parents) as parents
UNWIND (parents+children) as tree
RETURN tree
This returns the correct set of results, and unlike some previous attempts I made which did not use any collect/unwind, the results come back as a single column of data as desired.
Is this the most optimal way of making this type of query? It is surprisingly more complex than I thought the simple scenario called for. I tried some queries where I combined the roots ("parents" in my query) with the "children" using a UNION clause, but I could not find a way to do so without repeating the query for the relationship with p. In my real world queries, that's a much more expensive operation which i've reduced down here for the example, so I cannot run it more than once.
This might suit your needs:
MATCH (c)-[:HAS_PARENT*0..]->(root:n)<--(:p)
RETURN root, COLLECT(c) AS tree
Each result row will contain a distinct root node and a collection if its tree nodes (including the root node).

Find all relations starting with a given node

In a graph where the following nodes
A,B,C,D
have a relationship with each nodes successor
(A->B)
and
(B->C)
etc.
How do i make a query that starts with A and gives me all nodes (and relationships) from that and outwards.
I do not know the end node (C).
All i know is to start from A, and traverse the whole connected graph (with conditions on relationship and node type)
I think, you need to use this pattern:
(n)-[*]->(m) - variable length path of any number of relationships from n to m. (see Refcard)
A sample query would be:
MATCH path = (a:A)-[*]->()
RETURN path
Have also a look at the path functions in the refcard to expand your cypher query (I don't know what exact conditions you'll need to apply).
To get all the nodes / relationships starting at a node:
MATCH (a:A {id: "id"})-[r*]-(b)
RETURN a, r, b
This will return all the graphs originating with node A / Label A where id = "id".
One caveat - if this graph is large the query will take a long time to run.

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Cypher query to find all paths with same relationship type

I'm struggling to find a single clean, efficient Cypher query that will let me identify all distinct paths emanating from a start node such that every relationship in the path is of the same type when there are many relationship types.
Here's a simple version of the model:
CREATE (a), (b), (c), (d), (e), (f), (g),
(a)-[:X]->(b)-[:X]->(c)-[:X]->(d)-[:X]->(e),
(a)-[:Y]->(c)-[:Y]->(f)-[:Y]->(g)
In this model (a) has two outgoing relationship types, X and Y. I would like to retrieve all the paths that link nodes along relationship X as well as all the paths that link nodes along relationship Y.
I can do this programmatically outside of cypher by making a series of queries, the first to
retrieve the list of outgoing relationships from the start node, and then a single query (submitted together as a batch) for each relationship. That looks like:
START n=node(1)
MATCH n-[r]->()
RETURN COLLECT(DISTINCT TYPE(r)) as rels;
followed by:
START n=node(1)
MATCH n-[:`reltype_param`*]->()
RETURN p as path;
The above satisfies my need, but requires at minimum 2 round trips to the server (again, assuming I batch together the second set of queries in one transaction).
A single-query approach that works, but is horribly inefficient is the following single Cypher query:
START n=node(1)
MATCH p = n-[r*]->() WHERE
ALL (x in RELATIONSHIPS(p) WHERE TYPE(x) = TYPE(HEAD(RELATIONSHIPS(p))))
RETURN p as path;
That query uses the ALL predicate to filter the relationships along the paths enforcing that each relationship in the path matches the first relationship in the path. This, however, is really just a filter operation on what it essentially a combinatorial explosion of all possible paths --- much less efficient than traversing a relationship of a known, given type first.
I feel like this should be possible with a single Cypher query, but I have not been able to get it right.
Here's a minor optimization, at least non-matching the paths will fail fast:
MATCH n-[r]->()
WITH distinct type(r) AS t
MATCH p = n-[r*]->()
WHERE type(r[-1]) = t // last entry matches
RETURN p AS path
This is probably one of those things that should be in the Java API if you want it to be really performant, though.

Resources