match a branching path of variable length - neo4j

I have a graph which looks like this:
Here is the link to the graph in the neo4j console:
http://console.neo4j.org/?id=av3001
Basically, you have two branching paths, of variable length. I want to match the two paths between orange node and yellow nodes. I want to return one row of data for each path, including all traversed nodes. I also want to be able to include different WHERE clauses on different intermediate nodes.
At the end, i need to have a table of data, like this:
a - b - c - d
neo - morpheus - null - leo
neo - morpheus - trinity - cypher
How could i do that?
I have tried using OPTIONAL MATCH, but i can't get the two rows separately.
I have tried using variable length path, which returns the two paths but doesn't allow me to access and filter intermediate nodes. Plus it returns a list, and not a table of data.
I've seen this question:
Cypher - matching two different possible paths and return both
It's on the same subject but the example is very complex, a more generic solution to this simpler problem is what i'm looking for.

You can define what your end node by using WHERE statement. So in your case end node has no outgoing relationship. Not sure why you expect a null on return as you said neo - morpheus - null - leo
MATCH p=(n:Person{name:"Neo"})-[*]->(end) where not (end)-->()
RETURN extract(x IN nodes(p) | x.name)
Edit:
may not the the best option as I am not sure how to do this programmatically. If I use UNWIND I get back only one row. So this is a dummy solution
MATCH p=(n{name:"Neo"})-[*]->(end) where not (end)-->()
with nodes(p) as list
return list[0].name,list[1].name,list[2].name,list[3].name

You can use Cypher to match a path like this MATCH p=(:a)-[*]->(:d) RETURN p, and p will be a list of nodes/relationships in the path in the order it was traversed. You can apply WHERE to filter the path just like with node matching, and apply any list functions you need to it.
I will add these examples too
// Where on path
MATCH p=(:a)-[*]-(:d) WHERE NONE(n in NODES(p) WHERE n.name="Trinity") WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Spit path into columns
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Match path, filter on label
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN FILTER(n in p WHERE "a" in LABELS(n)) as a, FILTER(n in p WHERE "b" in LABELS(n)) as b, FILTER(n in p WHERE "c" in LABELS(n)) as c, FILTER(n in p WHERE "d" in LABELS(n)) as d
Unfortunately, you HAVE to explicitly set some logic for each column. You can't make dynamic columns (that I know of). In your table example, what is the rule for which column gets 'null'? In the last example, I set each column to be the set of nodes of a label.

I.m.o. you're asking for extensive post-processing of the results of a simply query (give me all the paths starting from Neo). I say this because :
You state you need to be able to specify specific WHERE clauses for each path (but you don't specify which clauses for which path ... indicating this might be a dynamic thing ?)
You don't know the size of the longest path beforehand ... but you still want the result to be a same-size-for-all-results table. And would any null columns then always be just before the end node ? Why (for that makes no real sense other then convenience) ?
...
Therefore (and again i.m.o.) you need to process the results in a (Java or whatever you prefer) program. There you'll have full control over the resultset and be able to slice and dice as you wish. Cypher (exactly like SQL in fact) can only do so much and it seems that you're going beyond that.
Hope this helps,
Regards,
Tom
P.S. This may seem like an easy opt-out, but look at how simple your query is as compared to the constructs that have to be wrought trying to answer your logic. So ... separate the concerns.

Related

Remove automorphisms of a cypher query output

When doing a Cypher query to retrieve a specific subgraph with automorphisms, let's say
MATCH (a)-[:X]-(b)-[:X]-(c),
RETURN a, b, c
It seems that the default behaviour is to return every retrieved subgraph and all their automorphisms.
In that exemple, if (u)-[:X]-(v)-[:X]-(w) is a graph matching the pattern, the output will be u,v,w but also w,v,u, which consist in the same graph.
Is there a way to retrieve each subgraph only once ?
EDIT: It would be great if Cypher have a feature to do that in the search, using some kind of symmetry breaking condition as it would reduce the computing time. If that is not the case, how would you post-process to find the desired output ?
In the query you are making, (a)-[r:X]-(b) and (a)-[t:X]-(c) refer to a similar pattern. Since (b) and (c) can be interchanged. What is the need to repeat matching twice? MATCH (a)-[r:X]-(b) RETURN a, r, b returns all the subgraphs you are looking for.
EDIT
You can do something as follows to find the nodes, which are having two relations of type X.
MATCH (a)-[r:X]-(b) WHERE size((a)-[:X]-()) = 2 RETURN a, r, b
For these kind of mirrored patterns, we can add a restriction on the internal graph ids so only one of the two paths is kept:
MATCH (a)-[:X]-(b)-[:X]-(c)
WHERE id(a) < id(c)
RETURN a, b, c
This will also prevent the case where a = c.

neo4j cypher keep ordering imposed by path for later in the query

I am using a query like
MATCH p=((:Start)-[:NEXT*..100]->(n))
WHERE ALL(node IN nodes(p) WHERE ...)
WITH DISTINCT n WHERE (n:RELEVANT)
...
RETURN n.someprop;
Where I want to have the results ordered by the natural ordering arising from the direction of the -[:NEXT]-> relationships.
But the WITH in the third line scrambles up that ordering. Problem is, I need the with to 1. filter for :RELEVANT nodes and 2. to get only distinct such nodes.
Is there some way to preserve the ordering? Maybe assign number ordering on the path and reuse it later with ORDER BY? No idea how to do it.
You're asking for distinct nodes, which indicates that the node might be reachable by multiple paths, and thus might be present at multiple distances from the start node.
Instead of using DISTINCT, you should use min() (or max(), depending on your requirements) on the path length for each n. Since those are aggregation functions, you will only ever get a single row for each n.
MATCH p=((:Start)-[:NEXT*..100]->(n:RELEVANT))
WHERE ALL(node IN nodes(p) WHERE ...)
WITH n, min(length(p)) as distance
WITH n
ORDER BY distance
...
RETURN n.someprop;
And if you remove the WHERE clause from WITH and put the label :RELEVANT in the MATCH? Maybe the WHERE is causing the problem... Try something this:
MATCH p=((:Start)-[:NEXT*..100]->(n:RELEVANT))
WHERE ALL(node IN nodes(p) WHERE ...)
WITH DISTINCT n
...
RETURN n.someprop;

Complex neo4j cypher query to traverse a graph and extract nodes of a specific label and use them in optional match

I have a huge database of size 260GB, which is storing a ton of transaction information. It has Agent, Customer,Phone,ID_Card as the nodes. Relationships are as follows:
Agent_Send, Customer_Send,Customer_at_Agent, Customer_used_Phone,Customer_used_ID.
A single agent is connected to many customers .And hence hitting the agent node while querying a path is not feasible. Below is my query:
match p=((ph: Phone {Phone_ID : "3851308.0"})-[r:Customer_Send
| Customer_used_ID | Customer_used_Phone *1..5]-(n2))
with nodes(p) as ns
return extract (node in ns | Labels(node) ) as Labels
I am starting with a phone number and trying to extract a big "Customer" network. I am intentionally not touching the "Customer_at_Agent" relationship in the above networked query as it is not optimal as far as performance is concerned.
So, the idea is to extract all the "Customer" labeled nodes from the path and match it with [Customer_at_Agent] relationship.
For instance , something like:
match p=((ph: Phone {Phone_ID : "3851308.0"})-[r:Customer_Send
| Customer_used_ID | Customer_used_Phone *1..5]-(n2))
with nodes(p) as ns
return extract (node in ns | Labels(node) ) as Labels
of "type customer as c "
optional match (c)-[r1:Customer_at_Agent]-(n3)
return distinct p,r1
I am still new to neo4j and cypher and I am not able to figure out a hack to extract only "customer" nodes from the path and use that in the optional match.
Thanks in advance.
Use filter notation instead of extract and you can drop any nodes that aren't labelled right. Try out this query instead:
MATCH p = (ph:Phone {Phone_ID : "3851308.0"}) - [:Customer_Send|:Customer_used_ID|:Customer_used_Phone*1..5] - ()
WITH ph, [node IN NODES(p) WHERE node:Customer] AS customer_nodes
UNWIND customer_nodes AS c_node
OPTIONAL MATCH (c_node) - [r1:Customer_at_Agent] - ()
RETURN ph, COLLECT(DISTINCT r1)
So the second line takes the phone number and the path generated and gives you a list of nodes that have the Customer label as customer_nodes. You then unwind this list so you have individual nodes you can use in path matching. Line 4 performs your optional match and finds the r1 you're interested in, then line 5 will return the phone number node you started with and a collection of all of the r1 relationships that you found on customer nodes hooked up to that phone number.
UPDATE: I added some modifications to clean up your first query line as well. If you aren't going to use an alias (like r or n2 in the first line), then don't assign them in the first place; they can affect performance and cause confusion. Empty nodes and relationships are totally fine if you don't actually have any restrictions to place on them. You also don't need parentheses to mark off a path; they are used as a core part of Cypher's ASCII art to signify nodes, so I find they are more confusing than helpful.

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Cypher path needs to exclude a certain relation

I have this graph:
A-[:X]->B-> a whole tree of badness
A-[:Y]->C-> a whole tree of goodness
I would like to know how to specify a path starting with A that excludes the :X relationship.
In this case "Y" could be any one of a number of different edge types. I do not want to specify them explicitly.
How do I write a path statement that includes A-[*]-B where * is not :X but can be anything else?
Solution for a fixed number of relationships between A and B
You can exclude a relationship type by matching all relationships from A to B and then filter out a specific type with WHERE NOT
MATCH p = (a:Label1)-[]-(b:Label2)
WHERE NOT (a)-[:X]-(b)
RETURN p
Solution for a variable length path between A and B
If you have a variable length path between A and B you cannot put the exact pattern in the WHERE NOT. Instead, you can use a NONE predicate on the path:
MATCH p = (a:Label1)-[*]-(b:Label2)
// this WHERE makes sure that none of the relationships in the
// returned path fulfill the criterion type(relationship) = 'X'
WHERE NONE (r in relationships(p) WHERE type(r) = 'X')
RETURN p
This Cypher query is simpler than the variable-length path query from #MartinPreusse, as it avoids using the RELATIONSHIPS function. Profiling shows that its execution plan is also a bit simpler, so it might be faster.
MATCH p=(a:Label1)-[rels*]-(b:Label2)
WHERE NONE (r IN rels WHERE type(r)= 'X')
RETURN p

Resources