Cypher Query to Collect Arbitrary Depth Nodes and Edge Properties - neo4j

I have a graph that looks like the image below. However, the depth and the number of rollups from the Person to the topmost Rollup is variable depending on how the rollups have been structured by the user. The edges from the Person to the Metric (HAS_METRIC) have the score values and the relationships from the metrics to the Rollup (HAS_PARENT) has the weighting that should be applied by to the value as it is rolled up to a top score.
Ideally, I would like to have a query that produces a table with the rollup and the summed/weighted scores. Like this:
node | value
-------------------
Metric A 23
Metric B 55
Metric C 29
Metric D 78
Rollup A 45.4
Rollup B 58.4
Rollup Tot 51.9
However, I am not understanding how to collect the edge properties for the HAS_PARENTS.
MATCH (p:Person)-[score:HAS_METRIC]->(m:Metric)-[weight:HAS_PARENT]->(ru:Rollup)
-[par_rel:HAS_PARENT*..8]->(ru_par:Rollup)
WITH p, score, m, weight, par_rel, ru, ru_par
RETURN p.uid, score.score, m.uid, weight.weight, ru.uid par_rel.weight, ru_par.uid
This query is giving me a type mismatch because it does not know what to do with the par_rel.weight. Any pointers are appreciated.

I believe what you are searching for is the relationships(path) function. It is one of the default path functions in Cypher. It returns all relationship is a defined path, and you can combine it with one or more Cypher list expressions to get the values you need from the relationships.
Generally speaking, you could do something like:
MATCH p = (n)-[:HAS_PARENT*..8]->()
RETURN [x IN relationships(p) | x.weight] AS weights
You might also find useful the reduce function. E.g.:
...
RETURN reduce(s = 0, x IN relationships(p) | s + x.weight) AS sumWeight
But you need to be careful with your variable length path queries and probably constrain them in order to get only the paths you are interested in.
A good advice would be probably to mark your leaf and root nodes in order to match only paths from a leaf to a/the root, not just intermediate ones. E.g.:
MATCH p = (n)-[:HAS_PARENT*..8]->(root)
WHERE NOT (root)-[:HAS_PARENT]->() AND NOT (n)<-[:HAS_PARENT]-()
...
And of course you can combine these cypher with others in order to return everything you need in one single query.
I hope this helps. Let us know when you succeed.

Related

Remove automorphisms of a cypher query output

When doing a Cypher query to retrieve a specific subgraph with automorphisms, let's say
MATCH (a)-[:X]-(b)-[:X]-(c),
RETURN a, b, c
It seems that the default behaviour is to return every retrieved subgraph and all their automorphisms.
In that exemple, if (u)-[:X]-(v)-[:X]-(w) is a graph matching the pattern, the output will be u,v,w but also w,v,u, which consist in the same graph.
Is there a way to retrieve each subgraph only once ?
EDIT: It would be great if Cypher have a feature to do that in the search, using some kind of symmetry breaking condition as it would reduce the computing time. If that is not the case, how would you post-process to find the desired output ?
In the query you are making, (a)-[r:X]-(b) and (a)-[t:X]-(c) refer to a similar pattern. Since (b) and (c) can be interchanged. What is the need to repeat matching twice? MATCH (a)-[r:X]-(b) RETURN a, r, b returns all the subgraphs you are looking for.
EDIT
You can do something as follows to find the nodes, which are having two relations of type X.
MATCH (a)-[r:X]-(b) WHERE size((a)-[:X]-()) = 2 RETURN a, r, b
For these kind of mirrored patterns, we can add a restriction on the internal graph ids so only one of the two paths is kept:
MATCH (a)-[:X]-(b)-[:X]-(c)
WHERE id(a) < id(c)
RETURN a, b, c
This will also prevent the case where a = c.

neo4j get random path from known node

I have a big neo4j db with info about celebs, all of them have relations with many others, they are linked, dated, married to each other. So I need to get random path from one celeb with defined count of relations (5). I don't care who will be in this chain, the only condition I have I shouldn't have repeated celebs in chain.
To be more clear: I need to get "new" chain after each query, for example:
I try to get chain started with Rita Ora
She has relations with
Drake, Jay Z and Justin Bieber
Query takes random from these guys, for example Jay Z
Then Query takes relations of Jay Z: Karrine
Steffans, Rosario Dawson and Rita Ora
Query can't take Rita Ora cuz
she is already in chain, so it takes random from others two, for
example Rosario Dawson
...
And at the end we should have a chain Rita Ora - Jay Z - Rosario Dawson - other celeb - other celeb 2
Is that possible to do it by query?
This is doable in Cypher, but it's quite tricky. You mention that
the only condition I have I shouldn't have repeated celebs in chain.
This condition could be captured by using node-isomorphic pattern matching, which requires all nodes in a path to be unique. Unfortunately, this is not yet supported in Cypher. It is proposed as part of the openCypher project, but is still work-in-progress. Currently, Cypher only supports relationship uniqueness, which is not enough for this use case as there are multiple relationship types (e.g. A is married to B, but B also collaborated with A, so we already have a duplicate with only two nodes).
APOC solution. If you can use the APOC library, take a look at the path expander, which supports various uniqueness constraints, including NODE_GLOBAL.
Plain Cypher solution. To work around this limitation, you can capture the node uniqueness constraint with a filtering operation:
MATCH p = (c1:Celebrity {name: 'Rita Ora'})-[*5]-(c2:Celebrity)
UNWIND nodes(p) AS node
WITH p, count(DISTINCT node) AS countNodes
WHERE countNodes = 5
RETURN p
LIMIT 1
Performance-wise this should be okay as long as you limit its results because the query engine will basically keep enumerating new paths until one of them passes the filtering test.
The goal of the UNWIND nodes(p) AS node WITH count(DISTINCT node) ... construct is to remove duplicates from the list of nodes by first UNWIND-ing it to separate rows, then aggregating them to a unique collection using DISTINCT. We then check whether the list of unique nodes still has 5 elements - if so, the original list was also unique and we RETURN the results.
Note. Instead of UNWIND and count(DISTINCT ...), getting unique elements from a list could be expressed in other ways:
(1) Using a list comprehension and ranges:
WITH [1, 2, 2, 3, 2] AS l
RETURN [i IN range(0, length(l)-1) WHERE NOT l[i] IN l[0..i] | l[i]]
(2) Using reduce:
WITH [1, 2, 2, 3, 2] AS l
RETURN reduce(acc = [], i IN l | acc + CASE NOT i IN acc WHEN true THEN [i] ELSE [] END)
However, I believe both forms are less readable than the original one.

match a branching path of variable length

I have a graph which looks like this:
Here is the link to the graph in the neo4j console:
http://console.neo4j.org/?id=av3001
Basically, you have two branching paths, of variable length. I want to match the two paths between orange node and yellow nodes. I want to return one row of data for each path, including all traversed nodes. I also want to be able to include different WHERE clauses on different intermediate nodes.
At the end, i need to have a table of data, like this:
a - b - c - d
neo - morpheus - null - leo
neo - morpheus - trinity - cypher
How could i do that?
I have tried using OPTIONAL MATCH, but i can't get the two rows separately.
I have tried using variable length path, which returns the two paths but doesn't allow me to access and filter intermediate nodes. Plus it returns a list, and not a table of data.
I've seen this question:
Cypher - matching two different possible paths and return both
It's on the same subject but the example is very complex, a more generic solution to this simpler problem is what i'm looking for.
You can define what your end node by using WHERE statement. So in your case end node has no outgoing relationship. Not sure why you expect a null on return as you said neo - morpheus - null - leo
MATCH p=(n:Person{name:"Neo"})-[*]->(end) where not (end)-->()
RETURN extract(x IN nodes(p) | x.name)
Edit:
may not the the best option as I am not sure how to do this programmatically. If I use UNWIND I get back only one row. So this is a dummy solution
MATCH p=(n{name:"Neo"})-[*]->(end) where not (end)-->()
with nodes(p) as list
return list[0].name,list[1].name,list[2].name,list[3].name
You can use Cypher to match a path like this MATCH p=(:a)-[*]->(:d) RETURN p, and p will be a list of nodes/relationships in the path in the order it was traversed. You can apply WHERE to filter the path just like with node matching, and apply any list functions you need to it.
I will add these examples too
// Where on path
MATCH p=(:a)-[*]-(:d) WHERE NONE(n in NODES(p) WHERE n.name="Trinity") WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Spit path into columns
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Match path, filter on label
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN FILTER(n in p WHERE "a" in LABELS(n)) as a, FILTER(n in p WHERE "b" in LABELS(n)) as b, FILTER(n in p WHERE "c" in LABELS(n)) as c, FILTER(n in p WHERE "d" in LABELS(n)) as d
Unfortunately, you HAVE to explicitly set some logic for each column. You can't make dynamic columns (that I know of). In your table example, what is the rule for which column gets 'null'? In the last example, I set each column to be the set of nodes of a label.
I.m.o. you're asking for extensive post-processing of the results of a simply query (give me all the paths starting from Neo). I say this because :
You state you need to be able to specify specific WHERE clauses for each path (but you don't specify which clauses for which path ... indicating this might be a dynamic thing ?)
You don't know the size of the longest path beforehand ... but you still want the result to be a same-size-for-all-results table. And would any null columns then always be just before the end node ? Why (for that makes no real sense other then convenience) ?
...
Therefore (and again i.m.o.) you need to process the results in a (Java or whatever you prefer) program. There you'll have full control over the resultset and be able to slice and dice as you wish. Cypher (exactly like SQL in fact) can only do so much and it seems that you're going beyond that.
Hope this helps,
Regards,
Tom
P.S. This may seem like an easy opt-out, but look at how simple your query is as compared to the constructs that have to be wrought trying to answer your logic. So ... separate the concerns.

Is this the optimal cypher for returning every node of a subtree?

I have a graph of a tree structure (well no, more of a DAG because i can have multiple parents) and need to be able to write queries that return all results in a flat list, starting at a particular node(s) and down.
I've reduced one of my use cases to this simple example. In the ascii representation here, n's are my nodes and I've appended their id. p is a permission in my auth system, but all that is pertinent to the question is that it marks the spot from which I need to recurse downwards to collect nodes which should be returned by the query.
There can be multiple root nodes related to p's
The roots, such as n3 below, should be contained in the results, as well as the children
The relationship depth is unbounded.
Graph:
n1
^ ^
/ \
n2 n3<--p
^ ^
/ \
n4 n5
^
/
n6
If it's helpful, here's the cypher I ran to throw together this quick example:
CREATE path=(n1:n{id:1})<-[:HAS_PARENT]-(n2:n{id:2}),
(n1)<-[:HAS_PARENT]-(n3:n{id:3})<-[:HAS_PARENT]-(n4:n{id:4}),
(n3)<-[:HAS_PARENT]-(n5:n{id:5}),
(n4)<-[:HAS_PARENT]-(n6:n{id:6})
MATCH (n{id:3})
CREATE (:p)-[:IN]->(n)
Here is the current best query I have:
MATCH (n:n)<--(:p)
WITH collect (n) as parents, (n) as n
OPTIONAL MATCH (c)-[:HAS_PARENT*]->(n)
WITH collect(c) as children, (parents) as parents
UNWIND (parents+children) as tree
RETURN tree
This returns the correct set of results, and unlike some previous attempts I made which did not use any collect/unwind, the results come back as a single column of data as desired.
Is this the most optimal way of making this type of query? It is surprisingly more complex than I thought the simple scenario called for. I tried some queries where I combined the roots ("parents" in my query) with the "children" using a UNION clause, but I could not find a way to do so without repeating the query for the relationship with p. In my real world queries, that's a much more expensive operation which i've reduced down here for the example, so I cannot run it more than once.
This might suit your needs:
MATCH (c)-[:HAS_PARENT*0..]->(root:n)<--(:p)
RETURN root, COLLECT(c) AS tree
Each result row will contain a distinct root node and a collection if its tree nodes (including the root node).

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Resources