Limiting a Neo4j cypher query results by sum of relationship property - neo4j

Is there a way to limit a cypher query by the sum of a relationship property?
I'm trying to create a cypher query that returns nodes that are within a distance of 100 of the start node. All the relationships have a distance set, the sum of all the distances in a path is the total distance from the start node.
If the WHERE clause could handle aggregate functions what I'm looking for might look like this
START n=node(1)
MATCH path = n-[rel:street*]-x
WHERE SUM( rel.distance ) < 100
RETURN x
Is there a way that I can sum the distances of the relationships in the path for the where clause?

Sure, what you want to do is like a having in a SQL query.
In cypher you can chain query segments and use the results of previous parts in the next part by using WITH, see the manual.
For your example one would assume:
START n=node(1)
MATCH n-[rel:street*]-x
WITH SUM(rel.distance) as distance
WHERE distance < 100
RETURN x
Unfortunately sum doesn't work with collections yet
So I tried to do it differently (for variable length paths):
START n=node(1)
MATCH n-[rel:street*]-x
WITH collect(rel.distance) as distances
WITH head(distances) + head(tail(distances)) + head(tail(tail(distances))) as distance
WHERE distance < 100
RETURN x
Unfortunately head of an empty list doesn't return null which could be coalesced to 0 but just fails. So this approach would only work for fixed length paths, don't know if that's working for you.

I've come across the same problem recently. In more recent versions of neo4j this was solved by the extract and reduce clauses. You could write:
START n=node(1)
MATCH path = (n)-[rel:street*..100]-(x)
WITH extract(x in rel | x.distance) as distances, x
WITH reduce(res = 0, x in rs | res + x) as distance, x
WHERE distance <100
RETURN x

i dont know about a limitation in the WHERE clause, but you can simply specify it in the MATCH clause:
START n=node(1)
MATCH path = n-[rel:street*..100]-x
RETURN x
see http://docs.neo4j.org/chunked/milestone/query-match.html#match-variable-length-relationships

Related

Neo4j Cypher Query - return counts of relationships in separate columns

I have experience with Neo4j and Cypher, but still struggle with aggregate functions. I'm trying to pull a CSV out of Neo4j that should look like this:
Location
Number of Node X at Location
Number of Node Y at Location
ABC
5
20
DEF
39
4
Etc.
#
#
My current query looks like this:
MATCH (loc:Location)--(x:Node_X)
RETURN loc.key AS Location, count(x) AS `Number of Node X at Location`, 0 AS `Number of Node Y at Location`
UNION
MATCH (loc:Location)--(y:Node_Y)
RETURN loc.key AS Location, 0 AS `Number of Node X at Location`, count(y) AS `Number of Node Y at Location`
Which yields a table like:
Location
Number of Node X at Location
Number of Node Y at Location
ABC
5
0
DEF
39
0
Etc.
#
#
ABC
0
20
DEF
0
4
Etc.
#
#
I think I'm close, but I have double the number of Location rows as I need, and am not sure how to make the results more succinct. Suggestions on this and generally tips for aggregate functions are appreciated!
I think You can solve it like this with even when counts are 0
MATCH (loc:loc1)
RETURN loc.type ,
size((loc)--(:Node_X)) AS xCount,
size((loc)--(:Node_Y)) AS yCount
You can also do
MATCH (loc:loc1)
RETURN loc.type ,
size([(loc)—-(x:Node_X) | x]) AS xCount
You can aggregate with distinct here.
MATCH (loc:loc1)--(x:Node_X), (loc)--(y:Node_Y)
RETURN loc.key ,
count(distinct(x)) as NODES_OF_TYPE_X,
count(distinct(y)) as NODES_OF_TYPE_Y
The problem in accessing x and y in above query is it changes the cardinality of the solution. For each solution of x, it will have all the solutions of y.
If you had n1 x nodes and n2 y nodes and you don't use distinct, then you would get n1*n2 nodes for each x and y.
#PrashantUpadhyay got the answer started, but I think this is the final answer I was looking for. It accounts for cases where counts may return zero, but still includes all Location rows.
MATCH (loc:Location)
OPTIONAL MATCH (loc)--(x:Node_X)
OPTIONAL MATCH (loc)--(y:Node_Y)
RETURN loc.key AS Location,
coalesce(count(distinct(x)), 0) as Node_X,
coalesce(count(distinct(y)), 0) as Node_Y
ORDER BY Location
I'd try this:
MATCH (loc:Location)
with distinct loc
OPTIONAL MATCH (loc)--(x:Node_X)
WITH distinct loc, count(x) AS xnum
OPTIONAL MATCH (loc)--(y:Node_Y)
WITH DISTINCT loc, count(y) AS ynum, xnum
RETURN
DISTINCT loc.key as Location,
xnum as `Number of Node X at Location`,
ynum as `Number of Node Y at Location`

Shortest Paths with Cost Property

I want to look up the top 5 (shortest) path in my graph (Neo4j 3.0.4) from point A to point Z.
The graph consists several nodes that are connected by the relation "CONNECTED_BY". This connection has a cost property that should be minimized.
I started with this:
MATCH p=(from:Stop{stopId:'A'}), (to:Stop{stopUri:'Z'}),
path = allShortestPaths((from)-[:CONNECTED_TO*]->(to))
WITH REDUCE (total = 0, r in relationships(p) | total + r.cost) as tt, path
RETURN path, tt
This query returns always the subgraph with the least hops, the cost property is not considered. There exists another subgraph with more hops that has a lower total cost. What I am doing wrong?
Furthermore, I acutally want to get the TOP 5 subgraphs. If I execute this query:
MATCH p=(from:Stop{stopUri:'A'})-[r:CONNECTED_TO*10]->(to:Stop{stopUri:'Z'}) RETURN p
I can see several paths, but the first one just returns one path.
The path should not contain loops etc. of course.
I want to execute this query via REST API, so a REST Call or cyhper query should do it.
EDIT1:
I want to execute this as REST Call, so I tried the dijkstra algorithm. This seems to be a good way, but I have to calculate the weight by adding 3 different cost properties in the relation. How this could be achieved?
allShortestPaths will find the shortest path between two points and then match every path that has the same number of hops. If you want to minimize based on cost rather than traversal length, try something like this:
MATCH p=(from:Stop{stopId:'A'}), (to:Stop{stopUri:'Z'}),
path = (from)-[:CONNECTED_TO*]->(to)
WITH REDUCE (total = 0, r in relationships(p) | total + r.cost) as cost, path
ORDER BY cost
RETURN path LIMIT 5

clarification on CYPHER dijkstra query how to use WHERE clause in reduce fuction

I am trying to find the shortestpath between two nodes (using dijkstra algorithm) where all the relationships having distance property.
My requirement is I need to filter paths based on the filters provided for nodes or relationships before calling shortestpath.
How to use WHERE caluse in reduce function below for filtering nodes or relationships.
MATCH p=(a:Place{Name:"US"})-[rels:IS_LOCATED_AT|CARRIES|BELONGS_TO*]-(b:Place{Name:"UK"})
RETURN p as shortestPath,
REDUCE(distance=0, r in rels(p) | distance+r.distance) AS totalDistance
ORDER BY totalDistance ASC
Within a REDUCE expression, you can use CASE to selectively update the accumulator variable (distance, in your example).
As a simple (and silly) example, if you wanted to ignore any distance less than 100 miles:
REDUCE(distance=0, r in rels(p) | CASE WHEN r.distance >= 100 THEN distance+r.distance ELSE distance END) AS totalDistance

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Cypher query to get shortest path between A and B that doesn't go through C, that isn't massively slow? Or recommend alternative to Cypher/Neo4j

I'm working with a graph that has thousands of nodes. Say I have person nodes, and FRIENDS relationships between them. e.g., gus-[:FRIENDS]-skylar
If I wanted to find the shortest friend path between hank and gus as long as they're not separated by more than 20 rels, I could do this:
START hank=node(68), gus=node(66)
MATCH p = shortestPath((hank)-[:FRIENDS*..20]-(gus))
RETURN p
This works and is fast, even when the found shortest path is of length 10 or more.
But say I wanted to find a path from hank to gus that does not go through glenn?
The query I've tried is this:
START hank=node(68), gus=node(66), glenn=node(59)
MATCH p =(hank)-[:FRIENDS*..20]-(gus)
WHERE NOT glenn IN nodes(p)
RETURN p
ORDER BY length(p)
LIMIT 1;
This works on very small graphs (30 or so people), but if there are 1000's...the JVM runs out of heapspace.
So I'm guessing Cypher finds ALL paths between gus and hank of length 20 or less, and then applies the WHERE filter? It's clear why that would be slow.
In an abstract sense, this algorithm should be doable with the same big O runtime, because all that would change is that you check to make sure each node (as you search) isn't the one you want to avoid.
Any suggestions for how to accomplish this? I'm pretty new to Cypher.
If this is not possible with Cypher, can you recommend some other database and graph language "stack"?
Thanks
Can you test the performance of the following query? The main difference is that it compares paths instead of nodes. I've added a direction in the paths as well, as that will speed up the query.
START hank=node(68), gus=node(66), glenn=node(59)
MATCH p = allshortestPaths((hank)-[:FRIENDS]->(gus))
WITH COLLECT(p) AS gusPaths, hank, glenn
MATCH p2 = allshortestPaths((hank)-[:FRIENDS]->(glenn))
WITH COLLECT(p2) AS glennPaths, gusPaths
WITH filter(x IN gusPaths
WHERE NONE (x2 IN glennPaths
WHERE x = x2)) AS filtered
RETURN filtered
ORDER BY length(filtered)
LIMIT 1

Resources