Neo4J optimize path generation query - neo4j

How to make the query below work seconds instead of minutes?
I'm new to graph databases. Am I right if I say that node indexing won't help to speed up my query? As I understand, indexes helps to find start point of traversal, not for traversing itself.
May relationship indexing be helpful in my case?
Query
I have 2,500 nodes of type COLUMN and 52,000 relationships between nodes.
The query below is too slow, I even don't know how slow is it. It takes more than 5 minutes, than I get java.net.SocketTimeoutException.
Query
MATCH path = (start:PERSON)-[r:MET_REL*2..5]->(person:PERSON)
WHERE start.ID = '385'
WITH path UNWIND NODES(path) AS col
WITH path,
COLLECT(DISTINCT col.COUNTRY_ID) as distinctCountries
WHERE LENGTH(path) + 1 = SIZE(distinctCountries)
RETURN path
P.S.
Moreover, I want to do [r:MET_REL*2..25] instead of [r:MET_REL*2..5]

Make sure you have an index/constraint on :PERSON(ID)
Please try this:
MATCH path = (start:PERSON)-[:MET_REL*2..5]->(person:PERSON)
WHERE start.ID = '385'
WITH path, reduce(a=[], n in nodes(path) | case when n.COUNTRY_ID IN a then a else a + [n.COUNTRY_ID] end) as countries
WHERE LENGTH(path) + 1 = SIZE(distinctCountries)
RETURN path
With APOC there is an apoc.coll.toSet function that you could use on the countries.

Related

How could I optimize this cypher query?

When I used this cypher query
match p=(n)-[r*8]-(n)
where id(n)=548
with p
where ALL(x IN nodes(p)[1..length(p)] WHERE SINGLE(y IN nodes(p)[1..length(p)] WHERE x=y))
return count(p)
it took 51922 ms to return the result; it is really a long time. How could I optimize this cypher query? Any help would be appreciated.
Looks like you want a simple circuit with no repeating nodes (except the start and end node).
There's an APOC Procedure to get all simple paths between two nodes, with a maximum path length. It doesn't currently work when the start and end nodes are the same, but if we set the end node as any adjacent node to your start node, and filter to only keep paths of length 7 (since the paths exclude the last hop back to the start node), then we should be able to get the right answer extremely fast.
match (n)--[m]
with distinct n, m
call apoc.algo.allSimplePaths(n, m, '', 7) YIELD path
with path
where length(path) = 7
return count(path)

neo4j how to use count(distinct()) over the nodes of path

I search the longest path of my graph and I want to count the number of distinct nodes of this longest path.
I want to use count(distinct())
I tried two queries.
First is
match p=(primero)-[:ResponseTo*]-(segundo)
with max(length(p)) as lengthPath
match p1=(primero)-[:ResponseTo*]-(segundo)
where length(p1) = lengthPath
return nodes(p1)
The query result is a graph with the path nodes.
But if I tried the query
match p=(primero)-[:ResponseTo*]-(segundo)
with max(length(p)) as lengthPath
match p1=(primero)-[:ResponseTo*]-(segundo)
where length(p1) = lengthPath
return count(distinct(primero))
The result is
count(distinct(primero))
2
How can I use count(distinct()) over the node primero.
Node Primero has a field called id.
You should bind at least one of those nodes, add a direction and also consider a path-limit otherwise this is an extremely expensive query.
match p=(primero)-[:ResponseTo*..30]-(segundo)
with p order by length(p) desc limit 1
unwind nodes(p) as n
return distinct n;

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Limiting number of paths the query search in cypher query other than limit

I want the query to stop as soon as it finds first 10 paths and return those.
But by default the limit clause finds all the paths and then just returns first 10 paths.
Because total paths in my case will be around 10k to 20k, its not practical to do that.
i tried following two queries which dont work
match path = (first:Job)-[:PRECEDES*]->(last:Job)
where first.name = 'xyz' and last.name = 'abc'
return nodes(path) as pathlist
match path1 = (first:Job)-[:PRECEDES*]->(middle:Job)
where first.name = 'xyz'
with middle, path1
match path2 = (middle:Job)-[:PRECEDES*]->(last:Job)
last.name = 'abc'
return nodes(path1),nodes(path2) as pathlist
both are taking forever to complete.
Be sure to have an index in place:
CREATE INDEX ON :Job(name)
By inspecting the statements using PROFILE in neo4j-shell I've found the following being the cheapest variant:
MATCH (a:Job {name:'xyz'}), (b:Job {name:'abc'})
MATCH path=(a)-[:PRECEDES*]->(b)
RETURN nodes(path) LIMT 10
Please note that I'm talking of Neo4j 2.1.6. Since Cypher's implementation is steadily evolving, a upcoming version might already optimize your statements appropriately.

Cypher query to get shortest path between A and B that doesn't go through C, that isn't massively slow? Or recommend alternative to Cypher/Neo4j

I'm working with a graph that has thousands of nodes. Say I have person nodes, and FRIENDS relationships between them. e.g., gus-[:FRIENDS]-skylar
If I wanted to find the shortest friend path between hank and gus as long as they're not separated by more than 20 rels, I could do this:
START hank=node(68), gus=node(66)
MATCH p = shortestPath((hank)-[:FRIENDS*..20]-(gus))
RETURN p
This works and is fast, even when the found shortest path is of length 10 or more.
But say I wanted to find a path from hank to gus that does not go through glenn?
The query I've tried is this:
START hank=node(68), gus=node(66), glenn=node(59)
MATCH p =(hank)-[:FRIENDS*..20]-(gus)
WHERE NOT glenn IN nodes(p)
RETURN p
ORDER BY length(p)
LIMIT 1;
This works on very small graphs (30 or so people), but if there are 1000's...the JVM runs out of heapspace.
So I'm guessing Cypher finds ALL paths between gus and hank of length 20 or less, and then applies the WHERE filter? It's clear why that would be slow.
In an abstract sense, this algorithm should be doable with the same big O runtime, because all that would change is that you check to make sure each node (as you search) isn't the one you want to avoid.
Any suggestions for how to accomplish this? I'm pretty new to Cypher.
If this is not possible with Cypher, can you recommend some other database and graph language "stack"?
Thanks
Can you test the performance of the following query? The main difference is that it compares paths instead of nodes. I've added a direction in the paths as well, as that will speed up the query.
START hank=node(68), gus=node(66), glenn=node(59)
MATCH p = allshortestPaths((hank)-[:FRIENDS]->(gus))
WITH COLLECT(p) AS gusPaths, hank, glenn
MATCH p2 = allshortestPaths((hank)-[:FRIENDS]->(glenn))
WITH COLLECT(p2) AS glennPaths, gusPaths
WITH filter(x IN gusPaths
WHERE NONE (x2 IN glennPaths
WHERE x = x2)) AS filtered
RETURN filtered
ORDER BY length(filtered)
LIMIT 1

Resources