Neo4j version: 3.5.16
What kind of API / driver do you use: Python API with py2neo to run the query with graph.run()
Py2neo version: 4.3.0.
Hey all,
I'm trying to optimize a cypher query to retrieve a variable length path.
The graph is created each time data arrives and startNode and endNode are fixed on their name property. Once created the graph, I have a startNode and an endNode and the corolllary/objective is:
"From all the possible paths with a minimum length of X and a maximum length of Y, I want to find the shortest path that yields the highest aggregated relationship value".
What I actually have managed to do is: "get the path of length between X and Y that yields the highest aggregated relationship value" with the following cyper query:
MATCH path = (startN:Batch { name: $startNode })-[:CHANGES_TO*4..7]->(endN:Batch { name: $endNode })
RETURN path,
REDUCE (s = 1, r IN RELATIONSHIPS(path) | s * r.rateValue) AS finalBatchValue
ORDER BY finalBatchValue DESC
LIMIT 1
Howerver, it takes some time to run. ¿Could someone provide ideas on how to optimize this, both to accomplish the objetive of the shortestPath and to optimize the query for running faster if possible?
I tried to make it work with APOC methods like allShortestPaths or Dijkstra with no success; it ended up returning the shortest path and I wasn't able to fix the minimum amount of nodes to consider.
Any help is much appreciated.
Since you are basically looking for strongest one in shortest length. You can come around the performance problem with a trick.
You can query the graph with fixed length than variable length. But you have to query n2-n1+1 times , in your case , 4 times , first with length 4, and then 5 and so on .
You can stop querying if you find a path at any point.
This approach will tremendously decrease the data loaded each time. But you have to hit the graph multiple times.Its most likely that average time taken for four hits approach will be less than single hit with variable length.
The reason being you don't calculate all the paths of higher length if you find a lower length solution.
Since,longer the path gets, the time taken will grow exponentially.
This is not possible only using cypher . One way is writing neo4j procedure in java and using that in cypher query .
The second way is : hitting neo4j using different query.
i am writing python code for your case here ,
query = 'MATCH path = (startN:Batch { name: $startNode })-[:CHANGES_TO*LENGTH_PARAM]->(endN:Batch { name: $endNode })
RETURN path,
REDUCE (s = 1, r IN RELATIONSHIPS(path) | s * r.rateValue) AS finalBatchValue
ORDER BY finalBatchValue DESC
LIMIT 1'
for length in range(4,8):
query = query.replace('LENGTH_PARAM',str(x))
result = graph.run(query)
#if result size > 0
#your implementation
#final_result= result['path']
#return final_result
That's how it works, here in worst case,you need to hit graph four times for each start,end node pair . Network calls increase, but average time taken should be reduced.
With java plugins, it can be reduced to one hit like previous query , as you can do the loop part inside the java code .
Related
we have some data about some nodes that nodes connected to each other with relations we called them ( cable )
the number of nodes is : 349 and the number of cables is : 924
we need to find possible path ( not shortest ) between two nodes and used this :
MATCH p=(n:location)-[*]-(m:location)
WHERE n.lo_id = 70 AND m.lo_id = 486
AND ALL(x IN NODES(p) WHERE SINGLE(y IN NODES(p) WHERE y = x))
return p
but it's failed . i used to explain and saw in plan that in "VarLengthExpand(Into)#neo4j" about
67,837,845,872,747,150,000 estimated rows !!!
what's wrong with this query ?
i'm newbie with neo4j . should i put index on fields or rewrite query ?
would you please help me to make it work and find possible path with a good query between nodes ?
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED.
It would be better to use APOC path expand config,
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-paths-config/
set uniqueness to NODE_PATH
use bfs: false
use your end node as terminator node
possible set a max depth at least as long as you're testing
What do you want to do with all those billions of paths?
Basically path expansion is degree to the power of hops.
So for graph with average degree of 10, a 10 hop path would be 10^10.
I would really suggest to first figure out what you want to do with all the paths and then express better how you want to guide the expansion.
If you have business rules, you can use the traversal API from a user defined procedure to guide the traversal/expansion at each step. (See the implementation of that APOC procedure on GitHub)
I have a neo4j database with ~260000 (EDIT: Incorrect by order of magnitude previously, missing 0) nodes of genes, something along the lines of:
example_nodes: sourceId, targetId
with an index on both sourceId and targetId
I am trying to build the relationships between all the nodes but am constantly running into OOM issues. I've increased my JVM heap size to -Xmx4096m and dbms.memory.pagecache.size=16g on a system with 16G of RAM.
I am assuming I need to optimize my query because it simply cannot complete in any of its current forms. However, I have tried the following three to no avail:
MATCH (start:example_nodes),(end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
(on a subset of the 5000 nodes, this query above completes in only a matter of seconds. It does of course warn: This query builds a cartesian product between disconnected patterns.)
MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
OPTIONAL MATCH (start:example_nodes) WITH start MATCH (end:example_nodes) WHERE start.targetId = end.sourceId CREATE (start)-[r:CONNECT]->(end) RETURN r
Any ideas how this query could be optimized to succeed would be much appreciated.
--
Edit
In a lot of ways I feel that while the apoc libary does indeed solve the memory issues, the function could be optimized if it were to run along the lines of this incredibly simple pseudocode:
for each start_gene
create relationship to end_gene where start_gene.targetId = end_gene.source_id
move on to next once relationship has been created
But I am unsure how to achieve this in cypher.
You can use apoc library for batching.
call apoc.periodic.commit("
MATCH (start:example_nodes),(end:example_nodes) WHERE not (start)-[:CONNECT]->(end) and id(start) > id(end) AND start.targetId =
end.sourceId
with start,end limit {limit}
CREATE (start)-[:CONNECT]->(end)
RETURN count(*)
",{limit:5000})
I want to look up the top 5 (shortest) path in my graph (Neo4j 3.0.4) from point A to point Z.
The graph consists several nodes that are connected by the relation "CONNECTED_BY". This connection has a cost property that should be minimized.
I started with this:
MATCH p=(from:Stop{stopId:'A'}), (to:Stop{stopUri:'Z'}),
path = allShortestPaths((from)-[:CONNECTED_TO*]->(to))
WITH REDUCE (total = 0, r in relationships(p) | total + r.cost) as tt, path
RETURN path, tt
This query returns always the subgraph with the least hops, the cost property is not considered. There exists another subgraph with more hops that has a lower total cost. What I am doing wrong?
Furthermore, I acutally want to get the TOP 5 subgraphs. If I execute this query:
MATCH p=(from:Stop{stopUri:'A'})-[r:CONNECTED_TO*10]->(to:Stop{stopUri:'Z'}) RETURN p
I can see several paths, but the first one just returns one path.
The path should not contain loops etc. of course.
I want to execute this query via REST API, so a REST Call or cyhper query should do it.
EDIT1:
I want to execute this as REST Call, so I tried the dijkstra algorithm. This seems to be a good way, but I have to calculate the weight by adding 3 different cost properties in the relation. How this could be achieved?
allShortestPaths will find the shortest path between two points and then match every path that has the same number of hops. If you want to minimize based on cost rather than traversal length, try something like this:
MATCH p=(from:Stop{stopId:'A'}), (to:Stop{stopUri:'Z'}),
path = (from)-[:CONNECTED_TO*]->(to)
WITH REDUCE (total = 0, r in relationships(p) | total + r.cost) as cost, path
ORDER BY cost
RETURN path LIMIT 5
I have a database in Neo4j of modules that I imported through CSV. The data looks something like this. Each module has its name, it's module that is the successor, average time duration and another duration called medtime.
I have been able to import the data and to set the relationships through a Cypher Query script that looks like this:
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
CREATE (n:Module)
SET n = row, n.name = row.name, n.mafter = row.mafter, n.avgtime = row.avgtime, n.medtime = row.medtime
WITH n
RETURN n
Then I have set the relationships like this:
Match (p:Module),(q:Module)
Where p.mafter = q.name
Merge (p)-[:PRECEEDS]->(q)
Return p,q
Now to the point. I want to calculate the shortest path from a certain module to another, more specifically the time that it takes to get from a module to another and for this, I use the more or less copied part of the script from
http://www.neo4j.org/graphgist?8412907 and that is
MATCH p = (trop:Module {name:'BLSACXAMT0A_00'})-[prec:PRECEEDS*]->(hop:Module {name:'BL_LOAD_CLOSE'})
WITH p, REDUCE(x = 0, a IN NODES(p) | x + a.avgtime) AS cum_duration
ORDER BY cum_duration DESC
LIMIT 1
RETURN cum_duration AS `Total Average Time`
This, however, takes about 50 second to execute and that is outrageous. You can see it on the screenshot right below. The ammount of modules imported into the database is only about 2000 and what I want to achieve, is to successfully work with more than 50 000 nodes and perform such tasks much faster.
Other issue is, that the results are somehow suspicious. The format looks wrong, every number I have in the database has max 4 digits after the decimal point and I am only adding these values to zero, therefore if the result looks like this: 00103,68330,51670, I have serious doubts. Please, help me, if it is wrong, why is it so, and what can I do to correct it.
Neo4j claims that it is efficient and fast, therefore I presume that the fault is in my code (the performance of my computer is more than enough). Please, If you can, help me to shorten this time and explain the patterns needed to perform this.
A few observations that should help:
You have several errors in how you are importing. These errors will create many more nodes than you think, and create the "suspicious" issue you raised:
Your file has multiple rows with the same name, but your import is creating a new Module node every time. Therefore, you are ending up with multiple nodes for some of your modules. You should be using MERGE instead of CREATE.
Your mafter property needs to contain a collection of strings, not a single string.
You are importing the numeric values as strings, so code such as x + a.avgtime is just doing string concatenation, not numeric addition. Furthermore, even if you did attempt to convert your strings to numbers, that would fail because your numbers use a comma instead of a period to indicate the decimal place.
Try this for importing (into an empty DB):
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
MERGE (n:Module {name: row.name})
ON CREATE SET
n.mafter = [row.mafter],
n.avgtime = TOFLOAT(REPLACE(row.avgtime, ',', '.')),
n.medtime = TOFLOAT(REPLACE(row.medtime, ',', '.'))
ON MATCH SET
n.mafter = n.mafter + row.mafter;
You also need to change your current merge query so that you can handle an mafter that is a collection. Note that the following query is designed to NOT create any new nodes (even if a name in mafter does not yet have a module node).
MATCH (p:Module)
OPTIONAL MATCH (p)-[:PRECEEDS]->(z:Module)
WITH p, COLLECT(z.name) AS existing
WITH p, filter(x IN p.mafter
WHERE NOT x IN existing) AS todo
MATCH (q:Module)
WHERE q.name IN todo
MERGE (p)-[:PRECEEDS]->(q)
RETURN p, q;
You should create an index to speed up the matching of modules by name:
CREATE INDEX ON :Module(name)
Cypher does have a shortestPath function, see http://neo4j.com/docs/stable/query-match.html#_shortest_path. However this calculates the shortest path based on the number of hops and does not take a weight into account.
Neo4j has couple of graph algorithms on board, e.g. Dijekstra or AStar. Unfortunately these are not yet available via cypher. Instead you have two alternatives to use them:
1) write an unmanaged extension to Neo4j and use GraphAlgoFactory in the implmentation. This requires to write same java code and deploy it to the Neo4j server. Using a custom CostEvaluator you can use the avgTime property on your nodes as cost parameter.
2) use the REST API as documented on http://neo4j.com/docs/stable/rest-api-graph-algos.html#rest-api-execute-a-dijkstra-algorithm-and-get-a-single-path. This approach requires to have the weight as a property on the relationship and not on a node (like in your data model)
I am creating simple graph db for tranportation between few cities. My structure is:
Station = physical station
Stop = each station has several stops, depend on time and line ID
Ride = connection between stops
I need to find route from city A to city C, but i has no direct stopconnection, but they are connected thru city B. see picture please, as new user i cant post images to question.
How can I get router from City A with STOP 1 connect RIDE 1 to STOP 2 then
STOP 2 connected by same City B to STOP3 and finnaly from STOP3 by RIDE2 to STOP4 (City C)?
Thank you.
UPDATE
Solution from Vince is ok, but I need set filter to STOP nodes for departure time, something like
MATCH p=shortestPath((a:City {name:'A'})-[*{departuretime>xxx}]-(c:City {name:'C'})) RETURN p
Is possible to do without iterations all matches collection? because its to slow.
If you are simply looking for a single route between two nodes, this Cypher query will return the shortest path between two City nodes, A and C.
MATCH p=shortestPath((a:City {name:'A'})-[*]-(c:City {name:'C'})) RETURN p
In general if you have a lot of potential paths in your graph, you should limit the search depth appropriately:
MATCH p=shortestPath((a:City {name:'A'})-[*..4]-(c:City {name:'C'})) RETURN p
If you want to return all possible paths you can omit the shortestPath clause:
MATCH p=(a:City {name:'A'})-[*]-(c:City) {name:'C'}) RETURN p
The same caveats apply. See the Neo4j documentation for full details
Update
After your subsequent comment.
I'm not sure what the exact purpose of the time property is here, but it seems as if you actually want to create the shortest weighted path between two nodes, based on some minimum time cost. This is different of course to shortestPath, because that minimises on the number of edges traversed only, not the cost of those edges.
You'd normally model the traversal cost on edges, rather than nodes, but your graph has time only on the STOP nodes (and not for example on the RIDE edges, or the CITY nodes). To make a shortest weighted path query work here, we'd need to also model time as a property on all nodes and edges. If you make this change, and set the value to 0 for all nodes / edges where it isn't relevant then the following Cypher query does what I think you need.
MATCH p=(a:City {name: 'A'})-[*]-(c:City {name:'C'})
RETURN p AS shortestPath,
reduce(time=0, n in nodes(p) | time + n.time) AS m,
reduce(time=0, r in relationships(p) | time + r.time) as n
ORDER BY m + n ASC
LIMIT 1
In your example graph this produces a least cost path between A and C:
(A)->(STOP1)-(STOP2)->(B)->(STOP5)->(STOP6)->(C)
with a minimum time cost of 230.
This path includes two stops you have designated "bad", though I don't really understand why they're bad, because their traversal costs are less than other stops that are not "bad".
Or, use Dijkstra
This simple Cypher will probably not be performant on densely connected graphs. If you find that performance is a problem, you should use the REST API and the path endpoint of your source node, and request a shortest weighted path to the target node using Dijkstra's algorithm. Details here
Ah ok, if the requirement is to find paths through the graph where the departure time at every stop is no earlier than the departure time of the previous stop, this should work:
MATCH p=(:City {name:'A'})-[*]-(:City {name:'C'})
MATCH (a:Stop) where a in nodes(p)
MATCH (b:Stop) where b in nodes(p)
WITH p, a, b order by b.time
WITH p as ps, collect(distinct a) as as, collect(distinct b) as bs
WHERE as = bs
WITH ps, last(as).time - head(as).time as elapsed
RETURN ps, elapsed ORDER BY elapsed ASC
This query works by matching every possible path, and then collecting all the stops on each matched path twice over. One of these collections of stops is ordered by departure time, while the other is not. Only if the two collections are equal (i.e. number and order) is the path admitted to the results. This step evicts invalid routes. Finally, the paths themselves are ordered by least elapsed time between the first and last stop, so the quickest route is first in the list.
Normal warnings about performance, etc. apply :)