Shortest paths between two sets of nodes - neo4j

I created a graph in Neo4j with 10 million nodes and 30 million relationships.
Each node is labeled as A (4 million nodes) , B (6 million nodes) or C (20 nodes).
Nodes in A lead to nodes in B. Nodes in B lead to other nodes in B, and to nodes in C.
For each node in A, I need to find the closest node (or nodes, if they are the same distance) in C, and add the ID of the C node as a value of a property in the A node.
Any help would be much appreciated.

So we're looking at a model like this (using :LEAD since you didn't specify a relationship type):
(:A)-[:LEAD]->(:B)
(:B)-[:LEAD]->(:B)
(:B)-[:LEAD]->(:C)
APOC Procedures offers the best solution for this one, but it's a two-parter since we first find the closest :C node using the path expander procedures, then rematch using that distance to get the full collection of :C nodes reachable at that distance.
You'll also want to make use of apoc.periodic.iterate() so you can batch this, though you may want to play around with the batchSize.
I'm making some assumptions in this query since you didn't provide much in the way of properties to use in the graph.
CALL apoc.periodic.iterate("MATCH (a:A) RETURN a",
"CALL apoc.path.spanningTree(a, {relationshipFilter:'LEAD>', labelFilter:'/C', limit:1}) YIELD path
WITH a, length(path) as length
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'LEAD>', labelFilter:'/C', maxLevel:length}) YIELD node
WITH a, collect(node.id) as ids
SET a.cIDs = ids",
{batchSize:1000}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

Related

Neo4j Cypher complex query optimization

Now I have a graph with millions of nodes and millions of edge relationships. There is a directed relationship between nodes.
Now suppose the node has two states A and B. I want to find all state A nodes on the path that do not have state B.
As shown in the figure below, there are nodes A--K, and then three of them, E, G and J, are of type B, and the others are of type A.
picture link is https://i.stack.imgur.com/a0yOV.jpg
For node E, its upstream and downstream traversal is shown below, so nodes B, H, K do not meet the requirements.
For node G, its upstream and downstream traversal is shown below, so nodes B, D, K do not meet the requirements.
For node J, its upstream and downstream traversal is shown below, so nodes A, B, C, D, F do not meet the requirements.
So finally only node "I" is the node that meets the requirements.
picture link is https://i.stack.imgur.com/A2eqv.jpg
The case of the above example is a DAG, but the actual situation is that there may be cycle in the graph, including spin cycle (case 1), AB cycle (case 2), large loops (case 3), and complex cycle (case 4)
picture link is https://i.stack.imgur.com/NDpED.jpg
The Cypher query statement I can write
MATCH (n:A)
WHERE NOT exists((n)-[*]->(:B))
AND NOT exists((n)<-[*]-(:B))
RETURN n;
But this query statement is stuck in the case of millions of nodes and millions of edges with a limit 35,But in the end there are more than 30,000 nodes that meet the requirements.
Obviously my statement is taking up too much memory, querying out 30+ nodes has taken up almost all the available memory, how can I write a more efficient query?
Here is a example
CREATE (a:A{id:'a'})
CREATE (b:A{id:'b'})
CREATE (c:A{id:'c'})
CREATE (d:A{id:'d'})
CREATE (e:B{id:'e'})
CREATE (f:A{id:'f'})
CREATE (g:B{id:'g'})
CREATE (h:A{id:'h'})
CREATE (i:A{id:'i'})
CREATE (j:B{id:'j'})
CREATE (k:A{id:'k'})
MERGE (a)-[:REF]->(c)
MERGE (b)-[:REF]->(c)
MERGE (b)-[:REF]->(d)
MERGE (b)-[:REF]->(e)
MERGE (c)-[:REF]->(f)
MERGE (d)-[:REF]->(g)
MERGE (e)-[:REF]->(g)
MERGE (e)-[:REF]->(h)
MERGE (f)-[:REF]->(i)
MERGE (f)-[:REF]->(j)
MERGE (f)-[:REF]->(k)
MERGE (g)-[:REF]->(k)
MERGE (g)-[:REF]->(j)
use this code will get the result 'i'
MATCH (n:A)
WHERE NOT exists((n)-[*]->(:B))
AND NOT exists((n)<-[*]-(:B))
RETURN n;
But when there are 800,000 nodes (400,000 type A, 400,000 type B) and over 1.4 million edges in the graph, this code cannot run the result
Some thoughts:
I don’t think this global graph search can be solved with a single query. You will need some kind of process to optimise exploration and use the result up to a certain point in subsequent steps.
when you could assign node labels instead of properties to reflect
the state of a node, you could use apoc.path.expandConfig to just
explore paths until you hit a node with state B.
you don’t need to re-investigate state A nodes that you traverse before you hit a node with state B, because they will not meet the requirements.
Another approach could be this, given the fact that all nodes that are on the up or downstream paths from a B node, will not fulfil the requirements. Still assuming that you use labels to distinguish A and B nodes.
MATCH (b:B)
CALL apoc.path.spanningTree(b,
{relationshipFilter: "<",
labelFilter:"/B"
}
) YIELD path
UNWIND nodes(path) AS downStreamNode
WITH b,COLLECT(DISTINCT downStreamNode) AS downStreamNodes
CALL apoc.path.spanningTree(b,
{relationshipFilter: ">",
labelFilter:"/B"}
) YIELD path
UNWIND nodes(path) AS upStreamNode
WITH b,downStreamNodes+COLLECT(DISTINCT upStreamNode) AS upAndDownStreamNodes
RETURN apoc.coll.toSet(apoc.coll.flatten(COLLECT(upAndDownStreamNodes))) AS allNodesThatDoNotFulfillRequirements

Neo4j: Customize Path Traversal

I am pretty new to Neo4j. I have implemented an example use case with the following setup:
acyclic directed graph
nodes have a property called externalID
Nodes:
Node Type S (Start Node)
Node Type E (End Node)
Node Type I (Intermediate Node)
Relations:
Node Type S can only have outgoing relations to Nodes of Type I
Node Type I can have ingoing relations from I and S
Node Type I can have outgoing relations to I and E
Node Type E can only have incomming relations from I
All relations have a weight property assigned which can be any number
With the help of stackoverflow and several tutorials I was able to formulate a Cypher query which gets me all paths from any start node with one externalID to the matching end node with the same externalID.
MATCH p=(a:S)-[r*]->(b:E)
WHERE a.externalID=b.externalID
WITH p, relationships(p) as rcoll
RETURN p
The query works more or less good so far ...
However, I have no idea how to change the behavior on how the graph is scanned for possible paths. Actually I only need a subset of all possible paths. Such paths fulfill the following requirement:
The path traversal is started at a Start Node S with a given capacity C.
if a relationship is traversed the weight property of this relationship is subtracted from the current capacity C (that means negative weights are added)
if the capacity gets negative the path up to this point is invalid (the path up to the previous node is still valid and may continue with other relationships)
if the capacity is still positive continue with another relationship from this point and use the result of C - weight as new C
Can I somehow adjust the query or is there any other possibility with Neo4j to get all paths using the strategy above?
Thanks a lot for your help in advance.
This Cypher query might be suitable for your use case:
MATCH p = (a:S)-[r*]->(b:E)
WHERE a.externalID = b.externalID
WITH
p,
REDUCE(c = a.capacity, r IN RELATIONSHIPS(p) |
CASE WHEN c < 0 THEN -1 ELSE c - r.weight END) AS residual
WHERE residual >= 0
RETURN p;
The REDUCE clause will set residual to a negative value if the capacity is ever reduced below 0, even if subsequent weights would normally cause it to go positive.

Find all relations starting with a given node

In a graph where the following nodes
A,B,C,D
have a relationship with each nodes successor
(A->B)
and
(B->C)
etc.
How do i make a query that starts with A and gives me all nodes (and relationships) from that and outwards.
I do not know the end node (C).
All i know is to start from A, and traverse the whole connected graph (with conditions on relationship and node type)
I think, you need to use this pattern:
(n)-[*]->(m) - variable length path of any number of relationships from n to m. (see Refcard)
A sample query would be:
MATCH path = (a:A)-[*]->()
RETURN path
Have also a look at the path functions in the refcard to expand your cypher query (I don't know what exact conditions you'll need to apply).
To get all the nodes / relationships starting at a node:
MATCH (a:A {id: "id"})-[r*]-(b)
RETURN a, r, b
This will return all the graphs originating with node A / Label A where id = "id".
One caveat - if this graph is large the query will take a long time to run.

Neo4J find route thru more points

I am creating simple graph db for tranportation between few cities. My structure is:
Station = physical station
Stop = each station has several stops, depend on time and line ID
Ride = connection between stops
I need to find route from city A to city C, but i has no direct stopconnection, but they are connected thru city B. see picture please, as new user i cant post images to question.
How can I get router from City A with STOP 1 connect RIDE 1 to STOP 2 then
STOP 2 connected by same City B to STOP3 and finnaly from STOP3 by RIDE2 to STOP4 (City C)?
Thank you.
UPDATE
Solution from Vince is ok, but I need set filter to STOP nodes for departure time, something like
MATCH p=shortestPath((a:City {name:'A'})-[*{departuretime>xxx}]-(c:City {name:'C'})) RETURN p
Is possible to do without iterations all matches collection? because its to slow.
If you are simply looking for a single route between two nodes, this Cypher query will return the shortest path between two City nodes, A and C.
MATCH p=shortestPath((a:City {name:'A'})-[*]-(c:City {name:'C'})) RETURN p
In general if you have a lot of potential paths in your graph, you should limit the search depth appropriately:
MATCH p=shortestPath((a:City {name:'A'})-[*..4]-(c:City {name:'C'})) RETURN p
If you want to return all possible paths you can omit the shortestPath clause:
MATCH p=(a:City {name:'A'})-[*]-(c:City) {name:'C'}) RETURN p
The same caveats apply. See the Neo4j documentation for full details
Update
After your subsequent comment.
I'm not sure what the exact purpose of the time property is here, but it seems as if you actually want to create the shortest weighted path between two nodes, based on some minimum time cost. This is different of course to shortestPath, because that minimises on the number of edges traversed only, not the cost of those edges.
You'd normally model the traversal cost on edges, rather than nodes, but your graph has time only on the STOP nodes (and not for example on the RIDE edges, or the CITY nodes). To make a shortest weighted path query work here, we'd need to also model time as a property on all nodes and edges. If you make this change, and set the value to 0 for all nodes / edges where it isn't relevant then the following Cypher query does what I think you need.
MATCH p=(a:City {name: 'A'})-[*]-(c:City {name:'C'})
RETURN p AS shortestPath,
reduce(time=0, n in nodes(p) | time + n.time) AS m,
reduce(time=0, r in relationships(p) | time + r.time) as n
ORDER BY m + n ASC
LIMIT 1
In your example graph this produces a least cost path between A and C:
(A)->(STOP1)-(STOP2)->(B)->(STOP5)->(STOP6)->(C)
with a minimum time cost of 230.
This path includes two stops you have designated "bad", though I don't really understand why they're bad, because their traversal costs are less than other stops that are not "bad".
Or, use Dijkstra
This simple Cypher will probably not be performant on densely connected graphs. If you find that performance is a problem, you should use the REST API and the path endpoint of your source node, and request a shortest weighted path to the target node using Dijkstra's algorithm. Details here
Ah ok, if the requirement is to find paths through the graph where the departure time at every stop is no earlier than the departure time of the previous stop, this should work:
MATCH p=(:City {name:'A'})-[*]-(:City {name:'C'})
MATCH (a:Stop) where a in nodes(p)
MATCH (b:Stop) where b in nodes(p)
WITH p, a, b order by b.time
WITH p as ps, collect(distinct a) as as, collect(distinct b) as bs
WHERE as = bs
WITH ps, last(as).time - head(as).time as elapsed
RETURN ps, elapsed ORDER BY elapsed ASC
This query works by matching every possible path, and then collecting all the stops on each matched path twice over. One of these collections of stops is ordered by departure time, while the other is not. Only if the two collections are equal (i.e. number and order) is the path admitted to the results. This step evicts invalid routes. Finally, the paths themselves are ordered by least elapsed time between the first and last stop, so the quickest route is first in the list.
Normal warnings about performance, etc. apply :)

Find last node in unknown amount of relationships

I can find last node like this
MATCH p=(a)-->(b)-->(c)
WHERE a.name='Object' AND c:Prime
RETURN c
But how i would find last node if i don't know how many relationships -->()-->() between two nodes?
I am trying to find last Node name by the Lable name. Last node doesn't have any outgoing relationships.
This will find c in an arbitrarily long path where c has not outgoing relationships.
MATCH p=(a)-[*]->(c:Prime)
WHERE a.name='Object'
AND not( c-->() )
RETURN c
It is generally advisable to use relationship types (if possible / practical) in your query and put an upward boundary on the number of hops your match can make. The example below follows only relationships of type CONNECTION in one direction to a maximum of 5 relationships.
MATCH p=(a)-[:CONNECTION*..5]->(c:Prime)
WHERE a.name='Object'
AND not( c-->() )
RETURN c

Resources