Shortest path that has to include certain waypoints - neo4j

I’m trying to find the shortest path that connects an arbitrary collection of nodes. Both start and end can be any of the nodes in the collection, as long as they are not the same.
Standard cypher functions shortestPath() or allShortestPaths() fail because they find the shortest path from start to end and do not include waypoints.
The following cypher works, but is there a faster way?
//some collection of nodeids, as waypoints the path has to include
match (n) where id(n) IN [24259,11,24647,28333,196]
with collect(n) as wps
// create possible start en endpoints
unwind wps as wpstart
unwind wps as wpend
with wps,wpstart,wpend where id(wpstart)<id(wpend)
// find paths that include all nodes in wps
match p=((wpstart)-[*..6]-(wpend))
where ALL(n IN wps WHERE n IN nodes(p))
// return paths, ordered by length
return id(wpstart),id(wpend),length(p) as lp,EXTRACT(n IN nodes(p) | id(n)) order by lp asc
Update 23-10-2015:
With the latest Neo4j version 2.3.0, it is possible to combine shortestPath() with a WHERE clasue that is pulled in somewhere during the evaluation process. You then get a construct like this, in which {wps} is a collection of nodeIds.
// unwind the collection to create combinations of all start-end points
UNWIND {wps} AS wpstartid
UNWIND {wps} AS wpendid
WITH wpstartid,wpendid WHERE wpstartid<wpendid
// for each start-end combi,calculate shortestPath() with a WHERE clasue
MATCH (wpstart) WHERE id(wpstart)=wpstartid
MATCH (wpend) WHERE id(wpend)=wpendid
MATCH p=shortestPath((wpstart)-[*..5]-(wpend))
WHERE ALL(id IN {wps} WHERE id IN EXTRACT(n IN nodes(p) | id(n)) )
//return the shortest of the shortestPath()s
WITH p, size(nodes(p)) as length order by length limit 1
RETURN EXTRACT(n IN nodes(p) | id(n))
This approach does not always work, since there is an internal optimization that determines at which stage the WHERE clause is applied. So beware, and be prepared to fall back to the more bruteforce approach in the beginning of the item.

This is going to be a really unsatisfying answer, but here goes:
The question you're asking I strongly suspect is reducible to the problem of Hamiltonian Paths. This is a classic graph algorithm problem that turns out to be NP-complete. So practically speaking, what that means is that while it might be possible to implement this, the performance is likely going to be horrific.
If you really must implement this, I'd probably recommend not using cypher, and instead building something with the neo4j traversal framework. You can find sample code and algorithms online that will do at least a portion of this. But more broadly, if your data is larger than trivial in size, the unsatisfying part of this answer is that I probably wouldn't do it at all.
Better options might be to decompose your graph into smaller sub-problems which you can work independently, or coming up with another heuristic method that gets you close to what you want, but not via this method.

Related

How to efficiently remove nodes that do not connect two different nodes of a targeted type?

I have a Neo4j database with relationships like : (:Person)-[:KNOWS]-(:Target)
I would like to remove all Persons that are not connected to at least two different Targets thanks to a Cypher query.
I tried to use a query that, for each node, get all connected nodes (with an arbitrary path length) and then count the number of Target in it. If there is less than two, I remove the node.
But the request seems to be extremely long and unsuccessful:
MATCH (n:Person)
OPTIONAL MATCH (n)-[*]-(t:Target)
WITH n, COUNT(t) AS nb_targets
WHERE NOT n:Target AND nb_targets < 2
RETURN n
The request does not even succeed due to its inefficiency...
NB : I have only a few Targets and a lot of Persons
Cypher is interested in returning all possible paths that match the pattern, so it won't do well with an unbounded variable-length query like this.
You can instead use the path expander procs from APOC Procedures, which are designed to be more efficient for these use cases. We can even limit the results per node to 2, since that's the minimum we would need to determine if a node needs to be kept or discarded.
If you needed a query just to return those that didn't have at least 2 targets, then this query should work:
MATCH (n:Person)
WHERE NOT n:TARGET
CALL apoc.path.subgraphNodes(n, {labelFilter:'>TARGET', limit:2, optional:true}) YIELD node
WITH n, count(node) AS nb_targets
WHERE nb_targets < 2
RETURN n
[UPDATED]
If this is a real-life scenario (say, identifying suspects who might be in cahoots with known bad guys), then you probably don't care about people who have a large "degree of separation" from the bad guys. Otherwise, you may end up with most of the population falling under suspicion. So, you will probably want to impose a reasonable upper bound on the depth of your search.
It just so happens that the time (and memory) required to search a variable length path goes up exponentially with the depth of the search, so to speed up the search you would want to impose a reasonable upper bound on the depth of the search as well.
Therefore, try using a reasonable upper bound (say, 6). Here is a query to identify people who are not themselves targets and who are connected (by up to a depth of 6) to fewer than 2 targets. To help speed up the search, I also limit the search to just using KNOWS relationships (assuming that is the only relationship type you care about). And note that I count DISTINCT targets, since multiple paths can contain the same target.
MATCH (n:Person) WHERE NOT n:Target
OPTIONAL MATCH (n)-[:KNOWS*..6]-(t:Target)
WITH n, COUNT(DISTINCT t) AS nb_targets
WHERE nb_targets < 2
RETURN n
To eliminate all the people who are not potential suspects (including the ones who are more than 6 hops from any target), then you can first identify all the suspects and delete the people who are neither targets nor suspects:
MATCH (n:Person) WHERE NOT n:Target
OPTIONAL MATCH (n)-[:KNOWS*..6]-(t:Target)
WITH n, COUNT(DISTINCT t) AS nb_targets
WHERE nb_targets >= 2
WITH COLLECT(n) AS suspects
MATCH (m:Person)
WHERE NOT m:Target AND NOT m IN suspects
DETACH DELETE m

Optimizing Cypher Query

I am currently starting to work with Neo4J and it's query language cypher.
I have a multple queries that follow the same pattern.
I am doing some comparison between a SQL-Database and Neo4J.
In my Neo4J Datababase I habe one type of label (person) and one type of relationship (FRIENDSHIP). The person has the propterties personID, name, email, phone.
Now I want to have the the friends n-th degree. I also want to filter out those persons that are also friends with a lower degree.
FOr example if I want to search for the friends 3 degree I want to filter out those that are also friends first and/or second degree.
Here my query type:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP*3]-(friends:person)
WHERE NOT (me:person)-[:FRIENDSHIP]-(friends:person)
AND NOT (me:person)-[:FRIENDSHIP*2]-(friends:person)
RETURN COUNT(DISTINCT friends);
I found something similiar somewhere.
This query works.
My problem is that this pattern of query is much to slow if I search for a higher degree of friendship and/or if the number of persons becomes more.
So I would really appreciate it, if somemone could help me with optimize this.
If you just wanted to handle depths of 3, this should return the distinct nodes that are 3 degrees away but not also less than 3 degrees away:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP]-(f1:person)-[:FRIENDSHIP]-(f2:person)-[:FRIENDSHIP]-(f3:person)
RETURN apoc.coll.subtract(COLLECT(f3), COLLECT(f1) + COLLECT(f2) + me) AS result;
The above query uses the APOC function apoc.coll.subtract to remove the unwanted nodes from the result. The function also makes sure the collection contains distinct elements.
The following query is more general, and should work for any given depth (by just replacing the number after *). For example, this query will work with a depth of 4:
MATCH p=(me:person {personID:'1'})-[:FRIENDSHIP*4]-(:person)
WITH NODES(p)[0..-1] AS priors, LAST(NODES(p)) AS candidate
UNWIND priors AS prior
RETURN apoc.coll.subtract(COLLECT(DISTINCT candidate), COLLECT(DISTINCT prior)) AS result;
The problem with Cypher's variable-length relationship matching is that it's looking for all possible paths to that depth. This can cause unnecessary performance issues when all you're interested in are the nodes at certain depths and not the paths to them.
APOC's path expander using 'NODE_GLOBAL' uniqueness is a more efficient means of matching to nodes at inclusive depths.
When using 'NODE_GLOBAL' uniqueness, nodes are only ever visited once during traversal. Because of this, when we set the path expander's minLevel and maxLevel to be the same, the result are nodes at that level that are not present at any lower level, which is exactly the result you're trying to get.
Try this query after installing APOC:
MATCH (me:person {personID:'1'})
CALL apoc.path.expandConfig(me, {uniqueness:'NODE_GLOBAL', minLevel:4, maxLevel:4}) YIELD path
// a single path for each node at depth 4 but not at any lower depth
RETURN COUNT(path)
Of course you'll want to parameterize your inputs (personID, level) when you get the chance.

Cypher query to find the hop depth length of particular relationships

I am trying to find the amount of relationships that stem originally from a parent node and I am not sure the syntax to use in order to gain access to this returned integer. I am can be sure in my code that each child node can only have one relationship of a particular type so this allows me to capture a "true" depth reading
My attempt is this but I am hoping there is a cleaner way:
MATCH p=(n {id:'123'})-[r:Foo*]->(c)
RETURN length(p)
I am not sure this is the correct syntax because it returns an array of integers with the last index being the true tally length. I am hoping for something that just returns an int instead of this mentioned array.
I am very grateful for help that you may be able to offer.
As Nicole says, in general, finding the longest path between two nodes in a graph is not feasible in any reasonable time. If your graph is very small, it is possible that you will be able to find all paths, and select the one with the most edges but this won't scale to larger graphs.
However there is a trick that you can do in certain circumstances. If your graph contains no directed cycles, you can assign each edge a weight of -1, and then look for the shortest weighted path between the source and target nodes. Since the edge weights are negative a shortest weighted path must correspond to a path with a maximum number of edges between the desired nodes.
Unfortunately, Cypher doesn't yet support shortest weighted path algorithms, however the Neo4j database engine does. The docs give an example of how to do this. You will also need to implement your own algorithm, such as Bellman-Ford using the traversal API, because Dijkstra won't work with -ve edge weights.
However, please be aware that this trick won't work if your graph contains cycles - it must be a DAG.
Your query:
MATCH p=(n {id:'123'})-[r:Foo*]->(c)
RETURN length(p)
is returning the length of ALL possible paths from n to c. You probably are only interested in the shortest path? You can use the shortestPath function to only consider the shortest path from n to c:
MATCH p = shortestPath((n {id:'123'})-[r:Foo*]->(c))
RETURN length(p)

Is it possible to reduce/optimize this query for node degrees?

Given the following Cypher query that returns afferent (inbound) and efferent (outbound) connections, and the sum as the node degree:
START n = node(*)
RETURN n.name, length((n)-->()) AS efferent,
length((n)<--()) AS afferent,
length((n)-->()) + length((n)<--()) AS degree
Is it possible to reduce the query so that the two length() functions are not repeated in the summation in the degree column?
You can resolve and bind the two length computations separately from and before returning by using WITH. Then you can sum those bound values while returning.
START n = node(*)
WITH n, length((n)-->()) AS efferent, length((n)<--()) AS afferent
RETURN n.name, efferent, afferent, efferent + afferent AS degree
You may want to use MATCH (n) instead of START n = node(*) if your Neo4j version is >2.0, but that's not what you're asking about so I'll assume you know what you are doing.
EDIT
In Neo4j 1.x START is how you began a query. From 2.x and on, while START is still around, MATCH is the preferred way. If you have Neo4j 2.x and don't know a particular reason why you should use START, then you should use MATCH. Here's a short explanation of why.
Your query is written to touch the entire graph. When that is the intention there is not a very big difference between START n = node(*) and MATCH (n). The execution plans do differ, but I'm not aware that the difference is very important.
If, however, you want to perform your computations only on part of the graph, and you add to your 'starting point pattern' to that effect, then there will be significant differences. If, for example, you want to perform your computation only on nodes with the :User label
START n = node(*)
WHERE n:User
will still pull up all nodes, and then apply a filter to discard those that don't have the label, whereas
MATCH (n)
WHERE n:User
will only pull up the nodes that have that label to begin with.
The general difference is this: WHERE is a dependent clause accompanying START, MATCH, OPTIONAL MATCH or WITH. When it accompanies START or WITH it does not work by modifying the operation but by filtering the results; when it accompanies MATCH and OPTIONAL MATCH it modifies (as often as it can) the operation and therefore doesn't have to filter the results. The difference is that between shouting "Everyone, if you are my child, don't go into the road" and "Kids, don't go into the road".
There are cases when WHERE is not pulled into the MATCH clause. One example is
MATCH n
WHERE n:Male OR n:Female
In this case all nodes are pulled up and then filtered, just as if we had used START instead of MATCH.
Sometimes it is easy to know which patterns in the WHERE clause are able to be pulled in to modify the MATCH. This is the case for patterns that you can move into the MATCH clause yourself, by simply rearranging the query. The first MATCH example above could also be expressed
MATCH (n:User)
There is no way, however, to do this for the WHERE clause in second MATCH example, WHERE n:Male OR n:Female.
That a WHERE pattern cannot be moved into the MATCH clause by reformulating the query is not a reliable indicator that the query planner is unable to make use of it in the match operation. Being a declarative language, you ultimately have to trust the query planner to wisely implement the instructions; trust, but verify.1,2
One other difference between START and MATCH pertains to indexing. If you use 'legacy indexing' then you need to use START to access these indices. The 'new' (about two years I believe) label indices have continuously been improved for features and efficiency and we are running out of reasons to use the old indices. I think the only reason left may be full-text indexing, for which a configured legacy lucene index is still necessary. In time this feature also will be added to the label indices. Possibly, at that point, the START clause will be removed from Cypher altogether–but that is just the author's speculation.

Get all Routes between two nodes neo4j

I'm working on a project where I have to deal with graphs...
I'm using a graph to get routes by bus and bike between two stops.
The fact is,all my relationship contains the time needed to go from the start point of the relationship and the end.
In order to get the shortest path between to node, I'm using the shortest path function of cypher. But something, the shortest path is not the fastest....
Is there a way to get all paths between two nodes not linked by a relationship?
Thanks
EDIT:
In fact I change my graph, to make it easier.
So I still have all my nodes. Now the relationship type correspond to the time needed to go from a node to another.
The shortestPath function of cypher give the path which contains less relationship. I would like that it returns the path where the addition of all Type (the time) is the smallest..
Is that possible?
Thanks
In cypher, to get all paths between two nodes not linked by a relationship, and sort by a total in a weight, you can use the reduce function introduced in 1.9:
start a=node(...), b=node(...) // get your start nodes
match p=a-[r*2..5]->b // match paths (best to provide maximum lengths to prevent queries from running away)
where not(a-->b) // where a is not directly connected to b
with p, relationships(p) as rcoll // just for readability, alias rcoll
return p, reduce(totalTime=0, x in rcoll: totalTime + x.time) as totalTime
order by totalTime
You can throw a limit 1 at the end, if you need only the shortest.
You can use the Dijkstra/Astar algorithm, which seems to be a perfect fit for you. Take a look at http://api.neo4j.org/1.8.1/org/neo4j/graphalgo/GraphAlgoFactory.html
Unfortunately you cannot use those from Cypher.

Resources