I try to develop a routing system, based on data in a Neo4j 3.0.4 Database. The graph contains multiple stops. Some of these stops are scheduled like bus stops, train stops, but not all of them. Therefore, the are connected to a schedule node. Each schedule node is connected to an offer.
A subgraph looks like this:
My question is: How can I create a query that returns this subgraph? Up to now I wrote this query in cypher:
MATCH (from:Stop{poiId:'A'}), (to:Stop{poiId:'Z'}) ,
path = allShortestPaths((from)-[r*]->(to))
RETURN path
This results in all shortest paths from stop A to stop Z. Between A and Z are to more stops, that are included in the returned path. I want to get for all stops the related schedules and for these schedules the related offers.
Furthermore it would be great if it would be possible to use constrains, based on the schedule node, e. g. allShortestPath from A to Z where filter(time in schedule.monday WHERE x > 1100).
If that is not possible, is it possilbe to create a new query with this constrain based on the previous query?
EDIT1: Further information:
In the schedules are departure times for each stop. I want to calculate based on a desired departure time (alternatively a desired arrival time) the full travel time and get the 5 best connections (less time).
E.g. I want to start at 7:00: the switch relation has cost time of 2. so check schedule 1 if there is a departure after 7:02. if yes, take the first departure after 7:02. The connected_by relation has a cost time of 12 min. the last switch_to relation has no cost time. So I will arrive at 07:14. Note: If I have to switch the service line during travelling, I have to check the schedule again. If the schedule fits not the desired time windows, exclude it from the result. I want to get the 5 best paths (based on travel time or arrival time), the number of hops is not important. If there is a connection with e. g. 6 stops, but with less travel time (or earlier arrival time) that prefer this one. I know this is a difficult and big problem, but I have no idea how to start... If there is a way to do this via REST (or if not in Java) I would be glad for each hint!
You can use the UNWIND construct in Cypher to get the nodes of a path and use OPTIONAL MATCH to look for schedules & offers.
I created a sample dataset:
CREATE
(offer: Offer),
(sch1: Schedule),
(sch2: Schedule),
(stop1: Stop {name: "stop1"}),
(stop2: Stop {name: "stop2"}),
(stop3: Stop {name: "stop3"}),
(stop4: Stop {name: "stop4"}),
(stop1)-[:SWITCH_TO]->(stop2),
(stop2)-[:CONNECTED_BY]->(stop3),
(stop3)-[:SWITCH_TO]->(stop4),
(stop2)-[:SCHEDULED_BY]->(sch1),
(stop3)-[:SCHEDULED_BY]->(sch2),
(sch1)-[:OFFERED_BY]->(offer),
(sch2)-[:OFFERED_BY]->(offer)
To get the subgraph, you can issue this query:
MATCH
(from:Stop {name:'stop1'}), (to:Stop {name:'stop4'}),
path = allShortestPaths((from)-[r*]->(to))
UNWIND nodes(path) AS stopNode
OPTIONAL MATCH (stopNode)-[sb:SCHEDULED_BY]->(schedule:Schedule)-[ob:OFFERED_BY]-(offer:Offer)
RETURN stopNode, sb, ob, schedule, offer
Using this approach, the edges in r are dropped, so it does not return the whole subgraph. The visualization on Neo4j's web UI adds those edges, so the result looks like this:
Anyways, I hope the post contains useful information - let me know how it works for you.
Related
in a 14 GB database I have a few CITES relationships:
MATCH p=()-[r:CITES]->() RETURN count(r)
91
However, when I run
MATCH ()-[r:CITES]-() RETURN count(r)
it loads forever and eventually crashes with a browser window reload (neo4j desktop)
You can see the differences in how each of those queries will execute if you prefix each query with EXPLAIN.
The pattern used for the first query is such that the planner will find that count in the counts store, a transactionally updated store of counts of various things. This is a fast constant time lookup.
The other pattern, when omitting the direction, will not use the count store lookup and will actually have to traverse the graph (starting from every node in the graph), and that will take a long time as your graph grows.
As for what this gives back, it should actually be twice the number of :CITIES relationships in your graph, since without the direction on the relationship, each individual relationship will be found twice, since the same path with the start and end nodes switched both fit the given pattern.
Neo4j always choose nodes as start points for query execution. In your query, probably the query engine is touching the whole graph, since you are not adding restrictions on node properties, labels, etc.
I think you should specify a label at least in your first node in the pattern.
MATCH (:Article)-[r:CITES]-() RETURN count(r)
I have built a database to store data on moving vehicles that have been divided into stop and trip events - so two node types are Stop and Trip. Stop nodes have attributes including the duration of the stop and its location and are connected to the next stop by that vehicle by -[:NEXT_STOP]->. Naturally there is only one relationship of this type per stop node. A stop by definition is 1 hour long and a trip by definition is what is between stops.
I am trying to build a query that can aggregate stops (and subsequently the intermediate trips) into larger units. That is a user may wish to define a "tour" that incorporates several trips and stops that begin and end at specified locations and are bookended by stops of longer duration, say 8 hours.
I can describe this quite easily in pseudo verbally. Say the user wants all tours (where a tour is bookended by stops of 8 hours) between Sydney and Melbourne. Then the process would finding all Stop nodes where location is Sydney and duration is over 8 hours. Then, for each of these, checking the NEXT_STOP Stop node. If that Stop is less than 8 hours in duration, then checking its next node in turn until we reach a node with a duration of over 8 hours. It would then filter these terminal nodes to retrieve only those in Melbourne. Ideally, there would be edge conditions defined to stop endless loops, for instance stopping once they reach stops that are a certain period removed from the start.
I have been unable to implement this in Cypher however. The recursive aspects are straight forward, but I cannot work out how to stipulate the conditions for intermediate nodes (ie, duration less than 8 hours) or make the path found non-greedy. For instance, if I start with something like
WITH 4*3600 as maxdur
MATCH (s1:Stop{location:'Sydney'})-[:NEXT_STOP *1..]->(s2:Stop)
WHERE s1.duration > maxdur and s2.duration > maxdur
It matches the longest possible path - for instance several tours joined together because it is not checking where intermediate stops are sufficiently short.
The shortestPath algorithms are no help here because they will match the shortest path of all starting nodes subject to the conditions, rather than the shortest path for each of the starting nodes.
I have a way of doing this off-server, especially if I predetermine the desired tour parameters, but this seems like a problem naturally suited to graph databases.
This snippet may work for you:
WITH 4*3600 as maxdur
MATCH p=(s1:Stop{location:'Sydney'})-[:NEXT_STOP*]->(s2:Stop)
WHERE
s1.duration > maxdur AND s2.duration > maxdur AND
NONE(n IN NODES(p)[1..-1] WHERE n.duration > maxdur)
NODES(p)[1..-1] is used to get the interior nodes in each candidate path. The NONE function is used to filter out paths in which an interior node has a duration exceeding maxdur.
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m
I'm new to Neo4j and I'm having fun with some data about our solar system in a game (called Elite Dangerous). As a trader, you want to find the most profitable route based on certain criteria. One of them is the number of jumps needed between on system and another. To calculate that, we first need to calculate the distance between all systems within 30Ly for every system so I've devised this query to calculate the distance in question :
MATCH (s1:System), (s2:System)
WITH s1, s2, (sqrt((s2.x-s1.x)^2+(s2.y-s1.y)^2+(s2.z-s1.z)^2)) AS dist
WHERE dist < 30 AND dist > 0
CREATE UNIQUE (s1)-[:IS_DISTANCED_FROM {distance: dist}]-(s2)
RETURN count(dist)
A system as x,y,z coordinates. The query is so slow, even after some hour, it didn't finished. Am I doing something wrong?
I've an index on System and i'm using version 2.1.6.
My cypher query failed but my database is at 806 777 relationship now. Is there a way to clean it because the relationships doesn't appear when I query for them afterward.
Thanks for your help!
Maybe you can try with a subset of your systems, to see where the execution plan is taking the most time.
Can you run this query in the neo4j shell and post the result of the Exectution ? :
PROFILE MATCH (s1:System)
WITH s1
LIMIT 15
MATCH (s2:System)
WHERE s2 <> s1
WITH s1, s2, (sqrt((s2.x-s1.x)^2+(s2.y-s1.y)^2+(s2.z-s1.z)^2)) AS dist
WHERE dist < 30 AND dist > 0
CREATE UNIQUE (s1)-[:IS_DISTANCED_FROM {distance: dist}]-(s2)
RETURN count(dist)
It is not surprising that your query takes a long time, since it has complexity O(N*N). And indexing does not help, since you are not matching using specific property values.
I would recommend that you calculate the distances programmatically outside of Cypher, and then use Cypher just to create the relationships. This will still be slow, but probably much faster than trying to do everything within Cypher.
Also, you can halve the number of calculations you need to perform by noticing that [the distance from System A to System B] equals [the distance from System B to System A]. You only need to create a single distance relationship between any 2 Systems.
Finally, to identify the apparently spurious relationships in your DB, you can try something like this query to take a look at some of them:
MATCH ()-[r]->()
RETURN r
LIMIT 50
If you are REALLY REALLY CERTAIN that you want to get rid of ALL relationships, you can use the following query. To be safe, you may want to first make a backup copy of your DB (shut down the DB server, make a recursive copy of its data/graph.db/ folder, and then restart the server).
MATCH ()-[r]->()
DELETE r