My task is to calculate the total length of roads from city. I'm using OSM data. After importing it to the database I have the following structure (This seemed logic to me but I can change if you think there is a better way):
There is a root node for each rode segment (way tag in OSM XML) that holds an ID and a type (I have other types as well but they are irrelevant now)
To the root node is connected to the first node of the road with a relation 'defines'
Every node is connected to the next one with a relation called 'connected' that has a property 'entity_id' which is the root nodes id. (One node can appear in more road segments, for example intersections, so I'm trying to avoid circles with this property.
I'm pretty new to Neo4J. I only have experience in SQL databases but based on that I feel like even if my approach was working, it would loose the advantage of the query language (referring to speed).
So here is what I have so far, but it is not even close. It outputs the same number (wrong number) a lot of times instead of one total. I'm pretty sure I did not get the whole idea of with but can't figure out what would be the solution:
CREATE (t:Tmp {total:0})
with t
MATCH (e:Entity {type:'road'})
with collect(e) as es, t
unwind es as entity
match p = ()-[r:connected {entity_id:entity.int_id}]->()
with entity, p,t
SET entity.lng = 0
with entity, p, t
unwind nodes(p) as nd
with t,nd,point({longitude:toFloat(nd.lon), latitude: toFloat(nd.lat)}) as point1, entity
SET entity.lng = entity.lng + distance(entity.p, point1)
with t,nd,point({longitude:toFloat(nd.lon), latitude: toFloat(nd.lat)}) as point1, entity
SET entity.p = point1
with entity, t
SET t.total = t.total + entity.lng
return t.total
Your query is returning the current t.total result per node, instead of an overall total value. And it seems to be incorrectly calculating a distance for the first node in a segment (the first node should have a 0 distance). It is also very inefficient. For example, it does not bother to leverage the defines relationship. In a neo4j query, it is vitally important to use the power of relationships to avoid scanning through a lot of irrelevant data.
In addition, there is no mention of a particular "city". Your query is for all Entity nodes. If your DB only contains Entity nodes for a single city, then that is OK. Otherwise, you will need to modify the query to only match Entity nodes for a specific city.
The following query may do what you want (going simply by what I gleaned from your question, and assuming your DB only has data for a single city), using the defines relationship to efficiently match the start node for each Entity segment, and using that start node to efficiently find the connected nodes of interest:
MATCH (entity:Entity {type:'road'})-[:defines]->(start)
MATCH p=(start)-[:connected* {entity_id:entity.id}]->(end)
WHERE NOT EXISTS((end)-[:connected {entity_id:entity.id}]->())
SET entity.lng = 0
SET entity.p = point({longitude:toFloat(start.lon), latitude: toFloat(start.lat)})
WITH entity, p
UNWIND TAIL(NODES(p)) AS nd
WITH point({longitude:toFloat(nd.lon), latitude: toFloat(nd.lat)}) as pt, entity
SET entity.lng = entity.lng + distance(entity.p, pt)
SET entity.p = pt
RETURN SUM(entity.lng) AS total
The aggregating function SUM() is used to return the total lng for all entities. The t node is not needed. The WHERE clause only matches paths that comprise complete segments. This query also initializes entity.p to the point of the start node, and UNWINDS the nodes after the start node.
If there are many Entity nodes with type values other than "road", you may also want to create an index on :Entity(type).
Related
I have got a graph that represents several bus/train stops in different cities.
Lets assume I want to go from city A (with stops a1, a2, a3...) to city Z (with stops z1, z2...)
There are several routes (relations) between the nodes and I want to get all paths between the start and the end node. My cost vector would be complex (travel time and waiting time and price and and and...) in reality, therefore I cannot use shortestpaths etc. I managed to write a (quite complex) query that does what I want: In general it is looking for each match with start A and end Z that is available.
I try to avoid looping by filter out results with special characteristics, e. g.
MATCH (from{name:'a1'}), (to{name:'z1'}),
path = (from)-[:CONNECTED_TO*0..8]->(to)
WHERE ALL(b IN NODES(path) WHERE SINGLE(c IN NODES(path) WHERE b = c))
Now I want to avoid the possiblity to visit one city more than once, e. g. instead of a1-->a2-->d2-->d4-->a3-->a4-->z1 I want to get a1-->a4-->z1.
Therefore I have to check all nodes in the path. If the value of n.city is the same for consecutive nodes, everything is fine. But If I got a path with nodes of the same city that are not consecutive, e. g. cityA--> cityB-->cityA I want to throw away that path.
How can I do that? Is something possible?
I know, that is not really a beatiful approach, but I invested quite a lot of time in finding a better one without throwing away the whole data structure but I could not find one. Its just a prototype and Neo4j is not my focus. I want to test some tools and products to build some knowledge. I will go ahead with a better approach next time.
Interesting question. The important thing to observe here is that a path that never revisits a city (after leaving it) must have fewer transitions between cities than the number of distinct cities. For example:
AABBC (a "good" path) has 3 distinct cities and 2 transitions
ABBAC (a "bad" path) also has 3 distinct cities but 3 transitions
With this observation in mind, the following query should work (even if the start and end nodes are the same):
MATCH path = ({name:'a1'})-[:CONNECTED_TO*0..8]->({name:'z1'})
WITH path, NODES(path) as ns
WITH path, ns,
REDUCE(s = {cnt: 0, last: ns[0].city}, x IN ns[1..] |
CASE WHEN x.city = s.last THEN s ELSE {cnt: s.cnt+1, last: x.city} END).cnt AS nTransitions
UNWIND ns AS node
WITH path, nTransitions, COUNT(DISTINCT node.city) AS nCities
WHERE nTransitions < nCities
RETURN path;
The REDUCE function is used to calculate the number of transitions in a path.
In a graph where the following nodes
A,B,C,D
have a relationship with each nodes successor
(A->B)
and
(B->C)
etc.
How do i make a query that starts with A and gives me all nodes (and relationships) from that and outwards.
I do not know the end node (C).
All i know is to start from A, and traverse the whole connected graph (with conditions on relationship and node type)
I think, you need to use this pattern:
(n)-[*]->(m) - variable length path of any number of relationships from n to m. (see Refcard)
A sample query would be:
MATCH path = (a:A)-[*]->()
RETURN path
Have also a look at the path functions in the refcard to expand your cypher query (I don't know what exact conditions you'll need to apply).
To get all the nodes / relationships starting at a node:
MATCH (a:A {id: "id"})-[r*]-(b)
RETURN a, r, b
This will return all the graphs originating with node A / Label A where id = "id".
One caveat - if this graph is large the query will take a long time to run.
I have a "Gene" Label/node type with properties "value" and "geneName"
I have a separate Label/node type called Pathway with property "
I want to go through all the different geneName's and find the average of all the Gene's value with that Gene name. I need all those Gene's displayed as different rows. Bearing in mind I have a a lot of geneName's so I can't name them all in the query. I need to do this inside a certain Pathway.
MATCH (sample)-[:Measures]->(gene)-[:Part_Of]->(pathway)
WHERE pathway.pathwayName = 'Pyrimidine metabolism'
WITH sample, gene, Collect (distinct gene.geneName) AS temp
I have been trying to figure this out all day now and all I can manage to do is retrieve all the rows of geneNames. I'm lost from there.
RETURN extract(n IN temp | RETURN avg(gene.value))
Mabye?
This query should return the average gene value for each distinct gene name:
MATCH (sample)-[:Measures]->(gene)-[:Part_Of]->(pathway:Pathway)
WHERE pathway.pathwayName = 'Pyrimidine metabolism'
RETURN sample, gene.geneName AS name, AVG(gene.value) AS avg;
When you use an aggregation function (like AVG), it automatically uses distinct values for the non-aggregating values in the same WITH or RETURN clause (i.e., sample and gene.geneName in the above query).
For efficiency, I have also added the label to the pathway nodes so that neo4j can start off by scanning just Pathway nodes instead of all nodes. In addition, you should consider creating an index on :Pathway(pathwayName), so that the search for the Pathway is as fast as possible.
I am creating simple graph db for tranportation between few cities. My structure is:
Station = physical station
Stop = each station has several stops, depend on time and line ID
Ride = connection between stops
I need to find route from city A to city C, but i has no direct stopconnection, but they are connected thru city B. see picture please, as new user i cant post images to question.
How can I get router from City A with STOP 1 connect RIDE 1 to STOP 2 then
STOP 2 connected by same City B to STOP3 and finnaly from STOP3 by RIDE2 to STOP4 (City C)?
Thank you.
UPDATE
Solution from Vince is ok, but I need set filter to STOP nodes for departure time, something like
MATCH p=shortestPath((a:City {name:'A'})-[*{departuretime>xxx}]-(c:City {name:'C'})) RETURN p
Is possible to do without iterations all matches collection? because its to slow.
If you are simply looking for a single route between two nodes, this Cypher query will return the shortest path between two City nodes, A and C.
MATCH p=shortestPath((a:City {name:'A'})-[*]-(c:City {name:'C'})) RETURN p
In general if you have a lot of potential paths in your graph, you should limit the search depth appropriately:
MATCH p=shortestPath((a:City {name:'A'})-[*..4]-(c:City {name:'C'})) RETURN p
If you want to return all possible paths you can omit the shortestPath clause:
MATCH p=(a:City {name:'A'})-[*]-(c:City) {name:'C'}) RETURN p
The same caveats apply. See the Neo4j documentation for full details
Update
After your subsequent comment.
I'm not sure what the exact purpose of the time property is here, but it seems as if you actually want to create the shortest weighted path between two nodes, based on some minimum time cost. This is different of course to shortestPath, because that minimises on the number of edges traversed only, not the cost of those edges.
You'd normally model the traversal cost on edges, rather than nodes, but your graph has time only on the STOP nodes (and not for example on the RIDE edges, or the CITY nodes). To make a shortest weighted path query work here, we'd need to also model time as a property on all nodes and edges. If you make this change, and set the value to 0 for all nodes / edges where it isn't relevant then the following Cypher query does what I think you need.
MATCH p=(a:City {name: 'A'})-[*]-(c:City {name:'C'})
RETURN p AS shortestPath,
reduce(time=0, n in nodes(p) | time + n.time) AS m,
reduce(time=0, r in relationships(p) | time + r.time) as n
ORDER BY m + n ASC
LIMIT 1
In your example graph this produces a least cost path between A and C:
(A)->(STOP1)-(STOP2)->(B)->(STOP5)->(STOP6)->(C)
with a minimum time cost of 230.
This path includes two stops you have designated "bad", though I don't really understand why they're bad, because their traversal costs are less than other stops that are not "bad".
Or, use Dijkstra
This simple Cypher will probably not be performant on densely connected graphs. If you find that performance is a problem, you should use the REST API and the path endpoint of your source node, and request a shortest weighted path to the target node using Dijkstra's algorithm. Details here
Ah ok, if the requirement is to find paths through the graph where the departure time at every stop is no earlier than the departure time of the previous stop, this should work:
MATCH p=(:City {name:'A'})-[*]-(:City {name:'C'})
MATCH (a:Stop) where a in nodes(p)
MATCH (b:Stop) where b in nodes(p)
WITH p, a, b order by b.time
WITH p as ps, collect(distinct a) as as, collect(distinct b) as bs
WHERE as = bs
WITH ps, last(as).time - head(as).time as elapsed
RETURN ps, elapsed ORDER BY elapsed ASC
This query works by matching every possible path, and then collecting all the stops on each matched path twice over. One of these collections of stops is ordered by departure time, while the other is not. Only if the two collections are equal (i.e. number and order) is the path admitted to the results. This step evicts invalid routes. Finally, the paths themselves are ordered by least elapsed time between the first and last stop, so the quickest route is first in the list.
Normal warnings about performance, etc. apply :)
I've this kind of data model in the db:
(a)<-[:has_parent]<-(b)-[:has_parent]-(c)<-[:has_parent]-(...)
every parent can have multiple children & this can go on to unknown number of levels.
I want to find these values for every node
the number of descendants it has
the depth [distance from the node] of every descendant
the creation time of every descendant
& I want to rank the returned nodes based on these values. Right now, with no optimization, the query runs very slow (especially when the number of descendants increases).
The Questions:
what can I do in the model to make the query performant (indexing, data structure, ...)
what can I do in the query
what can I do anywhere else?
edit:
the query starts from a specific node using START or MATCH
to clarify:
a. the query may start from any point in the hierarchy, not just the root node
b. every node under the starting node is returned ranked by the total number of descendants it has, the distance (from the returned node) of every descendant & timestamp of every descendant it has.
c. by descendant I mean everything under it, not just it's direct children
for example,
here's a sample graph:
http://console.neo4j.org/r/awk6m2
First you need to know how to find the root node. The following statement finds the nodes having no outboung parent relationship - be aware that statement is potentially expensive in a large graph.
MATCH (n)
WHERE NOT ((n)-[:has_parent]->())
RETURN n
Instead you should use an index to find that node:
MATCH (n:Node {name:'abc'})
Starting with our root node, we traverse inbound parent relationship with variable depth. On each node traversed we calculate the number of children - since this might be zero a OPTIONAL MATCH is used:
MATCH (root:Node) // line 1-3 to find root node, replace by index lookup
WHERE NOT ((root)-[:has_parent]->())
WITH root
MATCH p =(root)<-[:has_parent*]-() // variable path length match
WITH last(nodes(p)) AS currentNode, length(p) AS currentDepth
OPTIONAL MATCH (currentNode)<-[:has_parent]-(c) // tranverse children
RETURN currentNode, currentNode.created, currentDepth, count(c) AS countChildren