Neo4j (Cypher) - Is it possible to use non-implicit aggregation? - neo4j

My question is fairly straightforward. I've been trying to write a Cypher query which uses an aggregation function - min().
I am trying to obtain the closest node to a particular node using the new Spatial functions offered in Neo4j 3.4. My query currently looks like this:
MATCH (a { agency: "Bus", stop_id: "1234" }), (b { agency: "Train" })
WITH distance(a.location, b.location) AS dist, a.stop_id as orig_stop_id, b.stop_id AS dest_stop_id
RETURN orig_stop_id,min(dist)
The location property is a point property and this query does actually do what I want it to do, except for one thing: I'd like to also include the dest_stop_id field in the result so that I can actually know which other node corresponds to this minimal distance, however Neo4j seems to aggregate implicitly all fields in the RETURN clause that are not inside an aggregate function and the result is I just get a list of all pairs (orig_stop_id, dest_stop_id) and their distance versus getting just the minimum and the corresponding dest_stop_id. Is there any way to specify which fields should be grouped in the result set?
In SQL, GROUP BY allows you to specify this but I haven't been able to find a similar function in Cypher.
Thanks in advance, please let me know if you need any extra information.

This should work:
MATCH (a { agency: "Bus", stop_id: "1234" }), (b { agency: "Train" })
RETURN
a.stop_id AS orig_stop_id,
REDUCE(
s = NULL,
d IN COLLECT({dist: distance(a.location, b.location), sid: b.stop_id}) |
CASE WHEN s.dist < d.dist THEN s ELSE {dist: d.dist, dest_stop_id: d.sid} END
) AS min_data
This query uses REDUCE to get the minimum distance and also the corresponding dest_stop_id at the same time.
The tricky part is that the first time the CASE clause is executed, s will be NULL. Afterwards, s will be a map. The CASE clause handles the special NULL situation by specifically performing the s.dist < d.dist test, which will always evaluate to false if s is NULL -- causing the ELSE clause to be executed in that case, initializing s to be a map.
NOTE: Ideally, you should use the labels for your nodes in your query, so that the query does not have to scan every node in the DB to find each node. Also, you may want to add the appropriate indexes to further speed up the query.

Seems like you could skip the aggregation function and just order the distance and take the top:
MATCH (a { agency: "Bus", stop_id: "1234" }), (b { agency: "Train" })
WITH distance(a.location, b.location) AS dist, a, b
ORDER BY dist DESC
LIMIT 1
RETURN a.stop_id as orig_stop_id, b.stop_id AS dest_stop_id, dist
As others here have mentioned you really should use labels here (otherwise is this doing all node scans to find your starting points, this is probably the main performance bottleneck of your query), and have indexes in place so you're using index lookups for both a and b.
EDIT
If you need the nearest when you have multiple starting nodes, you can take the head of the collected elements like so:
MATCH (a { agency: "Bus", stop_id: "1234" }), (b { agency: "Train" })
WITH distance(a.location, b.location) AS dist, a, b
ORDER BY dist DESC
WITH a, head(collect(b {.stop_id, dist})) as b
RETURN a.stop_id as orig_stop_id, b.stop_id AS dest_stop_id, b.dist as dist
We do need to include dist into the map projection from b, otherwise it would be used as a grouping key along with a.
Alternately you could just collect b instead of the map projection and then recalculate with the distance() function per remaining row.

You can use COLLECT for aggregation (note this query isn't checked) :
MATCH (a { agency: "Bus", stop_id: "1234" }), (b { agency: "Train" })
WITH COLLECT (distance(a.location, b.location)) as distances, a.stop_id as stopId
UNWIND distances as distance
WITH min(distance) as min, stopId
MATCH (bus { agency: "Bus", stop_id: stopId}), (train{ agency: "Train" })
WHERE distance(bus.location, train.location) = distance
RETURN bus,train, distance
Hope this will help you.

Related

Neo4j: How to find for each node its next neighbour by distance and create a relationship

I imported a large set of nodes (>16 000) where each node contains the information about a location (longitudinal/lateral geo-data). All nodes have the same label. There are no relationships in this scenario. Now I want to identify for each node the next neighbour by distance and create a relationship between these nodes.
This (brute force) way worked well for sets containing about 1000 nodes: (1) I first defined relationships between all nodes containing the distance information. (2) Then I defined for all relationships the property "mindist=false".(3) After that I identified the next neighbour looking at the the distance information for each relationship and set "mindist" property "true" where the relationship represents the shortest distance. (4) Finally I deleted all relationships with "mindist=false".
(1)
match (n1:XXX),(n2:XXX)
where id(n1) <> id(n2)
with n1,n2,distance(n1.location,n2.location) as dist
create(n1)-[R:DISTANCE{dist:dist}]->(n2)
Return R
(2)
match (n1:XXX)-[R:DISTANCE]->(n2:XXX)
set R.mindist=false return R.mindist
(3)
match (n1:XXX)-[R:DISTANCE]->(n2:XXX)
with n1, min(R.dist) as mindist
match (o1:XXX)-[r:DISTANCE]->(o2:XXX)
where o1.name=n1.name and r.dist=mindist
Set r.mindist=TRUE
return r
(4)
match (n)-[R:DISTANCE]->()
where R.mindist=false
delete R return n
With sets containing about 16000 nodes this solution didn't work (memory problems ...). I am sure there is a smarter way to solve this problem (but at this point of time I am still short on experience working with neo4j/cypher). ;-)
You can process find the closest neighbor one by one for each node in batch using APOC. (This is also a brute-force way, but runs faster). It takes around 75 seconds for 7322 nodes.
CALL apoc.periodic.iterate("MATCH (n1:XXX)
RETURN n1", "
WITH n1
MATCH (n2:XXX)
WHERE id(n1) <> id(n2)
WITH n1, n2, distance(n1.location,n2.location) as dist ORDER BY dist LIMIT 1
CREATE (n1)-[r:DISTANCE{dist:dist}]->(n2)", {batchSize:1, parallel:true, concurrency:10})
NOTE: batchSize should be always 1 in this query. You can change
concurrency for experimentation.
Our options within Cypher are I think limited to a naive O(n^2) brute-force check of the distance from every node to every other node. If you were to write some custom Java to do it (which you could expose as a Neo4j plugin), you could do the check much quicker.
Still, you can do it with arbitrary numbers of nodes in the graph without blowing out the heap if you use APOC to split the query up into multiple transactions. Note: you'll need to add the APOC plugin to your install.
Let's first create 20,000 points of test data:
WITH range(0, 20000) as ids
WITH [x in ids | { id: x, loc: point({ x: rand() * 100, y: rand() * 100 }) }] as points
UNWIND points as pt
CREATE (p: Point { id: pt.id, location: pt.loc })
We'll probably want a couple of indexes too:
CREATE INDEX ON :Point(id)
CREATE INDEX ON :Point(location)
In general, the following query (don't run it yet...) would, for each Point node create a list containing the ID and distance to every other Point node in the graph, sort that list so the nearest one is at the top, pluck the first item from the list and create the corresponding relationship.
MATCH (p: Point)
MATCH (other: Point) WHERE other.id <> p.id
WITH p, [x in collect(other) | { id: x.id, dist: distance(p.location, x.location) }] AS dists
WITH p, head(apoc.coll.sortMaps(dists, '^dist')) AS closest
MATCH (closestPoint: Point { id: closest.id })
MERGE (p)-[:CLOSEST_TO]->(closestPoint)
However, the first two lines there cause a cartesian product of nodes in the graph: for us, it's 400 million rows (20,000 * 20,000) that flow into the rest of the query all of which is happening in memory - hence the blow-up. Instead, let's use APOC and apoc.periodic.iterate to split the query in two:
CALL apoc.periodic.iterate(
"
MATCH (p: Point)
RETURN p
",
"
MATCH (other: Point) WHERE other.id <> p.id
WITH p, [x in collect(other) | { id: x.id, dist: distance(p.location, x.location) }]
AS dists
WITH p, head(apoc.coll.sortMaps(dists, '^dist')) AS closest
MATCH (closestPoint: Point { id: closest.id })
MERGE (p)-[:CLOSEST_TO]->(closestPoint)
", { batchSize: 100 })
The first query just returns all Point nodes. apoc.periodic.iterate will then take the 20,000 nodes from that query and split them up into batches of 100 before running the inner query on each of the nodes in each batch. We'll get a commit after each batch, and our memory usage is constrained to whatever it costs to run the inner query.
It's not quick, but it does complete. On my machine it's running about 12 nodes a second on a graph with 20,000 nodes but the cost exponentially increases as the number of nodes in the graph increases. You'll rapidly hit the point where this approach just doesn't scale well enough.

How do I set relationship data as properties on a node?

I've taken the leap from SQL to Neo4j. I have a few complicated relationships that I need to set as properties on nodes as the first step towards building a recommendation engine.
This Cypher query returns a list of categories and weights.
MATCH (m:Movie {name: "The Matrix"})<-[:TAKEN_FROM]-(i:Image)-[r:CLASSIFIED_AS]->(c:Category) RETURN c.name, avg(r.weight)
This returns
{ "fighting": 0.334, "looking moody": 0.250, "lying down": 0.237 }
How do I set these results as key value pairs on the parent node?
The desired outcome is this:
(m:Movie { "name": "The Matrix", "fighting": 0.334, "looking moody": 0.250, "lying down": 0.237 })
Also, I assume I should process my (m:Movie) nodes in batches so what is the best way of accomplishing this?
Not quite sure how you're getting that output, that return shouldn't be returning both of them as key value pairs. Instead I would expect something like: {"c.name":"fighting", "avg(r.weight)":0.334}, with separate records for each pair.
You may need APOC procedures for this, as you need a means to set the property key to the value of the category name. That's a bit tricky, but you can do this by creating a map from the collected pairs, then use SET with += to update the relevant properties:
MATCH (m:Movie {name: "The Matrix"})<-[:TAKEN_FROM]-(:Image)-[r:CLASSIFIED_AS]->(c:Category)
WITH m, c.name as name, avg(r.weight) as weight
WITH m, collect([name, weight]) as category
WITH m, apoc.map.fromPairs(category) as categories
SET m += categories
As far as batching goes, take a look at apoc.periodic.iterate(), it will allow you to iterate on the streamed results of the outer query and execute the inner query on batches of the stream:
CALL apoc.periodic.iterate(
"MATCH (m:Movie)
RETURN m",
"MATCH (m)<-[:TAKEN_FROM]-(:Image)-[r:CLASSIFIED_AS]->(c:Category)
WITH m, c.name as name, avg(r.weight) as weight
WITH m, collect([name, weight]) as category
WITH m, apoc.map.fromPairs(category) as categories
SET m += categories",
{iterateList:true, parallel:false}) YIELD total, batches, errorMessages
RETURN total, batches, errorMessages

Return Relationship belonging to a list Neo4j cypher

I have this dataset containing 3M nodes and more than 5M relationships. There about 8 different relationship types. Now I want to return 2 nodes if they are inter-connected.. Here the 2 nodes are A & B and I would like to see if they are inter-connected.
MATCH (n:WCD_Ent)
USING INDEX n:WCD_Ent(WCD_NAME)
WHERE n.WCD_NAME = "A"
MATCH (m:WCD_Ent)
USING INDEX m:WCD_Ent(WCD_NAME)
WHERE m.WCD_NAME = "B"
MATCH (n) - [r*] - (m)
RETURN n,r,m
This gives me Java Heap Space error.
Another conditionality I am looking to put in my query is if the relationship between the 2 nodes A&B contains one particular relationship type(NAME_MATCH) atleast once. A Could you help me address the same?
Gabor's suggestion is the most important fix; you are blowing up heap space because you are generating a cartesian product of rows to start, then filtering out using the pattern. Generate rows using the pattern and you'll be much more space efficient. If you have an index on WCD_Ent(WCD_NAME), you don't need to specify the index, either; this is something you only do if your query is running very slow and a PROFILE shows that the query planner is skipping the index. Try this one instead:
MATCH (n:WCD_Ent { WCD_NAME: "A" })-[r*..5]-(m:WCD_Ent { WCD_NAME: "B" })
WHERE ANY(rel IN r WHERE TYPE(rel) = 'NAME_MATCH')
RETURN n, r, m
The WHERE filter here will check all of the relationships in r (which is a collection, the way you've assigned it) and ensure that at least 1 of them matches the desired type.
Tore's answer (including the variable relationship upper bound) is the best one for finding whether two nodes are connected and if a certain relationship exists in a path connecting them.
One weakness with most of the solutions given so far is that there is no limitation on the variable relationship match, meaning the query is going to crawl your entire graph attempting to match on all possible paths, instead of only checking that one such path exists and then stopping. This is likely the cause of your heap space error.
Tore's suggesting on adding an upper bound on the variable length relationships in your match is a great solution, as it also helps out in cases where the two nodes aren't connected, preventing you from having to crawl the entire graph. In all cases, the upper bound should prevent the heap from blowing up.
Here are a couple more possibilities. I'm leaving off the relationship upper bound, but that can easily be added in if needed.
// this one won't check for the particular relationship type in the path
// but doesn't need to match on all possible paths, just find connectedness
MATCH (n:WCD_Ent { WCD_NAME: "A" }), (m:WCD_Ent { WCD_NAME: "B" })
RETURN EXISTS((n)-[*]-(m))
// using shortestPath() will only give you a single path back that works
// however WHERE ANY may be a filter to apply after matches are found
// so this may still blow up, not sure
MATCH (n:WCD_Ent { WCD_NAME: "A" }), (m:WCD_Ent { WCD_NAME: "B" })
RETURN shortestPath((n)-[r*]-(m))
WHERE ANY(rel IN r WHERE TYPE(rel) = 'NAME_MATCH')
// Adding LIMIT 1 will only return one path result
// Unsure if this will prevent the heap from blowing up though
// The performance and outcome may be identical to the above query
MATCH (n:WCD_Ent { WCD_NAME: "A" }), (m:WCD_Ent { WCD_NAME: "B" })
MATCH (n)-[r*]-(m)
WHERE ANY(rel IN r WHERE TYPE(rel) = 'NAME_MATCH')
RETURN n, r, m
LIMIT 1
Some enhancements:
Instead of the WHERE condition, you can bind the property value inside the pattern.
You can combine the three MATCH conditions into a single one, which makes sure that the query engine will not calculate a Cartesian product of n AND m. (You can also use EXPLAIN to visualize the query plan and check this.)
The resulting query:
MATCH (n:WCD_Ent { WCD_NAME: "A" })-[r*]-(m:WCD_Ent { WCD_NAME: "B" })
RETURN n, r, m
Update: Tore Eschliman pointed out that you don't need to specify the indices, so I removed these two lines from the query:
USING INDEX n:WCD_Ent(WCD_NAME)
USING INDEX m:WCD_Ent(WCD_NAME)

Find most frequently used distinct set of terms

Imagine a graph database composed of URLs and tags used to describe them. From this we want to find which sets of tags are most frequently used together and determine which URLs belong in each identified set.
I've tried to create a dataset which simplifies this problem as such in cypher:
CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })
Using this as a reference (neo4j console example here), we can look at it and visually identify that the most commonly used tags are tech and mice (the query for this is trivial) both referencing 3 URLs. The most commonly used tag pair is [tech, mice] as it (in this example) is the only pair shared by 2 urls (u4, and u1). It's important to note that this tag pair is a subset of the matched URLs, it's not the entire set for either. There is no combination of 3 tags shared by any urls.
How can I write a cypher query to identify which tag combinations are the most frequently used together (either in pairs, or in N sized groups)? Perhaps there's a better way to structure this data which would make analysis easier? Or is this problem is not well suited for a Graph DB? Been struggling a bit trying to figure this one out, any help or thoughts would be appreciated!
It looks like the problem on combinatorics.
// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U,
collect(distinct T) as TAGS
// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available,
// use the logarithm and exponent
//
WITH U, TAGS,
toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations
// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex
// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex,
toInt(ceil(exp(log(2) * tagIndex))) as pw2
call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,
value WHERE value > 0
// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex,
collect(TAGS[tagIndex]) as combination
// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls
ORDER BY freq DESC
I think that it is best to calculate and store the tag combination with the use of this algorithm at the time of tagging. And the query will be something like this:
MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC
Start at the URL nodes, build a tuple of tag.name objects (order it first so they all group up the same). That'll give you all of the possible combinations of tags that exist. Then, use filters to find out how many url's match for each possible set of tags.
MATCH (u:url)
WITH u
MATCH (u) - [:IS_ABOUT] -> (t:tag)
WITH u, t
ORDER BY t.name
WITH u, [x IN COLLECT(t)|x.name] AS tags
WITH DISTINCT tags
MATCH (u)
WHERE ALL(tag IN tags WHERE (u) - [:IS_ABOUT] -> (tag))
RETURN tags, count(u)

Create relationship between nodes having same property value in common, using one Cypher query

Beginning with Neo4j 1.9.2, and using Cypher query language, I would like to create relationships between nodes having a specific property value in common.
I have set of nodes G having a property H, without any relationship currently existing between G nodes.
In a Cypher statement, is it possible to group G nodes by H property value and create a relationship HR between each nodes becoming to same group? Knowing that each group have a size between 2 & 10 and I'm having more than 15k of such groups (15k different H values) for about 50k G nodes.
I've tried hard to manage such query without finding a correct syntax. Below is a small sample dataset:
create
(G1 {name:'G1', H:'1'}),
(G2 {name:'G2', H:'1'}),
(G3 {name:'G3', H:'1'}),
(G4 {name:'G4', H:'2'}),
(G5 {name:'G5', H:'2'}),
(G6 {name:'G6', H:'2'}),
(G7 {name:'G7', H:'2'})
return * ;
At the end, I'd like such relationships:
G1-[:HR]-G2-[:HR]-G3-[:HR]-G1
And:
G4-[:HR]-G5-[:HR]-G6-[:HR]-G7-[:HR]-G4
In another case, I may want to update massively the relationships between nodes using/comparing some of their properties. Imagine nodes of type N and nodes of type M, with N nodes related to M with a relationship named :IS_LOCATED_ON. The order of the location can be stored as a property of N nodes (N.relativePosition being Long from 1 to MAX_POSITION), but we may need later to update the graph model such a way: make N nodes linked between themselves by a new :PRECEDES relationship, so that we can find easier and faster next node N on the given set.
I'd expect such language may allow to update massive set of nodes/relationships manipulating their properties.
Is it not possible?
If not, is it planned or may be it planned?
Any help would be greatly appreciated.
Since there's nothing in the data you supplied to get rank, I've played with collections
to get one as follows:
START
n=node(*), n2=node(*)
WHERE
HAS(n.H) AND HAS(n2.H) AND n.H = n2.H
WITH n, n2 ORDER BY n2.name
WITH n, COLLECT(n2) as others
WITH n, others, LENGTH(FILTER(x IN others : x.name < n.name)) as rank
RETURN n.name, n.H, rank ORDER BY n.H, n.name;
Building off of that you can then start determining relationships
START
n=node(*), n2=node(*)
WHERE
HAS(n.H) AND HAS(n2.H) AND n.H = n2.H
WITH n, n2 ORDER BY n2.name
WITH n, COLLECT(n2) as others
WITH n, others, LENGTH(FILTER(x IN others : x.name < n.name)) as rank
WITH n, others, rank, COALESCE(
HEAD(FILTER(x IN others : x.name > n.name)),
HEAD(others)
) as next
RETURN n.name, n.H, rank, next ORDER BY n.H, n.name;
Finally ( and slightly more condensed )
START
n=node(*), n2=node(*)
WHERE
HAS(n.H) AND HAS(n2.H) AND n.H = n2.H
WITH n, n2 ORDER BY n2.name
WITH n, COLLECT(n2) as others
WITH n, others, COALESCE(
HEAD(FILTER(x IN others : x.name > n.name)),
HEAD(others)
) as next
CREATE n-[:HR]->next
RETURN n, next;
You can just do it like that, maybe indicate direction in your relationships:
CREATE
(G1 { name:'G1', H:'1' }),
(G2 { name:'G2', H:'1' }),
(G3 { name:'G3', H:'1' }),
(G4 { name:'G4', H:'2' }),
(G5 { name:'G5', H:'2' }),
(G6 { name:'G6', H:'2' }),
(G7 { name:'G7', H:'2' }),
G1-[:HR]->G2-[:HR]->G3-[:HR]->G1,
G4-[:HR]->G5-[:HR]->G6-[:HR]->G7-[:HR]->G1
See http://console.neo4j.org/?id=ujns0x for an example.

Resources