Neo4J Cypher to get Combinations - neo4j

I am trying to find a way to group combinations together.
Say we have nodes of type person, hobby, place, city. Say the graph has the following relations (merged)
CREATE
(Joe:Person {name: 'Joe'}),
(hike:Hobby {name: 'hike'}),
(eat:Hobby {name: 'eat'}),
(drink:Hobby {name: 'drink'}),
(Mountain:Place {name: 'Mountain'}),
(Lake:Place {name: 'Lake'}),
(DavesBarGrill:Place {name: 'Daves BarGrill'}),
(Diner:Place {name: 'Diner'}),
(Lounge:Place {name: 'Lounge'}),
(DiveBar:Place {name: 'Dive Bar'}),
(Joe)-[:likes]->(hike),
(Joe)-[:likes]->(eat),
(Joe)-[:likes]->(drink),
(hike)-[:canDoAt]->(Mountain),
(hike)-[:canDoAt]->(Lake),
(eat)-[:canDoAt]->(DavesBarGrill),
(eat)-[:canDoAt]->(Diner),
(drink)-[:canDoAt]->(Lounge),
(drink)-[:canDoAt]->(DiveBar)
For a day planned to do each of his hobbies once, there are 8 combinations of places to hike and eat and drink. I want to be able to capture this in a query.
The naive approach,
MATCH (p:Person)-[:likes]->(h:Hobby)-[:canDoAt]->(pl:Place)
RETURN p, h, pl
will at best be able to group by person and hobby, which will cause rows of the same hobby to be grouped together. what i want is to somehow group by combos, i.e.:
//Joe Combo 1// Joe,hike,Mountain
Joe,eat,Daves
Joe,drink,Lounge
//Joe Combo 2// Joe,hike,Lake
Joe,eat,Daves
Joe,drink,Lounge
Is there a way to somehow assign a number to all path matches and then use that assignment to sort?

That's a very good question! I don't have the whole solution yet, but some thoughts: as Martin Preusse said, we are looking to generate a Cartesian product.
This is difficult, but you can workaround it by a lot of hacking, including using a double-reduce:
WITH [['a', 'b'], [1, 2, 3], [true, false]] AS hs
WITH hs, size(hs) AS numberOfHobbys, reduce(acc = 1, h in hs | acc * size(h)) AS numberOfCombinations, extract(h IN hs | length(h)) AS hLengths
WITH hs, hLengths, numberOfHobbys, range(0, numberOfCombinations-1) AS combinationIndexes
UNWIND combinationIndexes AS combinationIndex
WITH
combinationIndex,
reduce(acc = [], i in range(0, numberOfHobbys-1) |
acc + toInt(combinationIndex/(reduce(acc2 = 1, j in range(0, i-1) | acc2 * hLengths[j]))) % hLengths[i]
) AS indices,
reduce(acc = [], i in range(0, numberOfHobbys-1) |
acc + reduce(acc2 = 1, j in range(0, i-1) | acc2 * hLengths[j])
) AS multipliers,
reduce(acc = [], i in range(0, numberOfHobbys-1) |
acc + hs[i][
toInt(combinationIndex/(reduce(acc2 = 1, j in range(0, i-1) | acc2 * hLengths[j]))) % hLengths[i]
]
) AS combinations
RETURN combinationIndex, indices, multipliers, combinations
The idea is the following: we multiply the number of potential values, e.g. for ['a', 'b'], [1, 2, 3], [true, false], we calculate n = 2×3×2 = 12, using the first reduce in the query. We then iterate from 0 to n-1, and assign a row for each number, using the formula a×1 + b×2 + c×6, where a, b, c index the respective values, so all are non-negative integers and a < 2, b < 3 and c < 2.
0×1 + 0×2 + 0×6 = 0
1×1 + 0×2 + 0×6 = 1
0×1 + 1×2 + 0×6 = 2
1×1 + 1×2 + 0×6 = 3
0×1 + 2×2 + 0×6 = 4
1×1 + 2×2 + 0×6 = 5
0×1 + 0×2 + 1×6 = 6
1×1 + 0×2 + 1×6 = 7
0×1 + 1×2 + 1×6 = 8
1×1 + 1×2 + 1×6 = 9
0×1 + 2×2 + 1×6 = 10
1×1 + 2×2 + 1×6 = 11
The result is:
╒════════════════╤═════════╤═══════════╤═════════════╕
│combinationIndex│indices │multipliers│combinations │
╞════════════════╪═════════╪═══════════╪═════════════╡
│0 │[0, 0, 0]│[1, 2, 6] │[a, 1, true] │
├────────────────┼─────────┼───────────┼─────────────┤
│1 │[1, 0, 0]│[1, 2, 6] │[b, 1, true] │
├────────────────┼─────────┼───────────┼─────────────┤
│2 │[0, 1, 0]│[1, 2, 6] │[a, 2, true] │
├────────────────┼─────────┼───────────┼─────────────┤
│3 │[1, 1, 0]│[1, 2, 6] │[b, 2, true] │
├────────────────┼─────────┼───────────┼─────────────┤
│4 │[0, 2, 0]│[1, 2, 6] │[a, 3, true] │
├────────────────┼─────────┼───────────┼─────────────┤
│5 │[1, 2, 0]│[1, 2, 6] │[b, 3, true] │
├────────────────┼─────────┼───────────┼─────────────┤
│6 │[0, 0, 1]│[1, 2, 6] │[a, 1, false]│
├────────────────┼─────────┼───────────┼─────────────┤
│7 │[1, 0, 1]│[1, 2, 6] │[b, 1, false]│
├────────────────┼─────────┼───────────┼─────────────┤
│8 │[0, 1, 1]│[1, 2, 6] │[a, 2, false]│
├────────────────┼─────────┼───────────┼─────────────┤
│9 │[1, 1, 1]│[1, 2, 6] │[b, 2, false]│
├────────────────┼─────────┼───────────┼─────────────┤
│10 │[0, 2, 1]│[1, 2, 6] │[a, 3, false]│
├────────────────┼─────────┼───────────┼─────────────┤
│11 │[1, 2, 1]│[1, 2, 6] │[b, 3, false]│
└────────────────┴─────────┴───────────┴─────────────┘
So, for your problem, the query might look like this:
MATCH (p:Person)-[:likes]->(h:Hobby)-[:canDoAt]->(pl:Place)
WITH p, h, collect(pl.name) AS places
WITH p, collect(places) AS hs
WITH hs, size(hs) AS numberOfHobbys, reduce(acc = 1, h in hs | acc * size(h)) AS numberOfCombinations, extract(h IN hs | length(h)) AS hLengths
WITH hs, hLengths, numberOfHobbys, range(0, numberOfCombinations-1) AS combinationIndexes
UNWIND combinationIndexes AS combinationIndex
WITH
reduce(acc = [], i in range(0, numberOfHobbys-1) |
acc + hs[i][
toInt(combinationIndex/(reduce(acc2 = 1, j in range(0, i-1) | acc2 * hLengths[j]))) % hLengths[i]
]
) AS combinations
RETURN combinations
This looks like this:
╒════════════════════════════════════╕
│combinations │
╞════════════════════════════════════╡
│[Diner, Lounge, Lake] │
├────────────────────────────────────┤
│[Daves BarGrill, Lounge, Lake] │
├────────────────────────────────────┤
│[Diner, Dive Bar, Lake] │
├────────────────────────────────────┤
│[Daves BarGrill, Dive Bar, Lake] │
├────────────────────────────────────┤
│[Diner, Lounge, Mountain] │
├────────────────────────────────────┤
│[Daves BarGrill, Lounge, Mountain] │
├────────────────────────────────────┤
│[Diner, Dive Bar, Mountain] │
├────────────────────────────────────┤
│[Daves BarGrill, Dive Bar, Mountain]│
└────────────────────────────────────┘
Obviously, we would also like to get the person and the names of his/her hobbies:
MATCH (p:Person)-[:likes]->(h:Hobby)-[:canDoAt]->(pl:Place)
WITH p, h, collect([h.name, pl.name]) AS places
WITH p, collect(places) AS hs
WITH p, hs, size(hs) AS numberOfHobbys, reduce(acc = 1, h in hs | acc * size(h)) AS numberOfCombinations, extract(h IN hs | length(h)) AS hLengths
WITH p, hs, hLengths, numberOfHobbys, range(0, numberOfCombinations-1) AS combinationIndexes
UNWIND combinationIndexes AS combinationIndex
WITH
p, reduce(acc = [], i in range(0, numberOfHobbys-1) |
acc + [hs[i][
toInt(combinationIndex/(reduce(acc2 = 1, j in range(0, i-1) | acc2 * hLengths[j]))) % hLengths[i]
]]
) AS combinations
RETURN p, combinations
The results:
╒═══════════╤════════════════════════════════════════════════════════════╕
│p │combinations │
╞═══════════╪════════════════════════════════════════════════════════════╡
│{name: Joe}│[[eat, Diner], [drink, Lounge], [hike, Lake]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Daves BarGrill], [drink, Lounge], [hike, Lake]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Diner], [drink, Dive Bar], [hike, Lake]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Daves BarGrill], [drink, Dive Bar], [hike, Lake]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Diner], [drink, Lounge], [hike, Mountain]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Daves BarGrill], [drink, Lounge], [hike, Mountain]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Diner], [drink, Dive Bar], [hike, Mountain]] │
├───────────┼────────────────────────────────────────────────────────────┤
│{name: Joe}│[[eat, Daves BarGrill], [drink, Dive Bar], [hike, Mountain]]│
└───────────┴────────────────────────────────────────────────────────────┘
I might be overthinking this, so any comments are welcome.
An important remark: the fact that this is so complicated with pure Cypher is probably a good sign that you're better off calculating this from the client application.

I'm pretty sure you cannot do this in cypher. What you are looking for is the Cartesian product of all places grouped by person and hobby.
A: [ [Joe, hike, Mountain], [Joe, hike, Lake] ]
B: [ [Joe, eat, Daves], [Joe, eat, Diner] ]
C: [ [Joe, drink, Lounge], [Joe, drink, Bar] ]
And you are looking for A x B x C.
As far as I know you can't group the return in Cypher like this. You should return all person, hobby, place rows and do this in a Python script where you build the grouped sets and calculate the Cartesian product.
The problem is that you get a lot of combinations with growing numbers of hobbies and places.

Related

How in Cypher do I output a row number

Given a query like this
match (u:User)-[]->(p:Project) return 'x' as row_number, u, p
How do I output a resultset like this, with row_number being the row number?
╒════════════╤═════════════════╤══════════════════╕
│"row_number"│"u" │"p" │
╞════════════╪═════════════════╪══════════════════╡
│"x" │{"name":"Martin"}│{"name":"Martin2"}│
├────────────┼─────────────────┼──────────────────┤
│"x" │{"name":"Martin"}│{"name":"Martin1"}│
├────────────┼─────────────────┼──────────────────┤
│"x" │{"name":"Bob"} │{"name":"Bob2"} │
├────────────┼─────────────────┼──────────────────┤
│"x" │{"name":"Bob"} │{"name":"Bob1"} │
└────────────┴─────────────────┴──────────────────┘
I came up with this:
match (u:User)-[]->(p:Project)
with collect([u,p]) as col
unwind range(0, size(col)-1) as un
return un as row, col[un][0] as User, col[un][1] as Project
Which outputs:
╒═════╤═════════════════╤══════════════════╕
│"row"│"User" │"Project" │
╞═════╪═════════════════╪══════════════════╡
│0 │{"name":"Martin"}│{"name":"Martin2"}│
├─────┼─────────────────┼──────────────────┤
│1 │{"name":"Martin"}│{"name":"Martin1"}│
├─────┼─────────────────┼──────────────────┤
│2 │{"name":"Bob"} │{"name":"Bob2"} │
├─────┼─────────────────┼──────────────────┤
│3 │{"name":"Bob"} │{"name":"Bob1"} │
└─────┴─────────────────┴──────────────────┘
This should do it:
MATCH (n)-[]->(m)
WITH COLLECT([n,m]) AS rows
WITH REDUCE(arr=[], i IN RANGE(0,SIZE(rows)-1) |
arr
+[[i]+rows[i]]
) AS rowsWithNumber
UNWIND rowsWithNumber As rowWithNumber
RETURN rowWithNumber[0] AS row_number,
rowWithNumber[1] AS n,
rowWithNumber[2] AS m

neo4j aggregate function by distance

I want to have some aggregated statistics by distance from root. For example,
(A)-[value:20]->(B)-[value:40]->(C)
(A)-[value:0]->(D)-[value:20]->(E)
CREATE (:firm {name:'A'}), (:firm {name:'B'}), (:firm {name:'C'}), (:firm {name:'D'}), (:firm {name:'E'});
MATCH (a:firm {name:'A'}), (b:firm {name:'B'}), (c:firm {name:'C'}), (d:firm {name:'D'}), (e:firm {name:'E'})
CREATE (a)-[:REL {value: 20}]->(b)->[:REL {value: 40}]->(c),
(a)-[:REL {value: 0}]->(d)->[:REL {value: 20}]->(e);
I want to get the average value of A's immediate neighbors and that of the 2nd layer neighbors, i.e.,
+-------------------+
| distance | avg |
+-------------------+
| 1 | 10 |
| 2 | 30 |
+-------------------+
How should I do it? I have tried the following
MATCH p=(n:NODE {name:'A'})-[r:REL*1..2]->(n:NODE)
RETURN length(p), sum(r:value);
But I am not sure how to operate on the variable-length path r.
Similarly, is it possible to get the cumulative value? i.e.,
+-------------------+
| name | cum |
+-------------------+
| B | 20 |
| C | 60 |
| D | 0 |
| E | 20 |
+-------------------+
The query below solves the first problem. Please note that it also solves the case where paths are not of equal length. I added (E)-[REL {value:99}]->(F)
MATCH path=(:firm {name:'A'})-[:REL*]->(leaf:firm)
WHERE NOT (leaf)-[:REL]->(:firm)
WITH COLLECT(path) AS paths, max(length(path)) AS longest
UNWIND RANGE(1,longest) AS depth
WITH depth,
REDUCE(sum=0, path IN [p IN paths WHERE length(p) >= depth] |
sum
+ relationships(path)[depth-1].value
) AS sumAtDepth,
SIZE([p IN paths WHERE length(p) >= depth]) AS countAtDepth
RETURN depth, sumAtDepth, countAtDepth, sumAtDepth/countAtDepth AS avgAtDepth
returning
╒═══════╤════════════╤══════════════╤════════════╕
│"depth"│"sumAtDepth"│"countAtDepth"│"avgAtDepth"│
╞═══════╪════════════╪══════════════╪════════════╡
│1 │20 │2 │10 │
├───────┼────────────┼──────────────┼────────────┤
│2 │60 │2 │30 │
├───────┼────────────┼──────────────┼────────────┤
│3 │99 │1 │99 │
└───────┴────────────┴──────────────┴────────────┘
The second question can be answered as follows:
MATCH (root:firm {name:'A'})
MATCH (descendant:firm) WHERE EXISTS((root)-[:REL*]->(descendant))
WITH root,descendant
WITH descendant,
REDUCE(sum=0,rel IN relationships([(descendant)<-[:REL*]-(root)][0][0]) |
sum + rel.value
) AS cumulative
RETURN descendant.name,cumulative ORDER BY descendant.name
returning
╒═════════════════╤════════════╕
│"descendant.name"│"cumulative"│
╞═════════════════╪════════════╡
│"B" │20 │
├─────────────────┼────────────┤
│"C" │60 │
├─────────────────┼────────────┤
│"D" │0 │
├─────────────────┼────────────┤
│"E" │20 │
├─────────────────┼────────────┤
│"F" │119 │
└─────────────────┴────────────┘
may I suggest your try it with a reduce function, you can retro fit it your code
// Match something name or distance..
MATCH
// If you have a condition put in here
// WHERE A<>B AND n.name = m.name
// WITH filterItems, collect(m) AS myItems
// Reduce will help sum/aggregate entire you are looking for
RETURN reduce( sum=0, x IN myItems | sum+x.cost )
LIMIT 10;

Cypher - order by rarity of type of child node

I have this Cypher query:
MATCH (Parent)-[R]-(Child) WHERE ID(Parent)=$parentId
CALL {
WITH Child
RETURN apoc.node.degree(Child) as ChildDegree
}
WITH Parent, Child, R, ChildDegree
RETURN Parent, Child, type(R), ChildDegree
ORDER BY R
LIMIT 35
Which returns limited data (limit is 35). This limit is something which bothers me. Imagine that Parent has this Children:
40 x A
3 x B
2 x C
In this situation my query sometimes returns (35 x A). What I'd like to achieve is to make this query order by the rarest type of child for this parent and for this example return this data:
2 x C
3 x B
30 x A
I tested below query using the Movie database.
Collect parent, child, R and child degree and put all child degree in a list (collect_nodes)
Create a range of index to accumulate the sum of child degrees (range_idx)
From 0 to the number of rows, get a running sum of degrees
From each parent, child, R and child degree, check if sum_degree <= 35
Return the parent, child, R and child degree
You cannot get the exact rows that equals 35 because what you limit is the number of rows and not the child degrees. Also, show us sample data to work on so that we can give you the best answer
MATCH (Parent)-[R]-(Child) WHERE ID(Parent)=$parentId
CALL {
WITH Child
RETURN apoc.node.degree(Child) as ChildDegree
}
WITH Parent, Child, type(R) as R, ChildDegree ORDER BY R, ChildDegree
WITH collect({p:Parent, c:Child, r: R, cd:ChildDegree }) as collect_nodes, collect(ChildDegree) as collect_degs
WITH collect_nodes, collect_degs, RANGE(0, SIZE(collect_degs)-1) AS range_idx
UNWIND range_idx as idx
WITH collect_nodes[idx] as nodes, REDUCE(acc = 0, value in (collect_degs[idx] + collect_degs[..idx]) | acc + value) AS sum_degree
UNWIND nodes as n_set
WITH n_set.p as Parent, n_set.c as Child, n_set.r as R, n_set.cd as ChildDegree WHERE sum_degree <= 35
RETURN Parent, Child, R, ChildDegree
Sample result:
╒═══════════════════════════╤══════════════════════════════════════════════════════════════════════╤══════════╤═════════════╕
│"Parent" │"Child" │"R" │"ChildDegree"│
╞═══════════════════════════╪══════════════════════════════════════════════════════════════════════╪══════════╪═════════════╡
│{"name":"Jessica Thompson"}│{"name":"Paul Blythe"} │"FOLLOWS" │2 │
├───────────────────────────┼──────────────────────────────────────────────────────────────────────┼──────────┼─────────────┤
│{"name":"Jessica Thompson"}│{"name":"James Thompson"} │"FOLLOWS" │3 │
├───────────────────────────┼──────────────────────────────────────────────────────────────────────┼──────────┼─────────────┤
│{"name":"Jessica Thompson"}│{"name":"Angela Scope"} │"FOLLOWS" │3 │
├───────────────────────────┼──────────────────────────────────────────────────────────────────────┼──────────┼─────────────┤
│{"name":"Jessica Thompson"}│{"tagline":"Come as you are","title":"The Birdcage","released":1996} │"REVIEWED"│5 │
├───────────────────────────┼──────────────────────────────────────────────────────────────────────┼──────────┼─────────────┤
│{"name":"Jessica Thompson"}│{"tagline":"It's a hell of a thing, killing a man","title":"Unforgiven│"REVIEWED"│5 │
│ │","released":1992} │ │ │
├───────────────────────────┼──────────────────────────────────────────────────────────────────────┼──────────┼─────────────┤
│{"name":"Jessica Thompson"}│{"tagline":"Break The Codes","title":"The Da Vinci Code","released":20│"REVIEWED"│7 │
│ │06} │ │ │
├───────────────────────────┼──────────────────────────────────────────────────────────────────────┼──────────┼─────────────┤
│{"name":"Jessica Thompson"}│{"tagline":"Pain heals, Chicks dig scars... Glory lasts forever","titl│"REVIEWED"│8 │
│ │e":"The Replacements","released":2000} │ │ │
└───────────────────────────┴──────────────────────────────────────────────────────────────────────┴──────────┴─────────────┘

Evaluating expressions

I'm doing some past-papers and need to know if I am correct here.
Give step-by-step evaluations of the following expressions:
foo(0,[2,3,1])
foo(0,[4,0,1])
where foo is defined like this:
foo(_,[]) -> [];
foo(Y,[X|_]) when X==Y -> [X];
foo(Y,[X|Xs]) -> [X | foo(Y,Xs) ].
My answers:
1.
Foo(0, [2, 3, 1])
[2 | foo(0, 3, 1) ]
[2, 3| foo(0, 1) ]
[2, 3, 1 | foo (0)]
[2, 3, 1]
2.
Foo(0, [4, 0, 1])
[4 | foo(0, 0,1])
[4, 0]
Am I correct here?
At least the function parameter are wrong, I would say:
1.
foo(0,[2,3,1])
[2|foo(0,[3,1])] % 3rd clause
[2|[3|foo(0,[1])]] % 3rd clause
[2|[3|[1|foo(0,[])]]] % 3rd clause
[2|[3|[1|[]]]] % 1st clause
[2,3,1]
2.
foo(0,[4,0,1])
[4|foo(0,[0,1])] % 3rd clause
[4|[0]] % 2nd clause
[4, 0]

How to use average function in neo4j with collection

I want to calculate covariance of two vectors as collection
A=[1, 2, 3, 4]
B=[5, 6, 7, 8]
Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)
My problem for covariance computation is:
1) I can not have a nested aggregate function
when I write
SUM((ai-avg(a)) * (bi-avg(b)))
2) Or in another shape, how can I extract two collection with one reduce such as:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
I mean that can I write nested reduce?
4) Is there any ways with "unwind", "extract"
Thank you in advanced for any help.
cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
Edit:
Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
| aa | bb
---+----+----
1 | 1 | 4
2 | 1 | 5
3 | 1 | 6
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 4
8 | 3 | 5
9 | 3 | 6
Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 2.0 | 5.0
Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.
Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 500.0 | 1500.0
714 ms
Is significantly slower than using REDUCE:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
| e_a | e_b
---+-------+--------
1 | 500.0 | 1500.0
4 ms
To bring it all together, I'll compare the two queries in full on length-1000 vectors:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
covariance;
| covariance
---+------------
1 | 83583.5
9105 ms
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
| cov
---+---------
1 | 83583.5
33 ms
[EDITED]
This should calculate the covariance (according to your formula), given your sample inputs:
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
This approach is OK when n is small, as is the case with the original sample data.
However, as #NicoleWhite and #jjaderberg point out, when n is not small, this approach will be inefficient. The answer by #NicoleWhite is an elegant general solution.
How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:
WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg
and for both collections
WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg
Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do
RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance
Thank you so much Dears, however I wonder which one is most efficient
1) Nested unwind and range inside reduce -> #cybersam
2) nested Reduce -> #Nicole White
3) Nested With (reset query by with) -> #jjaderberg
BUT Important Issue is :
Why there is an error and difference between your computations and real and actual computations.
I mean your covariance equals to = 1.6666666666666667
But in real world covariance equals to = 1.25
please check: https://www.easycalculation.com/statistics/covariance.php
Vector X: [1, 2, 3, 4]
Vector Y: [5, 6, 7, 8]
I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

Resources