Calculating similarity index between two movies(Neo4j, cypher) - neo4j

In extension to this
Multiple relationships in Match Cypher
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
WITH DISTINCT sm, h
RETURN sm, collect(h.weight)
I am finding trouble in getting the distinct values of h1, h2, H, h all at the same time.
I want to calculate the similarity index between any two movies which will be dependent on h1, h2, h, H (h1.h2/|h||H|)
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
WITH sum(h1.weight*h2.weight) as num, sm, H, m, h
WITH DISTINCT m, sqrt(sum(h.weight^2)) as den1, sm, H, num
WITH DISTINCT sm, sqrt(sum(H.weight^2)) as den2, den1, num
RETURN num/(den1*den2)
This is all messed up..But I am unable to figure out the right way to solve this. Please help.

This works and gives the correct answer...
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag)<-[h2:HAS_TAG]-(sm)
WHERE m <> sm
WITH SUM(h1.weight * h2.weight) AS num,
SQRT(REDUCE(xDot = 0.0, a IN COLLECT(h1)| xDot + a.weight^2)) AS xLength,
SQRT(REDUCE(yDot = 0.0, b IN COLLECT(h2)| yDot + b.weight^2)) AS yLength, m, sm
RETURN num, xLength, yLength

Take a look at this example I generated using the Neo4j Console:
http://console.neo4j.org/?id=aq6cb3
The query should be:
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
WITH m, sm,
collect(DISTINCT h) AS h,
collect(DISTINCT H) AS H,
sum(h1.weight*h2.weight) AS num
WITH m, sm, num,
sqrt(reduce(s = 0.0, x IN h | s +(x.weight^2))) AS den1,
sqrt(reduce(s = 0.0, x IN H | s +(x.weight^2))) AS den2
RETURN m.title, sm.title, (num/(den1*den2)) AS similarity
Which results in the following:
+---------------------------------------------------------------+
| m.title | sm.title | similarity |
+---------------------------------------------------------------+
| "The Matrix" | "The Matrix: Revolutions" | 3.859767091086958 |
| "The Matrix" | "The Matrix: Reloaded" | 1.4380667053087486 |
+---------------------------------------------------------------+
I used the reduce function to aggregate the relationship values from a distinct collection and perform the similarity index calculation.

Related

How to return the value generated inside where clause in Cypher?

I have the following Cypher query:
MATCH (n)-[r]->(k)
WHERE ANY(x in keys(n)
WHERE round(apoc.text.levenshteinSimilarity(
TRIM(
REDUCE(mergedString = "", item in n[x]
| mergedString + item + " ")), "syn"), 4)
> 0.8)
RETURN n, r, k
How can I return the score generated inside the WHERE clause by the similarity function.
I am trying to do this with WITH, without luck:
MATCH (n)-[r]->(k)
WITH *, [x in keys(n) | [x, round(apoc.text. levenshteinSimilarity(TRIM(REDUCE(mergedString = '', item in n[x] | mergedString + item + ' ')), 'syn'), 4)]] as scores
WHERE [s in scores WHERE s[1] >= 0.8]
RETURN n,r,k,[s in scores WHERE s[1] >= 0.8] AS attr_scores
To return only relevant attributes with a score > 0.8, update your list comprehension to this:
MATCH (n)-[r]->(k)
WITH *, [x in keys(n) | [x, round(apoc.text. levenshteinSimilarity(TRIM(REDUCE(mergedString = '', item in n[x] | mergedString + item + ' ')), 'syn'), 4)]] as scores
RETURN n,r,k,[s in scores WHERE s[1] >= 0.8 | s] AS attr_scores
Finally together with Charchit Kapoor we've found out the best solution:
MATCH (n)-[r]->(k)
UNWIND keys(n) as key
WITH n, r, k, key, round(apoc.text. levenshteinSimilarity(TRIM(REDUCE(mergedString = "", item in n[key] | mergedString + item + " ")), "syn"), 4) as score
WITH n, r, k, collect({key:key, value:n[key], score:score}) as keyScores
WITH n, r, k, [s in keyScores
WHERE s.score >= 0.8 | s] AS attr_scores WHERE size(attr_scores) > 0
RETURN *

extract decorating nodes if it exists but still return path if decorating nodes does not exist

I have the following graph
(y1:Y)
^
|
(a1:A) -> (b1:B) -> (c1:C)
(e1:E)
^
|
(d1:D)
^
|
(a2:A) -> (b2:B) -> (c2:C)
(a3:A) -> (b3:B) -> (c3:C)
I would like to find path between node label A and C. I can use the query
match p=((:A)-[*]->(:C))
return p
But I also want to get node label Y and node label D, E if these decorating nodes exists. If I try:
match p=((:A)-[*]->(cc:C)), (cc)-->(yy:Y), (cc)-[*]->(dd:D)-[*]->(ee:E)
return p, yy, dd, ee
Then it is only going to return the path if the C node has Y, D, E connects to it.
The output that I need is:
a1->b1->c1, y1, null
a2->b2->c2, null, [[d1, e1]]
a3->b3->c3, null, null
I.e., if decorating node does not exist, then just return null. For the array, it can be null or empty array. Also D and E nodes will be group into an array of arrays since there could be many pairs of D and E.
What is the best way to achieve this?
This should do it, returning an empty array for the deDecoration if there aren't any D-E decorations
MATCH p=((:A)-[*]->(c:C))
WITH p,
HEAD([(c)--(y:Y) | y ]) AS yDecoration,
[(c)-[*]->(d:D)-[*]->(e:E) | [d,e]] AS deDecoration
RETURN p, yDecoration, deDecoration
with this graph (multiple D-E)
this query
MATCH p=((:A)-[*]->(c:C))
WITH REDUCE(s='' , node IN nodes(p) | s + CASE WHEN s='' THEN '' ELSE '->' END + node.name) AS p,
HEAD([(c)--(y:Y) | y.name ]) AS yDecoration,
[(c)-[*]->(d:D)-[*]->(e:E) | [d.name,e.name]] AS deDecoration
RETURN p, yDecoration, deDecoration
returns
╒════════════╤═════════════╤═════════════════════════╕
│"p" │"yDecoration"│"deDecoration" │
╞════════════╪═════════════╪═════════════════════════╡
│"A2->B2->C2"│null │[] │
├────────────┼─────────────┼─────────────────────────┤
│"A1->B1->C1"│null │[["D2","E2"],["D1","E1"]]│
├────────────┼─────────────┼─────────────────────────┤
│"A3->B3->C3"│"Y1" │[] │
└────────────┴─────────────┴─────────────────────────┘

Cypher query doesn't behave as expected with multiple match blocks?

I've got the following query:
MATCH (u:User) WHERE u.username = "ben"
OPTIONAL MATCH (u)-[:HAS]->(pl)
//MATCH (u)-[r1:IS_AT|PREFERS|DESIRES|VALUES]->()<-[]-(fp:FitnessProgram) WHERE NOT (fp)-[:LIMITED_BY]-(pl)
//WITH u, pl, fp, coalesce(r1.importance, 0.5) AS importance
//WITH u, pl, fp, collect({name: fp.name, importance: importance}) AS fpTraits
//WITH u, pl, reduce(s = 0, t IN fpTraits | s + t.importance) AS fpScore order by fpScore
MATCH (u)-[r2:IS_AT|PREFERS|DESIRES|VALUES]->()<-[]-(ns:NutritionalSupplement) WHERE NOT (ns)-[:LIMITED_BY]-(pl)
WITH u, ns, coalesce(r2.importance, 0.5) AS importance
WITH u, ns, collect({name: ns.name, importance: importance}) AS nsTraits
WITH u, ns, reduce(s = 0, t IN nsTraits | s + t.importance) AS nsScore order by nsScore desc limit 5
return u, ns.name, nsScore
As it is, with the 4 lines commented out, it works correctly and gives me the top 5 nutritional supplements as expected.
If I commented out the bottom block and uncomment the top block, that one works as expected too.
If I have both uncommented like below, neither block works and I get a bunch of dupes and the scores are all crazy... seems like the two matches get combined in some way I'm not understanding yet (I'm new to Neo4j)?
MATCH (u:User) WHERE u.username = "ben"
OPTIONAL MATCH (u)-[:HAS]->(pl)
MATCH (u)-[r1:IS_AT|PREFERS|DESIRES|VALUES]->()<-[]-(fp:FitnessProgram) WHERE NOT (fp)-[:LIMITED_BY]-(pl)
WITH u, pl, fp, coalesce(r1.importance, 0.5) AS importance
WITH u, pl, fp, collect({name: fp.name, importance: importance}) AS fpTraits
WITH u, pl, fp, reduce(s = 0, t IN fpTraits | s + t.importance) AS fpScore order by fpScore desc limit 5
MATCH (u)-[r2:IS_AT|PREFERS|DESIRES|VALUES]->()<-[]-(ns:NutritionalSupplement) WHERE NOT (ns)-[:LIMITED_BY]-(pl)
WITH u, fp, fpScore, ns, coalesce(r2.importance, 0.5) AS importance
WITH u, fp, fpScore, ns, collect({name: ns.name, importance: importance}) AS nsTraits
WITH u, fp, fpScore, ns, reduce(s = 0, t IN nsTraits | s + t.importance) AS nsScore order by nsScore desc limit 5
return u, fp.name, fpScore, ns.name, nsScore
What values of fp do you expect to have in the last block? It's not a part of the last query, so I don't think it can be in your WITH statements
You do not need to keep declaring fp in your WITH statements:
MATCH (u)-[r2:IS_AT|PREFERS|DESIRES|VALUES]->()<-[]-(ns:NutritionalSupplement)
WHERE NOT (ns)-[:LIMITED_BY]-(pl)
WITH u, ns, coalesce(r2.importance, 0.5) AS importance
WITH u, ns, collect({name: ns.name, importance: importance}) AS nsTraits
WITH u ns, reduce(s = 0, t IN nsTraits | s + t.importance) AS nsScore order by nsScore desc limit 5
return u, fp.name, fpScore, ns.name, nsScore

Make a path where next node is not the previous node?

I have ~1.5 M nodes in a graph, that are structured like this (picture)
I run a Cypher query that performs calculations on each relationship traversed:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
The problem is it returns the path ("One")->("hop")->("One"), which is useless for me.
How can I make it not choose the previously walked node as the next node (i.e. "One"->"hop"->"any_other_node_but_not_"one")?
I have read that NODE_RECENT should address my issue. However, there was no example on how to specify the length of recent nodes in RestAPI or APOC procedures.
Is there a Cypher query for my case?
Thank you.
P.S. I am extremely new (less than 2 month) to Neo4j and coding. So my apologies if there is an obvious simple solution.
I don't know if I understood your question completely, but I believe that you problem can be solved putting a WHERE clause on the MATCH to prevent the not desired relationship be matched, like this:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE NOT (m)-[:Arb]->(c)
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5
Try inserting this clause after your MATCH clause, to filter out cases where c and m are the same:
WHERE c <> m
[EDITED]
That is:
WITH 1 AS startVal
MATCH x = (c:Currency)-[r:Arb*2]->(m)
WHERE c <> m
WITH x, REDUCE(s = startVal, e IN r | s * e.rate) AS endVal, startVal
RETURN EXTRACT(n IN NODES(x) | n) as Exchanges,
extract ( e IN relationships(x) | startVal * e.rate) AS Rel,
endVal, endVal - startVal AS Profit
ORDER BY Profit DESC LIMIT 5;
After using this query to create test data:
CREATE
(c:Currency {name: 'One'})-[:Arb {rate:1}]->(h:Account {name: 'hop'})-[:Arb {rate:2}]->(t:Currency {name: 'Two'}),
(t)-[:Arb {rate:3}]->(h)-[:Arb {rate:4}]->(c)
the above query produces these results:
+-----------------------------------------------------------------------------------------+
| Exchanges | Rel | endVal | Profit |
+-----------------------------------------------------------------------------------------+
| [Node[8]{name:"Two"},Node[7]{name:"hop"},Node[6]{name:"One"}] | [3,4] | 12 | 11 |
| [Node[6]{name:"One"},Node[7]{name:"hop"},Node[8]{name:"Two"}] | [1,2] | 2 | 1 |
+-----------------------------------------------------------------------------------------+

How to use average function in neo4j with collection

I want to calculate covariance of two vectors as collection
A=[1, 2, 3, 4]
B=[5, 6, 7, 8]
Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)
My problem for covariance computation is:
1) I can not have a nested aggregate function
when I write
SUM((ai-avg(a)) * (bi-avg(b)))
2) Or in another shape, how can I extract two collection with one reduce such as:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
I mean that can I write nested reduce?
4) Is there any ways with "unwind", "extract"
Thank you in advanced for any help.
cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
Edit:
Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
| aa | bb
---+----+----
1 | 1 | 4
2 | 1 | 5
3 | 1 | 6
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 4
8 | 3 | 5
9 | 3 | 6
Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 2.0 | 5.0
Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.
Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 500.0 | 1500.0
714 ms
Is significantly slower than using REDUCE:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
| e_a | e_b
---+-------+--------
1 | 500.0 | 1500.0
4 ms
To bring it all together, I'll compare the two queries in full on length-1000 vectors:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
covariance;
| covariance
---+------------
1 | 83583.5
9105 ms
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
| cov
---+---------
1 | 83583.5
33 ms
[EDITED]
This should calculate the covariance (according to your formula), given your sample inputs:
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
This approach is OK when n is small, as is the case with the original sample data.
However, as #NicoleWhite and #jjaderberg point out, when n is not small, this approach will be inefficient. The answer by #NicoleWhite is an elegant general solution.
How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:
WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg
and for both collections
WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg
Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do
RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance
Thank you so much Dears, however I wonder which one is most efficient
1) Nested unwind and range inside reduce -> #cybersam
2) nested Reduce -> #Nicole White
3) Nested With (reset query by with) -> #jjaderberg
BUT Important Issue is :
Why there is an error and difference between your computations and real and actual computations.
I mean your covariance equals to = 1.6666666666666667
But in real world covariance equals to = 1.25
please check: https://www.easycalculation.com/statistics/covariance.php
Vector X: [1, 2, 3, 4]
Vector Y: [5, 6, 7, 8]
I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

Resources