Cosine similarity Vectors that need to have same size - neo4j

I want to calculate similarity between an open role that need X,Y,Z skills with W,T,L level of expertise (proficiency) and differente employees... BUT not all the employees are going to have all X,Y,Z skills so we will need to put a 0 if skill is not present....
What I have is not working since is just matching when both the role and the employee has the skill. Any idea? Thanks in advance
MATCH (p1:Employee)-[x:HAS_SKILL]->(sk:Personal_Skill)<-[y:REQUIRES_SKILL] -(p2:Role {name:'Role 1-Analytics Manager'})
WITH SUM(x.proficiency * y.proficiency) AS xyDotProduct,
SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.proficiency) | xDot + a^2)) AS xLength,
SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.proficiency) | yDot + b^2)) AS yLength,
p1, p2
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.similarity = xyDotProduct / (xLength * yLength)
RETURN p1.name, s.similarity

The key to this one is breaking up your MATCHes into several, usage of an OPTIONAL MATCH, and using COALESCE() to get a default value for a null.
First step is to MATCH on all the skills required of the role.
Next is to MATCH on all employees.
Last is an OPTIONAL MATCH from the employee to the skill, which will give us a null for the HAS_SKILL relationship if the employee doesn't have the skill.
From there, we get the proficiencies, using COALESCE() to give a default of 0 where HAS_SKILL is null.
MATCH (sk:Personal_Skill)<-[y:REQUIRES_SKILL] -(p2:Role {name:'Role 1-Analytics Manager'})
MATCH (p1:Employee)
OPTIONAL MATCH (p1)-[x:HAS_SKILL]->(sk)
WITH p1, COALESCE(x.proficiency, 0) as xProf, y.proficiency as yProf, p2
WITH SUM(xProf * yProf) AS xyDotProduct,
SQRT(REDUCE(xDot = 0.0, a IN COLLECT(xProf) | xDot + a^2)) AS xLength,
SQRT(REDUCE(yDot = 0.0, b IN COLLECT(yProf) | yDot + b^2)) AS yLength,
p1, p2
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.similarity = xyDotProduct / (xLength * yLength)
RETURN p1.name, s.similarity

Related

Finding right path in Cypher Neo4j

I'm working with Flight Analyzer database (https://neo4j.com/graphgist/flight-analyzer).
We have there few nodes and relationships types.
Nodes:
Airport
(SEA:Airport { name:'SEA' })
Flight
(f0:Flight { date:'11/30/2015 04:24:12', duration:218, distance:1721, airline:'19977' })
Ticket
(t1f0:Ticket { class:'economy', price:1344.75 })
Relationships
Destination
(f0)-[:DESTINATION]->(ORD)
Origin
(f0)-[:ORIGIN]->(SEA)
Assign
(t1f0)-[:ASSIGN]->(f0)
Now I need to find some path and I have problem with that connection ORIGIN - FLIGHT - DESTINATION.
I need to find all airports that are connected to LAX airport with sum of ticket prices < 3000.
I tried
MATCH path = (origin:Airport { name:"LAX" })<-[r:ORIGIN|DESTINATION*..5]->(destination:Airport)
WHERE REDUCE(s = 0, n IN [x IN NODES(path) WHERE 'Flight' IN LABELS(x)] |
s + [(n)<-[:ASSIGN]-(ticket) | ticket.price][0]
) < 3000
RETURN path
but in this solution LAX can be ORIGIN and DESTINATION too. I only want to chose paths that always have the same order aiport1 <- origin - flight1 - destination -> airport2 <- origin - flight2 - destination -> aiport etc..
I need to include departure and arrive time so
flight1 date + duration < flight2 date then flight2 date + duration < flight3 date etc...
[UPDATED]
This query should check that:
matched paths have alternating ORIGIN/DESTINATION relationships, and
every departing flight lands at least 30 minutes before the next departing flight (if any), and
the sum of the ticket prices of the Flight nodes (which are every other node starting at the second one) < 3000
MATCH p = (origin:Airport {name: 'LAX'})-[:ORIGIN|DESTINATION*..5]-(destination:Airport)
WHERE
ALL(i IN RANGE(0, LENGTH(p)-1) WHERE
TYPE(RELATIONSHIPS(p)[i]) = ['ORIGIN', 'DESTINATION'][i] AND
(i%4 <> 1 OR (i + 2) > LENGTH(p) OR
(apoc.date.parse(NODES(p)[i].date,'m','MM/dd/yyyy hh:mm:ss') + NODES(p)[i].duration + 30) < apoc.date.parse(NODES(p)[i+2].date,'m','MM/dd/yyyy hh:mm:ss'))
) AND
REDUCE(s = 0, n IN [k IN RANGE(1, LENGTH(p), 2) | NODES(p)[k]] |
s + [(n)<-[:ASSIGN]-(ticket) | ticket.price][0]
) < 3000
RETURN p
The query uses the apoc.date.parse function to convert each date into the number of epoch minutes, so that a duration (assumed to also be in minutes) can be added to it.
I believe, you should create new relationships like flyto from an airport to an airport with ticket price and ticket class. it can be useful.
then you can find flights easier.
match
(a:Airport )<-[:ORIGIN]-(f:Flight)-[:DESTINATION ]->(b:Airport ),
(f)-[:ASSIGN]-(t:Ticket)
CREATE (a)-[r:FLY_TO {price:t.price,Class:t.class} ]->(b)

How do I find the maximum value between several specific nodes in neo4j?

EXAMPLE:
A unidirectional graph of the following type is given:
CREATE
(b0:Bar {id:1, value: 1}),
(b1:Bar {id:2, value: 4}),
(b2:Bar {id:3, value: 3}),
(b3:Bar {id:4, value: 5}),
(b4:Bar {id:5, value: 9}),
(b5:Bar {id:6, value: 7}),
(b0)-[:NEXT_BAR]->(b1),
(b1)-[:NEXT_BAR]->(b2),
(b2)-[:NEXT_BAR]->(b3),
(b3)-[:NEXT_BAR]->(b4),
(b4)-[:NEXT_BAR]->(b5);
MATCH (b1)->[*1..5]->(b2)->(b3)->[*1..5]->(b4)
WHERE // here you need to write a condition that the maximum value between the value of nodes b3 and b4 is greater than the maximum value of nodes b1 and b2
RETURN //b1_b2_max, b3_b4_max
In other words, the result should be as follows:
b1_b2_max | b3_b4_max
4 | 9
Can you tell me how I can find aggregated information between certain nodes (including these nodes)?
What should my request look like?
You could do something like this to get the right values.
// start with a set of slices you would like to get the max from
WITH [[1,3],[3,5]] AS slices
// match the path you want to get the slices from
MATCH path=(:Bar {id: 1})-[:NEXT_BAR*..5]->(end:Bar)
WHERE NOT (end)-->()
WITH slices, path
// look at the nodes in each slice of the path
UNWIND slices AS slice
// find the max value in the slice
UNWIND nodes(path)[slice[0]..slice[1]] AS b
RETURN 'b' + toString(slice[0]) + '_b' + toString(slice[1]-1) + '_max', max(b.value) AS max_value
Rather than returning the slice and max values in rows you can instead collect them as pairs and convert that to a map using apoc.map.fromPairs. Then access specific values in the map and return them as columns.
WITH [[1,3],[3,5]] AS slices
MATCH path=(:Bar {id: 1})-[:NEXT_BAR*..5]->(end:Bar)
WHERE NOT (end)-->()
WITH slices, path
UNWIND slices AS slice
UNWIND nodes(path)[slice[0]..slice[1]] AS b
WITH ['b' + toString(slice[0]) + '_b' + toString(slice[1]-1) + '_max', max(b.value)] AS pair
WITH collect(pair) AS pairs
RETURN apoc.map.fromPairs(pairs)['b1_b2_max'] AS b1_b2_max,
apoc.map.fromPairs(pairs)['b3_b4_max'] AS b3_b4_max

Cypher: analog of `sort -u` to merge 2 collections?

Suppose I have a node with a collection in a property, say
START x = node(17) SET x.c = [ 4, 6, 2, 3, 7, 9, 11 ];
and somewhere (i.e. from .csv file) I get another collection of values, say
c1 = [ 11, 4, 5, 8, 1, 9 ]
I'm treating my collections as just sets, order of elements does not matter. What I need is to merge x.c with c1 with come magic operation so that resulting x.c will contain only distinct elements from both. The following idea comes to mind (yet untested):
LOAD CSV FROM "file:///tmp/additives.csv" as row
START x=node(TOINT(row[0]))
MATCH c1 = [ elem IN SPLIT(row[1], ':') | TOINT(elem) ]
SET
x.c = [ newxc IN x.c + c1 WHERE (newx IN x.c AND newx IN c1) ];
This won't work, it will give an intersection but not a collection of distinct items.
More RTFM gives another idea: use REDUCE() ? but how?
How to extend Cypher with a new builtin function UNIQUE() which accept collection and return collection, cleaned form duplicates?
UPD. Seems that FILTER() function is something close but intersection again :(
x.c = FILTER( newxc IN x.c + c1 WHERE (newx IN x.c AND newx IN c1) )
WBR,
Andrii
How about something like this...
with [1,2,3] as a1
, [3,4,5] as a2
with a1 + a2 as all
unwind all as a
return collect(distinct a) as unique
Add two collections and return the collection of distinct elements.
dec 15, 2014 - here is an update to my answer...
I started with a node in the neo4j database...
//create a node in the DB with a collection of values on it
create (n:Node {name:"Node 01",values:[4,6,2,3,7,9,11]})
return n
I created a csv sample file with two columns...
Name,Coll
"Node 01","11,4,5,8,1,9"
I created a LOAD CSV statement...
LOAD CSV
WITH HEADERS FROM "file:///c:/Users/db/projects/coll-merge/load_csv_file.csv" as row
// find the matching node
MATCH (x:Node)
WHERE x.name = row.Name
// merge the collections
WITH x.values + split(row.Coll,',') AS combo, x
// process the individual values
UNWIND combo AS value
// use toInt as the values from the csv come in as string
// may be a better way around this but i am a little short on time
WITH toInt(value) AS value, x
// might as well sort 'em so they are all purdy
ORDER BY value
WITH collect(distinct value) AS values, x
SET x.values = values
You could use reduce like this:
with [1,2,3] as a, [3,4,5] as b
return reduce(r = [], x in a + b | case when x in r then r else r + [x] end)
Since Neo4j 3.0, with APOC Procedures you can easily solve this with apoc.coll.union(). In 3.1+ it's a function, and can be used like this:
...
WITH apoc.coll.union(list1, list2) as unionedList
...

Cypher: find a path which takes the maximum valued step each time

I am trying to write a cypher query that finds a path between nodes a and b such that each step has the maximum timestamp value out of all available alternatives that is less than 15.
Here is my query so far, it does everything except for select the maximum possible timestamp at each step. How do I express this condition?
MATCH path=(a:NODE)-[rs:PARENT*]->(b:NODE)
WHERE a.name = 'SOME_VALUE' and b.name = 'SOME_OTHER_VALUE' AND ALL (r IN rs
WHERE r.timestamp < 15)
RETURN path
This is just awful sudo code but I think it expresses what I am looking for
MATCH path=(a:NODE)-[rs:PARENT*]->(b:NODE)
WHERE a.name = 'SOME_VALUE' and b.name = 'SOME_OTHER_VALUE' AND ALL (r IN rs
WHERE r.timestamp < 15 AND r.timestamp = max(allPossibleRsForThisStep))
RETURN path
Can this kind of query be written in cypher?
It won't be fast in cypher, it's possible to compute all maximum values first and then do what you want to do by compare the max value in a list with the current value.
Something like this (not sure if it works)
WITH range(1,10) as max_vals // a list with 10 values (actual values are not that important)
MATCH (a:NODE)-[rs:PARENT*..10]->(b:NODE)
WHERE a.name = 'SOME_VALUE' and b.name = 'SOME_OTHER_VALUE'
WITH a,b,
map(idx in range(0,size(rs)) |
max_vals[idx] = case when max_vals[idx]<rs[idx].timestamp then rs[idx].timestamp else max_vals[idx] end ), max_vals
MATCH path=(a)-[rs:PARENT*..10]->(b)
AND ALL (idx in range(0,size(rs) WHERE rs[idx].timestamp < 15 AND rs[idx].timestamp = max_vals[idx])
RETURN path

Iterate through Neo4j relationships and return the minimum value of relationships properties

I want to iterate through the relationships between the "begining node" and the "end node".
Indeed, there is my cypher request :
MATCH (ar1:Article)-[:PART_OF]->()-[:SERIES]->(s1),
(ar2:Article)-[:PART_OF]->()-[:SERIES]->(s2),
(ar1)-[:CREATOR]->(au1:Author),
(ar2)-[:CREATOR]->(au1:Author),
p1 = (au1)-[CONTRIBUTOR*]->(au2:Author)
WITH REDUCE (edge IN relationships(p1)|weight + 1/edge.fdegree) AS
strength_au1_au2_p1,ar1 AS ar1,s1 AS s1,ar2 AS ar2,s2 AS s2,au1 AS au1,au2 AS au2
WHERE s1.name='WWW' AND s2.name='Pods' AND ar2.year >2010.0 AND ar1.year >2010.0
AND strength_au1_au2_p1<5.0
RETURN ar1,s1,ar2,s2,au1,au2,ar1.year AS calc_fuzzy_ar1_year_recent,ar2.year AS
calc_fuzzy_ar2_year_recent,strength_au1_au2_p1 AS calc_fuzzy_length_p1_short**
Now I want to iterate through CONTRIBUTOR* relationships (in p1) and get each of its 'fdegree' and return the minimum value(fdegree) of relationships in p1.
Thank you all
Try this:
MATCH (au1:Author)<-[:CREATOR]-(ar1:Article)-[:PART_OF]->()-[:SERIES]->(s1),
(au2:Author)<-[:CREATOR]-(ar2:Article)-[:PART_OF]->()-[:SERIES]->(s2)
WHERE s1.name='WWW' AND s2.name='Pods' AND ar2.year >2010.0 AND ar1.year >2010.0
WITH au1,au2,ar1,ar2,s1,s2
MATCH (au1)-[rels:CONTRIBUTOR*]->(au2:Author)
WHERE REDUCE (weight = 0, edge IN rels | weight + 1/edge.fdegree) < 5.0
RETURN au1,au2,ar1,ar2,s1,s2,
REDUCE (weight = 1000000, edge IN rels |
case when weight < edge.fdegree then weight else edge.fdegree end) as min_degree

Resources