Matching all nodes related to a set of other nodes - neo4j - neo4j

I'm just getting started with neo4j and would like some help trying to solve a problem.
I have a set of Questions that require information (Slots) to answer them.
The rules of the graph (i.e. the Slots required for each Question) are shown below:
Graph diagram here
In a scenario in which I have a set of slots e.g. [Slot A, Slot B] I want to be able to check all Questions that the Slots are related to e.g. [Question 1 , Question 2].
I then want to be able to check for which of the Questions all required Slots are available, e.g. [Question 1]
Is this possible, and if so how should I go about it?

Yes it's possible.
Some data fixtures :
CREATE (q1:Question {name: "Q1"})
CREATE (q2:Question {name: "Q2"})
CREATE (s1:Slot {name: "Slot A"})
CREATE (s2:Slot {name: "Slot B"})
CREATE (s3:Slot {name: "Slot C"})
CREATE (q1)-[:REQUIRES]->(s1)
CREATE (q1)-[:REQUIRES]->(s2)
CREATE (q2)-[:REQUIRES]->(s1)
CREATE (q2)-[:REQUIRES]->(s3)
Find questions related to a slots list :
MATCH p=(q:Question)-[:REQUIRES]->(slot)
WHERE slot.name IN ["Slot A", "Slot B"]
RETURN p
Then, find questions related to a slot list, and return a boolean if the slot list contains all required slots for a question :
MATCH p=(q:Question)-[:REQUIRES]->(slot)
WHERE slot.name IN ["Slot A", "Slot B"]
WITH q, collect(slot) AS slots
RETURN q, ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
╒═════════════╤═══════════════════════════════════════════════════════╕
│"q" │"ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)"│
╞═════════════╪═══════════════════════════════════════════════════════╡
│{"name":"Q1"}│true │
├─────────────┼───────────────────────────────────────────────────────┤
│{"name":"Q2"}│false │
└─────────────┴───────────────────────────────────────────────────────┘
A bit of explanation on that part ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
the ALL predicate, will check that the condition for every value in a list is true, for example ALL (x IN [10,20,30] WHERE x > 5)
the extract shortcut syntax, you pass a list, it returns a list of the extracted values, the syntax is extract(x IN <LIST> | <key to extract>) for example :
extract(x IN [{name: "Chris", age: 38},{name: "John", age: 27}] | x.age)
// equivalent to the shortcut syntax for extract, with square brackets
[x IN [{name: "Chris", age: 38},{name: "John", age: 27}] | x.age]
Will return [38,27]
Combining it now :
For every path, extract the Slot node
[(q)-[:REQUIRES]->(s) | s]
Returns
[s1, s2]
Are every of s1 and s2, in the list of the slot nodes previously collected ?
ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
Return true or false
Return only the questions when true :
MATCH p=(q:Question)-[:REQUIRES]->(slot)
WHERE slot.name IN ["Slot A", "Slot B"]
WITH q, collect(slot) AS slots
WITH q WHERE ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
RETURN q

Related

Neo4j where predicates on multiple attributes

In Neo4j, for the where predicate, can we have constraints on more than one property? For example, suppose that we have a list of pairs: L = [(23, 'San Diego'), (25, 'Palo Alto'), (21, 'Seattle'), ....], then does Cypher support something similar to the following:
Match (a) where (a.age, a.city) in L return a
The age and city combinations need to be in the L list
Neo4j does not accept tuples but map of key, value pairs (or dictionary).
However, this query will be close to what you have described.
WITH [{age:23, city:'San Diego'}, {age:25, city:'Palo Alto'}, {age:21, city:'Seattle'}] as L
MATCH (p:Person) WHERE {age: p.age, city: p.city} in L
RETURN p
Sample result:
╒═══════════════════════════════════════════╕
│"p" │
╞═══════════════════════════════════════════╡
│{"name":"Andy","city":"San Diego","age":23}│
└───────────────────────────────────────────┘
See below:
https://neo4j.com/docs/cypher-manual/current/syntax/values/#composite-types

How do I find all nodes which shares at least n nodes with a given node

The title is a bit of a mess, but I have an image which accurately describes what I'm trying to achieve.
As an example, I'm working in neo4j's sandbox, with the Movies-dataset.
The task: Find all actors that have worked on a movie with Tom Hanks more than twice (3 or more times).
Here's the query (and result) which shows all actors that have worked with him at all, and the movies they've been part of.
MATCH (p1:Person {name: "Tom Hanks"})-->(m:Movie)<--(p2:Person)
RETURN m, p2
To save you some time, the only person who has worked with Tom at least thrice (3 times) is Meg Ryan to the right.
So, as a freshman at this query language, my immediate thought was to try the following Cypher-query:
MATCH (p1:Person {name: "Tom Hanks"})-->(m:Movie)<--(p2:Person)
WHERE count(m) > 2
RETURN p2, m
This gave an error, telling me that I can't put the count-function there.
I've also tried using the WITH-keyword:
MATCH (p1:Person {name: "Tom Hanks"})-->(m:Movie)<--(p2:Person)
WITH p1, m, p2, count(m) AS common_movie_count
WHERE common_movie_count > 2
RETURN DISTINCT p2, m
...but that didn't help me much, and although it did run, it gave me an empty output (meaning no matches).
For some reason I was allowed to get the names of relevant actors (and the count), as long as I accept getting ALL actors, in table format.
MATCH (p1:Person {name: "Tom Hanks"})-->(m:Movie)<--(p2:Person)
RETURN DISTINCT p2.name AS name, count(m) as common_movie_count
ORDER BY common_movie_count DESC
This query returned the following table (or started with these three results):
| name | common_movie_count |
| -------------- | ------------------ |
| "Meg Ryan" | 3 |
| "Ron Howard" | 2 |
| "Gary Sinise" | 2 |
I want the nodes, not just the name. Also: I only want the nodes that are relevant (count > 3), as in the dataset I intend to apply this, there will be too many non-relevant nodes to apply this to them all.
Do you have any ideas or simple solutions to my problem? To me the problems seems so simple that I can't be the first one to run into it, but I can't seem to google my way to a good solution.
You were pretty close here:
MATCH (p1:Person {name: "Tom Hanks"})-->(m:Movie)<--(p2:Person)
WITH p1, m, p2, count(m) AS common_movie_count
WHERE common_movie_count > 2
RETURN DISTINCT p2, m
except that the m in the WITH statement makes common_movie_count always return 0
If you do this:
MATCH (p1:Person {name: "Tom Hanks"})-->(m:Movie)<--(p2:Person)
WITH p1, p2, COLLECT( DISTINCT m) AS common_movies
WHERE SIZE(common_movies) > 2
UNWIND common_movies AS m
RETURN p2, m
you would get the expected result

Return all paths and cost between two nodes in neo4j

I already find all paths between two nodes, but I need their cost(weight).
CREATE (a:Location {name: 'A'}),
(b:Location {name: 'B'}),
(c:Location {name: 'C'}),
(h:Location {name: 'H'}),
(j:Location {name: 'J'}),
(a)-[:ROAD {cost: 50}]->(b),
(a)-[:ROAD {cost: 50}]->(c),
(c)-[:ROAD {cost: 40}]->(j),
(j)-[:ROAD {cost: 30}]->(h),
(h)-[:ROAD {cost: 50}]->(b);
MATCH p=(o{name:"A"})-[r*]->(x{name:"B"})
RETURN [x in nodes(p) | id(x)]
output
╒═════════════════════╕
│"list_path" │
╞═════════════════════╡
│["A","C","J","H","B"]│
├─────────────────────┤
│["A","B"] │
└─────────────────────┘
expected output
path cost
1. [A,C,J,H,B] [0,50,90,130,180]
2. [A,B] [0,50]
Here is my query. I use node names for start and end. actually, I need to use nodeID
query I tried
MATCH p=(o)-[r*]->(x)
WHERE ID(o) =13 AND ID(x) = 14
RETURN [x in nodes(p) | id(x)] as list_path, [y in r | y.cost] as cost
output
╒════════════════╤═════════════╕
│"list_path" │"cost" │
╞════════════════╪═════════════╡
│[13,15,22,20,14]│[50,40,30,50]│
├────────────────┼─────────────┤
│[13,14] │[50] │
└────────────────┴─────────────┘
I need cost start with zero like([0,50,90,130,180])
To start off, you can just prepend a zero to the cost array. Next, I create a list of array that are used to calculate the cumulative sum of cost along the way using the apoc.coll.sum function. Probably there are more options how to calculate the cumulative sum of elements in a list, but this is simplest I could think of:
MATCH p=(o)-[r*0..]->(x)
WHERE ID(o) =13 AND ID(x) = 14
// Prepend a zero
WITH [x in nodes(p) | id(x)] as list_path, [0] + [y in r | y.cost] as cost_path
// Create list of lists
RETURN list_path, [i in range(1, size(cost_path)) | apoc.coll.sum(cost_path[..i])] as cost, apoc.coll.sum(cost_path) as total_cost
ORDER BY total_cost DESC
Output:
╒═══════════╤═══════════════════════════╕
│"list_path"│"cost" │
╞═══════════╪═══════════════════════════╡
│[0,1] │[0.0,50.0] │
├───────────┼───────────────────────────┤
│[0,2,4,3,1]│[0.0,50.0,90.0,120.0,170.0]│
└───────────┴───────────────────────────┘
You can convert the elements in the cost array to integer if that is what you prefer.

Cypher - how to walk graph while computing

I'm just starting studying Cypher here..
How would would I specify a Cypher query to return the node connected, from 1 to 3 hops away of the initial node, which has the highest average of weights in the path?
Example
Graph is:
(I know I'm not using the Cypher's notation here..)
A-[2]-B-[4]-C
A-[3.5]-D
It would return D, because 3.5 > (2+4)/2
And with Graph:
A-[2]-B-[4]-C
A-[3.5]-D
A-[2]-B-[4]-C-[20]-E
A-[2]-B-[4]-C-[20]-E-[80]-F
It would return E, because (2+4+20)/3 > 3.5
and F is more than 3 hops away
One way to write the query, which has the benefit of being easy to read, is
MATCH p=(A {name: 'A'})-[*1..3]-(x)
UNWIND [r IN relationships(p) | r.weight] AS weight
RETURN x.name, avg(weight) AS avgWeight
ORDER BY avgWeight DESC
LIMIT 1
Here we extract the weights in the path into a list, and unwind that list. Try inserting a RETURN there to see what the results look like at that point. Because we unwind we can use the avg() aggregation function. By returning not only the avg(weight), but also the name of the last path node, the aggregation will be grouped by that node name. If you don't want to return the weight, only the node name, then change RETURN to WITH in the query, and add another return clause which only returns the node name.
You can also add something like [n IN nodes(p) | n.name] AS nodesInPath to the return statement to see what the path looks like. I created an example graph based on your question with below query with nodes named A, B, C etc.
CREATE (A {name: 'A'}),
(B {name: 'B'}),
(C {name: 'C'}),
(D {name: 'D'}),
(E {name: 'E'}),
(F {name: 'F'}),
(A)-[:R {weight: 2}]->(B),
(B)-[:R {weight: 4}]->(C),
(A)-[:R {weight: 3.5}]->(D),
(C)-[:R {weight: 20}]->(E),
(E)-[:R {weight: 80}]->(F)
1) To select the possible paths with length from one to three - use match with variable length relationships:
MATCH p = (A)-[*1..3]->(T)
2) And then use the reduce function to calculate the average weight. And then sorting and limits to get one value:
MATCH p = (A)-[*1..3]->(T)
WITH p, T,
reduce(s=0, r in rels(p) | s + r.weight)/length(p) AS weight
RETURN T ORDER BY weight DESC LIMIT 1

Neo4J: How to find unique nodes from a collection of paths

I am using neo4j to solve a realtime normalization problem. Lets say I have 3 places from 2 different sources. 1 source 45 gives me 2 places that are in-fact duplicates of each other, and 1 source 55 gives me 1 correct identifier. However, for any place identifier (duplicate or not), I want to find the closest set of places that are unique by a feed identifier. My data looks like so:
CREATE (a: Place {feedId:45, placeId: 123, name:"Empire State", address: "350 5th Ave", city: "New York", state: "NY", zip: "10118" })
CREATE (b: Place {feedId:45, placeId: 456, name:"Empire State Building", address: "350 5th Ave", city: "New York", state: "NY"})
CREATE (c: Place {feedId:55, placeId: 789, name:"Empire State", address: "350 5th Ave", city: "New York", state: "NY", zip: "10118"})
I have connected these nodes by Matching nodes so I can do some normalization on the data. For instance:
MERGE (m1: Matching:NameAndCity { attr: "EmpireStateBuildingNewYork", cost: 5.0 })
MERGE (a)-[:MATCHES]-(m1)
MERGE (b)-[:MATCHES]-(m1)
MERGE (c)-[:MATCHES]-(m1)
MERGE (m2: Matching:CityAndZip { attr: "NewYork10118", cost: 7.0 })
MERGE (a)-[:MATCHES]-(m2)
MERGE (c)-[:MATCHES]-(m2)
When I want to find what are the closest matches from a start place id, I can run a match on all paths from the start node, ranked by cost, ie:
MATCH p=(a:Place {placeId:789, feedId:55})-[*..4]-(d:Place)
WHERE NONE (n IN nodes(p)
WHERE size(filter(x IN nodes(p)
WHERE n = x))> 1)
WITH p,
reduce(costAccum = 0, n in filter(n in nodes(p) where has(n.cost)) | costAccum+n.cost) AS costAccum
order by costAccum
RETURN p, costAccum
However, as there are multiple paths to the same places, I get the same node replicated multiple times when querying like this. Is it possible to collect the nodes and their costs, and then only return a distinct subset (for e.g., give me the best result from feed 45 and 55?
How could I return a distinct set of paths, ranked by cost, and unique by the feed identifier? Am I structuring this type of problem wrong?
Please help!
You can collect all paths for each place d, and then just take the best path in each collection (since they will be sorted then collected)
MATCH p=(a:Place {placeId:789, feedId:55})-[*..4]-(d:Place)
WITH d, collect(p) as paths,
reduce(costAccum = 0, n in filter(n in nodes(p) where has(n.cost)) | costAccum+n.cost) AS costAccum
order by costAccum
RETURN head(paths) as p, costAccum

Resources