Neo4J: How to find unique nodes from a collection of paths - neo4j

I am using neo4j to solve a realtime normalization problem. Lets say I have 3 places from 2 different sources. 1 source 45 gives me 2 places that are in-fact duplicates of each other, and 1 source 55 gives me 1 correct identifier. However, for any place identifier (duplicate or not), I want to find the closest set of places that are unique by a feed identifier. My data looks like so:
CREATE (a: Place {feedId:45, placeId: 123, name:"Empire State", address: "350 5th Ave", city: "New York", state: "NY", zip: "10118" })
CREATE (b: Place {feedId:45, placeId: 456, name:"Empire State Building", address: "350 5th Ave", city: "New York", state: "NY"})
CREATE (c: Place {feedId:55, placeId: 789, name:"Empire State", address: "350 5th Ave", city: "New York", state: "NY", zip: "10118"})
I have connected these nodes by Matching nodes so I can do some normalization on the data. For instance:
MERGE (m1: Matching:NameAndCity { attr: "EmpireStateBuildingNewYork", cost: 5.0 })
MERGE (a)-[:MATCHES]-(m1)
MERGE (b)-[:MATCHES]-(m1)
MERGE (c)-[:MATCHES]-(m1)
MERGE (m2: Matching:CityAndZip { attr: "NewYork10118", cost: 7.0 })
MERGE (a)-[:MATCHES]-(m2)
MERGE (c)-[:MATCHES]-(m2)
When I want to find what are the closest matches from a start place id, I can run a match on all paths from the start node, ranked by cost, ie:
MATCH p=(a:Place {placeId:789, feedId:55})-[*..4]-(d:Place)
WHERE NONE (n IN nodes(p)
WHERE size(filter(x IN nodes(p)
WHERE n = x))> 1)
WITH p,
reduce(costAccum = 0, n in filter(n in nodes(p) where has(n.cost)) | costAccum+n.cost) AS costAccum
order by costAccum
RETURN p, costAccum
However, as there are multiple paths to the same places, I get the same node replicated multiple times when querying like this. Is it possible to collect the nodes and their costs, and then only return a distinct subset (for e.g., give me the best result from feed 45 and 55?
How could I return a distinct set of paths, ranked by cost, and unique by the feed identifier? Am I structuring this type of problem wrong?
Please help!

You can collect all paths for each place d, and then just take the best path in each collection (since they will be sorted then collected)
MATCH p=(a:Place {placeId:789, feedId:55})-[*..4]-(d:Place)
WITH d, collect(p) as paths,
reduce(costAccum = 0, n in filter(n in nodes(p) where has(n.cost)) | costAccum+n.cost) AS costAccum
order by costAccum
RETURN head(paths) as p, costAccum

Related

Matching all nodes related to a set of other nodes - neo4j

I'm just getting started with neo4j and would like some help trying to solve a problem.
I have a set of Questions that require information (Slots) to answer them.
The rules of the graph (i.e. the Slots required for each Question) are shown below:
Graph diagram here
In a scenario in which I have a set of slots e.g. [Slot A, Slot B] I want to be able to check all Questions that the Slots are related to e.g. [Question 1 , Question 2].
I then want to be able to check for which of the Questions all required Slots are available, e.g. [Question 1]
Is this possible, and if so how should I go about it?
Yes it's possible.
Some data fixtures :
CREATE (q1:Question {name: "Q1"})
CREATE (q2:Question {name: "Q2"})
CREATE (s1:Slot {name: "Slot A"})
CREATE (s2:Slot {name: "Slot B"})
CREATE (s3:Slot {name: "Slot C"})
CREATE (q1)-[:REQUIRES]->(s1)
CREATE (q1)-[:REQUIRES]->(s2)
CREATE (q2)-[:REQUIRES]->(s1)
CREATE (q2)-[:REQUIRES]->(s3)
Find questions related to a slots list :
MATCH p=(q:Question)-[:REQUIRES]->(slot)
WHERE slot.name IN ["Slot A", "Slot B"]
RETURN p
Then, find questions related to a slot list, and return a boolean if the slot list contains all required slots for a question :
MATCH p=(q:Question)-[:REQUIRES]->(slot)
WHERE slot.name IN ["Slot A", "Slot B"]
WITH q, collect(slot) AS slots
RETURN q, ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
╒═════════════╤═══════════════════════════════════════════════════════╕
│"q" │"ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)"│
╞═════════════╪═══════════════════════════════════════════════════════╡
│{"name":"Q1"}│true │
├─────────────┼───────────────────────────────────────────────────────┤
│{"name":"Q2"}│false │
└─────────────┴───────────────────────────────────────────────────────┘
A bit of explanation on that part ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
the ALL predicate, will check that the condition for every value in a list is true, for example ALL (x IN [10,20,30] WHERE x > 5)
the extract shortcut syntax, you pass a list, it returns a list of the extracted values, the syntax is extract(x IN <LIST> | <key to extract>) for example :
extract(x IN [{name: "Chris", age: 38},{name: "John", age: 27}] | x.age)
// equivalent to the shortcut syntax for extract, with square brackets
[x IN [{name: "Chris", age: 38},{name: "John", age: 27}] | x.age]
Will return [38,27]
Combining it now :
For every path, extract the Slot node
[(q)-[:REQUIRES]->(s) | s]
Returns
[s1, s2]
Are every of s1 and s2, in the list of the slot nodes previously collected ?
ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
Return true or false
Return only the questions when true :
MATCH p=(q:Question)-[:REQUIRES]->(slot)
WHERE slot.name IN ["Slot A", "Slot B"]
WITH q, collect(slot) AS slots
WITH q WHERE ALL(x IN [(q)-[:REQUIRES]->(s) | s] WHERE x IN slots)
RETURN q

Cypher: List declared before UNWIND-ing a second list becomes null after UNWIND-ing the second list and executing a MATCH which returns no results

I have the following scenario in a noe4j db:
There are tasks which can be assigned to different users based on some criteria. There's an optional criterion (some tasks have a filter for user's location, some don't).
I need to find all tasks for a user (if they have a location filter, I need to check user's location as well, if they don't I match only by the rest of the criteria).
I've tried to collect the tasks matching the mandatory criteria, then filter those which don't require the optional filter, then filter those which require the optional filter and match the current user and eventually merge the two lists.
Could you also suggest a more efficient way to do this please?
Here's a minimal example (of course, I have more complex matches after UNWIND)
WITH [{a: 'test'}, {a: 'a', b: 'b'}] AS initialList
WITH [i IN initialList WHERE i.b IS NULL] AS itemsWithoutB, initialList
UNWIND initialList AS item
MATCH (item) WHERE item.a IS NULL
RETURN COLLECT(item) + itemsWithoutB
I would expect here to have the content of itemsWithoutB returned, but I get no records (Response: []).
Note that if the MATCH done after UNWIND does actually return some records, then the content of itemsWithoutB is returned as well.
For example:
WITH [{a: 'test'}, {a: 'a', b: 'b'}] AS initialList
WITH [i IN initialList WHERE i.b IS NULL] AS itemsWithoutB, initialList
UNWIND initialList AS item
MATCH (item) WHERE item.a IS NOT NULL
RETURN COLLECT(item) + itemsWithoutB
this returns:
╒═════════════════════════════════════════════╕
│"COLLECT(item) + itemsWithoutB" │
╞═════════════════════════════════════════════╡
│[{"a":"test"},{"a":"a","b":"b"},{"a":"test"}]│
└─────────────────────────────────────────────┘
Neo4j version: enterprise 3.5.6
What am I missing here, please?
---EDITED---
I'm adding here a more complex example, closer to the real scenario:
Generate initial data:
MERGE (d:Device {code: 'device1', latitude:90.5, longitude: 90.5})-[:USED_BY]->(u:User {name: 'user1'})-[:WORKS_IN]->(c:Country {code: 'RO'})<-[targets:TARGETS]-(:Task {name: 'task1', latitude: 90.5, longitude: 90.5, maxDistance: 1000, maxMinutesAfterLastInRange: 99999})<-[:IN_RANGE {timestamp: datetime()}]-(d)
MERGE (c)<-[:TARGETS]-(:Task {name: 'task2'})
MERGE (c)<-[:TARGETS]-(:Task {name: 'task4', latitude: 10.5, longitude: 10.5, maxDistance: 1, maxMinutesAfterLastInRange: 99999})
CREATE (:User {name: 'user2'})-[:WORKS_IN]->(:Country {code: 'GB'})<-[:TARGETS]-(:Task {name: 'task3'})
Here's a neo4j console link for this example.
I want to be able to use the same query to find the tasks for any user (task1 and task2 should be returned for user1, task3 for user2, task4 shouldn't be returned for neither of them).
The following query works for user1, but doesn't work if I change the user name filter to "user2":
MATCH (user:User {name: "user1"})-[:WORKS_IN]->(country)
OPTIONAL MATCH (device:Device)-[:USED_BY]->(user)
WITH country, device
MATCH (task:Task)-[:TARGETS]->(country)
WITH COLLECT(task) AS filteredTasks, device
WITH [t IN filteredTasks WHERE t.latitude IS NULL OR t.longitude IS NULL] AS matchedTasksWithoutLocationFilter, filteredTasks, device
UNWIND filteredTasks AS task
MATCH (device)-[inRange:IN_RANGE]->(task)
WHERE task.maxMinutesAfterLastInRange IS NOT NULL
AND duration.between(datetime(inRange.timestamp), datetime()).minutes <= task.maxMinutesAfterLastInRange
RETURN COLLECT(task) + matchedTasksWithoutLocationFilter AS matchedTasks
Updated answer based on new information
I think you can do this in one shot and not need list comprehensions.
MATCH (user: User {name: "user1" })-[:WORKS_IN]->(country)<-[:TARGETS]-(task: Task)
OPTIONAL MATCH (task)<-[inRange: IN_RANGE]-(device: Device)-[:USED_BY]->(user)
WITH task, inRange
MATCH (task)
WHERE (task.latitude IS NULL OR task.longitude IS NULL)
OR (inRange IS NOT NULL AND
task.maxMinutesAfterLastInRange IS NOT NULL AND
duration.between(datetime(inRange.timestamp), datetime()).minutes <= task.maxMinutesAfterLastInRange)
RETURN task
For user1:
╒══════════════════════════════════════════════════════════════════════╕
│"task" │
╞══════════════════════════════════════════════════════════════════════╡
│{"name":"task2"} │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"task1","maxDistance":1000,"maxMinutesAfterLastInRange":99999,│
│"latitude":90.5,"longitude":90.5} │
└──────────────────────────────────────────────────────────────────────┘
For user2:
╒════════════════╕
│"task" │
╞════════════════╡
│{"name":"task3"}│
└────────────────┘
Original answer
When your MATCH doesn't return any nodes (in that example, all nodes have an a property), the rest of the query's got no work to do - sort of like a failed inner join in a traditional SQL database.
If you switch to an OPTIONAL MATCH then you'll see results from itemsWithoutB irrespective of whether the MATCH worked. I know your example's synthetic so I'm not sure if that's what you're after - in your example the COLLECT(item) is going to be working off the item from UNWIND, and the result of the OPTIONAL MATCH is basically irrelevant. Still, imagine that these are real nodes with real queries:
WITH [{a: 'test'}, {a: 'a', b: 'b'}] AS initialList
WITH [i IN initialList WHERE i.b IS NULL] AS itemsWithoutB, initialList
UNWIND initialList AS item
OPTIONAL MATCH (item) WHERE item.a IS NULL
RETURN COLLECT(item) + itemsWithoutB
You may need to do some further work to de-duplicate the results.

How to create all edges that using one statement in Cypher?

How to create all edges that using one statement in Cypher?
For example: lets say I have one object like this
Employees {name: "abc, country: "NZ", zipcode: "123456"}
Employees {name: "def", country: "AUS", zipcode: "964573"}
and lets say I have the following Manager objects
Manager { name: "abc", depatment: "product"}
Manager {name: "abc", depatment: "sales"}
Manager {name: "abc", depatment: "marketing"}
and Finally the Address Objects
Address {zipcode: "964573", street: "Auckland St"}
Now I want to create all the edges where Employees.name = Manager.name and Employees.zipcode = Address.zipcode however if Employees.name != Manager.name but Employees.zipcode = Address.zipcode then I want all the edges to be created between Employees and Address similarly if Employees.zipcode != Address.zipcode but Employees.name = Manager.namethen I want all the edges to be created between Employees and Manager. And I want to achieve all of this in one single statement/query
Simply put if there are matching vertices between Employees, Manager and Address I want all the edges to be created between them but if there only a match between any two I want the edge to be created between those two vertices as well. And I am trying to all of this in a single query/statement?
Is this possible to write a query in one statement that can satisfy all the conditions above?
What I tried so far is this
Find the pairs first with MATCH clause and then CREATE a relationship between them.
MATCH (e:Employees),(m:Manager), (a:Address)
WHERE e.name=m.name or e.zipcode = a.zipcode
WITH e,m,a
CREATE (m)-[:REL_NAME]->(e), (e)-[:ADDR_REL]->(a)
This clearly won't work because of the Where clause because if e.name=m.name then e.zipcode = a.zipcode won't be checked and therefore no edge will be created between employees and address.
The following query avoids producing a cartesian product of all 3 node labels (and will perform better if you have indexes for :Manager(name) and :Address(zipcode)):
MATCH (e:Employees)
OPTIONAL MATCH (m:Manager)
WHERE e.name = m.name
WITH e, COLLECT(m) AS mList
FOREACH(x IN mList | CREATE (x)-[:REL_NAME]->(e))
WITH e
OPTIONAL MATCH (a:Address)
WHERE e.zipcode = a.zipcode
WITH e, COLLECT(a) AS aList
FOREACH(y IN aList | CREATE (e)-[:ADDR_REL]->(y))

Adding multiple relationships using WITH, WHERE, and UNWIND

I have data in the following structure:
{"id": "1", "name": "A. I. Lazarev", "org": "United States Department of State", "tags": [{"t": "Infrared"}, {"t": "Near-infrared spectroscopy"}, {"t": "Infrared astronomy"}, {"t": "Data collection"}], "pubs": [{"i": "1542417502", "r": 6}], }
{"id": "2", "name": "Stevan Spremo", "tags": [{"t": "Micro-g environment"}, {"t": "Antibiotics"}, {"t": "Bacteriology"}], "pubs": [{"i": "222163962", "r": 0}], }
{"id": "3", "name": "Bricchi G", "pubs": [{"i": "2417067698", "r": 1}, {"i": "2406980973", "r": 1}]}
Some of the rows have tags, some have organizations, some have both, and some have neither.
I'd like to add relationships between (1) authors and tags, (2) authors and organizations, and (3) authors and publications. I have the publications as nodes already, so it should be fairly straightforward to get (3) once I get (1) and (2).
I have been trying to use the following code:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:/test.txt') YIELD value AS q RETURN q",
"UNWIND q.id as id
CREATE (a:Author {id:id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {{name: tags.t}})
CREATE (a)-[:HAS_TAGS]->(t)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false})
The tags and the organizations show up multiple times in the data, but should only have one node each, so I have used MERGE to create unique nodes for these.
The problem with the following code is that it creates duplicate AFFILIATED_WITH relationships - it actually creates the same number of AFFILIATED_WITH relationships as there are tags.
How can I change the cypher query so that it isn't creating duplicate relationships?
After this clause:
UNWIND q.tags as tags
your query will have as many data rows as the number of tags for the current q (each row will have q, a, id, tags values). The subsequent operations will be performed once per data row. That is why you are creating too many AFFILIATED_WITH relationships.
To solve your issue, you have to reduce the number of data rows appropriately, at the appropriate time (and this will also speed up your processing, since unnecessarily repeated operations will be avoided). In your case, you can just change the second WITH q, a clause to WITH DISTINCT q, a:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)
I have also simplified the query by removing the unnecessary UNWIND q.id as id clause, and fixed some syntax issues.
[UPDATED]
If you want to add the AUTHORED relationships (as requested in the comments to this answer), you should do that before you create the AFFILIATED_WITH relationships -- since the WHERE q.org is not null clause would filter out some q nodes. Also, whenever you use CREATE to create a relationship, Cypher requires that you specify a direction for the relationship.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
UNWIND q.pubs as pubs
MERGE (p:Quanta {id: pubs.i})
CREATE (a)-[r:AUTHORED {rank: pubs.r}]->(p)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)

Cypher - how to walk graph while computing

I'm just starting studying Cypher here..
How would would I specify a Cypher query to return the node connected, from 1 to 3 hops away of the initial node, which has the highest average of weights in the path?
Example
Graph is:
(I know I'm not using the Cypher's notation here..)
A-[2]-B-[4]-C
A-[3.5]-D
It would return D, because 3.5 > (2+4)/2
And with Graph:
A-[2]-B-[4]-C
A-[3.5]-D
A-[2]-B-[4]-C-[20]-E
A-[2]-B-[4]-C-[20]-E-[80]-F
It would return E, because (2+4+20)/3 > 3.5
and F is more than 3 hops away
One way to write the query, which has the benefit of being easy to read, is
MATCH p=(A {name: 'A'})-[*1..3]-(x)
UNWIND [r IN relationships(p) | r.weight] AS weight
RETURN x.name, avg(weight) AS avgWeight
ORDER BY avgWeight DESC
LIMIT 1
Here we extract the weights in the path into a list, and unwind that list. Try inserting a RETURN there to see what the results look like at that point. Because we unwind we can use the avg() aggregation function. By returning not only the avg(weight), but also the name of the last path node, the aggregation will be grouped by that node name. If you don't want to return the weight, only the node name, then change RETURN to WITH in the query, and add another return clause which only returns the node name.
You can also add something like [n IN nodes(p) | n.name] AS nodesInPath to the return statement to see what the path looks like. I created an example graph based on your question with below query with nodes named A, B, C etc.
CREATE (A {name: 'A'}),
(B {name: 'B'}),
(C {name: 'C'}),
(D {name: 'D'}),
(E {name: 'E'}),
(F {name: 'F'}),
(A)-[:R {weight: 2}]->(B),
(B)-[:R {weight: 4}]->(C),
(A)-[:R {weight: 3.5}]->(D),
(C)-[:R {weight: 20}]->(E),
(E)-[:R {weight: 80}]->(F)
1) To select the possible paths with length from one to three - use match with variable length relationships:
MATCH p = (A)-[*1..3]->(T)
2) And then use the reduce function to calculate the average weight. And then sorting and limits to get one value:
MATCH p = (A)-[*1..3]->(T)
WITH p, T,
reduce(s=0, r in rels(p) | s + r.weight)/length(p) AS weight
RETURN T ORDER BY weight DESC LIMIT 1

Resources