Adding multiple relationships using WITH, WHERE, and UNWIND - neo4j

I have data in the following structure:
{"id": "1", "name": "A. I. Lazarev", "org": "United States Department of State", "tags": [{"t": "Infrared"}, {"t": "Near-infrared spectroscopy"}, {"t": "Infrared astronomy"}, {"t": "Data collection"}], "pubs": [{"i": "1542417502", "r": 6}], }
{"id": "2", "name": "Stevan Spremo", "tags": [{"t": "Micro-g environment"}, {"t": "Antibiotics"}, {"t": "Bacteriology"}], "pubs": [{"i": "222163962", "r": 0}], }
{"id": "3", "name": "Bricchi G", "pubs": [{"i": "2417067698", "r": 1}, {"i": "2406980973", "r": 1}]}
Some of the rows have tags, some have organizations, some have both, and some have neither.
I'd like to add relationships between (1) authors and tags, (2) authors and organizations, and (3) authors and publications. I have the publications as nodes already, so it should be fairly straightforward to get (3) once I get (1) and (2).
I have been trying to use the following code:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:/test.txt') YIELD value AS q RETURN q",
"UNWIND q.id as id
CREATE (a:Author {id:id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {{name: tags.t}})
CREATE (a)-[:HAS_TAGS]->(t)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false})
The tags and the organizations show up multiple times in the data, but should only have one node each, so I have used MERGE to create unique nodes for these.
The problem with the following code is that it creates duplicate AFFILIATED_WITH relationships - it actually creates the same number of AFFILIATED_WITH relationships as there are tags.
How can I change the cypher query so that it isn't creating duplicate relationships?

After this clause:
UNWIND q.tags as tags
your query will have as many data rows as the number of tags for the current q (each row will have q, a, id, tags values). The subsequent operations will be performed once per data row. That is why you are creating too many AFFILIATED_WITH relationships.
To solve your issue, you have to reduce the number of data rows appropriately, at the appropriate time (and this will also speed up your processing, since unnecessarily repeated operations will be avoided). In your case, you can just change the second WITH q, a clause to WITH DISTINCT q, a:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)
I have also simplified the query by removing the unnecessary UNWIND q.id as id clause, and fixed some syntax issues.
[UPDATED]
If you want to add the AUTHORED relationships (as requested in the comments to this answer), you should do that before you create the AFFILIATED_WITH relationships -- since the WHERE q.org is not null clause would filter out some q nodes. Also, whenever you use CREATE to create a relationship, Cypher requires that you specify a direction for the relationship.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
UNWIND q.pubs as pubs
MERGE (p:Quanta {id: pubs.i})
CREATE (a)-[r:AUTHORED {rank: pubs.r}]->(p)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)

Related

Weird Neo4J Cypher behavior on setting relationship properties

I have a Cypher request for Neo4J of this kind:
MATCH (u:User {uid: $userId})
UNWIND $contextNames as contextName
MERGE (context:Context {name:contextName.name,by:u.uid,uid:contextName.uid})
ON CREATE SET context.timestamp=$timestamp
MERGE (context)-[by:BY]->(u)
SET by.timestamp = $timestamp
My params are:
{
"userId": "15229100-b20e-11e3-80d3-6150cb20a1b9",
"contextNames": [
{
"uid": "822e2580-1f5e-11e9-9ed0-5b93e8900a78",
"name": "fnas"
}
],
"timestamp": "1912811921129"
}
That query above sets 8 parameters because I have (probably) 8 other relationships of the :BY type in relation to that u. Which seems to me illogical as it should only find one relationship between a concrete context and a concrete u, create it if it doesn't exist, and set the property for it
However, when I do this:
MATCH (u:User {uid: $userId})
UNWIND $contextNames as contextName
MERGE (context:Context {name:contextName.name,by:u.uid,uid:contextName.uid})
ON CREATE SET context.timestamp=$timestamp
MERGE (context)-[:BY{timestamp:$timestamp}]->(u)
It either creates a relationship (if the one with the same timestamp doesn't exist) or it simply doesn't do anything (which seems to be the right behavior).
What is the reason for this discrepancy? A bug in Neo4J?

Is there a way to Traverse neo4j Path using Cypher for different relationships

I want to Traverse a PATH in neo4j (preferably using Cypher, but I can write neo4j managed extensions).
Problem -
For any starting node (:Person) I want to traverse hierarchy like
(me:Person)-[:FRIEND|:KNOWS*]->(newPerson:Person)
if the :FRIEND outgoing relationship is present then the path should traverse that, and ignore any :KNOWS outgoing relationships, if :FRIEND relationship does not exist but :KNOWS relationship is present then the PATH should traverse that node.
Right now the problem with above syntax is that it returns both the paths with :FRIEND and :KNOWS - I am not able to filter out a specific direction based on above requirement.
1. Example data set
For the ease of possible further answers and solutions I note my graph creating statement:
CREATE
(personA:Person {name:'Person A'})-[:FRIEND]->(personB:Person {name: 'Person B'}),
(personB)-[:FRIEND]->(personC:Person {name: 'Person C'}),
(personC)-[:FRIEND]->(personD:Person {name: 'Person D'}),
(personC)-[:FRIEND]->(personE:Person {name: 'Person E'}),
(personE)-[:FRIEND]->(personF:Person {name: 'Person F'}),
(personA)-[:KNOWS]->(personG:Person {name: 'Person G'}),
(personA)-[:KNOWS]->(personH:Person {name: 'Person H'}),
(personH)-[:KNOWS]->(personI:Person {name: 'Person I'}),
(personI)-[:FRIEND]->(personJ:Person {name: 'Person J'});
2. Scenario "Optional Match"
2.1 Solution
MATCH (startNode:Person {name:'Person A'})
OPTIONAL MATCH friendPath = (startNode)-[:FRIEND*]->(:Person)
OPTIONAL MATCH knowsPath = (startNode)-[:KNOWS*]->(:Person)
RETURN friendPath, knowsPath;
If you do not need every path to all nodes of the entire path, but only the whole, I recommend using shortestPath() for performance reasons.
2.1 Result
Note the missing node 'Person J', because it owns a FRIENDS relationship to node 'Person I'.
3. Scenario "Expand paths"
3.1 Solution
Alternatively you could use the Expand paths functions of the APOC user library. Depending on the next steps of your process you can choose between the identification of nodes, relationships or both.
MATCH (startNode:Person {name:'Person A'})
CALL apoc.path.subgraphNodes(startNode,
{maxLevel: -1, relationshipFilter: 'FRIEND>', labelFilter: '+Person'}) YIELD node AS friendNodes
CALL apoc.path.subgraphNodes(startNode,
{maxLevel: -1, relationshipFilter: 'KNOWS>', labelFilter: '+Person'}) YIELD node AS knowsNodes
WITH
collect(DISTINCT friendNodes.name) AS friendNodes,
collect(DISTINCT knowsNodes.name) AS knowsNodes
RETURN friendNodes, knowsNodes;
3.2 Explanation
line 1: defining your start node based on the name
line 2-3: Expand from the given startNode following the given relationships (relationshipFilter: 'FRIEND>') adhering to the label filter (labelFilter: '+Person').
line 4-5: Expand from the given startNode following the given relationships (relationshipFilter: 'KNOWS>') adhering to the label filter (labelFilter: '+Person').
line 7: aggregates all nodes by following the FRIEND relationship type (omit the .name part if you need the complete node)
line 8: aggregates all nodes by following the KNOWS relationship type (omit the .name part if you need the complete node)
line 9: render the resulting groups of nodes
3.3 Result
╒═════════════════════════════════════════════╤═════════════════════════════════════════════╕
│"friendNodes" │"knowsNodes" │
╞═════════════════════════════════════════════╪═════════════════════════════════════════════╡
│["Person A","Person B","Person C","Person E",│["Person A","Person H","Person G","Person I"]│
│"Person D","Person F"] │ │
└─────────────────────────────────────────────┴─────────────────────────────────────────────┘
MATCH p = (me:Person)-[:FRIEND|:KNOWS*]->(newPerson:Person)
WITH p, extract(r in relationships(p) | type(r)) AS types
RETURN p ORDER BY types asc LIMIT 1
This is a matter of interrogating the types of outgoing relationships for each node and then making a prioritized decision on which relationships to retain leveraging some nested case logic.
Using the small graph above
MATCH path = (a)-[r:KNOWS|FRIEND]->(b)
WITH a, COLLECT([type(r),a,r,b]) AS rels
WITH a,
rels,
CASE WHEN filter(el in rels WHERE el[0] = "FRIEND") THEN filter(el in rels WHERE el[0] = "FRIEND")
ELSE CASE WHEN filter(el in rels WHERE el[0] = "KNOWS") THEN filter(el in rels WHERE el[0] = "KNOWS") ELSE [''] END END AS search
UNWIND search AS s
RETURN s[1] AS a, s[2] AS r, s[3] AS b
I believe this returns your expected result:
Based on your logic, there should be no traversal to Person G or Person H from Person A, as there is a FRIEND relationship from Person A to Person B that takes precedence.
However there is a traversal from Person H to Person I because of the existence of the singular KNOWS relationship, and then a subsequent traversal from Person I to Person J.

How do I set relationship data as properties on a node?

I've taken the leap from SQL to Neo4j. I have a few complicated relationships that I need to set as properties on nodes as the first step towards building a recommendation engine.
This Cypher query returns a list of categories and weights.
MATCH (m:Movie {name: "The Matrix"})<-[:TAKEN_FROM]-(i:Image)-[r:CLASSIFIED_AS]->(c:Category) RETURN c.name, avg(r.weight)
This returns
{ "fighting": 0.334, "looking moody": 0.250, "lying down": 0.237 }
How do I set these results as key value pairs on the parent node?
The desired outcome is this:
(m:Movie { "name": "The Matrix", "fighting": 0.334, "looking moody": 0.250, "lying down": 0.237 })
Also, I assume I should process my (m:Movie) nodes in batches so what is the best way of accomplishing this?
Not quite sure how you're getting that output, that return shouldn't be returning both of them as key value pairs. Instead I would expect something like: {"c.name":"fighting", "avg(r.weight)":0.334}, with separate records for each pair.
You may need APOC procedures for this, as you need a means to set the property key to the value of the category name. That's a bit tricky, but you can do this by creating a map from the collected pairs, then use SET with += to update the relevant properties:
MATCH (m:Movie {name: "The Matrix"})<-[:TAKEN_FROM]-(:Image)-[r:CLASSIFIED_AS]->(c:Category)
WITH m, c.name as name, avg(r.weight) as weight
WITH m, collect([name, weight]) as category
WITH m, apoc.map.fromPairs(category) as categories
SET m += categories
As far as batching goes, take a look at apoc.periodic.iterate(), it will allow you to iterate on the streamed results of the outer query and execute the inner query on batches of the stream:
CALL apoc.periodic.iterate(
"MATCH (m:Movie)
RETURN m",
"MATCH (m)<-[:TAKEN_FROM]-(:Image)-[r:CLASSIFIED_AS]->(c:Category)
WITH m, c.name as name, avg(r.weight) as weight
WITH m, collect([name, weight]) as category
WITH m, apoc.map.fromPairs(category) as categories
SET m += categories",
{iterateList:true, parallel:false}) YIELD total, batches, errorMessages
RETURN total, batches, errorMessages

How to avoid duplicate nodes when importing JSON into Neo4J

Let's say I have a JSON containing relationships between people:
{
[
{
"name": "mike",
"loves": ["karen", "david", "joy"],
"loved": ["karen", "joy"]
},
{
"name": "karen",
"loves": ["mike", "david", "joy"],
"loved": ["mike"]
},
{
"name": "joy",
"loves": ["karen"],
"loved": ["karen", "david"]
}
]
}
I want to import nodes and relationships into a Neo4J DB. For this sample, there's only one relationship ("LOVES") and the 2 lists each user has just control the arrow's direction. I use the following query to import the JSON:
UNWIND {json} as person
CREATE (p:Person {name: person.username})
FOREACH (l in person.loves | MERGE (v:Person {name: l}) CREATE (p)-[:LOVES]->(v))
FOREACH (f in person.loved | MERGE (v:Person {name: f}) CREATE (v)-[:LOVES]->(p))
My problem is that I now have duplicate nodes (i.e. 2 nodes with {name: 'karen'}). I know I could probably use UNIQUE if I insert records one at a time. But what should I use here when importing a large JSON? (to be clear: the name property would always be unique in the JSON - i.e., there are no 2 "mikes").
[EDITED]
Since you cannot assume that a Person node does not yet exist, you need to MERGE your Person nodes everywhere.
If there is no need to use your loved data (that is, if the loves data is sufficient to create all the necessary relationships):
UNWIND {json} as person
MERGE (p:Person {name: person.name})
FOREACH (l in person.loves | MERGE (v:Person {name: l}) CREATE (p)-[:LOVES]->(v))
On the other hand, if the loved data is needed, then you need to use MERGE when creating the relationships as well (since any relationship might already exist).
UNWIND {json} as person
MERGE (p:Person {name: person.name})
FOREACH (l in person.loves | MERGE (v:Person {name: l}) MERGE (p)-[:LOVES]->(v))
FOREACH (f in person.loved | MERGE (v:Person {name: f}) MERGE (v)-[:LOVES]->(p))
In both cases, you should create an index (or uniqueness constraint) on :Person(name) to speed up the query.

Neo4j Passing distinct nodes through WITH in Cypher

I have the following query, where there are 3 MATCHES, connected with WITH, searching through 3 paths.
MATCH (:File {name: 'A'})-[:FILE_OF]->(:Fun {name: 'B'})-->(ent:CFGEntry)-[:Flows*]->()-->(expr:CallExpr {name: 'C'})-->()-[:IS_PARENT]->(Callee {name: 'd'})
WITH expr, ent
MATCH (expr)-->(:Arg {chNum: '1'})-->(id:Id)
WITH id, ent
MATCH (entry)-[:Flows*]->(:IdDecl)-[:Def]->(sym:Sym)
WHERE id.name = sym.name
RETURN id.name
The query returns two distinct id and one distinct entry, and 7 distinct sym.
The problem is that since in the second MATCH I pass "WITH id, entry", and two distinct id were found, two instances of entry is passed to the third match instead of 1, and the run time of the third match unnecessarily gets doubled at least.
I am wondering if anyone know how I should write this query to just make use of one single instance of entry.
Your best bet will be to aggregate id, but then you'll need to adjust your logic in the third part of your query accordingly:
MATCH (:File {name: 'A'})-[:FILE_OF]->(:Fun {name: 'B'})-->(ent:CFGEntry)-[:Flows*]->()-->(expr:CallExpr {name: 'C'})-->()-[:IS_PARENT]->(Callee {name: 'd'})
WITH expr, ent
MATCH (expr)-->(:Arg {chNum: '1'})-->(id:Id)
WITH collect(id.name) as names, ent
MATCH (entry)-[:Flows*]->(:IdDecl)-[:Def]->(sym:Sym)
WHERE sym.name in names
RETURN sym.name

Resources