How to avoid duplicate nodes when importing JSON into Neo4J

How to avoid duplicate nodes when importing JSON into Neo4J - neo4j

Let's say I have a JSON containing relationships between people:
{
[
{
"name": "mike",
"loves": ["karen", "david", "joy"],
"loved": ["karen", "joy"]
},
{
"name": "karen",
"loves": ["mike", "david", "joy"],
"loved": ["mike"]
},
{
"name": "joy",
"loves": ["karen"],
"loved": ["karen", "david"]
}
]
}
I want to import nodes and relationships into a Neo4J DB. For this sample, there's only one relationship ("LOVES") and the 2 lists each user has just control the arrow's direction. I use the following query to import the JSON:
UNWIND {json} as person
CREATE (p:Person {name: person.username})
FOREACH (l in person.loves | MERGE (v:Person {name: l}) CREATE (p)-[:LOVES]->(v))
FOREACH (f in person.loved | MERGE (v:Person {name: f}) CREATE (v)-[:LOVES]->(p))
My problem is that I now have duplicate nodes (i.e. 2 nodes with {name: 'karen'}). I know I could probably use UNIQUE if I insert records one at a time. But what should I use here when importing a large JSON? (to be clear: the name property would always be unique in the JSON - i.e., there are no 2 "mikes").

[EDITED]
Since you cannot assume that a Person node does not yet exist, you need to MERGE your Person nodes everywhere.
If there is no need to use your loved data (that is, if the loves data is sufficient to create all the necessary relationships):
UNWIND {json} as person
MERGE (p:Person {name: person.name})
FOREACH (l in person.loves | MERGE (v:Person {name: l}) CREATE (p)-[:LOVES]->(v))
On the other hand, if the loved data is needed, then you need to use MERGE when creating the relationships as well (since any relationship might already exist).
UNWIND {json} as person
MERGE (p:Person {name: person.name})
FOREACH (l in person.loves | MERGE (v:Person {name: l}) MERGE (p)-[:LOVES]->(v))
FOREACH (f in person.loved | MERGE (v:Person {name: f}) MERGE (v)-[:LOVES]->(p))
In both cases, you should create an index (or uniqueness constraint) on :Person(name) to speed up the query.

Related

Match paths of node types where nodes may have cycles

I'm trying to find a match pattern to match paths of certain node types. I don't care about the type of relation. Any relation type may match. I only care about the node types.
Of course the following would work:
MATCH (n)-->(:a)-->(:b)-->(:c) WHERE id(n) = 0
But, some of these paths may have relations to themselves. This could be for :b, so I'd also like to match:
MATCH (n)-->(:a)-->(:b)-->(:b)-->(:c) WHERE id(n) = 0
And:
MATCH (n)-->(:a)-->(:b)-->(:b)-->(:b)-->(:c) WHERE id(n) = 0
I can do this with relations easily enough, but I can't figure out how to do this with nodes, something like:
MATCH (n)-->(:a)-->(:b*1..)-->(:c) WHERE id(n) = 0
As a practical example, let's say I have a database with people, cars and bikes. The cars and bikes are "owned" by people, and people have relationships like son, daughter, husband, wife, etc. What I'm looking for is a query that from a specific node, gets all nodes of related types. So:
MATCH (n)-->(:person*1..)-->(:car) WHERE Id(n) = 0
I would expect that to get node "n", it's parents, grandparents, children, grandchildren, all recursively. And then of those people, their cars. If I could assume that I know the full list of relations, and that they only apply to people, I could get this to work as follows:
MATCH
p = (n)-->(:person)-[:son|daughter|husband|wife|etc*0..]->(:person)-->(:car)
WHERE Id(n) = 0
RETURN nodes(p)
What I'm looking for is the same without having to specify the full list of relations; but just the node label.

Edit:
If you want to find the path from one Person node to each Car node, using only the node labels, and assuming nodes may create cycles, you can use apoc.path.expandConfig.
For example:
MERGE (mark:Person {name: "Mark"})
MERGE (lju:Person {name: "Lju"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (zhen:Person {name: "Zhen"})
MERGE (martin:Person {name: "Martin"})
MERGE (joe:Person {name: "Joe"})
MERGE (stefan:Person {name: "Stefan"})
MERGE (alicia:Person {name: "Alicia"})
MERGE (markCar:Car {name: "Mark's car"})
MERGE (ljuCar:Car {name: "Lju's car"})
MERGE (praveenaCar:Car {name: "Praveena's car"})
MERGE (zhenCar:Car {name: "Zhen's car"})
MERGE (zhen)-[:CHILD_OF]-(mark)
MERGE (praveena)-[:CHILD_OF]-(martin)
MERGE (praveena)-[:MARRIED_TO]-(joe)
MERGE (zhen)-[:CHILD_OF]-(joe)
MERGE (alicia)-[:CHILD_OF]-(joe)
MERGE (zhen)-[:CHILD_OF]-(mark)
MERGE (anthony)-[:CHILD_OF]-(rik)
MERGE (martin)-[:CHILD_OF]-(mark)
MERGE (stefan)-[:CHILD_OF]-(zhen)
MERGE (lju)-[:CHILD_OF]-(stefan)
MERGE (markCar)-[:OWNED]-(mark)
MERGE (ljuCar)-[:OWNED]-(lju)
MERGE (praveenaCar)-[:OWNED]-(praveena)
MERGE (zhenCar)-[:OWNED]-(zhen)
Running a query:
MATCH (n:Person{name:'Joe'})
CALL apoc.path.expandConfig(n, {labelFilter: "Person|/Car", uniqueness: "NODE_GLOBAL"})
YIELD path
RETURN path
will return four unique paths from Joe node to the four car nodes. There are several options for uniqueness of the path, see uniqueness
The /CAR makes it a Termination label, i.e. returned paths are only up to this given label.

How to get tree structure that has access rights on each node

I have a tree structure like a folder structure so with a project with nested project without a depth limit, each node has access rights on them.
Here is my graph:
Here is my query:
MATCH (a:Account {name: "bob"})-[r:VIEWER | EDITOR]->(c:Project)
MATCH (c)<-[:IS_PARENT*]-(p)
WHERE (p)<-[:VIEWER | EDITOR]-(a)
WITH TYPE(r) as relation, p, collect(distinct c) AS children
RETURN {name: p.name, Children: [c in children | {name: c.name, access:relation}]}
Here is my result:
And this is what I want to get:
My problem is that the result is split in two results, and nested child isn't nested in cohort.
An other thing that is tricky is that I don't want to get a node if I don't have a relation with it.
For example here I removed the relation between bob and cohort:
So I must not get cohort in my result, like this:
Here is my data if you want to try:
MERGE (project:Project:RootProject {name: "project test"})
MERGE (child1:Project {name: "cohort"})
MERGE (child2:Project {name: "protocol"})
MERGE (child3:Project {name: "experience"})
MERGE (child4:Project {name: "nested child"})
MERGE (project)-[:IS_PARENT]->(child1)
MERGE (project)-[:IS_PARENT]->(child2)
MERGE (project)-[:IS_PARENT]->(child3)
MERGE (child1)-[:IS_PARENT]->(child4)
MERGE (bob:Account {name: "bob"})
MERGE (bob)-[:EDITOR]->(child4)
MERGE (bob)-[:EDITOR]->(child2)
MERGE (bob)-[:VIEWER]->(child3)
MERGE (bob)-[:VIEWER]->(child1)
MERGE (bob)-[:VIEWER]->(project)
I have tried a lot of things but I never get a good result.

Here is my answer. The main thing is to construct the json object as parent then grandparent projects to the root project rather than mixing both (line 10). Notice that I removed getting VIEWER or EDITOR relationship since removing them is also faster.
//Get the root project from bob
MATCH (a:Account {name: "bob"})-[r]->(root:RootProject)
WITH a, root, {name:root.name, access:type(r)} as rootProject
//Get only those projects without nested parent
MATCH (a)-[r]->(p:Project) WHERE EXISTS((p)-[:IS_PARENT]-(root))
WITH a, rootProject, p, type(r) as relations
//Get those projects with another parent (or grand parent of root project)
OPTIONAL MATCH (p)-[:IS_PARENT]->(gp:Project)<-[r]-(a)
//Collect the children and grandchildren
WITH rootProject, collect({children: {name: p.name, access:relations}, grandchild:(case when gp is null then [] else [{name:gp.name, access:type(r)}] end)}) as allChildren
RETURN {name: rootProject.name, access: rootProject.access, children: [c in allChildren]} as projects

It may not meet your expectation, but how about this for scalability?
MATCH (a:Account {name: "bob"})-[r:VIEWER|EDITOR]->(c:Project)
MATCH path=(p)-[:IS_PARENT*]->(c)
WHERE (p)<-[:VIEWER | EDITOR]-(a)
WITH COLLECT(path) AS paths
CALL apoc.convert.toTree(paths, false, {
nodes: {Project: ['name']},
rels: {IS_PARENT: ['name']}
}) YIELD value
RETURN value

Conditional partial merge of pattern into graph

I'm trying to create a relationship that connects a person to a city -> state -> country without recreating the city/state/country nodes and relationships if they do already exist - so I'd end-up with only one USA node in my graph for example
I start with a person
CREATE (p:Person {name:'Omar', Id: 'a'})
RETURN p
then I'd like to turn this into an apoc.do.case statement with apoc
or turn it into one merge statement using unique the constraint that creates a new node if no node is found or otherwise matches an existing node
// first case where the city/state/country all exist
MATCH (locality:Locality{name:"San Diego"})-[:SITUATED_IN]->(adminArea:AdministrativeArea { name: 'California' })-[:SITUATED_IN]->(country:Country { name: 'USA' })
MERGE (p)-[:SITUATED_IN]->(locality)-[:SITUATED_IN]->(adminArea)-[:SITUATED_IN]->(country)
return p
// second case where only state/country exist
MATCH (adminArea:AdministrativeArea { name: 'California' })-[:SITUATED_IN]->(country:Country { name: 'USA' })
MERGE (p)-[:SITUATED_IN]->(locality:Locality{name:"San Diego"})-[:SITUATED_IN]->(adminArea)-[:SITUATED_IN]->(country)
return p
// third case where only country exists
MATCH (country:Country { name: 'USA' })
MERGE (p)-[:SITUATED_IN]->(locality:Locality{name:"San Diego"})-[:SITUATED_IN]->(adminArea:AdministrativeArea { name: 'California' })-[:SITUATED_IN]->(country)
return p
// last case where none of city/state/country exist, so I have to create all nodes + relations
MERGE (p)-[:SITUATED_IN]->(locality:Locality{name:"San Diego"})-[:SITUATED_IN]->(adminArea:AdministrativeArea { name: 'California' })-[:SITUATED_IN]->(country:Country { name: 'USA' })
return p
The key here is I only want to end-up with one (California)->(USA). I don't want those nodes & relationships to get duplicated

Your queries that use MATCH never specify which Person you want. Variable names like p only exist for the life of a query (and sometimes not even that long). So p is unbound in your MATCH queries, and can result in your MERGE clauses creating empty nodes. You need to add MATCH (p:Person {Id: 'a'}) to the start of those queries (assuming all people have unique Id values).
It should NOT be the responsibility of every single query to ensure that all needed localities exist and are connected correctly -- that is way too much complexity and overhead for every query. Instead, you should create the appropriate localities and inter-locality relationships separately -- before you need them. If fact, it should be the responsibility of each query that creates a locality to create all the relationships associated with it.
A MERGE will only not create the specified pattern if every single thing in the pattern already exists, so to avoid duplicates a MERGE pattern should have at most 1 thing that might not already exist. So, a MERGE pattern should have at most 1 relationship, and if it has a relationship then the 2 end nodes should already be bound (by MATCH clauses, for example).
Once the Locality nodes and the inter-locality relationships exist, you can add a person like this:
MATCH (locality:Locality {name: "San Diego"})
MERGE (p:Person {Id: 'a'}) // create person if needed, specifying a unique identifier
ON CREATE SET p.name = 'Omar'; // set other properties as needed
MERGE (p)-[:SITUATED_IN]->(locality) // create relationship if necessary
The above considerations should help you design the code for creating the Locality nodes and the inter-locality relationships.

Finally, the solution I used is much simpler, it's a series of merges.
match (person:Person {Id: 'Omar'}) // that should be present in the graph
merge (country:Country {name: 'USA'})
merge (state:State {name: 'California'})-[:SITUATED_IN]->(country)
merge (city:City {name: 'Los Angeles'})-[:SITUATED_IN]->(state)
merge (person)-[:SITUATED_IN]->(city)
return person;

Adding multiple relationships using WITH, WHERE, and UNWIND

I have data in the following structure:
{"id": "1", "name": "A. I. Lazarev", "org": "United States Department of State", "tags": [{"t": "Infrared"}, {"t": "Near-infrared spectroscopy"}, {"t": "Infrared astronomy"}, {"t": "Data collection"}], "pubs": [{"i": "1542417502", "r": 6}], }
{"id": "2", "name": "Stevan Spremo", "tags": [{"t": "Micro-g environment"}, {"t": "Antibiotics"}, {"t": "Bacteriology"}], "pubs": [{"i": "222163962", "r": 0}], }
{"id": "3", "name": "Bricchi G", "pubs": [{"i": "2417067698", "r": 1}, {"i": "2406980973", "r": 1}]}
Some of the rows have tags, some have organizations, some have both, and some have neither.
I'd like to add relationships between (1) authors and tags, (2) authors and organizations, and (3) authors and publications. I have the publications as nodes already, so it should be fairly straightforward to get (3) once I get (1) and (2).
I have been trying to use the following code:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:/test.txt') YIELD value AS q RETURN q",
"UNWIND q.id as id
CREATE (a:Author {id:id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {{name: tags.t}})
CREATE (a)-[:HAS_TAGS]->(t)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false})
The tags and the organizations show up multiple times in the data, but should only have one node each, so I have used MERGE to create unique nodes for these.
The problem with the following code is that it creates duplicate AFFILIATED_WITH relationships - it actually creates the same number of AFFILIATED_WITH relationships as there are tags.
How can I change the cypher query so that it isn't creating duplicate relationships?

After this clause:
UNWIND q.tags as tags
your query will have as many data rows as the number of tags for the current q (each row will have q, a, id, tags values). The subsequent operations will be performed once per data row. That is why you are creating too many AFFILIATED_WITH relationships.
To solve your issue, you have to reduce the number of data rows appropriately, at the appropriate time (and this will also speed up your processing, since unnecessarily repeated operations will be avoided). In your case, you can just change the second WITH q, a clause to WITH DISTINCT q, a:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)
I have also simplified the query by removing the unnecessary UNWIND q.id as id clause, and fixed some syntax issues.
[UPDATED]
If you want to add the AUTHORED relationships (as requested in the comments to this answer), you should do that before you create the AFFILIATED_WITH relationships -- since the WHERE q.org is not null clause would filter out some q nodes. Also, whenever you use CREATE to create a relationship, Cypher requires that you specify a direction for the relationship.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
UNWIND q.pubs as pubs
MERGE (p:Quanta {id: pubs.i})
CREATE (a)-[r:AUTHORED {rank: pubs.r}]->(p)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)

Neo4j Cypher : How to set StartNode or endNode of a relationship?

Let's say we have two nodes n and m
Is it possible to set m as the startNode for all Relationship with n as the StartNode n-[r]->()
The relationships can have different types.
Is it possible using only one cypher request?

No, you can't re-assign the start node for a certain relationship. What you can do is delete that relationship, and then create new ones that point where you want them to go.
For example:
MATCH (n { id: "startpoint"})-[r]->(), (m {id: "endpoint"})
MERGE (n)-[:newRelationship]->(m)
DELETE r;
This query would have to get much more complicated if the type of :newRelationship could change depending on r

Example Data:
CREATE CONSTRAINT ON (city:City) ASSERT city.name IS UNIQUE;
CREATE CONSTRAINT ON (state:State) ASSERT state.name IS UNIQUE;
MERGE (pb:City {name: 'Paderborn'})
MERGE (state1:State {name: 'Bavaria'})
MERGE (state2:State {name: 'North Rhine-Westphalia'})
MERGE (pb)-[:LOCATED_IN]->(state1);
The following statement will remove the existing relationship and create a new one:
MATCH (n { name: "Paderborn"})-[r]->(), (state {name: "Bavaria"})
MERGE (n)-[:LOCATED_IN]->(state)
DELETE r;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to avoid duplicate nodes when importing JSON into Neo4J - neo4j

Related

Match paths of node types where nodes may have cycles

How to get tree structure that has access rights on each node

Conditional partial merge of pattern into graph

Adding multiple relationships using WITH, WHERE, and UNWIND

Neo4j Cypher : How to set StartNode or endNode of a relationship?

Categories

Resources