Neo4j Cypher: Unable to define relationships with duplicate records - neo4j

So, I have some user event data and would like to create a graph of the same. A snapshot of the data
Now, the _id col has duplicate records but they are actually the same person, however there are multiple sessionField records for the same _id
What I'd want is something like this:
Node A -> sessionNode a1 -> Action Node a11 (with event type as properties, 4 in this case)
-> sessionNode a2 -> Action Node a21 (with event type as properties, 2 in this case)
Node B -> sessionNode b1 -> Action Node b11 (with event type as properties, 3 in this case)
I've tried the following code, but being new to graphs I'm not able to replicate the same:
session_streams_y has same data like _id
LOAD CSV WITH HEADERS FROM 'file:///df_temp.csv' AS users
CREATE (p:Person {nodeId: users._id, sessionId: users.session_streams_y})
CREATE (sn:Session {sessId: users.sessionField, sessionId: users.session_streams_y})
MATCH (p:Person)
with p as ppl
MATCH (sn:Session)
WITH ppl, sn as ss
WHERE ppl.sessionId=ss.sessionId
MERGE (ppl)-[:Sessions {sess: 'Has Sessions'}]-(ss)
WITH [ppl,ss] as ns
CALL apoc.refactor.mergeNodes(ns) YIELD node
RETURN node
This gives something different

Something like this may work for you:
LOAD CSV WITH HEADERS FROM 'file:///df_temp.csv' AS row
MERGE (p:Person {id: row._id})
MERGE (s:Session {id: row.sessionField})
FOREACH(
x IN CASE WHEN s.eventTypes IS NULL OR NOT row.eventType IN s.eventTypes THEN [1] END |
SET s.eventTypes = COALESCE(s.eventTypes, []) + row.eventType)
MERGE (p)-[:HAS_SESSION]->(s)
RETURN p, s
The resulting Person and Session nodes would be unique, each Session node would have an eventTypes list with distinct values, and the appropriate Person and Session nodes would be connected by a HAS_SESSION relationship.
An Action node does not seem to be necessary.

Related

Using Unwind and Dumping Data in neo4j - Query Optimization

I am doing batch insertion to insert data in neo4j but my transaction is taking huge time as my database is increasing continuously also.
In my project, For only one case ,I am having more then 18,000 records which are meant to be stored in db and will have relationships with a Target node.
Each record will be stored as Friend Node
Relationships are like
Target_Node-[r:followed_by]->Friend_Node
Target_Node-[r:Friends_with]->Friend_Node
Target_Node-[r:Performs_Activity]->Friend_Node
My query executes for all the cases separately and the chances are very likely that there maybe all three relations between a Target and Friend Node.
I am sending 20 records per thread for a single insertion which unwinds over the array of records and checks if the records is already exists in Friend_Node or Target_Node, if not then create it as a Friend_Node and then assign relation to it; If the node already have relationship and a new relation is passed to the query then a new relation will also be added between the two nodes.
Also I do check in my query if a Record do have a Location property then I do create a Location Node and assign the relation with that also.
Note: create_rel variable can be Friends_with,Followed_by or Activity_p
My query is as follows
"""UNWIND [{id: "1235" , uid : "0"}] as user
UNWIND """+ l +""" as c
OPTIONAL MATCH (n:Target {id : c.id , uid : "0"})
OPTIONAL MATCH (m:Friend {id : c.id , screen_name:c.screen_name, uid : "0"})
WITH coalesce(n, m) as node,user,c // returns first non-null value
CALL apoc.do.when(node is null, "MERGE (n:Friend {id:c.id, name:c.name, profile: c.profile, location:c.location, uid : user.uid}) RETURN n", '', {c:c,user:user}) YIELD value
with coalesce(node, value.n) as y,user,c
MERGE (u:Target {id: user.id , uid : user.uid})
"""+create_rel+"""
foreach (sc in c.cityn | merge(cn:Location {location:sc.location, loc_lower : sc.loc_lower}) merge (y)-[:`located_at`]-(cn))
"""
Db sometimes gives TransientError error also.
Feedback is appreciated as I am a learner and will appreciate valuable suggestions.
Thanks in advance
I think your main problem lies in how you merge and match the nodes. Ideally, you always want to have a unique identifier for nodes. I can see that Friend node has a property id, which I will assume is unique for every Friend and Target.
First, you want to create a unique constraint on that property:
CREATE CONSTRAINT ON (f:Friend) ASSERT f.id IS UNIQUE;
CREATE CONSTRAINT ON (f:Target) ASSERT f.id IS UNIQUE;
You want something similar for Location nodes as well. seems like you store both location value and the lowercase value of location, so any of them should be unique for each node.
CREATE CONSTRAINT ON (l:Location) ASSERT l.id IS UNIQUE;
Now you can optimize your query like this:
"""UNWIND [{id: "1235" , uid : "0"}] as user
UNWIND """+ l +""" as c
OPTIONAL MATCH (n:Target {id : c.id})
OPTIONAL MATCH (m:Friend {id : c.id})
WITH coalesce(n, m) as node,user,c // returns first non-null value
CALL apoc.do.when(node is null,
"MERGE (n:Friend {id:c.id})
ON CREATE SET n+= {name:c.name, profile: c.profile,
location:c.location, uid : user.uid}
RETURN n", '', {c:c,user:user})
YIELD value
with coalesce(node, value.n) as y,user,c
MERGE (u:Target {id: user.id , uid : user.uid})
"""+create_rel+"""
foreach (sc in c.cityn |
merge(cn:Location {location:sc.location})
ON CREATE SET cn.loc_lower = sc.loc_lower
merge (y)-[:`located_at`]-(cn))
"""
You should avoid running multiple write queries (that can touch the same nodes and relationships) concurrently, as that could cause intermittent TransientErrors, as you have seen. (However, queries that cause transient errors can be retried.)
You should be passing user and l to your query as parameters, so that the Cypher planner will only need to compile the query once, and to make the query less prone to Cypher-injection attacks. (Also, there is no need to UNWIND a list that will always have just a single map -- you could have directly used the map via WITH {id: "1235" , uid : "0"} AS user. But, as I mentioned, you should just pass the user map as a parameter so you can efficiently change the user without forcing a recompilation.)
To avoid recompilation, you also need to need to make the create_rel string a constant string (so, it might as well be directly in your main query string). Again, you should also pass any variables needed by that as parameters.
You should create indexes (or uniqueness constraints) on :Target(id) and :Friend(id), to speed up your MATCH and MERGE clauses.
(a) MERGE (u:Target {id: user.id , uid : user.uid}) only needs to be executed once, not per c value. So, it should be executed before the UNWIND.
(b) Also, it is not strictly necessary for this query to create u, since nothing in the query uses it. So, instead of running this identical MERGE clause once per thread, you should consider taking it out and running it a separate standalone query.
Here is a query that combines suggestions #2 and #5a (but you will have to take care of the others yourself), along with some refactoring using pattern comprehension to avoid unnecessary DB hits:
MERGE (u:Target {id: $user.id, uid: $user.uid})
WITH u
UNWIND $l as c
WITH u, c, [(n:Target {id : c.id})-[*0]-()|n] AS nodeList
WITH u, c, CASE WHEN SIZE(nodeList) = 0 THEN [(n:Friend {id : c.id})-[*0]-()|n] ELSE nodeList END AS nodeList
CALL apoc.do.when(SIZE(nodeList) = 0, 'MERGE (n:Friend {id: c.id, name: c.name, profile: c.profile, location: c.location, uid: user.uid}) RETURN n', 'RETURN nodeList[0] AS n', {c:c,user:$user,nodeList:nodeList}) YIELD value
WITH u, c, value.n AS node
FOREACH (sc IN c.cityn | MERGE (cn:Location {location: sc.location, loc_lower: sc.loc_lower}) MERGE (node)-[:located_at]-(cn))
// Put your parameterized create_rel code here

How to log when a relation already exist?

I have created a hierarchical tree to represent the organization chart of a company on Neo4j, which is like the picture below.
When I insert a lot of relation with a LOAD CSV, I use this request:
LOAD CSV WITH HEADERS FROM "file:///newRelation.csv" AS row
MERGE (a:Person {name:row.person1Name})
MERGE(b:Person {name:row.person2Name})
FOREACH (t in CASE WHEN NOT EXISTS((a)-[*]->(b)) THEN [1] ELSE [] END |
MERGE (a)-[pr:Manage]->(b) )
With this request, I only create the relationship if the two people do not already have a hierarchical relationship.
How to save (log) the list of relationships that are not created because the test below fail?
CASE WHEN NOT EXISTS((a)-[*]->(b)
You need to move the existence check to a level above the foreach:
LOAD CSV WITH HEADERS FROM "file:///newRelation.csv" AS row
MERGE (a:Person {name:row.person1Name})
MERGE(b:Person {name:row.person2Name})
WITH a, b, row,
CASE WHEN NOT exists((a)-[*]->(b)) THEN [1] ELSE [] END AS check
FOREACH (t IN check |
MERGE (a)-[pr:Manage]->(b)
)
WITH a, b, row, check WHERE size(check) = 0
RETURN a, b, row

Can you create a node and link it to multiple nodes in one query?

I've been trying to crate a node and link it to a list of other nodes, I arrived at the following query:
MATCH (s:Subject), (p:Programme {name: 'Bsc. Agriculture' })
Where s.name IN ['Physics (CAPE)', 'Biology (CAPE)', 'Chemistry (CAPE)']
Create (c: Combo {amt:1}), (c)-[:contains]->(s), (p)-[:requires]->(c) return *
but unfortunately the combo node is created three times, and each combo node is linked to a Subject node. Is there any way I can adjust this query so that only one Combo node is created?
The query below uses the aggregation function COLLECT to produce a single row of data (instead of three) per p, to ensure that the first CREATE is only executed only once per p -- thus producing only a single Combo and requires relationship per p. The FOREACH clause then creates all the required contains relationships.
MATCH (sub:Subject), (p:Programme {name: 'Bsc. Agriculture' })
WHERE sub.name IN ['Physics (CAPE)', 'Biology (CAPE)', 'Chemistry (CAPE)']
WITH p, COLLECT(sub) AS subs
CREATE (p)-[:requires]->(c: Combo {amt:1})
FOREACH(s IN subs | CREATE (c)-[:contains]->(s))
RETURN *
Your query creates Combo node for every other node used in CREATE that was found by MATCH.
One of the solutions could be that you create Combo node first and than use it in other parts of your query
Create (c: Combo {amt:1})
WITH c
MATCH (s:Subject), (p:Programme {name: 'Bsc. Agriculture' })
Where s.name IN ['Physics (CAPE)', 'Biology (CAPE)', 'Chemistry (CAPE)']
CREATE (c)-[:contains]->(s), (p)-[:requires]->(c) return *
If you use MERGE instead of CREATE, it should address your requirements and avoid creating duplicate nodes. But this assumes that each relationship target node is identified by all of its unique properties. Try this ...
MATCH (s:Subject), (p:Programme {name: 'Bsc. Agriculture' })
Where s.name IN ['Physics (CAPE)', 'Biology (CAPE)', 'Chemistry (CAPE)']
MERGE (c: Combo {amt:1}), (c)-[:contains]->(s), (p)-[:requires]->(c) return *
Merge will create an element only if it does not exist. The caution is that the element in question must match in all its parameters (name and properties). There is more on this at https://graphaware.com/neo4j/2014/07/31/cypher-merge-explained.html.

Cypher: Adding properties to relationship as distinct values

What I'm trying to do is to write a query - I already made it a webservice(working on local machine, so I get the name and people as parameters) - which connects people who share the same hobbies and set the hobbies as the relationship property as an array.
My first attempt was;
MERGE (aa:Person{name:$name})
WITH aa, $people as people
FOREACH (person IN people |
MERGE (bb:Person{name:person.name})
MERGE (bb)-[r:SHARESSAMEHOBBY]->(aa)
ON MATCH SET r.hobbies = r.hobbies + person.hobby
ON CREATE SET r.hobbies = [person.hobby])
However this caused duplicated property elements like ["swimming","swimming"]
I'm trying to set only unique properties. Then I tried the following query;
MERGE (aa:Person{name:$name})
WITH aa, $people as people FOREACH (person IN people | MERGE (bb:Person{name:person.name}) MERGE (bb)-[r:SHARESSAMEHOBBY]->(aa)
WITH r, COALESCE(r.hobbies, []) + person.hobby AS hobbies
UNWIND hobbies as unwindedHobbies
WITH r, collect(distinct, unwindedHobbies) AS unique
set r.as = unique)
However now it gives me syntax error;
errorMessage = "[Neo.ClientError.Statement.SyntaxError] Invalid use of WITH inside FOREACH
Any help is appreciated.
This should work:
MERGE (aa:Person {name: $name})
WITH aa
UNWIND $people AS person
MERGE (bb:Person {name: person.name})
MERGE (bb)-[r:SHARESSAMEHOBBY]-(aa)
WITH r, person, CASE
WHEN NOT EXISTS(r.hobbies) THEN {new: true}
WHEN NOT (person.hobby IN r.hobbies) THEN {add: true}
END AS todo
FOREACH(ignored IN todo.new | SET r.hobbies = [person.hobby])
FOREACH(ignored IN todo.add | SET r.hobbies = r.hobbies + person.hobby);
You actually had 2 issues, and the above query addresses both:
If a SHARESSAMEHOBBY relationship already existed in the opposite direction (from aa to bb), the following MERGE clause would have caused the unnecessary creation of a second SHARESSAMEHOBBY relationship (from bb to aa):
MERGE (bb)-[r:SHARESSAMEHOBBY]->(aa)
To avoid this, you should have used a non-directional relationship pattern (which is is permitted by MERGE, but not CREATE) to match a relationship in either direction, like this:
MERGE (bb)-[r:SHARESSAMEHOBBY]-(aa)
You needed to determine whether it is necessary to initialize a new hobbies list or to add the person.hobby value to an existing r.hobbies list that did not already have that value. The above query uses a CASE clause to assign to todo either NULL, or a map with a key indicating what additional work to do. It then uses a FOREACH clause to execute each thing to do, as appropriate.

Order list without scanning every node

When using LIMIT with ORDER BY, every node with the selected label still gets scanned (even with index).
For example, let's say I have the following:
MERGE (:Test {name:'b'})
MERGE (:Test {name:'c'})
MERGE (:Test {name:'a'})
MERGE (:Test {name:'d'})
Running the following gets us :Test {name: 'a'}, however using PROFILE we can see the entire list get scanned, which obviously will not scale well.
MATCH (n:Node)
RETURN n
ORDER BY n.name
LIMIT 1
I have a few sorting options available for this label. the order of nodes within these sorts should not change often, however, I can't cache these lists because each list is personalized for a user, i.e. a user may have hidden :Test {name:'b'}
Is there a golden rule for something like this? Would creating pointers from node to node for each sort option be a good option here? Something like
(n {name:'a'})-[:ABC_NEXT]->(n {name:'b'})-[:ABC_NEXT]->(n {name:'c'})-...
Would I be able to have multiple sort pointers? Would that be overkill?
Ref:
https://neo4j.com/blog/moving-relationships-neo4j/
http://www.markhneedham.com/blog/2014/04/19/neo4j-cypher-creating-relationships-between-a-collection-of-nodes-invalid-input/
Here's what I ended up doing for anyone interested:
// connect nodes
MATCH (n:Test)
WITH n
ORDER BY n.name
WITH COLLECT(n) AS nodes
FOREACH(i in RANGE(0, length(nodes)-2) |
FOREACH(node1 in [nodes[i]] |
FOREACH(node2 in [nodes[i+1]] |
CREATE UNIQUE (node1)-[:IN_ORDER_NAME]->(node2))))
// create list, point first item to list
CREATE (l:List { name: 'name' })
WITH l
MATCH (n:Test) WHERE NOT (m)<-[:IN_ORDER_NAME]-()
MERGE (l)-[:IN_ORDER_NAME]->(n)
// getting 10 nodes sorted alphabetically
MATCH (:List { name: 'name' })-[:IN_ORDER_NAME*]->(n)
RETURN n
LIMIT 10

Resources