Using Unwind and Dumping Data in neo4j - Query Optimization - neo4j

I am doing batch insertion to insert data in neo4j but my transaction is taking huge time as my database is increasing continuously also.
In my project, For only one case ,I am having more then 18,000 records which are meant to be stored in db and will have relationships with a Target node.
Each record will be stored as Friend Node
Relationships are like
Target_Node-[r:followed_by]->Friend_Node
Target_Node-[r:Friends_with]->Friend_Node
Target_Node-[r:Performs_Activity]->Friend_Node
My query executes for all the cases separately and the chances are very likely that there maybe all three relations between a Target and Friend Node.
I am sending 20 records per thread for a single insertion which unwinds over the array of records and checks if the records is already exists in Friend_Node or Target_Node, if not then create it as a Friend_Node and then assign relation to it; If the node already have relationship and a new relation is passed to the query then a new relation will also be added between the two nodes.
Also I do check in my query if a Record do have a Location property then I do create a Location Node and assign the relation with that also.
Note: create_rel variable can be Friends_with,Followed_by or Activity_p
My query is as follows
"""UNWIND [{id: "1235" , uid : "0"}] as user
UNWIND """+ l +""" as c
OPTIONAL MATCH (n:Target {id : c.id , uid : "0"})
OPTIONAL MATCH (m:Friend {id : c.id , screen_name:c.screen_name, uid : "0"})
WITH coalesce(n, m) as node,user,c // returns first non-null value
CALL apoc.do.when(node is null, "MERGE (n:Friend {id:c.id, name:c.name, profile: c.profile, location:c.location, uid : user.uid}) RETURN n", '', {c:c,user:user}) YIELD value
with coalesce(node, value.n) as y,user,c
MERGE (u:Target {id: user.id , uid : user.uid})
"""+create_rel+"""
foreach (sc in c.cityn | merge(cn:Location {location:sc.location, loc_lower : sc.loc_lower}) merge (y)-[:`located_at`]-(cn))
"""
Db sometimes gives TransientError error also.
Feedback is appreciated as I am a learner and will appreciate valuable suggestions.
Thanks in advance

I think your main problem lies in how you merge and match the nodes. Ideally, you always want to have a unique identifier for nodes. I can see that Friend node has a property id, which I will assume is unique for every Friend and Target.
First, you want to create a unique constraint on that property:
CREATE CONSTRAINT ON (f:Friend) ASSERT f.id IS UNIQUE;
CREATE CONSTRAINT ON (f:Target) ASSERT f.id IS UNIQUE;
You want something similar for Location nodes as well. seems like you store both location value and the lowercase value of location, so any of them should be unique for each node.
CREATE CONSTRAINT ON (l:Location) ASSERT l.id IS UNIQUE;
Now you can optimize your query like this:
"""UNWIND [{id: "1235" , uid : "0"}] as user
UNWIND """+ l +""" as c
OPTIONAL MATCH (n:Target {id : c.id})
OPTIONAL MATCH (m:Friend {id : c.id})
WITH coalesce(n, m) as node,user,c // returns first non-null value
CALL apoc.do.when(node is null,
"MERGE (n:Friend {id:c.id})
ON CREATE SET n+= {name:c.name, profile: c.profile,
location:c.location, uid : user.uid}
RETURN n", '', {c:c,user:user})
YIELD value
with coalesce(node, value.n) as y,user,c
MERGE (u:Target {id: user.id , uid : user.uid})
"""+create_rel+"""
foreach (sc in c.cityn |
merge(cn:Location {location:sc.location})
ON CREATE SET cn.loc_lower = sc.loc_lower
merge (y)-[:`located_at`]-(cn))
"""

You should avoid running multiple write queries (that can touch the same nodes and relationships) concurrently, as that could cause intermittent TransientErrors, as you have seen. (However, queries that cause transient errors can be retried.)
You should be passing user and l to your query as parameters, so that the Cypher planner will only need to compile the query once, and to make the query less prone to Cypher-injection attacks. (Also, there is no need to UNWIND a list that will always have just a single map -- you could have directly used the map via WITH {id: "1235" , uid : "0"} AS user. But, as I mentioned, you should just pass the user map as a parameter so you can efficiently change the user without forcing a recompilation.)
To avoid recompilation, you also need to need to make the create_rel string a constant string (so, it might as well be directly in your main query string). Again, you should also pass any variables needed by that as parameters.
You should create indexes (or uniqueness constraints) on :Target(id) and :Friend(id), to speed up your MATCH and MERGE clauses.
(a) MERGE (u:Target {id: user.id , uid : user.uid}) only needs to be executed once, not per c value. So, it should be executed before the UNWIND.
(b) Also, it is not strictly necessary for this query to create u, since nothing in the query uses it. So, instead of running this identical MERGE clause once per thread, you should consider taking it out and running it a separate standalone query.
Here is a query that combines suggestions #2 and #5a (but you will have to take care of the others yourself), along with some refactoring using pattern comprehension to avoid unnecessary DB hits:
MERGE (u:Target {id: $user.id, uid: $user.uid})
WITH u
UNWIND $l as c
WITH u, c, [(n:Target {id : c.id})-[*0]-()|n] AS nodeList
WITH u, c, CASE WHEN SIZE(nodeList) = 0 THEN [(n:Friend {id : c.id})-[*0]-()|n] ELSE nodeList END AS nodeList
CALL apoc.do.when(SIZE(nodeList) = 0, 'MERGE (n:Friend {id: c.id, name: c.name, profile: c.profile, location: c.location, uid: user.uid}) RETURN n', 'RETURN nodeList[0] AS n', {c:c,user:$user,nodeList:nodeList}) YIELD value
WITH u, c, value.n AS node
FOREACH (sc IN c.cityn | MERGE (cn:Location {location: sc.location, loc_lower: sc.loc_lower}) MERGE (node)-[:located_at]-(cn))
// Put your parameterized create_rel code here

Related

Pass variable from outside to query used in a Neo4j apoc.do.case procedure

So I am writing a trigger which looks at a certain type of relationship and gets the start and end node of this relationship. Additionally another node is matched.
What the trigger should do depends on a condition that is why I need to use apoc.do.case. So for the condition as well as the query the variables for the nodes have to be accessed. In the condition that seems to work just fine which i tested with different queries for the conditions. But in the queries as displayed below the variables cannot be accessed which I tested by accessing one node by id to see what gets created. The problem is node o does not get selected as matched before but rather gets created completely new as an empty node. ( When using this query for the first condition: Match (c) where id(c) = 62 CREATE (c)-[:is_a]->(o) )
So my problem is to access the variables from outside in apoc.do.case
I know that it is possible to use input parameters or another with statement inside apoc.do.case but I already tried different approaches but did not get it right.
Maybe someone knows how to do it
CALL apoc.trigger.add('Update drinking',
"UNWIND [rel in $createdRelationships WHERE type(rel) =
'has_alcohol_comsumption'] AS rel
WITH startNode(rel) AS start, endNode(rel) AS end MATCH
(o:customer{label:'drinker'}) CALL apoc.do.case( [end.label <> 'Never' and
not((start)-[:is_a]->(o)), 'CREATE (start)-[:is_a]->
(o)', end.label ='Never' and((start)-[:is_a]->(o)), 'MATCH (start)-
[rel2:is_a]->(o) DETACH DELETE rel2' ]) YIELD value return value",
{phase:'before'});
To understand the problem it is enough to create a few nodes:
create(rarely:occurrence_frequency{label:'Rarely'})
create(never:occurrence_frequency{label:’Never'})
create(christina:customer{label:'Christina',has_name:'Christina'})
create(drinker:customer{label:’drinker'})
Match (n:customer {label:'Christina'}) Match (x:occurrence_frequency {label:'Never'}) Create (n)-[:has_alcohol_comsumption]->(x)
So at the beginning Christina does not drink alcohol but the Node drinker and Rarely do exist. When we change the alcohol consumption as in the following a is_a relationship to drinker should get created. And the other way around, when we change the consumption from rarely to never the is_a relationship should get deleted (which works but ALL is_a relationships are getting deleted).
Match (n:customer {label:'Christina'})-
[rel:has_alcohol_comsumption]->(m:occurrence_frequency
{label:'Never'}) Match (o:occurrence_frequency
{label:'Rarely'}) Call apoc.refactor.to(rel, o) Yield
input, output Return input, output
I hope i did not forget something and my question is understandable.
You are not passing the required parameters in apoc.do.case, try this:
CALL apoc.trigger.add(
'Update drinking',
"UNWIND [rel in $createdRelationships WHERE type(rel) = 'has_alcohol_comsumption'] AS rel
WITH startNode(rel) AS start, endNode(rel) AS end
MATCH (o:customer{label:'drinker'})
CALL apoc.do.case(
[end.label <> 'Never' and not((start)-[:is_a]->(o)),
'CREATE (start)-[:is_a]->(o)',
end.label ='Never' and ((start)-[:is_a]->(o)),
'MATCH (start)-[rel2:is_a]->(o) DETACH DELETE rel2' ],
'',
{start: start, end: end, o: o})
YIELD value return value",
{phase:'before'}
);

Is matching with id performant in Neo4J?

I'm wondering, when I have read the data of a node and I want to match it in another query, which way will have the best performance? Using id like this:
MATCH (n) WHERE ID(n) = 1234
or using indices of the node:
MATCH (n:Label {SomeIndexProperty: 3456})
Which one is better?
IDs are a technical ID for Neo4j, and those should not be used as a primary key for your application.
Every node (and relationship) has a technical ID, and it's stable over time.
But if you delete a node, for example the node 32, Neo4j will reuse this ID for a new node.
So you can use it in your queries inside the same transaction (there is no problem), otherwise you should know what you are doing.
The only way to retrieve the technical ID, is to use the function ID like you do on your first query : MATCH (n) WHERE ID(n) = 1234 RETURN n.
The ID is not exposed as a node's property, so you can't do MATCH (n {ID:1234}) RETURN n.
You have noticed that if you want to do a WHERE on a strict equality, you can do put the condition directly on the node.
For example :
MATCH (n:Node) WHERE n.name = 'logisima' RETURN n
MATCH (n:Node {name:'logisima'}) RETURN n
Those two queries are identicals, they generate the same query plan, it's just a syntactic sugar.
Is it faster to retrieve a node by its ID or by an indexed property ?
The easier way to know the answer to this question is to profile the two queries.
On the one based on the ID, you will see the box NodeByIdSeek that cost 1 db hit, and on the one with a unique constrainst you will see the box NodeUniqueIndexSeek with 2 db hits.
So searching a node by its ID is faster.

Neo4j Cypher: Create a relationship only if the end node exists

Building on this similar question, I want the most performant way to handle this scenario.
MERGE (n1{id:<uuid>})
SET n1.topicID = <unique_name1>
IF (EXISTS((a:Topic{id:<unique_name1>})) | CREATE UNIQUE (n1)-[:HAS]->(a))
MERGE (n2{id:<uuid>})
SET n2.topicID = <unique_name2>
IF (EXISTS((a:Topic{id:<unique_name2>})) | CREATE UNIQUE (n2)-[:HAS]->(a))
Unfortunately, IF doesn't exist, and EXISTS can't be used to match or find a unique node.
I can't use OPTIONAL MATCH, because then CREATE UNIQUE will throw a null exception (as much as I wish it would ignore null parameters)
I can't use MATCH, because if the topic doesn't exist, I will will loose all my rows.
I can't use MERGE, because I don't want to create the node if it doesn't exist yet.
I can't use APOC, because I have no guarantee that it will be available for use on our Neo4j server.
The best solution I have right now is
MERGE (a:TEST{id:1})
WITH a
OPTIONAL MATCH (b:TEST{id:2})
// collect b so that there are no nulls, and rows aren't lost when no match
WITH a, collect(b) AS c
FOREACH(n IN c | CREATE UNIQUE (a)-[:HAS]->(n))
RETURN a
However, this seems kinda complicated and needs 2 WITHs for what is essentially CREATE UNIQUE RELATION if start and end node exist (and in the plan there is an eager). Is it possible to do any better? (Using Cypher 3.1)
You can simplify a quite a bit:
MERGE (a:TEST{id:1})
WITH a
MATCH (b:TEST{id:2})
CREATE UNIQUE (a)-[:HAS]->(b)
RETURN a;
The (single) WITH clause serves to split the query into 2 "sub-queries".
So, if the MATCH sub-query fails, it only aborts its own sub-query (and any subsequent ones) but does not roll back the previous successful MERGE sub-query.
Note, however, that whenever a final sub-query fails, the RETURN clause would return nothing. You will have to determine if this is acceptable.
Because the above RETURN clause would only return something if b exists, it might make more sense for it to return b, or the path. Here is an example of the latter (p will be assigned a value even if the path already existed):
MERGE (a:TEST{id:1})
WITH a
MATCH (b:TEST{id:2})
CREATE UNIQUE p=(a)-[:HAS]->(b)
RETURN p;
[UPDATE]
In neo4j 4.0+, CREATE UNIQUE is no longer supported, so MERGE needs to be used instead.
Also, if you want to return a even if b does not exist, you can use the APOC function apoc.do.when:
MERGE (a:TEST{id:1})
WITH a
OPTIONAL MATCH (b:TEST{id:2})
CALL apoc.do.when(
b IS NOT NULL,
'MERGE (a)-[:HAS]->(b)',
'',
{a: a, b: b}) YIELD value
RETURN a;

Find a graph node by a field, update all other fields

I have a Neo4J graph database where I want to store users and relationships between them.
I want to be able to update a User node that I find them by GUID with data contained in a .Net User object. Ideally I'd like to know how to do that in Neo4JClient but even plain Cypher query would do.
Ideally I'd like to use the whole object, not knowing what properties have been modified, and replace all of them - including array properties - unlike the example below that knows PhoneNumber is to be updated
Something like this:
MATCH (n:`User` {Id:'24d03ce7-8d23-4dc3-a13b-cffc0c7ce0d8'})
MERGE (n {PhoneNumber: '123-123-1234'})
RETURN n
The problem with the code above is that MERGE redefines the n
and I get this error:
n already declared (line 2, column 8) "MERGE (n {PhoneNumber: '123-123-1234'})" ^
If all you want to do is completely replace all the properties of existing nodes, do not use MERGE. You should just use MATCH, and SET all the properties. Something like this:
MATCH (n:`User` {Id:'24d03ce7-8d23-4dc3-a13b-cffc0c7ce0d8'})
SET n = {PhoneNumber: '123-123-1234', Age: 32}
RETURN n;
On the other hand, if you want to create a new node iff one with the specified Id does not yet exist, and you also want to completely replace all the properties of the new or existing node, you can do this:
MERGE (n:`User` {Id:'24d03ce7-8d23-4dc3-a13b-cffc0c7ce0d8'})
SET n = {PhoneNumber: '123-123-1234', Age: 32}
RETURN n;
Note: in the above queries, all the existing properties of n would be deleted before the new properties are added. Also, the map assigned to n in the SET clause can be passed to the query as a parameter (so no hardcoding is needed).

Cypher, create unique relationship for one node

I want to add a "created by" relationship on nodes in my database. Any node should be able of having this relationship but there can never be more than one.
Right now my query looks something like this:
MATCH (u:User {email: 'my#mail.com'})
MERGE (n:Node {name: 'Node name'})
ON CREATE SET n.name='Node name', n.attribute='value'
CREATE UNIQUE (n)-[:CREATED_BY {date: '2015-02-23'}]->(u)
RETURN n
As I have understood Cypher there is no way to achieve what I want, the current query will only make sure there are no unique relationships based on TWO nodes, not ONE. So, this will create more CREATED_BY relationships when run for another User and I want to limit the outgoing CREATED_BY relationship to just one for all nodes.
Is there a way to achieve this without running multiple queries involving program logic?
Thanks.
Update
I tried to simplyfy the query by removing implementation details, if it helps here's the updated query based on cybersams response.
MERGE (c:Company {name: 'Test Company'})
ON CREATE SET c.uuid='db764628-5695-40ee-92a7-6b750854ebfa', c.created_at='2015-02-23 23:08:15', c.updated_at='2015-02-23 23:08:15'
WITH c
OPTIONAL MATCH (c)
WHERE NOT (c)-[:CREATED_BY]-()
CREATE (c)-[:CREATED_BY {date: '2015-02-23 23:08:15'}]->(u:User {token: '32ba9d2a2367131cecc53c310cfcdd62413bf18e8048c496ea69257822c0ee53'})
RETURN c
Still not working as expected.
Update #2
I ended up splitting this into two queries.
The problem I found was that there was two possible outcomes as I noticed.
The CREATED_BY relationship was created and (n) was returned using OPTIONAL MATCH, this relationship would always be created if it didn't already exist between (n) and (u), so when changing the email attribute it would re-create the relationship.
The Node (n) was not found (because of not using OPTIONAL MATCH and the WHERE NOT (c)-[:CREATED_BY]-() clause), resulting in no relationship created (yay!) but without getting the (n) back the MERGE query looses all it's meaning I think.
My Solution was the following two queries:
MERGE (n:Node {name: 'Name'})
ON CREATE SET
SET n.attribute='value'
WITH n
OPTIONAL MATCH (n)-[r:CREATED_BY]-()
RETURN c, r
Then I had program logic check the value of r, if there was no relationship I would run the second query.
MATCH (n:Node {name: 'Name'})
MATCH (u:User {email: 'my#email.com'})
CREATE UNIQUE (n)-[:CREATED_BY {date: '2015-02-23'}]->(u)
RETURN n
Unfortunately I couldn't find any real solution to combining this in one single query with Cypher. Sam, thanks! I've selected your answer even though it didn't quite solve my problem, but it was very close.
This should work for you:
MERGE (n:Node {name: 'Node name'})
ON CREATE SET n.attribute='value'
WITH n
OPTIONAL MATCH (n)
WHERE NOT (n)-[:CREATED_BY]->()
CREATE UNIQUE (n)-[:CREATED_BY {date: '2015-02-23'}]->(:User {email: 'my#mail.com'})
RETURN n;
I've removed the starting MATCH clause (because I presume you want to create a CREATED_BY relationship even when that User does not yet exist in the DB), and simplified the ON CREATE to remove the redundant setting of the name property.
I have also added an OPTIONAL MATCH that will only match an n node that does not already have an outgoing CREATED_BY relationship, followed by a CREATE UNIQUE clause that fully specifies the User node.

Resources