I'm looking to perform periodic refreshes of node data in my neo4j db. A good example of my needs would be company employees -- where if an employee is terminated, they are removed from the graph completely and new employees are added.
Really, deleting all nodes of this label and ingesting a fresh dataset likely suffices -- but it feels quite ugly. Is there a more elegant solution? My fresh data exists in csv and I want to pull it in daily.
You could put a 'last updated' timestamp on your nodes. Each day, do your update using MERGE. If the csv data exists on the database, update the timestamp using the ON MATCH clause of MERGE. If the csv data doesn't exist MERGE will create new nodes (make sure to add a timestamp property of some description). E.g:
MERGE (n:Person {<selection_filter>})
ON CREATE SET <required_properties>, n.lastUpdated = date()
ON MATCH SET
n.lastUpdated = date()
After updating the graph with csv data, run a query which deletes all nodes whose timestamps are before today's; i.e. haven't been updated.
You might find creating an index on lastUpdated will improve performance for the delete query.
If your CSV file is all active employees, then you can do something like this:
MATCH (e:Employee)
SET e.ActiveToday = False
LOAD CSV FROM "employees.csv" as line
MERGE (e:Employee {employeeID:line.employeeID})
SET e.ActiveToday = True
MATCH (e:Employee {ActiveToday: False})
DETACH DELETE e
The MERGE will create nodes for new employees in the file and match those that already exist. Both creates and matches will have their ActiveToday property updated. From there, you just match those where the property is still false and remove them.
Related
I am working in a Streamsets pipeline to read data from a active file directory where .csv files are uploaded remotely and put those data in a neo4j database.
The steps I have used is-
Creating a observation node for each row in .csv
Creating a csv node and creating relation between csv & the record
Updating Timestamp taken from csv node to burn_in_test nodes, already created in graph database from different pipeline, if it is latest
creating relation from csv to burn in test
deleting outdated relation based on latest timestamp
Now I am doing all of these using jdbc query and the cypher query used is
MERGE (m:OBSERVATION{
SerialNumber: "${record:value('/SerialNumber')}",
Test_Stage: "${record:value('/Test_Stage')}",
CUR: "${record:value('/CUR')}",
VOLT: "${record:value('/VOLT')}",
Rel_Lot: "${record:value('/Rel_Lot')}",
TimestampINT: "${record:value('/TimestampINT')}",
Temp: "${record:value('/Temp')}",
LP: "${record:value('/LP')}",
MON: "${record:value('/MON')}"
})
MERGE (t:CSV{
SerialNumber: "${record:value('/SerialNumber')}",
Test_Stage: "${record:value('/Test_Stage')}",
TimestampINT: "${record:value('/TimestampINT')}"
})
WITH m
MATCH (t:CSV) where t.SerialNumber=m.SerialNumber and t.Test_Stage=m.Test_Stage and t.TimestampINT=m.TimestampINT MERGE (m)-[:PART_OF]->(t)
WITH t, t.TimestampINT AS TimestampINT
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage and rl.TimestampINT<TimestampINT
SET rl.TimestampINT=TimestampINT
WITH t
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage
MERGE (t)-[:POINTS_TO]->(rl)
WITH rl
MATCH (t:CSV)-[r:POINTS_TO]->(rl) WHERE t.TimestampINT<rl.TimestampINT
DELETE r
Right now this process is very slow and taking about 15 mins of time for 10 records. Can This be further optimized?
Best practices when using MERGE is to merge on a single property and then use SET to add other properties.
If I assume that serial number is property is unique for every node (might not be), it would look like:
MERGE (m:OBSERVATION{SerialNumber: "${record:value('/SerialNumber')}"})
SET m.Test_Stage = "${record:value('/Test_Stage')}",
m.CUR= "${record:value('/CUR')}",
m.VOLT= "${record:value('/VOLT')}",
m.Rel_Lot= "${record:value('/Rel_Lot')}",
m.TimestampINT = "${record:value('/TimestampINT')}",
m.Temp= "${record:value('/Temp')}",
m.LP= "${record:value('/LP')}",
m.MON= "${record:value('/MON')}"
MERGE (t:CSV{
SerialNumber: "${record:value('/SerialNumber')}"
})
SET t.Test_Stage = "${record:value('/Test_Stage')}",
t.TimestampINT = "${record:value('/TimestampINT')}"
WITH m
MATCH (t:CSV) where t.SerialNumber=m.SerialNumber and t.Test_Stage=m.Test_Stage and t.TimestampINT=m.TimestampINT MERGE (m)-[:PART_OF]->(t)
WITH t, t.TimestampINT AS TimestampINT
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage and rl.TimestampINT<TimestampINT
SET rl.TimestampINT=TimestampINT
WITH t
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage
MERGE (t)-[:POINTS_TO]->(rl)
WITH rl
MATCH (t:CSV)-[r:POINTS_TO]->(rl) WHERE t.TimestampINT<rl.TimestampINT
DELETE r
another thing to add is that I would probably split this into two queries.
First one would be the importing part and the second one would be the delete of relationships. Also add unique constraints and indexes where possible.
I already have some data in Neo4j, data is modelled in the below fashion :
:A {ID:"123",Group:"ABC",Family:"XYZ"}
:B {ID:"456",Group:"ABC",Family:"XYZ"})
(:A)-[:SCORE{score:'2'}]-(:B)
Please find the attached image for more clarification how data looks like currently.
Now,
I am importing some new data through CSV file which has 5 columns
A's ID
B's ID
Score through which A is attached to B
Group
Family
In the new data there can be some new A Ids or some new B Ids
Question :
I want to create those new nodes of type A and B and create a relationship 'Score' and assigning score as the value of relationship type 'Score' between them
there are chances that there already existing scores between A and B might have changed. So i want to just update the previous score with the new one.
How to write cypher to achieve the above problem using CSV as import.
I used the below cypher query to model data for the first time:
using periodic commit LOAD CSV WITH HEADERS FROM "file:///ABC.csv" as line Merge(a:A{ID: line.A,Group:line.Group,Family:line.Family})
Merge(b:B{ID: line.A,Group:line.Group,Family:line.Family})
Merge(a)-[:Score{score:toFloat(line.Score)}]-(b)
Note: Family and Group are same for both type of nodes 'A' and 'B'
Thanks in advance.
You can MERGE the relationship and set the score after the fact so it does not create new SCORE relationships for every new value.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///ABC.csv" AS line
MERGE (a:A {ID: line.A, Group:line.Group, Family:line.Family})
MERGE (b:B {ID: line.A, Group:line.Group, Family:line.Family})
MERGE (a)-[score:SCORE]-(b)
SET score.score = toFloat(line.Score)
I have following query to create person node -
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "http://192.168.11.121/movie-reco-db/person_node.csv" as row
CREATE (:Person {personId: row.person_id, name: row.name});
I set index on personId , person_node.csv is the file I exported from MySql database , this query is working fine but the problem is that the CSV file will have new records on each time I export , If I run this query again then it is creating duplicate nodes , If I set UNIQUE index on personId then it is says -
Node 0 already exists with label Person and property "personId"=[1]
And does not insert new records. So is there any elegant way to update record if already exists or create new one if not.
You're looking for the MERGE operation, which will attempt a match, and if it doesn't find the thing, it will create it. Be aware that if the entirety of the thing you are merging does not exist (for example, merging a node with a personId and a name, but an existing node has that personId but a slightly different name) it will create the node.
If you have a unique ID for the node, merge on that, then use ON CREATE to add the remaining properties (ON CREATE only gets executed when MERGE causes a create instead of matching on existing entities in your db, there is another command ON MATCH that only gets executed when it matches instead of creates).
Your final query will look like:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "http://192.168.11.121/movie-reco-db/person_node.csv" as row
MERGE (p:Person {personId: row.person_id})
ON CREATE SET p.name = row.name;
I am relatively new to neo4j.
I have imported dataset of 12 million records and I have created a relationship between two nodes. When I created the relationship, I forgot to attach a property to the relationship. Now I am trying to set the property for the relationship as follows.
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
MATCH (user:User{userID: USERID})
MATCH (order:Order{orderID: OrderId})
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=field1,
acc.field2=field2;
But this query is taking too much time to execute,
I even tried USING index on user and order node.
MATCH (user:User{userID: USERID}) USING INDEX user:User(userID)
Isn't it possible to create new attributes for the relationship at a later point?
Please let me know, how can I do this operation in a quick and efficient way.
You also forgot to prefix your query with USING PERIODIC COMMIT,
your query will build up transaction state for 24 million changes (property updates) and won't have enough memory to keep all that state.
You also forgot row. for the data that comes from your CSV and those names are inconsistently spelled.
If you run this from neo4j browser pay attention to any YELLOW warning signs.
Run
CREATE CONSTRAINT ON (u:User) ASSERT u.userID IS UNIQUE;
Run
CREATE CONSTRAINT ON (o:Order) ASSERT o.orderID IS UNIQUE;
Run
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
with row.USERID as userID, row.OrderId as orderID
MATCH (user:User{userID: userID})
USING INDEX user:User(userID)
MATCH (order:Order{orderID: orderID})
USING INDEX order:Order(orderID)
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=row.field1, acc.field2=row.field2;
I have a set of CSV files with duplicate data, i.e. the same row might (and does) appear in multiple files. Each row is uniquely identified by one of the columns (id) and has quite a few other columns that indicate properties, as well as required relationships (i.e. ids of other nodes to link to). The files all have the same format.
My problem is that, due to size and number of the files, I want to avoid processing the rows that already exist - I know that as long as id is the same, the contents of the rows will be the same across the files.
Can any cypher wizard advise how to write a query that would create the node, set all the properties and create all the relationship if a node with given id does not exist, but skip the action altogether if such node is found? I tried with MERGE ON CREATE, something along the lines of:
LOAD CSV WITH HEADERS FROM "..." AS row
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)
but unfortunately that can only be applied to not setting the properties again, but I couldn't work out how to skip the merging part of relationships (only shown one here, but there are quite a few more).
Thanks in advance!
You can just optionally match the node and then skip with WHERE n IS NULL
Make sure you have an index or constraint on :MyLabel(id)
LOAD CSV WITH HEADERS FROM "..." AS row
OPTIONAL MATCH (f:MyLabel {id:row.uniqueId})
WHERE f IS NULL
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)