I'm having issues importing a large volume of data into a Neo4j instance using the Cypher LOAD CSV command. I'm attempting to load in roughly 253k user records each with a unique user_id. My first step was to add a unique constraint on tje label to make sure the user was only being run once
CREATE CONSTRAINT ON (b:User) ASSERT b.user_id IS UNIQUE;
I then tried to run LOAD CSV with periodic commits to pull this data in.
This query failed so I tried to merge User record before setting
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_users.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id)})-[:REGISTERED_TO]->(t)
set p.created=toInt(line.created), p.completed=toInt(line.completed);
Modifying the periodic commit value has made no difference, the same error is returned.
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
I receive the following error:
LoadCsvStatusWrapCypherException: Node 9752 already exists with label Person and property "hpcm_uk_buddy_id"=[2446] (Failure when processing URL 'file:/home/data/uk_buddies.csv' on line 253316 (which is the last row in the file). Possibly the last row committed during import is line 253299. Note that this information might not be accurate.)
The numbers seem to match up roughly, the CSV file contains 253315 records in total. The periodic commit doesn't seem to have taken effect either, a count of nodes returns only 5446 rows.
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 5446 |
+----------+
1 row
768 ms
I can understand the number of nodes being incorrect if this ID is only roughly 5000 rows into the CSV file. But is there any technique or command I can use to succeed this import?
You're falling victim to a common mistake with MERGE I think. Relative to cypher query, seriously this would be like in my top 10 FAQs about common problems with cypher. See you're doing this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
The way merge works, that last merge matches on the entire relationship, not just on the user node. So probably, you're creating duplicate users that you shouldn't be. When you run this merge, even if a user with those exact properties already exists, the relationship to the t node doesn't, so it attempt to create a new user node with those attributes, to connect to t, which isn't what you want.
The solution is to merge the user individually, then separately merge the relationship path, like this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})
merge (p)-[:REGISTERED_TO]->(t);
Note the two merges at the end. One creates just the user. If the user already exists, it won't try to create a duplicate, and you should hopefully be OK with your constraint (assuming there aren't two users with the same user_id, but different created values). After you've merged just the user, then you merge the relationship.
The net result of the second query is the same, but shouldn't create duplicate users.
Related
I am very new to Neo4j/cypher/graph databases, and have been trying to follow the Neo4j tutorial to import data I have in a csv and create relationships.
The following code does what I want in terms of reading in the data, creating nodes, and setting properties.
/* Importing data on seller-buyer relationshsips */
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///customer_rel_table.tsv' AS row
FIELDTERMINATOR '\t'
MERGE (seller:Seller {sellerID: row.seller})
ON CREATE SET seller += {name: row.seller_name,
root_eid: row.vendor_eid,
city: row.city}
MERGE (buyer:Buyer {buyerID: row.buyer})
ON CREATE SET buyer += {name: row.buyer_name};
/* Creating indices for the properties I might want to match on */
CREATE INDEX seller_id FOR (s:Seller) on (s.seller_name);
CREATE INDEX buyer_id FOR (b:Buyer) on (b.buyer_name);
/* Creating constraints to guarantee buyer-seller pairs are not duplicated */
CREATE CONSTRAINT sellerID ON (s:Seller) ASSERT s.sellerID IS UNIQUE;
CREATE CONSTRAINT buyerID on (b:Buyer) ASSERT b.buyerID IS UNIQUE;
Now I have the nodes (sellers and buyers) that I want, and I would like to link buyers and sellers. The code I have tried for this is:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///customer_rel_table.tsv' AS row
MATCH (s:Seller {sellerID: row.seller})
MATCH (b:Buyer {buyerID: row.buyer})
MERGE (s)-[st:SOLD_TO]->(b)
The query runs, but I don't get any relationships:
Query executed in 294ms. Query type: WRITE_ONLY.
No results.
Since I'm not asking it to RETURN anything, I think the "No results" comment is correct, but when I look at metadata for the DB, no relationships appear. Also, my data has ~220K rows, so 294ms seems fast.
EDIT: At #cybersam's prompting, I tried this query:
MATCH p=(:Seller)-[:SOLD_TO]->(:Buyer) RETURN p, which gives No results.
For clarity, there are two fields in my data that are the heart of the relationship:
seller and buyer, where the seller sells stuff to the buyer. The seller identifiers are repeated, but for each seller there are unique seller-buyer pairs.
What do I need to fix in my code to get relationships between the sellers and buyers? Thank you!
Your second query's LOAD CSV clause does not specify FIELDTERMINATOR '\t'. The default terminator is a comma (','). That is probably why it fails to MATCH anything.
Try adding FIELDTERMINATOR '\t' at the end of that clause.
I am working on creating a graph database in neo4j for a CALL dataset. The dataset is stored in csv file with following columns: Source, Target, Timestamp, Duration. Here Source and Target are Person id's (numeric), Timestamp is datetime and duration is in seconds (integer).
I modeled my graph where person are nodes(person_id as property) and call as relationship (time and duration as property).
There are around 2,00,000 nodes and around 70 million relationships. I have a separate csv files with person id's which I used to create the nodes. I also added uniqueness constraint on the Person id's.
CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE
I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well(fast enough) for first few(around 10) files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file.
Can someone link me to an efficient and easy way of doing it?
Here is the cypher query:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)
I am using Neo4j 4.0.3
Your MERGE clause has to check for an existing matching relationship (to avoid creating duplicates). If you added a lot of relationships between Person nodes, that could make the MERGE clause slower.
You should consider whether it is safe for you to use CREATE instead of MERGE.
Is much better if you export the match using the ID of each node and then create the relationship.
POC
CREATE INDEX ON :Person(`pid`);
CALL apoc.export.csv.query("LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
RETURN ID(a) AS ida,ID(b) as idb,time,Duration","rels.csv", {});
and then
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:////rels.csv' AS row
MATCH (a:Person) WHERE ID(a) = toInt(row.ida)
MATCH (b:Person) WHERE ID(b) = toInt(row.idb)
MERGE (b)-[:CALLS {time: row.time, duration: Duration}]->(a);
For me this is the best way to do this.
I have been created a graph having a constraint on primary id. In my csv a primary id is duplicate but the other proprieties are different. Based on the other properties I want to create relationships.
I tried multiple times to change the code but it does not do what I need.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id})
with line.cui= cui
MATCH (m:Intervention)
where m.id = cui
MERGE (n)-[:HAS_INTERVENTION]->(m);
I already have the nodes Intervention in the graph as well as the trials. So what I am trying to do is to match a trial with the id from intervention and create only the relationship. Instead is creating me also the nodes.
This is a sample of my data, so the same primary id, having different cuis and I am trying to match on cui:
You can refer the following query which finds Trial and Intervention nodes by primary_id and cui respectively and creates the relationship between them.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id}), (m:Intervention {id: line.cui})
MERGE (n)-[:HAS_INTERVENTION]->(m);
The behavior you observed is caused by 2 aspects of the Cypher language:
The WITH clause drops all existing variables except for the ones explicitly specified in the clause. Therefore, since your WITH clause does not specify the n node, n becomes an unbound variable after the clause.
The MERGE clause will create its entire pattern if any part of the pattern does not already exist. Since n is not bound to anything, the MERGE clause would go ahead and create the entire pattern (including the 2 nodes).
So, you could have fixed the issue by simply specifying the n variable in the WITH clause, as in:
WITH n, line.cui= cui
But #Raj's query is even better, avoiding the need for WITH entirely.
I am relatively new to neo4j.
I have imported dataset of 12 million records and I have created a relationship between two nodes. When I created the relationship, I forgot to attach a property to the relationship. Now I am trying to set the property for the relationship as follows.
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
MATCH (user:User{userID: USERID})
MATCH (order:Order{orderID: OrderId})
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=field1,
acc.field2=field2;
But this query is taking too much time to execute,
I even tried USING index on user and order node.
MATCH (user:User{userID: USERID}) USING INDEX user:User(userID)
Isn't it possible to create new attributes for the relationship at a later point?
Please let me know, how can I do this operation in a quick and efficient way.
You also forgot to prefix your query with USING PERIODIC COMMIT,
your query will build up transaction state for 24 million changes (property updates) and won't have enough memory to keep all that state.
You also forgot row. for the data that comes from your CSV and those names are inconsistently spelled.
If you run this from neo4j browser pay attention to any YELLOW warning signs.
Run
CREATE CONSTRAINT ON (u:User) ASSERT u.userID IS UNIQUE;
Run
CREATE CONSTRAINT ON (o:Order) ASSERT o.orderID IS UNIQUE;
Run
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
with row.USERID as userID, row.OrderId as orderID
MATCH (user:User{userID: userID})
USING INDEX user:User(userID)
MATCH (order:Order{orderID: orderID})
USING INDEX order:Order(orderID)
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=row.field1, acc.field2=row.field2;
I have a set of CSV files with duplicate data, i.e. the same row might (and does) appear in multiple files. Each row is uniquely identified by one of the columns (id) and has quite a few other columns that indicate properties, as well as required relationships (i.e. ids of other nodes to link to). The files all have the same format.
My problem is that, due to size and number of the files, I want to avoid processing the rows that already exist - I know that as long as id is the same, the contents of the rows will be the same across the files.
Can any cypher wizard advise how to write a query that would create the node, set all the properties and create all the relationship if a node with given id does not exist, but skip the action altogether if such node is found? I tried with MERGE ON CREATE, something along the lines of:
LOAD CSV WITH HEADERS FROM "..." AS row
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)
but unfortunately that can only be applied to not setting the properties again, but I couldn't work out how to skip the merging part of relationships (only shown one here, but there are quite a few more).
Thanks in advance!
You can just optionally match the node and then skip with WHERE n IS NULL
Make sure you have an index or constraint on :MyLabel(id)
LOAD CSV WITH HEADERS FROM "..." AS row
OPTIONAL MATCH (f:MyLabel {id:row.uniqueId})
WHERE f IS NULL
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)