I am loading the data into neo4j using loadcsv function. I have two types of nodes -Director and Company.
The below command is working fine and is executing within 50milisec.
LOAD CSV FROM "file:///Director.csv" AS line
CREATE(:Director {DirectorDIN:line[0]})
Load csv from "file:///Company.csv" AS line
Create(:Company{CompanyCIN:line[0]})
Now I am trying to build the relationship between the two nodes which is taking an infinite time to execute my query. Here is the simple query that I am trying.
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
match(c:Company{CompanyCIN:toString(line[0])}),(d:Director{DirectorDIN:toString(line[1])}) create (c)-[:Directed_by]->(d)
I have also tried:
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
match(c:Company{CompanyCIN:line[0]}),(d:Director{DirectorDIN:line[1]}) create (c)-[:Directed_by]->(d)
It is taking an infinite time. Please let me know what can be the issue over here?
Information:
The CSV file does not contain more than 20k records.
CompanyCIN is alphanumeric
DirectorDIN is numeric in nature
I think you forgot to create some schema constraint in your database :
CREATE CONSTRAINT on (n:Company) ASSERT n.CompanyCIN IS UNIQUE;
CREATE CONSTRAINT on (n:Director) ASSERT n.DirectorDIN IS UNIQUE;
Without thoses constraints the complexity of your query is N*M, where N is the number of Company nodes and M the number of Director.
To see what I mean, you can EXPLAIN your query before and after the creation of thoses constraints.
Moreover, you should also use the PERIODIC COMMIT on your LOAD CSV query, like that :
USING PERIODIC COMMIT 5000
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
MATCH (c:Company{CompanyCIN:line[0]})
MATCH (d:Director{DirectorDIN:line[1]})
CREATE (c)-[:Directed_by]->(d)
The main issue was that you did not have indexes on :Company(CompanyCIN) and :Director{DirectorDIN). Without the indexes, neo4j is forced to evaluate every possible pair of Company and Director nodes for every line in your CSV file. That takes a lot of time.
CREATE INDEX ON :Company(CompanyCIN);
CREATE INDEX ON :Director{DirectorDIN);
By the way, creating the corresponding uniqueness constraints (as suggested by #logisma) has the side-effect of creating these indexes, but the issue was not caused by missing uniqueness constraints.
In addition, you should avoid creating duplicate Directed_by relationships by using MERGE instead of CREATE.
This should work better (you can use the USING PERIODIC COMMIT option, as suggested by #logisima if you have ):
USING PERIODIC COMMIT 5000 LOAD CSV FROM "file:///CompanyDirector.csv" AS line
MATCH (c:Company {CompanyCIN:line[0]})
MATCH (d:Director {DirectorDIN:line[1]})
MERGE (c)-[:Directed_by]->(d)
Related
I am new to neo4j, my data is in csv files trying load them in db and create relationships.
departments.csv(9 rows)
dept_name
dept_no
dept_emp.csv(331603 rows)
dept_no
emp_no
from_date
to_date
I have create nodes with labels departments and dept_emp with all columns as properties. now trying to create relationship between them.
CALL apoc.periodic.iterate("
load csv with headers from 'file:///dept_emp.csv' as row return row",
"match(de:dept_emp)
match(d:departments)
where de.dept_no=row.dept_no and d.dept_no= row.dept_no
merge (de)-[:BELONGS_TO]->(d)",{batchSize:10000, parallel:false})
I do have indexes on :dept_emp and :departments
When I try to run this it is taking ages to complete(many days). When I change the batch size to 10 it created 331603 relations, but it kept on running until it completes all the batches which is taking too long. When it encounters 9 different dept_no at initial rows in dept_emp.csv it is creating all the relations but it has to complete all the batches. In each batch it has to scan all the 331603 relations which were create in first two batches or so. Please help me with optimizing this.
Here I have used apoc.periodic.iterate to deal with the huge data in future, here how the data is related and how I am trying to establish the relation is making the problem . Each department will be having many dept_emp nodes connected.
Currently using Neo4j 4.2.1 version
Max heap size is 1G due to my laptop limitations.
There's no need to create nodes in this fashion, i.e. set properties and then load the same csv again but match all nodes in the graph and do a cartesian join.
Instead:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS row
CREATE (d:Department) SET d.deptNo=row.dept_no, d.name=row.dept_name
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///dept_emp.csv' AS row
MATCH (d:Department {deptNo:row.`dept_no`})
WITH d
MERGE (e:Employee {empNo: row.`emp_no`})
MERGE (e)-[:BELONGS_TO]->(d)
I'm unsure if I'm using CREATE CONSTRAINT optimally while importing CSV data via LOAD CSV and would appreciate feedback/advice from the more knowledgeable.
I am importing from databases of about 3 and 12 million records. I know that the bulk import function would be faster, but for various reasons, LOAD CSV is the better option for this project. I can let things run for a long time, but want to be sure I'm optimizing as much as possible.
My code is currently:
CREATE CONSTRAINT ON (i:Inventor) ASSERT i.hanID IS UNIQUE;
CREATE CONSTRAINT ON (p:Patent) ASSERT p.patNo IS UNIQUE;
CREATE CONSTRAINT ON (c:Country) ASSERT c.countryCode IS UNIQUE;
// Import Inventors and link them to their country
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///...//names.short" AS row
FIELDTERMINATOR '|'
MERGE (c:Country {countryCode:row.Person_ctry_code})
MERGE (i:Inventor {hanId:row.HAN_ID, name:row.Person_name_clean})
CREATE (i)-[:LivesIn]->(c);
// Load patents and link the to their inventors
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///.../patents.short" as row
FIELDTERMINATOR '|'
MERGE (i:Inventor {hanId:row.HAN_ID})
MERGE(p:Patent {patNo:row.Patent_number})
CREATE (i)-[:Invented]->(p);
Each inventor has a unique hanID, each patent a unique patNo and each country a unique countryCode, although each inventor, patent and country may show up in the data many times.
Is creating the constraints before I begin the LOAD CSV statements optimal?
Are there any obvious ways to improve the speed of my imports?
Thank you very much.
Constraint creation before loading CSV is a good move, as constraints only need to be created once.
As for your import queries, it's best to MERGE only with the unique property, and use ON CREATE to SET additional properties (like an inventor's name).
As far as speed improvements go, when you're importing you're likely only doing this once, so speed usually isn't a factor unless it's taking an unusually long time for some reason.
One way you could improve this is to load CSVs with just :Country, just :Inventor, and just :Patent, with no repeats of entries, and use CREATE instead of MERGE to get them into the db. Then, after all nodes are imported, you can use the queries and CSVs in your description to create relationships, but you can use use MATCH instead of MERGE on all nodes.
Remember that MERGE is shorthand for attempting a MATCH, and if no MATCH, it will CREATE, so creating all your nodes ahead of time with CREATE avoids the extra unnecessary checks to see if the node exists first.
EDIT
cybersam's answer for a different question highlighted something I wasn't previously aware of. Apparently indexes are not used for lookup when using another property for input (that should apply to unique properties too).
To get around this, you'll have to alias the properties as values, then use those.
For example, in your query to load :Patent and :Inventor nodes, you would have to do something like this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///.../patents.short" as row
FIELDTERMINATOR '|'
WITH row.HAN_ID as hanId, row.Patent_number as patNo
MERGE (i:Inventor {hanId:hanId})
MERGE(p:Patent {patNo:patNo})
CREATE (i)-[:Invented]->(p);
CREATE (i)-[:LivesIn]->(c);
I have installed Neo4j community Edition 3.0.3 on Ubuntu 14.04 on a linux local server and have successfully installed it. Now I am accessing it through my windows browser through the port 7474 on that server.
Now I have a csv file having sales order data in the following format:
Customer_id, Item_id, Order_Date
It has 90000 rows, and both customer_id and item_id are the nodes. A total of (30000 customer_ids + 30000 item_ids) nodes and 90000 relationships(order_date as the distance attribute name). I ran the below query to insert the data from csv to my graph database:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
I left it to run, and after around 7 to 8 hours, it was still running. My question is, am I doing anything wrong? is my query not optimized? or is this thing usual? I am new to both Neo4j and Cypher. Please help me on this.
Create a uniqueness constraint
You should create a uniqueness constraint on MyNode.Name:
CREATE CONSTRAINT ON (m:MyNode) ASSERT m.Name IS UNIQUE;
In addition to enforcing the data integrity / uniqueness of MyNode, that will create an index on MyNode.Name which will speed the lookups on the MERGE statements. There's a bit more info in the indexes and performance section here.
Using periodic commit
Since Neo4j is a transactional database, the results of your query is built up in memory and the entire query is committed at once. Depending on the size of the data / resources available on your machine you may want to use the periodic commit functionality in LOAD CSV to avoid building up the entire statement in memory. Just start your query with USING PERIODIC COMMIT. This will commit the results periodically, freeing memory resources while iterating through your CSV file.
Avoiding the eager
One problem with your query is that is contains an eager operation. This will hinder the periodic commit functionality and the transaction will all be built up into memory regardless. To avoid the eager operation you can use two passes through the csv file:
Once to create the nodes:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
Then again to create the relationships:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MATCH (n:MyNode {Name:line.Customer})
MATCH (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
See these two posts for more info about the eager operation.
At a minimum you need to create the uniqueness constraint - that should be enough to increase the performance of your LOAD CSV statement.
I have created nodes using LOAD CSV method using Cypher. The next part is creating relationships with the nodes. For that I have CSV in the following format
fromStopName,from,route,toStopName,to
Swargate,1,route1_1,Swargate Corner,2
Swargate Corner,2,route1_1,Hirabaug,3
Hirabaug,3,route1_1,Maruti,4
Maruti,4,route1_1,Mandai,5
Now I would like to have "route" name as relationship between nodes. So, I am using the following LOAD CSV command in CYPHER
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName}) CREATE f - [:row.route]->t
But looks like, I cannot do that. Instead, if I name relationship statically and then assign property from csv route field, it works.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName}) CREATE f - [:CONNECTS {route: row.route}]->t
I am wondering if this is disabled to enforce good practice of having "pure" verb kind of relationships and avoiding creating multiplicity of same relationship. like "connected by 1_1" "connected by 1_2".
Or I am just not finding the right link or not using correct syntax. Appreciate help!
Right now you can't as this is structural information.
Either use neo4j-import tool for that.
Or use one CSV file per type and spell out the rel-type.
Or even filter the CSV and do multi-pass:
e.g.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
with row where row.route = "route1_1"
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName})
CREATE (f)-[:route1_1]->(t)
There is also a trick using fake conditionals but you still have to spell them out.
I have a list of MATCH statements which are totally unrelated to each other. But if I execute them like
MATCH (a:Person),(b:InProceedings) WHERE a.identifier = 'person/joseph-valeri' and b.identifier = 'conference/edm2008/paper/209' CREATE (a)-[r:creator]->(b)
MATCH (a:Person),(b:InProceedings) WHERE a.identifier = 'person/nell-duke' and b.identifier = 'conference/edm2008/paper/209' CREATE (a)-[r:creator]->(b)
But if I execute them at once I get the following error:
WITH is required between CREATE and MATCH (line 2, column 1)
What changes should I incorporate?
(I am new to Neo4j)
Does this need to happen in a single transaction? In which case you should be matching your nodes up front before performing the create:
MATCH (jo:Person{identifier:'person/joseph-valeri'}), (nell:Person{identifier:'person/nell-duke'}), (b:InProceedings{identifier:'conference/edm2008/paper/209'})
CREATE (jo)-[:creator]->(b), (nell)-[:creator]->(b)
If it's just the two creators you could change the create to:
CREATE (jo)-[:creator]->(b)<-[:creator]-(nell)
If this isn't what you want to achieve then effectively what you have posted is two distinct Cypher statements that you are trying to run as one, and the parser is getting confused.
Post comment edit
Given that you said millions I think that you are going to find the transaction time on performing the import prohibitive and therefore you should investigate the CSV import syntax (and specifically pay attention to PERIODIC COMMIT) if you can write to CSV instead of to the big Cypher dump?
If for some reason that is not an option and you are starting from empty then build slowly - creating nodes first. These are going to need names to keep the speed up (but these names aren't persisted, just constant in your Cypher query):
CREATE (a:Person{identifier:'person/joseph-valeri'}),
(b:Person{identifier:'person/nell-duke'}),
(zzz:Person{identifier:'person/do-you-really-want-person-in-all-these-identifiers'}),
(inProca:InProceedings{identifier:'conference/edm2008/paper/209'}),
(inProcb:InProceedings{identifier:'conference/edm2009/paper/209'})
You will have kept track of a, b .. zzz in your Python script allowing you to build the CREATE statment up with:
(a)-[:creator]->(inProcA), (zzz)-[:creator]-(inProcB)
Now if all of your nodes already exist and you just want to build the relationships in now, then you have the choice of:
Performing individual MATCH and CREATEs for each new relationship, exceuting them each individually. This looks like what your original code was doing. You should move the conditions into the MATCH rather than the WHERE clause.
MATCHing a large set of nodes and CREATEing new realtionships. This is more akin to what my initial code was doing and will require your script to be smart in generating the queries.
MERGEing existing nodes into new relationships.
Whatever you do, you'e going to need to batch the writes within the transaction or you're going to run out of memory - you can advise Neo4J to do this by using the USING PERIODIC COMMIT 50000 syntax, here is a great blog post on it.