Best use of CREATE CONSTRAINT during LOAD CSV - neo4j

I'm unsure if I'm using CREATE CONSTRAINT optimally while importing CSV data via LOAD CSV and would appreciate feedback/advice from the more knowledgeable.
I am importing from databases of about 3 and 12 million records. I know that the bulk import function would be faster, but for various reasons, LOAD CSV is the better option for this project. I can let things run for a long time, but want to be sure I'm optimizing as much as possible.
My code is currently:
CREATE CONSTRAINT ON (i:Inventor) ASSERT i.hanID IS UNIQUE;
CREATE CONSTRAINT ON (p:Patent) ASSERT p.patNo IS UNIQUE;
CREATE CONSTRAINT ON (c:Country) ASSERT c.countryCode IS UNIQUE;
// Import Inventors and link them to their country
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///...//names.short" AS row
FIELDTERMINATOR '|'
MERGE (c:Country {countryCode:row.Person_ctry_code})
MERGE (i:Inventor {hanId:row.HAN_ID, name:row.Person_name_clean})
CREATE (i)-[:LivesIn]->(c);
// Load patents and link the to their inventors
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///.../patents.short" as row
FIELDTERMINATOR '|'
MERGE (i:Inventor {hanId:row.HAN_ID})
MERGE(p:Patent {patNo:row.Patent_number})
CREATE (i)-[:Invented]->(p);
Each inventor has a unique hanID, each patent a unique patNo and each country a unique countryCode, although each inventor, patent and country may show up in the data many times.
Is creating the constraints before I begin the LOAD CSV statements optimal?
Are there any obvious ways to improve the speed of my imports?
Thank you very much.

Constraint creation before loading CSV is a good move, as constraints only need to be created once.
As for your import queries, it's best to MERGE only with the unique property, and use ON CREATE to SET additional properties (like an inventor's name).
As far as speed improvements go, when you're importing you're likely only doing this once, so speed usually isn't a factor unless it's taking an unusually long time for some reason.
One way you could improve this is to load CSVs with just :Country, just :Inventor, and just :Patent, with no repeats of entries, and use CREATE instead of MERGE to get them into the db. Then, after all nodes are imported, you can use the queries and CSVs in your description to create relationships, but you can use use MATCH instead of MERGE on all nodes.
Remember that MERGE is shorthand for attempting a MATCH, and if no MATCH, it will CREATE, so creating all your nodes ahead of time with CREATE avoids the extra unnecessary checks to see if the node exists first.
EDIT
cybersam's answer for a different question highlighted something I wasn't previously aware of. Apparently indexes are not used for lookup when using another property for input (that should apply to unique properties too).
To get around this, you'll have to alias the properties as values, then use those.
For example, in your query to load :Patent and :Inventor nodes, you would have to do something like this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///.../patents.short" as row
FIELDTERMINATOR '|'
WITH row.HAN_ID as hanId, row.Patent_number as patNo
MERGE (i:Inventor {hanId:hanId})
MERGE(p:Patent {patNo:patNo})
CREATE (i)-[:Invented]->(p);
CREATE (i)-[:LivesIn]->(c);

Related

Neo4j Data Import Slowness

I have to load around 5M Records in the Neo4j DB so I broke the excel into the chunks of 100K the Data is in Tabular Format and I am using CyperShell for that but seems like it has been more than 8 hours and it's still stuck on the first chunk
I'm Using
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'file://aa.xlsx' as row
MERGE (p1:L1 {Name: row.sl1})
MERGE (p2:L2 {Name: row.sl2})
MERGE (p3:L3 {Name: row.sl3, Path:row.sl3a})
MERGE (p4:L4 {Name: row.sl4})
MERGE (p5:L4 {Name: row.tl1})
MERGE (p6:L3 {Name: row.tl2})
MERGE (p7:L2 {Name: row.tl3, Path:row.tl3a})
MERGE (p8:L1 {Name: row.tl4})
MERGE (p1)-[:s]->(p2)-[:s]->(p3)-[:s]->(p4)-[:it]->(p5)-[:t]->(p6)-[:t]->(p7)-[:t]->(p8)
Can Anyone Suggest me the changes or alternate Method to load the data in faster way
Data in Excel Format
For importing a large amount of data, you should consider using the import tool instead of Cypher's LOAD CSV clause. That tool can only import into a previously unused database.
If you still want to use LOAD CSV, you need to make some changes.
You are using MERGE improperly, and are probably generating many duplicate nodes and relationships as a result. You may find this answer instructive.
A MERGE clause's entire pattern will be created if anything in
the pattern does not already exist.
So, your last MERGE pattern, with its seven relationships, is especially dangerous. It should be split into seven MERGE clauses with individual relationships.
Also, a MERGE pattern that specifies multiple properties is likely bad as well. For example, if all L3 nodes have a unique Name value, then it would be safer to replace this:
MERGE (p3:L3 {Name: row.sl3, Path:row.sl3a})
with something like the following:
MERGE (p3:L3 {Name: row.sl3})
ON CREATE SET p3.Path = row.sl3a
In the above snippet, if the node already exists but row.sl3a is different than the existing Path value, then no additional node is created. In addition, since the node already existed, the ON CREATE option does not execute its SET clause, leaving the original Path value unchanged. You could also choose to use ON MATCH instead, or even just call SET directly if you want to set the value no matter what.
To avoid having to scanning through all the nodes with a given label every time MERGE needs to find an existing node, you should create an index or uniqueness constraint for every label/property pair of every node that you are MERGEing:
:L1(Name)
:L2(Name)
:L3(Name)
:L4(Name)

neo4j load csv taking infinite time to execute my query

I am loading the data into neo4j using loadcsv function. I have two types of nodes -Director and Company.
The below command is working fine and is executing within 50milisec.
LOAD CSV FROM "file:///Director.csv" AS line
CREATE(:Director {DirectorDIN:line[0]})
Load csv from "file:///Company.csv" AS line
Create(:Company{CompanyCIN:line[0]})
Now I am trying to build the relationship between the two nodes which is taking an infinite time to execute my query. Here is the simple query that I am trying.
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
match(c:Company{CompanyCIN:toString(line[0])}),(d:Director{DirectorDIN:toString(line[1])}) create (c)-[:Directed_by]->(d)
I have also tried:
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
match(c:Company{CompanyCIN:line[0]}),(d:Director{DirectorDIN:line[1]}) create (c)-[:Directed_by]->(d)
It is taking an infinite time. Please let me know what can be the issue over here?
Information:
The CSV file does not contain more than 20k records.
CompanyCIN is alphanumeric
DirectorDIN is numeric in nature
I think you forgot to create some schema constraint in your database :
CREATE CONSTRAINT on (n:Company) ASSERT n.CompanyCIN IS UNIQUE;
CREATE CONSTRAINT on (n:Director) ASSERT n.DirectorDIN IS UNIQUE;
Without thoses constraints the complexity of your query is N*M, where N is the number of Company nodes and M the number of Director.
To see what I mean, you can EXPLAIN your query before and after the creation of thoses constraints.
Moreover, you should also use the PERIODIC COMMIT on your LOAD CSV query, like that :
USING PERIODIC COMMIT 5000
LOAD CSV FROM "file:///CompanyDirector.csv" AS line
MATCH (c:Company{CompanyCIN:line[0]})
MATCH (d:Director{DirectorDIN:line[1]})
CREATE (c)-[:Directed_by]->(d)
The main issue was that you did not have indexes on :Company(CompanyCIN) and :Director{DirectorDIN). Without the indexes, neo4j is forced to evaluate every possible pair of Company and Director nodes for every line in your CSV file. That takes a lot of time.
CREATE INDEX ON :Company(CompanyCIN);
CREATE INDEX ON :Director{DirectorDIN);
By the way, creating the corresponding uniqueness constraints (as suggested by #logisma) has the side-effect of creating these indexes, but the issue was not caused by missing uniqueness constraints.
In addition, you should avoid creating duplicate Directed_by relationships by using MERGE instead of CREATE.
This should work better (you can use the USING PERIODIC COMMIT option, as suggested by #logisima if you have ):
USING PERIODIC COMMIT 5000 LOAD CSV FROM "file:///CompanyDirector.csv" AS line
MATCH (c:Company {CompanyCIN:line[0]})
MATCH (d:Director {DirectorDIN:line[1]})
MERGE (c)-[:Directed_by]->(d)

Cypher Query endless loop

I am new to graph databases and especially cypher. I am importing data from my csv. Below is the sample I pulled for some country data and added the cities and states. Now I was pushing the data for areas
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
I've already pushed city and country data using the same query and built a relation between them too.
But when I try to create the areas inside the city it just keeps going, there is no stopping it. (15 mins have passed).
There are 7000 cities in the data I've got from the internet and 90k areas inside those cities.
Is it just taking time or have I messed up with the query.
After the Update
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
Okay, your query plan shows NodeByLabelScans and filters are being used to find your nodes, which means that every time you match or merge to a node, it has to scan all nodes with the given labels and perform property access on all of them to find the nodes you're looking for.
You need to add indexes (or unique constraints, depending on if the field is supposed to be unique) on the relevant label/property combinations so those lookups will be quick.
So you'll need one on :city(poc), and probably one on :area(eoc), assuming those properties are referring to unique properties.
EDIT
One other big thing I initially missed, you need to add USING PERIODIC COMMIT before the LOAD CSV so the load will batch the writes to the db, that should do the trick here.

Big data import into neo4j

I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.

In Neo4j, is there a way to read relationship name dynamically using loadcsv?

I have created nodes using LOAD CSV method using Cypher. The next part is creating relationships with the nodes. For that I have CSV in the following format
fromStopName,from,route,toStopName,to
Swargate,1,route1_1,Swargate Corner,2
Swargate Corner,2,route1_1,Hirabaug,3
Hirabaug,3,route1_1,Maruti,4
Maruti,4,route1_1,Mandai,5
Now I would like to have "route" name as relationship between nodes. So, I am using the following LOAD CSV command in CYPHER
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName}) CREATE f - [:row.route]->t
But looks like, I cannot do that. Instead, if I name relationship statically and then assign property from csv route field, it works.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName}) CREATE f - [:CONNECTS {route: row.route}]->t
I am wondering if this is disabled to enforce good practice of having "pure" verb kind of relationships and avoiding creating multiplicity of same relationship. like "connected by 1_1" "connected by 1_2".
Or I am just not finding the right link or not using correct syntax. Appreciate help!
Right now you can't as this is structural information.
Either use neo4j-import tool for that.
Or use one CSV file per type and spell out the rel-type.
Or even filter the CSV and do multi-pass:
e.g.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
with row where row.route = "route1_1"
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName})
CREATE (f)-[:route1_1]->(t)
There is also a trick using fake conditionals but you still have to spell them out.

Resources