Neo4j Data Import Slowness - neo4j

I have to load around 5M Records in the Neo4j DB so I broke the excel into the chunks of 100K the Data is in Tabular Format and I am using CyperShell for that but seems like it has been more than 8 hours and it's still stuck on the first chunk
I'm Using
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'file://aa.xlsx' as row
MERGE (p1:L1 {Name: row.sl1})
MERGE (p2:L2 {Name: row.sl2})
MERGE (p3:L3 {Name: row.sl3, Path:row.sl3a})
MERGE (p4:L4 {Name: row.sl4})
MERGE (p5:L4 {Name: row.tl1})
MERGE (p6:L3 {Name: row.tl2})
MERGE (p7:L2 {Name: row.tl3, Path:row.tl3a})
MERGE (p8:L1 {Name: row.tl4})
MERGE (p1)-[:s]->(p2)-[:s]->(p3)-[:s]->(p4)-[:it]->(p5)-[:t]->(p6)-[:t]->(p7)-[:t]->(p8)
Can Anyone Suggest me the changes or alternate Method to load the data in faster way
Data in Excel Format

For importing a large amount of data, you should consider using the import tool instead of Cypher's LOAD CSV clause. That tool can only import into a previously unused database.
If you still want to use LOAD CSV, you need to make some changes.
You are using MERGE improperly, and are probably generating many duplicate nodes and relationships as a result. You may find this answer instructive.
A MERGE clause's entire pattern will be created if anything in
the pattern does not already exist.
So, your last MERGE pattern, with its seven relationships, is especially dangerous. It should be split into seven MERGE clauses with individual relationships.
Also, a MERGE pattern that specifies multiple properties is likely bad as well. For example, if all L3 nodes have a unique Name value, then it would be safer to replace this:
MERGE (p3:L3 {Name: row.sl3, Path:row.sl3a})
with something like the following:
MERGE (p3:L3 {Name: row.sl3})
ON CREATE SET p3.Path = row.sl3a
In the above snippet, if the node already exists but row.sl3a is different than the existing Path value, then no additional node is created. In addition, since the node already existed, the ON CREATE option does not execute its SET clause, leaving the original Path value unchanged. You could also choose to use ON MATCH instead, or even just call SET directly if you want to set the value no matter what.
To avoid having to scanning through all the nodes with a given label every time MERGE needs to find an existing node, you should create an index or uniqueness constraint for every label/property pair of every node that you are MERGEing:
:L1(Name)
:L2(Name)
:L3(Name)
:L4(Name)

Related

How to make relationships between already created nodes of different columns from a single csv file?

I have a single csv file whose contents are as follows -
id,name,country,level
1,jon,USA,international
2,don,USA,national
3,ron,USA,local
4,bon,IND,national
5,kon,IND,national
6,jen,IND,local
7,ken,IND,international
8,ben,GB,local
9,den,GB,international
10,lin,GB,national
11,min,AU,national
12,win,AU,local
13,kin,AU,international
14,bin,AU,international
15,nin,CN,national
16,con,CN,local
17,eon,CN,international
18,fon,CN,international
19,pon,SZN,national
20,zon,SZN,international
First of all I created a constraint on id
CREATE CONSTRAINT idConstraint ON (n:Name) ASSERT n.id IS UNIQUE
Then I created nodes for name, then for country and finally for level as follows -
LOAD CSV WITH HEADERS FROM "file:///demo.csv" AS row
MERGE (name:Name {name: row.name, id: row.id, country:row.country, level:row.level})
MERGE (country:Country {name: row.country})
MERGE (level:Level {type: row.level})
I can see the nodes fine. However, I want to be able to query for things like, for a given country how many names are there? For a given level, how many countries and then how many names for that country are there?
So for that I need to make Relationships between the nodes.
For that I tried like this -
LOAD CSV WITH HEADERS FROM "file:///demo.csv" AS row
MATCH (n:Name {name:row.name}), (c:Country {name:row.country})
CREATE (n)-[:LIVES_IN]->(c)
RETURN n,c
However this gives me a warning as follows -
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c))
Moreover the resulting Graph looks slightly wrong - each Name node has 2 relations with a country whereas I would think there would be only one?
I also have a nagging fear that I am not doing things in an optimized or correct way. This is just a demo. In my real dataset, I often cannot run multiple CREATE or MERGE statements together. I have to LOAD the same CSV file again and again to do pretty much everything from creating nodes. When creating relationships, because a cartesian product forms, the command basically gives Java Heap Memory error.
PS. I just started with neo4j yesterday. I really don't know much about it. I have been struggling with this for a whole day, hence thought of asking here.
You can ignore the cartesian product warning, since that exact approach is needed in order to create the relationships that form the patterns you need.
As for the multiple relationships, it's possible you may have run the query twice. The second run would have created the duplicate relationships. You could use MERGE instead of CREATE for the relationships, that would ensure that there would be no duplicates.

Neo4j: how to avoid node to be created again if it is already in the database?

I have a question about Cypher requests and the update of a database.
I have a python script that does web scraping and generate a csv at the end. I use this csv to import data in a neo4j database.
The scraping is done 5 times a day. So every time a new scraping is done the csv is updated, new data is added to the the previous csv and so on.
I import the data after each scraping.
Actually when I import the data after each scraping to update the DB, I have all the nodes created again even if it is already in the DB.
For example the first csv gives 5 rows and I insert this in Neo4j.
Next the new scraping gives 2 rows so the csv has now 7 rows. And if I insert the data I will have the first five rows twice in the DB.
I would like to have everything unique and not added if it is already in the database.
For example when I try to create node ARTICLE I do this:
CREATE (a:ARTICLE {id:$id, title:$title, img_url:$img_url, link:$link, sentence:$sentence, published:$published})
I think MERGE instead of CREATE should solve the solution, but it doesn't and I can't figure it out why.
How can I do this ?
A MERGE clause will create its entire pattern if any part of it does not already exist. So, for a MERGE clause to work reasonably, the pattern used with it must only specify the minimum data necessary to uniquely identify a node (or a relationship).
For instance, assuming ARTICLE nodes are supposed to have unique id properties, then you should replace your CREATE clause:
CREATE (a:ARTICLE {id:$id, title:$title, img_url:$img_url, link:$link, sentence:$sentence, published:$published})
with something like this:
MERGE (a:ARTICLE {id:$id})
SET a += {title:$title, img_url:$img_url, link:$link, sentence:$sentence, published:$published}
In the above example, the SET clause will always overwrite the non-id properties. If you want to set those properties only when the node is created, you can use ON CREATE before the SET clause.
Use MERGE instead of CREATE. You can use it for both nodes and relationships.
MERGE (charlie { name: 'Charlie Sheen', age: 10 })
Create a single node with properties where not all properties match any existing node.
MATCH (a:Person {name: "Martin"}),
(b:Person {name: "Marie"})
MERGE (a)-[r:LOVES]->(b)
Finds or creates a relationship between the nodes.

MERGE the Creation of Nodes from Two geohash Columns in CSV

So I am planning to create a geohash Graph with neo4j.
my CSV contains ,for each row, two informations for geohash one for pickup and another for dropoff as follow :
What I want is:
the node that have the same geohash as another one shouldn't be recreated (so multiple edges are allowed).
one node could be a pickup and a dropoff in the same time
I tried to use MERGE but works by columns:
load csv from "file:///green_data.csv" as line
merge(pick:pickup{geohash:line[20]})merge (drop:dropoff{geohash: line[22]})merge(pick)-[:trip]->(drop)
as you can see , the same geohash dr5rkky node is being created twice one for pickups and another for dropoffs
how to avoid that ?
load csv from "file:///green_data.csv" as line MERGE(p:HashNode {geohash: line[20]}) ON CREATE set p.pickup=True ON MATCH set p.pickup=True MERGE(d:HashNode {geohash: line[22]}) ON CREATE set d.dropoff=True ON MATCH set d.dropoff=True MERGE (p)-[:trip]->(d)
Base on neo4j docs:
MERGE either matches existing nodes and binds them, or it creates new data and binds that. It’s like a combination of MATCH and CREATE that additionally allows you to specify what happens if the data was matched or created.
The last part of MERGE is the ON CREATE and ON MATCH. These allow a query to express additional changes to the properties of a node or relationship, depending on if the element was MATCH -ed in the database or if it was CREATE -ed.

create nodes / relationship pointing to the same city

I have an empty neo4j database. I want the city{val:"new york"} node to only have one instance, not two. What is the correct way to CREATE these nodes and relationships so that john and sam are pointing to the same city{val:"new york"} node?
CREATE
(p:person{name:"john"}),
(c:city{val:"new york"}),
(p)-[:LIVES_IN]->(c)
CREATE
(p:person{name:"sam"}),
(c:city{val:"new york"}),
(p)-[:LIVES_IN]->(c)
The data I am importing is in a csv file. I need some way to only create the city if it does not already exist. I tried to replace CREATE with MERGE, but the syntax is unclear.
It is simpler (and safer, since you don't always know if the data already exists) to just always use MERGE in cases where there can be duplicate attempts to create data that you want to be unique.
These 2 blocks of Cypher statements will not create duplicate nodes/relationships, even if you reverse the order (or if the DB already has some of the same data).
MERGE (p:person{name:"john"})
MERGE (c:city{val:"new york"})
MERGE (p)-[:LIVES_IN]->(c);
MERGE (p:person{name:"sam"})
MERGE (c:city{val:"new york"})
MERGE (p)-[:LIVES_IN]->(c);
Answering my own question. Each line needs its own MERGE clause.
CREATE
(p:person{name:"john"}),
(c:city{val:"new york"}),
(p)-[:LIVES_IN]->(c)
MERGE (p:person{name:"sam"})
MERGE (c:city{val:"new york"})
MERGE (p)-[:LIVES_IN]->(c)
Good related resource...https://neo4j.com/blog/common-confusions-cypher/

Avoiding duplication on nodes with same value neo4j

I have two column in csv file, emp_id and mngr_id. The relationship is (emp_id)-[:WORKS_UNDER]->(mngr_id). I want to merge all those nodes where emp_id=mngr_id. How to do that while creating nodes itself?
If I understand correctly, you're looking to ensure that you avoid creating duplicate relationships when iterating over the CSV data and avoid entering a relationship where a person works for themselves.
To avoid creating a relationship where emp_id and mngr_id identify the same person, I would suggest filtering the CSV before processing it to enter the data. It should be much easier to omit any lines in the CSV file where the emp_id and mngr_id are the same value before passing it to Neo4j.
Next, if you're using Cypher to do the importing, something like this may be useful:
MERGE (emp:Person{id:'emp_id'}) MERGE (mgr:Person{id:'mngr_id'}) MERGE (emp)-[:WORKS_UNDER]->(mgr) RETURN emp,mgr
Note that if you run the above query multiple times in a block statement then you'll need unique identifiers for emp and mgr in each query.
Merge is explained well in the Neo4j docs: http://docs.neo4j.org/chunked/stable/query-merge.html

Resources