Bulk Insertion in Py2neo - neo4j

Im writing a custom doc manager for mongo-connector to replicate mongodb documents to neo4j. Here I would like to create bulk relationships. Im using py2neo2020.0.
It seems there are some options in previous versions but not in this version. Is there any way to create bulk nodes and relationships in py2neo

I am currently working on bulk load functionality. There will be some new functions available in the next release. Until then, Cypher UNWIND...CREATE queries are your best bet for performance.

I would strongly recommend switching to the neo4j Python driver, as it's supported by Neo4j directly.
In any case, you can also do bulk insert directly in Cypher, and/or call that Cypher from within Python using the neo4j driver.
I recommend importing the nodes first, and then the relationships. It helps if you have a guaranteed unique identifier for the nodes, because then you can set up an index on that property before loading. Then you can load nodes from a CSV (or better yet a TSV) file like so:
// Create constraint on the unique ID - greatly improves performance.
CREATE CONSTRAINT ON (a:my_label) ASSERT a.id IS UNIQUE
;
// Load the nodes, along with any properties you might want, from
// a file in the Neo4j import folder.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///my_nodes.tsv" AS tsvLine FIELDTERMINATOR '\t'
CREATE (:my_label{id: toInteger(tsvLine.id), my_field2: tsvLine.my_field2})
;
// Load relationships.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///my_relationships.tsv" AS tsvLine FIELDTERMINATOR '\t'
MATCH(parent_node:my_label)
WHERE parent_node.id = toInteger(tsvLine.parent)
MATCH(child_node:my_label)
WHERE child_node.id = toInteger(tsvLine.child)
CREATE(parent_node) --> (child_node)
;

Related

Bulk Update neo4j relationship properties through csv

I have a csv file which have 3 column
Follower_id,Following_id,createTime
My Node in neo4j represent a USER and it has multiple properties one of them is profileId,.Two nodes in the graph can have FOLLOW_RELATIONSHIP and i have to update the createtime for FOLLOW_RELATIONSHIP properties.There are lots of relationships in the graph. I am new in neo4j i dont have much idea about how to do bulk update efficiently.
You can try something like this:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'FILEPATH' AS row
WITH row
MATCH (u1:User{profileId: row.Follower_id})
MATCH (u2:User{profileId: row.Following_id})
MERGE (u1)-[r:FOLLOW_RELATIONSHIP]->(u2)
SET r.createTime = row.createTime
FILEPATH is the path of the file on your system, usually within the database directory itself or some web link. You can learn how to set it from this article.

How to create relationships with load CSV from a batch-import file relationships

I have already created the nodes before and I would like to use the relationships file used some time ago during a batch- import, to create relationships using the load CSV method.
This is my relationships CSV file:
You'll need to use LOAD CSV for this (USING PERIODIC COMMIT), although you'll need to watch out for spaces in both the headers (if you use them) and your fields. trim() may help in your fields.
The headers shouldn't have : in them if at all possible.
The biggest obstacle will be dynamically using the type of the relationship from the csv. Currently Cypher does not deal with relationship types dynamically, you'll need an alternate approach. Install APOC Procedures and use apoc.create.relationship() to handle that.

Best Way to Bulk Load the Data into Neo4J

We are trying to load Millions of nodes and relationships into Neo4j. We are currently using below command
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:customers.csv" AS row
CREATE (:Customer ....
But it is taking us lot of time.
I do see a link which explains modifying the neo4j Files directly.
http://blog.xebia.com/combining-neo4j-and-hadoop-part-ii/
But above link seems to be very old. wanted to know if above process is still valid ?
There is a issue in "neo4j-spark-connector" Github link. which is not updated fully.
https://github.com/neo4j-contrib/neo4j-spark-connector/issues/15
What is the best way among those ?
The fastest way, especially for large data sets, should be through the import tool instead of via Cypher with LOAD CSV.
If you are using LOAD CSV, potentially with MERGE, I highly recommend adding unique constraints - for us it sped up a smallish import (100k nodes) by 100 times or so
You can make use of apoc methods which can perform better for large datasets. Below is a sample cypher query
CALL apoc.periodic.iterate(
'CALL apoc.load.csv(file_path) YIELD lineNo, map as row, list return row',
'MATCH (post:Post {id:row.`:END_ID(Post)`})
MATCH (owner:User {id:row.`:START_ID(User)`})
MERGE (owner)-[:ASKED]->(post);',
{batchSize:500, iterateList:true, parallel:true}
);
Below is the documentation link :
https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_examples_for_apoc_load_csv

neo4j data insertion taking time

I have installed Neo4j community Edition 3.0.3 on Ubuntu 14.04 on a linux local server and have successfully installed it. Now I am accessing it through my windows browser through the port 7474 on that server.
Now I have a csv file having sales order data in the following format:
Customer_id, Item_id, Order_Date
It has 90000 rows, and both customer_id and item_id are the nodes. A total of (30000 customer_ids + 30000 item_ids) nodes and 90000 relationships(order_date as the distance attribute name). I ran the below query to insert the data from csv to my graph database:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
I left it to run, and after around 7 to 8 hours, it was still running. My question is, am I doing anything wrong? is my query not optimized? or is this thing usual? I am new to both Neo4j and Cypher. Please help me on this.
Create a uniqueness constraint
You should create a uniqueness constraint on MyNode.Name:
CREATE CONSTRAINT ON (m:MyNode) ASSERT m.Name IS UNIQUE;
In addition to enforcing the data integrity / uniqueness of MyNode, that will create an index on MyNode.Name which will speed the lookups on the MERGE statements. There's a bit more info in the indexes and performance section here.
Using periodic commit
Since Neo4j is a transactional database, the results of your query is built up in memory and the entire query is committed at once. Depending on the size of the data / resources available on your machine you may want to use the periodic commit functionality in LOAD CSV to avoid building up the entire statement in memory. Just start your query with USING PERIODIC COMMIT. This will commit the results periodically, freeing memory resources while iterating through your CSV file.
Avoiding the eager
One problem with your query is that is contains an eager operation. This will hinder the periodic commit functionality and the transaction will all be built up into memory regardless. To avoid the eager operation you can use two passes through the csv file:
Once to create the nodes:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (n:MyNode {Name:line.Customer})
MERGE (m:MyNode {Name:line.Item})
Then again to create the relationships:
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MATCH (n:MyNode {Name:line.Customer})
MATCH (m:MyNode {Name:line.Item})
MERGE (n) -[:TO {dist:line.OrderDate}]-> (m)
See these two posts for more info about the eager operation.
At a minimum you need to create the uniqueness constraint - that should be enough to increase the performance of your LOAD CSV statement.

In Neo4j, is there a way to read relationship name dynamically using loadcsv?

I have created nodes using LOAD CSV method using Cypher. The next part is creating relationships with the nodes. For that I have CSV in the following format
fromStopName,from,route,toStopName,to
Swargate,1,route1_1,Swargate Corner,2
Swargate Corner,2,route1_1,Hirabaug,3
Hirabaug,3,route1_1,Maruti,4
Maruti,4,route1_1,Mandai,5
Now I would like to have "route" name as relationship between nodes. So, I am using the following LOAD CSV command in CYPHER
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName}) CREATE f - [:row.route]->t
But looks like, I cannot do that. Instead, if I name relationship statically and then assign property from csv route field, it works.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName}) CREATE f - [:CONNECTS {route: row.route}]->t
I am wondering if this is disabled to enforce good practice of having "pure" verb kind of relationships and avoiding creating multiplicity of same relationship. like "connected by 1_1" "connected by 1_2".
Or I am just not finding the right link or not using correct syntax. Appreciate help!
Right now you can't as this is structural information.
Either use neo4j-import tool for that.
Or use one CSV file per type and spell out the rel-type.
Or even filter the CSV and do multi-pass:
e.g.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:C:\\\\busroutes.csv" AS row
with row where row.route = "route1_1"
MATCH(f {name:row.fromStopName}),(t {name:row.toStopName})
CREATE (f)-[:route1_1]->(t)
There is also a trick using fake conditionals but you still have to spell them out.

Resources