Loading edges to neo4j taking too much time - neo4j

Hi I am trying to load edge files to neo4j of approximately 80000 records each.
I am using:
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM
"file:///EdgesWriterSong_wrote.csv" AS csvLine
MATCH (writer:Writer { id: toInt(csvLine.WriterId),(songs:Songs { SongId: toInt(csvLine.SongId)
CREATE (writer)-[r:Wrote]->(songs)
It is taking way too much time to load. Is there a quicker way pls?

Your query has syntax errors, but I will assume your actual code looks like this:
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///EdgesWriterSong_wrote.csv" AS csvLine
MATCH (writer:Writer { id: toInt(csvLine.WriterId) }),
(songs:Songs { SongId: toInt(csvLine.SongId) })
CREATE (writer)-[r:Wrote]->(songs);
The most obvious reason for slowness for such a simple query would be that you have not yet created indexes for :Writer(id) and Songs(SongId). Do that by running these 2 queries (one at a time):
CREATE INDEX ON :Writer(id);
CREATE INDEX ON :Songs(SongId);

Related

Neo4J - unable to create relationships (30,000)

I've got two csv files Job (30,000 entries) and Cat (30 entries) imported into neo4j and am trying to create a relationship between them
Each Job has a cat_ID and Cat contains the category name and ID
after executing the following
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
it returns (no changes, no records)
I received a prompt recommending that I index the category and so using
Create INDEX ON :Job(cat_id); I did, but I still get the same error
How do I create a relationship between the two?
I am able to get this to work on a smaller dataset
You are probably trying to match on non-existing nodes. Try
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MERGE (job:Job {cat_ID: row.cat_ID})
MERGE (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
Have a look in your logs and see if you are running out of memory.
You could try chunking the data set up into smaller pieces with Periodic Commit and see if that helps:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)

How to improve performance of LOAD CSV in NEO4J

I am using community edition of neo4j.I am trying to create 50000 nodes and 93400 relationships using CSV file.But the load csv command in neo4j is taking around 40 mins to create the nodes and relationships.
Using py2neo package in python to connect and run cypher queries.Load csv command looks similar to one below:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///Sample.csv" AS row WITH row
MERGE(animal:Animal { name:row.`ANIMAL_NAME`})
ON CREATE SET animal{name:row.`ANIMAL_NAME`,type:row.`TYPE`, status:row.`Status`, birth_date:row.`DATE`}
ON MATCH SET animal +={name:row.`ANIMAL_NAME`,type:row.`TYPE`,status:row.`Status`,birth_date:row.`DATE`}
MERGE (person:Person { name:row.`PERSON_NAME`})
ON CREATE SET person ={name:row.`PERSON_NAME` age:row.`AGE`, address:row.`Address`, birth_date:row.`PERSON_DATE`}
ON MATCH SET person += { name:row.`PERSON_NAME`, age:row.`AGE`, address:row.`Address`, birth_date:row.`PERSON_DATE`}
MERGE (person)-[:OWNS]->(animal);
Infrastructure Details:
dbms.memory.heap.max_size=16384M
dbms.memory.heap.initial_size=2048M
dbms.memory.pagecache.size=512M
neo4j_version:3.3.9
How would I get it to work faster.Thanks in advance
Ideally, you should be using the lastest neo4j version, as there have been many performance improvements since 3.3.9. Since you already have indexes on :Animal(name) and :Person(name), the other main issue is probably that the Cypher planner is generating an expensive Eager operation (at least in neo4j 4.0.3) for your query. Whenever you have performance issues, you. should use EXPLAIN or PROFILE to see the operations that the Cypher planner generates.
Try using this simpler query (which should do the same thing as yours). Using EXPLAIN in neo4j 4.0.3, this query does not use the Eager operation:
:auto USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///Test.csv" AS row
MERGE(animal:Animal {name: row.`ANIMAL_NAME`})
SET animal += {type:row.`TYPE`, status:row.`Status`, birth_date:row.`DATE`}
MERGE (person:Person { name:row.`PERSON_NAME`})
SET person += {age:row.`AGE`, address:row.`Address`, birth_date:row.`PERSON_DATE`}
MERGE (person)-[:OWNS]->(animal);
The :auto command is required in neo4j 4.x when using USING PERIODIC COMMIT.

Create many relationships from CSV on Neo4j is very slow, how to resolve this?

I try to move some RDB records into Neo4j. There is a relation table on MySQL. I would like to change this into relationships on Neo4j.
I created two nodes, like following.
LOAD CSV WITH HEADERS
FROM 'file:///transactions.csv' AS line
CREATE (:Tran {
key_from: line.key
});
LOAD CSV WITH HEADERS
FROM 'file:///master.csv' AS line
CREATE (:Master {
key_to: line.key
, value: line.value
});
Import finished very quickly. Tran has 20,000 records. Master has 270,000 records. Then I tried to import relation csv file.
I tried two queries. But, both of them never finished. There are 5,000 relations, but there are some relations should be matched, rest of them will not match.
Try1:
LOAD CSV WITH HEADERS
FROM 'file:///relation.csv' AS line
OPTIONAL MATCH (t:Tran { key_from: line.key_from })
OPTIONAL MATCH (m:Master { key_to: line.key_to })
CREATE (t)-[r:CONV]->(m);
Try2:
LOAD CSV WITH HEADERS
FROM 'file:///relation.csv' AS line
WITH line.key_from AS key_from,
line.key_to AS key_to
MERGE (t:Tran { key_from: key_from })
MERGE (m:Master { key_to: key_to })
MERGE (t)-[r:CONV]->(m);
Would you tell me best practice ?

Out of memory when creating large number of relationships

I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.

Load csv merge performance

I have a performance issue with bulk insert into neo4j.
I have a csv file with 400k rows which produces about 3.5 million rows, and I use LOAD CSV command, with the latest version on neo4j.
I've noticed that when I user Create statement, the load takes about 4 minutes, and without indexes at all- about 3.5 minutes.
My first question, is whether this is the normal rate of nodes/ min.
Now, my real problem, is that I need to use merge, for data integrity reasons, and when I use it, it can take even 24 hours, together with indexes.
So 2 additional questions will be:
Is the LOAD CSV recommended for the best performance load,
and also:
What can I do do about this performance issue?
EDIT:
here is the query:
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS line FIELDTERMINATOR '|'
MERGE (session :Session { session:line.session })
MERGE (hit :Hit { key:line.key,date_time:line.date_time,session:line.session })
MERGE (user :User { id:line.user_id })
MERGE (session2 :Session2 { session2:line.session2 })
MERGE (country :Country{ name:line.country})
MERGE (tv :TV { name:tv.Model })
MERGE (transfer_protocol :Protocol { name:line.transfer_protocol })
MERGE (os :OS { name:line.os_name ,version:line.os_version, row_key:line.os_name+line.os_version})
Sample: session_guid|hit_key_guid|useridguid|session2_guid|PANASONIC|TCP|ANDROID|5.0
the session,user,session2,country,tv,transfer_protocol and os has unique constraint, and hit has an index
**session1 and session2 can have many hits (1 to 100, average 5)
hit_key_guid is different for each csv line
it's running really slow- pretty strong machine, and each 1000 rows can take up to 10 seconds.
also checked with the profiler, and no "Eager"
thanks
Lior
You should share your data model, your indexes, your LOAD CSV query and also the profile output. Are you using PERIODIC commit?
Make sure that you don't run into the Eager issue, see here:
http://neo4j.com/developer/guide-import-csv/#_load_csv_for_medium_sized_datasets
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
In general for a dataset your size LOAD CSV is ok, from 10M rows I'd probably switch to the import-tool.
It appears that the server side code, didn't create the indexes properly, and once they were created, the load done in good performance

Resources