I have a cypher script which loads the content from a CSV and then tries to load it into the graph. What it basically looks like is something similar to that:
LOAD CSV WITH HEADERS
FROM 'file:///data.csv'
AS row
MERGE (o:Order {Id:toInteger(row.tr_ID)})
ON CREATE SET
o.Created = toInteger(row.tr_Created)
// read credit card and create a relation to an order (if applicable)
FOREACH (n IN CASE WHEN row.cc_CrdId = 'NULL' THEN [] ELSE [1] END |
MERGE (cc:CreditCard {Id:row.cc_CrdId})
ON CREATE SET
cc.Created = toInteger(row.cc_CrdCreated)
MERGE (o)-[:WITH_CC]->(cc)
)
This script reads the file row by row. But I was wondering if it is somehow possible to execute the logic for multiple rows in parallel.
I saw this article for parallel cypher but don't really understand how I could use it. I don't know if some kind of lock needs to be implemented.
Related
I've got two csv files Job (30,000 entries) and Cat (30 entries) imported into neo4j and am trying to create a relationship between them
Each Job has a cat_ID and Cat contains the category name and ID
after executing the following
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
it returns (no changes, no records)
I received a prompt recommending that I index the category and so using
Create INDEX ON :Job(cat_id); I did, but I still get the same error
How do I create a relationship between the two?
I am able to get this to work on a smaller dataset
You are probably trying to match on non-existing nodes. Try
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MERGE (job:Job {cat_ID: row.cat_ID})
MERGE (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
Have a look in your logs and see if you are running out of memory.
You could try chunking the data set up into smaller pieces with Periodic Commit and see if that helps:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.
I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).
I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?
// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)
Thanks in advance for any help!
You should review the documentation for indexes here:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/
Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.
For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.
Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.
You need indexes on: :race(raceId) and :Driver(driverId).
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)
You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.
I removed your RETURN because returning the types isn't useful information.
Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.
Also, you would probably want to batch these changes instead of trying to process them all at once.
If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.
CALL apoc.periodic.iterate("
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
RETURN lt, d, ir",
"MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Single CSV load
If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:
CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);
After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:
USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)
I am importing the following to Neo4J:
categories.csv
CategoryName1
CategoryName2
CategoryName3
...
categories_relations.csv
category_parent category_child
CategoryName3 CategoryName10
CategoryName32 CategoryName41
...
Basically, categories_relations.csv shows parent-child relationships between the categories from categories.csv.
I imported the first csv file with the following query which went well and pretty quickly:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///categories.csv' as line
CREATE (:Category {name:line[0]})
Then I imported the second csv file with:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///categories_relations.csv' as line
MATCH (a:Category),(b:Category)
WHERE a.name = line[0] AND b.name = line[1]
CREATE (a)-[r:ISPARENTOF]->(b)
I have about 2 million nodes.
I tried executing the 2nd query and it is taking quite long. Can I make the query execute more quickly?
Confirm you are matching on right property. You are setting only one property for Category node i.e. name while creating
categories. But you are matching on property id in your second
query to create the relationships between categories.
For executing the 2nd query faster you can add an index on the property (here id) which you are matching Category nodes on.
CREATE INDEX ON :Category(id)
If it still takes time, You can refer my answer to Load CSV here
I am trying to load large dataset into neo4j-3 and looking for the options. I found one neo4j-import but the problem with that is it is for initial load only. I have to load 2M records around every week.
I tried loading through shell but having some performance issue, I tried following.
1) Creating constraint upfront.
2) Creating Node and relationships in separate query.
3) Heap space 8G
4) dbms.memory.pagecache 4G
Many times the import just hangs and does nothing for hours.
Edit - CSV load being executed:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS
FROM "file:///my_sds_39_joe.csv"
AS row
OPTIONAL MATCH (per:Person {UID : "Person."+row.player_cardnum})
WHERE per IS NULL
MERGE (p:Person {CardNumber : row.player_cardnum})
ON CREATE SET p.Creation Date = timestamp(), p.Modification Date = timestamp() ;
EDIT
On a second look, seems like you're trying to implement some kind of conditional logic to your insert.
It looks like what you're trying to do is figure out if a :Person exists with a UID (derived from some concatenation with row.player_cardnum), and in the case where that :Person doesn't exist and the match fails, MERGE a :Person with the CardNumber given by row.player_cardnum.
If this is your goal, you're ALMOST there with your query. The problem is with your WHERE clause.
Understand that WHERE clauses are linked with a preceding MATCH, OPTIONAL MATCH, or WITH, and only affects the linked clause.
With that WHERE on that OPTIONAL MATCH, per will always be null, but more importantly, your row will still exist, and the following MERGE will ALWAYS take place for all rows in the CSV. This is probably the source of your slowdown, as it's creating new :Person nodes for all rows.
If you're trying to null out the row completely when the OPTIONAL MATCH hits on an existing :Person (so the MERGE won't happen in that case), you'll need to add a WITH clause, and make sure your WHERE clause is applied to it instead of the OPTIONAL MATCH.
Additionally, make sure that you have either unique constraints or indexes on Person.UID and Person.CardNumber. As for the UID match, I've heard that indexes are not used when there's some kind of string concatenation of the thing you're matching upon, so you may need to assemble it first and pass it in with a WITH.
Your final query would look like this:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS
FROM "file:///my_sds_39_joe.csv"
AS row
// first build the UID so we can take advantage of the index
WITH row, "Person." + row.player_cardnum AS UID
OPTIONAL MATCH (per:Person {UID : UID})
// the WHERE now applies to the WITH, which will filter out and null out the row when an OPTIONAL MATCH is found
WITH row, per
WHERE per IS NULL
MERGE (p:Person {CardNumber : row.player_cardnum})
ON CREATE SET p.Creation Date = timestamp(), p.Modification Date = timestamp() ;
i have recently started with neo4j and graph databases.
I am using this Api to make the persistence of my model. I have everything done and working but my problems comes related to efficiency.
So first of all i will talk about the scenary. I have a couple of xml documents which translates to some nodes and relations between the, as i already read that this API still not support a batch insertion, i am creating the nodes and relations once a time.
This is the code i am using for creating a node:
var newEntry = new EntryNode { hash = incremento++.ToString() };
var result = client.Cypher
.Merge("(entry:EntryNode {hash: {_hash} })")
.OnCreate()
.Set("entry = {newEntry}")
.WithParams(new
{
_hash = newEntry.hash,
newEntry
})
.Return(entry => new
{
EntryNode = entry.As<Node<EntryNode>>()
});
As i get it takes time to create all the nodes, i do not understand why the time it takes to create one increments so fats. I have made some tests and am stuck at the point where creating an EntryNode the setence takes 0,2 seconds to resolve, but once it has reached 500 it has incremented to ~2 seconds.
I have also created an index on EntryNode(hash) manually on the console before inserting any data, and made test with both versions, with and without index.
Am i doing something wrong? is this time normal?
EDITED:
#Tatham
Thanks for the answer, really helped. Now i am using the foreach statement in the neo4jclient to create 1000 nodes in just 2 seconds.
On a related topic, now that i create the nodes this way i wanted to also create relationships. This is the code i am trying right now, but got some errors.
client.Cypher
.Match("(e:EntryNode)")
.Match("(p:EntryPointerNode)")
.ForEach("(n in {set} | " +
"FOREACH (e in (CASE WHEN e.hash = n.EntryHash THEN [e] END) " +
"FOREACH (p in pointers (CASE WHEN p.hash = n.PointerHash THEN [p] END) "+
"MERGE ((p)-[r:PointerToEntry]->(ee)) )))")
.WithParam("set", nodesSet)
.ExecuteWithoutResults();
What i want it to do is, given a list of pairs of strings, get the nodes (which are uniques) with the string value as the property "hash" and create a relationship between them. I have tried a couple of variants to do this query but i dont seem to find the solution.
Is this possible?
This approach is going to be very slow because you do a separate HTTP call to Neo4j for every node you are inserting. Each call is then a transaction. Finally, you are also returning the node back, which is probably a waste.
There are two options for doing this in batches instead.
From https://stackoverflow.com/a/21865110/211747, you can do something like this, where you pass in a set of objects and then FOREACH through them in Cypher. This means one, larger, HTTP call to Neo4j and then executing in a single transaction on the DB:
FOREACH (n in {set} | MERGE (c:Label {Id : n.Id}) SET c = n)
http://docs.neo4j.org/chunked/stable/query-foreach.html
The other option, coming soon, is that you will be able to write something like this in Cypher:
LOAD CSV WITH HEADERS FROM 'file://c:/temp/input.csv' AS n
MERGE (c:Label { Id : n.Id })
SET c = n
https://github.com/davidegrohmann/neo4j/blob/2.1-fix-resource-failure-load-csv/community/cypher/cypher/src/test/scala/org/neo4j/cypher/LoadCsvAcceptanceTest.scala