After playing around with toy datasets, this was my first attempt to use data that is relevant for a project at work. In short, after limping to get nearly all of my data into Neo4j, my last query simply stalled. See the screenshot.
Note: I was prototyping my queries by pasting them into the browser tool, but my longer term plan was to keep all of the commands in a .cql file that I could script on my workstation in order to perform nightly analyses.
To add context to my problem, I am prototyping on my macbook.
8gb ram
2.2 ghz intel core i7
osx 10.9.5
2.2.0 community
The files I am processing (rows/columns). I am not importing every column, it was just easier to keep my current datasets in check.
Ability.csv = 3/1
brm.csv = 276992/34
cont.sv = 80093/17
email chain.csv = 199143/34 (this is the only data I can't get in)
email first last.csv = 77849/20
recs.csv = 77962/20
templates_topics.csv = 29/3
templates.csv = 49/4
topics.csv = 13/1
vendors = 5/1
The only config options that I set manually for neo4j were in neo4j-wrapper.conf where I set wrapper.java.initmemory and wrapper.java.maxmemory to 4096. I did this after poking around to find similar problems.
I made these changes out of the gate because within the browser, I was getting error messages that the database was disconnected while processing my queries.
Lastly, because my data are work-related, I can't provide test data. I can, however, link to my cypher queries.
Constraint and LOAD CSV .cql file
Any help and advice would be greatly appreciated. I am pretty confident this is user error on my end, but I have definitely hit the road with respect to what my next steps would be.
Avoid eager loading in LOAD CSV. It doesn't respect PERIODIC COMMIT. See this article by Mark Needham for a thorough explanation.
I would split this one up, into creating nodes once and creating relationships (each) second:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///Users/btibert/Dropbox/Projects/bentley-search-neo4j/data/templates.csv" AS row
WITH row
MATCH (r:Vendor {name:row.vendor})
WITH row, r
MERGE (p:Template {name:row.template_clean})
MERGE (v:Version {version:row.template_ver})
MERGE (p)-[:FROM_VERSION]->(v)
MERGE (p)-[:CREATED_BY]->(r);
As you can clearly see the Eager operation in the plan.
I mean it doesn't matter if you just have a few thousand rows. But if it goes towards many hundred thousand or millions then pulling all data in takes more memory.
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Identifiers | Other |
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
| EmptyResult | | |
| UpdateGraph(0) | anon[270], anon[301], p, r, row, v | MergePattern |
| UpdateGraph(1) | anon[270], p, r, row, v | MergePattern |
| UpdateGraph(2) | p, r, row, v | MergeNode; row.template_clean; :Template(name); MergeNode; row.template_ver; :Version(version) |
| Eager | r, row | |
| SchemaIndex | r, row | row.vendor; :Vendor(name) |
| LoadCSV | row | |
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
I would probably change this into an ON CREATE SET variant for the non-key properties:
Also if you have multiple rows per student you can use WITH DISTINCT toInt(row.pidm) as pidm, .... to reduce the number of merges it has to run.
LOAD CSV WITH HEADERS FROM "recs.csv" AS row
WITH row
MERGE (s:Student {pidm:toInt(row.pidm)})
ON CREATE SET s.hash_pidm=toInt(row.hash_pidm), ....;
This one I'd split up into two statements, one for each relationship, otherwise you might get too many matches:
(And you don't need the WITHs in between)
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
WITH row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (v:Vendor {name: row.vendor} )
MATCH (a:Ability {name: row.ability} )
WITH row, s, v, a
MERGE (s)-[:PURCHASED_FROM]->(v)
MERGE (s)-[:HAS_ABILITY]->(a);
would become:
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (v:Vendor {name: row.vendor} )
MERGE (s)-[:PURCHASED_FROM]->(v);
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (a:Ability {name: row.ability} )
MERGE (s)-[:HAS_ABILITY]->(a);
Here I would also create the contacts on themselves. (Again with ON CREATE SET)
And do the student relationship in a separate statement:
LOAD CSV WITH HEADERS FROM "....cont.csv" AS row
MERGE (c:Contact {cid:row.cid}) ON CREATE SET ....;
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
MATCH (s:Student {pidm:toInt(row.pidm)} )
MATCH (c:Contact {cid:row.cid})
MERGE (s)-[:HAS_CONTACT]->(c);
I would also split this one up into two statements:
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
WITH row WHERE toInt(row.seqnum) = 1
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (f:Contact {cid:row.first_cont})
MERGE (s)-[:FIRST]->(f);
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
WITH row WHERE toInt(row.seqnum) = 1
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (l:Contact {cid:row.last_cont})
MERGE (s)-[:LAST]->(l);
Split this one up into E-Mail creation and then later connecting it to the student by msg-id:
LOAD CSV WITH HEADERS FROM "...brm.csv" AS row
MERGE (e:Email {msgid:row.msgid}) ON CREATE SET ... ;
LOAD CSV WITH HEADERS FROM "file:///Users/btibert/Dropbox/Projects/bentley-search-neo4j/data/brm.csv" AS row
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (e:Email {msgid:row.msgid})
MERGE (s)-[:WAS_SENT]->(e);
HTH Michael
Related
Assuming we have a CSV that has nodes with various relationship types. Is there an option to load the CSV in one query allowing each relationship type to be displayed as the relationship name without breaking up the CSV to separate files (one per relationship type)? (We don't want to add the relationship type as a property to the Edge).
Id1 | Id2 | RelationshipType
1 | 2 | type1
1 | 3 | type2
2 | 3 | type1
...
We want to later on display & query the data with queries similar to below:
MATCH l=(p:Id1) - [:type1] - (p:Id2) RETURN l;
MATCH l=(p:Id1) - [:type2] - (p:Id2) RETURN l;
You can do it using the APOC Procedure apoc.create.relationship.
Considering this CSV file:
Id1|Id2|RelationshipType
1|2|type1
1|3|type2
2|3|type1
The LOAD CSV query will be:
LOAD CSV WITH HEADERS FROM 'file:///sample.csv' AS line FIELDTERMINATOR '|'
WITH line
MERGE(node0:Node {id : line.Id1})
MERGE(node1:Node {id : line.Id2})
WITH node0, node1, line
CALL apoc.create.relationship(node0, line.RelationshipType, {}, node1) YIELD rel
RETURN *
The resultant graph will be:
Note: Remember to install APOC procedures according to the version of Neo4j you are using. Take a look in the Version Compatibility Matrix.
I am trying to load 500000 nodes ,but the query is not executed successfully.Can any one tell me the limitation of number of nodes in neo4j community edition database?
I am running these queries
result = session.run("""
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///relationships.csv" AS row
merge (s:Start {ac:row.START})
on create set s.START=row.START
merge (e:End {en:row.END})
on create set s.END=row.END
FOREACH (_ in CASE row.TYPE WHEN "PAID" then [1] else [] end |
MERGE (s)-[:PAID {cr:row.CREDIT}]->(e))
FOREACH (_ in CASE row.TYPE WHEN "UNPAID" then [1] else [] end |
MERGE (s)-[:UNPAID {db:row.DEBIT}]->(e))
RETURN s.START as index, count(e) as connections
order by connections desc
""")
I don't think the community edition is more limited than the enterprise edition in that regard, and most of the limits have been removed in 3.0.
Anyway, I can easily create a million nodes (in one transaction):
neo4j-sh (?)$ unwind range(1, 1000000) as i create (n:Node) return count(n);
+----------+
| count(n) |
+----------+
| 1000000 |
+----------+
1 row
Nodes created: 1000000
Labels added: 1000000
3495 ms
Running that 10 times, I've definitely created 10 million nodes:
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 10000000 |
+----------+
1 row
3 ms
Your problem is most likely related to the size of the transaction: if it's too large, it can result in an OutOfMemory error, and before that it can slow the instance to a crawl because of all the garbage collection. Split the node creation in smaller batches, e.g. with USING PERIODIC COMMIT if you use LOAD CSV.
Update:
Your query already includes USING PERIODIC COMMIT and only creates 2 nodes and 1 relationship per line from the CSV file, so it most likely has to do with the performance of the query itself than the size of the transaction.
You have Start nodes with 2 properties set with the same value from the CSV (ac and START), and End nodes also with 2 properties set with the same value (en and END). Is there a unicity constraint on the property used for the MERGE? Without it, as nodes are created, processing each line will take longer and longer as it needs to scan all the existing nodes with the wanted label (an O(n^2) algorithm, which is pretty bad for 500K nodes).
CREATE CONSTRAINT ON (n:Start) ASSERT n.ac IS UNIQUE;
CREATE CONSTRAINT ON (n:End) ASSERT n.en IS UNIQUE;
That's probably the main improvement to apply.
However, do you actually need to MERGE the relationships (instead of CREATE)? Either the CSV contains a snapshot of the current credit relationships between all Start and End nodes (in which case there's a single relationship per pair), or it contains all transactions and there's no real reason to merge those for the same amount.
Finally, do you actually need to report the sorted, aggregated result from that loading query? It requires more memory and could be split into a separate query, after the loading has succeeded.
i have a csv file containing activities (process graph) :
startActivityId,Name,endActivityId
1,A,2
2,B,3
3,C,4
4,D,5
so that it will look like this : A->B->C->D
i imported the csv file successfully into neo4j server : using this Cypher query :
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:path/graph/activity.csv" AS row
CREATE (:Activity {startactivityId:row.startActivityId, Name: row.Name, endActivityId: row.endActivityId});
i then created an index on startactivityId :
CREATE INDEX ON :activity(startActivityId);
then i want to create the relationships between these nodes, so tried this cypher query :
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:path/graph/activity.csv" AS row
MATCH (startActivity:Activity {startActivityId: row.startActivityId})
MATCH (endActivity:Activity {startActivityId: row.endActivityId})
MERGE (startActivity)-[:LINKS_TO]->(endActivity);`
but no relationships created, nothing happens
i'm sure i missed something cause i'm new to cypher but i can't figure it out.
any ideas ?
I copied your updated csv (and removed the whitespace at the head of the first column) and ran your queries.
neo4j-sh (?)$ USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///Users/jonatan/src/doc/stackexchange/32225817.pdc" as row CREATE (:Activity {startActivityId:row.startActivityId, name:row.Name, endActivityId:row.endActivityId});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 4
Properties set: 12
Labels added: 4
115 ms
neo4j-sh (?)$ USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///Users/jonatan/src/doc/stackexchange/32225817.pdc" as row MATCH (s:Activity {startActivityId:row.startActivityId}) MATCH (e:Activity {startActivityId:row.endActivityId}) MERGE (s)-[r:LINKS_TO]->(e) RETURN r;
+-------------------+
| r |
+-------------------+
| :LINKS_TO[2084]{} |
| :LINKS_TO[2085]{} |
| :LINKS_TO[2086]{} |
+-------------------+
3 rows
Relationships created: 3
178 ms
Three relationships created. To confirm that they are the right relationships I match and return the path (:Activity)-[:LINKS_TO]->().
neo4j-sh (?)$ MATCH p=(:Activity)-[:LINKS_TO]->() RETURN p;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| p |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| [Node[1415]{name:"A",startActivityId:"1",endActivityId:"2"},:LINKS_TO[2084]{},Node[1416]{name:"B",startActivityId:"2",endActivityId:"3"}] |
| [Node[1416]{name:"B",startActivityId:"2",endActivityId:"3"},:LINKS_TO[2085]{},Node[1417]{name:"C",startActivityId:"3",endActivityId:"4"}] |
| [Node[1417]{name:"C",startActivityId:"3",endActivityId:"4"},:LINKS_TO[2086]{},Node[1418]{name:"D",startActivityId:"4",endActivityId:"5"}] |
+-------------------------------------------------------------------------------------------------------------------------------------------+
3 rows
49 ms
neo4j-sh (?)$
It looks OK to me, not sure what's not working for you.
What does MATCH p=(:Activity)-[r]->() RETURN p; tell you?
I'm having issues importing a large volume of data into a Neo4j instance using the Cypher LOAD CSV command. I'm attempting to load in roughly 253k user records each with a unique user_id. My first step was to add a unique constraint on tje label to make sure the user was only being run once
CREATE CONSTRAINT ON (b:User) ASSERT b.user_id IS UNIQUE;
I then tried to run LOAD CSV with periodic commits to pull this data in.
This query failed so I tried to merge User record before setting
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_users.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id)})-[:REGISTERED_TO]->(t)
set p.created=toInt(line.created), p.completed=toInt(line.completed);
Modifying the periodic commit value has made no difference, the same error is returned.
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
I receive the following error:
LoadCsvStatusWrapCypherException: Node 9752 already exists with label Person and property "hpcm_uk_buddy_id"=[2446] (Failure when processing URL 'file:/home/data/uk_buddies.csv' on line 253316 (which is the last row in the file). Possibly the last row committed during import is line 253299. Note that this information might not be accurate.)
The numbers seem to match up roughly, the CSV file contains 253315 records in total. The periodic commit doesn't seem to have taken effect either, a count of nodes returns only 5446 rows.
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 5446 |
+----------+
1 row
768 ms
I can understand the number of nodes being incorrect if this ID is only roughly 5000 rows into the CSV file. But is there any technique or command I can use to succeed this import?
You're falling victim to a common mistake with MERGE I think. Relative to cypher query, seriously this would be like in my top 10 FAQs about common problems with cypher. See you're doing this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
The way merge works, that last merge matches on the entire relationship, not just on the user node. So probably, you're creating duplicate users that you shouldn't be. When you run this merge, even if a user with those exact properties already exists, the relationship to the t node doesn't, so it attempt to create a new user node with those attributes, to connect to t, which isn't what you want.
The solution is to merge the user individually, then separately merge the relationship path, like this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})
merge (p)-[:REGISTERED_TO]->(t);
Note the two merges at the end. One creates just the user. If the user already exists, it won't try to create a duplicate, and you should hopefully be OK with your constraint (assuming there aren't two users with the same user_id, but different created values). After you've merged just the user, then you merge the relationship.
The net result of the second query is the same, but shouldn't create duplicate users.
I am trying to get a csv into Neo4j. As it consists of log entries, I'd like to connect nodes with a NEXT-pointer/relationship when the corresponding logs have been created at subsequent times.
LOAD CSV WITH HEADERS FROM 'http://localhost/Export.csv' AS line
CREATE (:Entry { date: line[0], ...})
MATCH (n)
RETURN n
ORDER BY n:date
MATCH (a:Entry),(b:Entry),(c:Entry)
WITH p AS min(b:date)
WHERE a:date < b:date AND c.date = p
CREATE (a)-[r:NEXT]->(c)
The last four lines do not work however. What I try is to get the earliest entry 'c' out of the group of entries 'b' with a larger timestamp than 'a'. Can anyone help me out here?
Not sure if I understood your question correctly: you have a csv file containing log records with a timestamp. Each line contains one record. You want to interconnect the events to form a linked list based on a timestamp?
In this case I'd split up the process into two steps:
using LOAD CSV create a node with a data property for each line
afterwards connect the entries using e.g. a cypher statement like this:
.
MATCH (e:Entry)
WITH e ORDER BY e.date DESC
WITH collect(e) as entries
FOREACH(i in RANGE(0, length(entries)-2) |
FOREACH(e1 in [entries[i]] |
FOREACH(e2 in [entries[i+1]] |
MERGE (e1)-[:NEXT]->(e2))))