We are trying to load Millions of nodes and relationships into Neo4j. We are currently using below command
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:customers.csv" AS row
CREATE (:Customer ....
But it is taking us lot of time.
I do see a link which explains modifying the neo4j Files directly.
http://blog.xebia.com/combining-neo4j-and-hadoop-part-ii/
But above link seems to be very old. wanted to know if above process is still valid ?
There is a issue in "neo4j-spark-connector" Github link. which is not updated fully.
https://github.com/neo4j-contrib/neo4j-spark-connector/issues/15
What is the best way among those ?
The fastest way, especially for large data sets, should be through the import tool instead of via Cypher with LOAD CSV.
If you are using LOAD CSV, potentially with MERGE, I highly recommend adding unique constraints - for us it sped up a smallish import (100k nodes) by 100 times or so
You can make use of apoc methods which can perform better for large datasets. Below is a sample cypher query
CALL apoc.periodic.iterate(
'CALL apoc.load.csv(file_path) YIELD lineNo, map as row, list return row',
'MATCH (post:Post {id:row.`:END_ID(Post)`})
MATCH (owner:User {id:row.`:START_ID(User)`})
MERGE (owner)-[:ASKED]->(post);',
{batchSize:500, iterateList:true, parallel:true}
);
Below is the documentation link :
https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_examples_for_apoc_load_csv
Related
I'm doing some stuff with my University and I've been asked to create a system that builds Complete Trees with millions of nodes (1 or 2 million at least).
I was trying to create the Tree with a Load CSV Using a periodic commit and it worked well with the creation of just Nodes (70000 ms on a general purpose Notebook :P ). When I tried the same with the Edges, it didn't scale as well.
Using periodic commit LOAD CSV WITH HEADERS FROM 'file:///Archi.csv' AS line
Merge (:Vertex {name:line.from})<-[:EDGE {attr1: toFloat(line.attr1), attr2:toFloat(line.attr2), attr3: toFloat(line.attr3), attr4: toFloat(line.attr4), attr5: toFloat(line.attr5)}]-(:Vertex {name:line.to})
I need to guarantee that a Tree is generated in no more than 5 minutes.
Is there a Faster method that can return such a performances?
P.S. : The task doesn't expect to use Neo4j, but just a Database (either SQL or NoSQL), but I found out this NoSQL Graph DB and I thought would be nice to implement with Neo4j as the graph data structure is given for free.
P.P.S : I'm using Cypher
I think you should read up on MERGE in the developer documentation again, to make sure you understand exactly what it's doing.
A few things in particular to be aware of...
If the pattern you are merging does not exist, all elements of the pattern will be merged, which could result in duplicate :Vertex nodes being created. If your :Vertexes are supposed to be in the database already, and if there are no relationships yet, and if you are sure that no relationship repeats itself in your CSV, I strongly urge you to MATCH on the start and end nodes, and then CREATE the relationship between them instead of the MERGE. Remember that doing a MERGE with a relationship with many attributes means it will try to match on that first, so as the number of relationships grow between nodes, there will be an increasing number of comparisons, which will slow your query down further. CREATE is a better choice if you know that no relationship will be duplicated, and if you are sure those relationships don't exist yet.
I also urge you to create an index on :Vertex(name), as that will significantly help matching on end nodes.
I have a question related to the NEO4J
when I create the nodes arround 40,000 and their relationships which are around 20,000 it means, 4K nodes and 6![enter image description here][1]K relationships which are less than 15MB for sure.
when I run a query
"match (n) optional match (n)-[r]-() return n,r: "
it starts to load and after waiting for long time it returns nothing (in graphical form). But in the resultant file it shows how many nodes and relationships I have but no graphs . I want to see the complete graph of my data. is there anyway to see how does it look like, its only to visualize. When I limit the query till 800 it works.
Is there anything I need to change in settings or in my system memory?
any suggestion for that?
The web console isn't very good for more than the hundreds of nodes scale. I'd suggest looking at Gephi:
http://gephi.github.io/
Alternatively you could use Linkurious, an online tool:
https://linkurio.us/
If you want to roll your own there are a number of choices out there. I like Sigma.js:
http://sigmajs.org/
Linkurious also has a library based on Sigma:
https://github.com/Linkurious/linkurious.js
EDIT: http://keylines.com/ is another online service like Linurious
I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.
I'm evaluating using Neo4J Community 2.1.3 to store a list of concepts and relationships between them. I'm trying to load my sample test data (CSV files) into Neo4J using Cypher from the Web interface , as described in the online manual.
My data looks something like this:
concepts.csv
id,concept
1,tree
2,apple
3,grapes
4,fruit salad
5,motor vehicle
6,internal combustion engine
relationships.csv
sourceid,targetid
2,1
4,2
4,3
5,6
6,5
And so on... For my sample, I have ~17K concepts and ~16M relationships. Following the manual, I started Neo4J server, and entered this into Cypher:
LOAD CSV WITH HEADERS FROM "file:///data/concepts.csv" AS csvLine
CREATE (c:Concept { id: csvLine.id, concept: csvLine.concept })
This worked fine and loaded my concepts. Then I tried to load my relationships.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///data/relationships.csv" AS csvLine
MATCH (c1:Concept { id: csvLine.sourceid }),(c2:Concept { id: csvLine.targetid })
CREATE (c1)-[:RELATED_TO]->(c2)
This would run for an hour or so, but always stopped with either:
"Unknown error" (no other info!), or
"Neo.TransientError.Transaction.DeadlockDetected" with a detailed message like
"LockClient[695] can't wait on resource RWLock[RELATIONSHIP(572801), hash=267423386] since => LockClient[695] <-[:HELD_BY]- RWLock[NODE(4145), hash=1224203266] <-[:WAITING_FOR]- LockClient[691] <-[:HELD_BY]- RWLock[RELATIONSHIP(572801), hash=267423386]"
It would stop after loading maybe 200-300K relationships. I've done a "sort | uniq" on the relationships.csv so I'm pretty sure there are no duplicates. I looked at the log files in data/log but found no error message.
Has anyone seen this before? BTW, I don't mind losing a small portion of the relationships, so I'll be happy if I can just turn off ACID transactions. I also want to avoid writing code (to use the Java API) at this stage. I just want to load up my data to try it out. Is there anyway to do this?
My full data set will have millions of concepts and maybe hundreds of millions of relationships. Does anyone know if Neo4J can handle this amount of data?
Thank you.
You're doing it correctly.
Do you use the neo4j-shell or the browser?
Did you do: create index on :Concept(id);?
If you don't have an index, searching for the concepts will take exponentially longer, as it has to scan all nodes of this label for this id-value. You should / could also check via prefixing your query with PROFILE if it uses an index for both matches.
Never seen that deadlock before despite importing millions of relationships.
Can you share the full stack trace? If you use shell, you might want to do export STACKTRACES=true
Can you use USING PERIODIC COMMIT 1000 ?
I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.