Import of large dataset in Neo4j taking really long (>12 hours) with Neo4j import tool - neo4j

I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.
...
Available memory:
Free machine memory: 184.49 GB
Max heap memory : 26.52 GB
Nodes
[>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B
Done in 1h 7m 18s 54ms
Prepare node index
[*SORT:11.52 GB--------------------------------------------------------------------------------]881M
...
My question is how can I speed this up? I am thinking the following:
1. Split up the import command for nodes and relationships and do the nodes import.
2. Create indexes on the nodes
3. Do merge/match to get rid of dupes
4. Do rels import.
Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?
Thanks.
UPDATE
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?
FURTHER UPDATE
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this
1|Author
2|Author
3|Author
and the header (in a separate file) looks like this
authorID:ID(Author)|:LABEL
And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" \
--bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true \
--nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"
Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.

I think I found the issue. I was using some of the tips here http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets where it says I can re-use the same csv file with different headers -- once for nodes and once for relationships. I underestimated the 1-n (ness) of the data I was using, causing a lot of duplicates on the ID. That stage was basically almost all spent on trying to sort and then dedupe. Re-working my queries to extract the data split into nodes and rels files, fixed that problem. Thanks for looking into this!
So basically, ideally always having separate files for each type of node and rel will give fastest results (at least in my tests).

Have a look at the batch importer I wrote for a stress test:
https://github.com/graphaware/neo4j-stress-test
I used both neo4j index and in memory map between two commit. It is really fast and works for both version of neo4j.
Ignore the tests and get the batch importer.

Related

Error when importing csv using import tool

I'm trying to load a graph with two nodes (Autor,Paper) and a relation with the import tool, right now I have this two files, which, as far as I understand, they must be:
authors.csv:
:Author(Autor) :Adscription(Autor) :PMID(Paper)
Author1 Department of Hematology. 31207293
Papers.csv
:PMID(Paper) :PaperName(Paper) :AuthorList(Autor)
31207293 A huge paper name Author1,Author2,
These files are stored in /var/lib/neo4j/import
With this in mind, I run the following code
sudo neo4j-admin import --database=graph.db --id-type=STRING --mode=csv --delimiter=" " --nodes :Autor:Paper="authors.csv,Papers.csv"
but I got
WARNING: Max 1024 open files allowed, minimum of 40000 recommended. See the Neo4j manual.
Expected '--nodes' to have at least 1 valid item, but had 0 []
usage: neo4j-admin import [--mode=csv] [--database=<name>]
[--additional-config=<config-file-path>]
[--report-file=<filename>]
[--nodes[:Label1:Label2]=<"file1,file2,...">]
[--relationships[:RELATIONSHIP_TYPE]=<"file1,file2,...">]
Right now, I'm only attempting to load the nodes Paper and Author, I'm able to do this in the browser by means of
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///authors.csv" AS row
MERGE ( c:Autor{ Name:row.Autor , Adscription: row.Adscription, PMID=row.PMID } )
but the time taken by doing so, is long.
This warning is probably not affecting you, but see here for more info.
If you are importing large amounts of data, the reason why your Cypher is taking so long is because of MERGE. If you know that the authors.csv contains a unique entry for each author, then you do not need to do a MERGE since it will never match to an existing node.
Try switching MERGE to CREATE. It should be much faster.

influxdb CLI import failed inserts when using a huge file

I am currently working on NASDAQ data parsing and insertion into the influx database. I have taken care of all the data insertion rules (escaping special characters and organizing the according to the format : <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]).
Below is a sample of my data:
apatel17#*****:~/output$ head S051018-v50-U.csv
# DDL
CREATE DATABASE NASDAQData
# DML
# CONTEXT-DATABASE:NASDAQData
U,StockLoc=6445,OrigOrderRef=22159,NewOrderRef=46667 TrackingNum=0,Shares=200,Price=73.7000 1525942800343419608
U,StockLoc=6445,OrigOrderRef=20491,NewOrderRef=46671 TrackingNum=0,Shares=200,Price=73.7800 1525942800344047668
U,StockLoc=952,OrigOrderRef=65253,NewOrderRef=75009 TrackingNum=0,Shares=400,Price=45.8200 1525942800792553625
U,StockLoc=7092,OrigOrderRef=51344,NewOrderRef=80292 TrackingNum=0,Shares=100,Price=38.2500 1525942803130310652
U,StockLoc=7092,OrigOrderRef=80292,NewOrderRef=80300 TrackingNum=0,Shares=100,Price=38.1600 1525942803130395217
U,StockLoc=7092,OrigOrderRef=82000,NewOrderRef=82004 TrackingNum=0,Shares=300,Price=37.1900 1525942803232492698
I have also created the database: NASDAQData inside influx.
The problem I am facing is this:
The file has approximately 13 million rows (12,861,906 to be exact). I am trying to insert this data using the CLI import command as below:
influx -import -path=S051118-v50-U.csv -precision=ns -database=NASDAQData
I usually get upto 5,000,000 lines before I start getting the error for insertion. I have run this code multiple times and sometimes I get the error at 3,000,000 lines as well. To figure out this error, I am running the same code on a part of the file. I divide the data into 500,000 lines each and the code successfully ran for all the smaller files. (all 26 files of 500,000 rows)
Has this happened to somebody else or does somebody know a fix for this problem wherein a huge file shows errors during data insert but if broken down and worked with smaller data size, the import works perfectly.
Any help is appreciated. Thanks
As recommended by influx documentation, it may be necessary to split your data file into several smaller ones as the http request used for issuing your writes can timeout after 5 seconds.
If your data file has more than 5,000 points, it may be necessary to
split that file into several files in order to write your data in
batches to InfluxDB. We recommend writing points in batches of 5,000
to 10,000 points. Smaller batches, and more HTTP requests, will result
in sub-optimal performance. By default, the HTTP request times out
after five seconds. InfluxDB will still attempt to write the points
after that time out but there will be no confirmation that they were
successfully written.
Alternatively you can set a limit on how much points to write per second using the pps option. This should relief some stress off your influxdb.
See:
https://docs.influxdata.com/influxdb/v1.7/tools/shell/#import-data-from-a-file-with-import

Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Garbage collection tuning/performance degradation for neo4j bulk .csv import

I am running a bulk import of data into a neo4j instance (I have run against 2.2.0 community and enterprise editions as well as 2.1.7 community) running in server mode. My application creates a bunch of nodes in memory, and will peridoically stop to write a series .csv files and send cypher to the neo4j instance to upload the files. (this was done due to performance issues with running the application using the plain old REST API).
Overall, I'm looking to upload something like 150-5000 million nodes, so this is, in principle, the type of thing that neo4j claims to be able to handle relatively well.
Well, anyway, what I'm noticing when I run this against production data is that the application runs in two states -- one where the csv upload processes between 2k-8k of nodes per second, and one where it processes between 80-200 nodes per second. The two states are interwoven when you look at the upload as a time series, and as time goes on, it spends increasingly long amounts of time in the slow state.
Nodes are created through a series of
MERGE (:{NODE_TYPE} {csvLine.key = n.primaryKey}) on create set [PROPERTY LIST];
statements, and I have indexes on everything that I'm doing merges against. This doesn't feel like a degradation in the insert statements, because the slowdown is not linear, but rather bimodal, this feels like there are garbage collection in the neo4j instance. What is the best way to tune the neo4j JVM garbage collector for frequent bulk inserts?
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
#neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.mapped_memory=100M
#neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional=-XX:hashCode=5
wrapper.java.initmemory=8194
wrapper.java.maxmemory=8194
This felt like the sweet spot for both the overall heap memory and the neostore stuff. Increasing the overall heap degraded performance. That said, the neo4j garbage collection logs frequently have that GC (Allocation Failure) message.
EDIT: in response to Michael Hunger:
the machine has 64 GB of RAM, and nothing seems to be maxed out. It also seems like only a small number of cores are being used at any time. Garbage collector profiling shows that the garbage collector seems to be running quite frequently.
The exact cypher statements are, for example:
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event_2015_4_773476.csv' AS csvLine MERGE (s:Event {primaryKey: csvLine.primaryKey}) ON CREATE SET s.checkSum= csvLine.checkSum,s.epochTime= toInt(csvLine.epochTime),s.epochTimeCreated= toInt(csvLine.epochTimeCreated),s.epochTimeUpdated= toInt(csvLine.epochTimeUpdated),s.eventDescription= csvLine.eventDescription,s.fileName= csvLine.fileName,s.ip= csvLine.ip,s.lineNumber= toInt(csvLine.lineNumber),s.port= csvLine.port,s.processPid= csvLine.processPid,s.rawEventLine= csvLine.rawEventLine,s.serverId= csvLine.serverId,s.status= toInt(csvLine.status);
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event__File_2015_4_773476.csv' AS csvLine MATCH (n:SC_CSR{primaryKey: csvLine.Event_id}), (s:File{fileName: csvLine.File_id}) MERGE n-[:DATA_SOURCE]->s;
Though there are serveral such statements being made
I have tried a single concurrent transaction as well as running several (~3) such statements in parallel (which gives a roughly 2x improvement). I've tried tuning the periodic commit frequency, and the size of the file. It seems that this maximizes performance when the csv file is roughly 100k lines, which means that really, the periodic commit can be off.
I have not run profile on the staments. I will do that, but I thought that the eager merget problem was avoided by using MERGE ... on create statements.
IN general your config looks ok, what RAM does your machine have?
For the things you merge against I'd recommend constraint instead of an index.
What's your tx size? And how many concurrent tx do you run?
Instead of your generic merge statement (which wouldn't compile) can you share the concrete statements?
Did you profile the statements? Perhaps you run into the eager pipe problem:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Do you use periodic commit?
How large are you CSV files?
See: http://neo4j.com/developer/guide-import-csv/

How to explain the performance of Cypher's LOAD CSV clause?

I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer

Resources