How to load a large csv file into Neo4j - neo4j

I'm trying to load a large csv file (1458644 row) into neo4j, but i'm still getting this error :
Neo.TransientError.General.OutOfMemoryError: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
even if i change dbms.memory.heap.max_size=1024m with m=megbite , the same error occurs again !
Note : the size of the csv is 195.888 KB
this is my code :
load csv with headers from "file:///train.csv" as line
create(pl:pickup_location{latitude:toFloat(line.pickup_latitude),longitude:toFloat(line.pickup_longitude)}),(pt:pickup_time{pickup:line.pickup_datetime}),(dl:dropoff_location{latitude:toFloat(line.dropoff_latitude),longitude:toFloat(line.dropoff_longitude)}),(dt:dropoff_time{dropoff:line.dropoff_datetime})
create (pl)-[:TLR]->(pt),(dl)-[:TLR]->(dt),(pl)-[:Trip]->(dl);
what should i do ?

You should use periodic commits to process the CSV data in batches. For example, this will process 10,000 lines at a time (the default batch size is 1000):
USING PERIOD COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///train.csv" as line
CREATE (pl:pickup_location{latitude:toFloat(line.pickup_latitude),longitude:toFloat(line.pickup_longitude)}),(pt:pickup_time{pickup:line.pickup_datetime}),(dl:dropoff_location{latitude:toFloat(line.dropoff_latitude),longitude:toFloat(line.dropoff_longitude)}),(dt:dropoff_time{dropoff:line.dropoff_datetime})
CREATE (pl)-[:TLR]->(pt),(dl)-[:TLR]->(dt),(pl)-[:Trip]->(dl);

I Solved the problem by copying the solution for limited row here
so this is my solution:
USING PERIODIC COMMIT
load csv with headers from "file:///train.csv" as line
with line LIMIT 1458644
create (pl:pickup_location{latitude:toFloat(line.pickup_latitude),longitude:toFloat(line.pickup_longitude)}),(pt:pickup_time{pickup:line.pickup_datetime}),(dl:dropoff_location{latitude:toFloat(line.dropoff_latitude),longitude:toFloat(line.dropoff_longitude)}),(dt:dropoff_time{dropoff:line.dropoff_datetime})
create (pl)-[:TLR]->(pt),(dl)-[:TLR]->(dt),(pl)-[:Trip]->(dl);
the downside of this solution is that you need to know the number of rows of your big csv file (excel can't open large csv files).

Related

Error when importing csv using import tool

I'm trying to load a graph with two nodes (Autor,Paper) and a relation with the import tool, right now I have this two files, which, as far as I understand, they must be:
authors.csv:
:Author(Autor) :Adscription(Autor) :PMID(Paper)
Author1 Department of Hematology. 31207293
Papers.csv
:PMID(Paper) :PaperName(Paper) :AuthorList(Autor)
31207293 A huge paper name Author1,Author2,
These files are stored in /var/lib/neo4j/import
With this in mind, I run the following code
sudo neo4j-admin import --database=graph.db --id-type=STRING --mode=csv --delimiter=" " --nodes :Autor:Paper="authors.csv,Papers.csv"
but I got
WARNING: Max 1024 open files allowed, minimum of 40000 recommended. See the Neo4j manual.
Expected '--nodes' to have at least 1 valid item, but had 0 []
usage: neo4j-admin import [--mode=csv] [--database=<name>]
[--additional-config=<config-file-path>]
[--report-file=<filename>]
[--nodes[:Label1:Label2]=<"file1,file2,...">]
[--relationships[:RELATIONSHIP_TYPE]=<"file1,file2,...">]
Right now, I'm only attempting to load the nodes Paper and Author, I'm able to do this in the browser by means of
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///authors.csv" AS row
MERGE ( c:Autor{ Name:row.Autor , Adscription: row.Adscription, PMID=row.PMID } )
but the time taken by doing so, is long.
This warning is probably not affecting you, but see here for more info.
If you are importing large amounts of data, the reason why your Cypher is taking so long is because of MERGE. If you know that the authors.csv contains a unique entry for each author, then you do not need to do a MERGE since it will never match to an existing node.
Try switching MERGE to CREATE. It should be much faster.

influxdb CLI import failed inserts when using a huge file

I am currently working on NASDAQ data parsing and insertion into the influx database. I have taken care of all the data insertion rules (escaping special characters and organizing the according to the format : <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]).
Below is a sample of my data:
apatel17#*****:~/output$ head S051018-v50-U.csv
# DDL
CREATE DATABASE NASDAQData
# DML
# CONTEXT-DATABASE:NASDAQData
U,StockLoc=6445,OrigOrderRef=22159,NewOrderRef=46667 TrackingNum=0,Shares=200,Price=73.7000 1525942800343419608
U,StockLoc=6445,OrigOrderRef=20491,NewOrderRef=46671 TrackingNum=0,Shares=200,Price=73.7800 1525942800344047668
U,StockLoc=952,OrigOrderRef=65253,NewOrderRef=75009 TrackingNum=0,Shares=400,Price=45.8200 1525942800792553625
U,StockLoc=7092,OrigOrderRef=51344,NewOrderRef=80292 TrackingNum=0,Shares=100,Price=38.2500 1525942803130310652
U,StockLoc=7092,OrigOrderRef=80292,NewOrderRef=80300 TrackingNum=0,Shares=100,Price=38.1600 1525942803130395217
U,StockLoc=7092,OrigOrderRef=82000,NewOrderRef=82004 TrackingNum=0,Shares=300,Price=37.1900 1525942803232492698
I have also created the database: NASDAQData inside influx.
The problem I am facing is this:
The file has approximately 13 million rows (12,861,906 to be exact). I am trying to insert this data using the CLI import command as below:
influx -import -path=S051118-v50-U.csv -precision=ns -database=NASDAQData
I usually get upto 5,000,000 lines before I start getting the error for insertion. I have run this code multiple times and sometimes I get the error at 3,000,000 lines as well. To figure out this error, I am running the same code on a part of the file. I divide the data into 500,000 lines each and the code successfully ran for all the smaller files. (all 26 files of 500,000 rows)
Has this happened to somebody else or does somebody know a fix for this problem wherein a huge file shows errors during data insert but if broken down and worked with smaller data size, the import works perfectly.
Any help is appreciated. Thanks
As recommended by influx documentation, it may be necessary to split your data file into several smaller ones as the http request used for issuing your writes can timeout after 5 seconds.
If your data file has more than 5,000 points, it may be necessary to
split that file into several files in order to write your data in
batches to InfluxDB. We recommend writing points in batches of 5,000
to 10,000 points. Smaller batches, and more HTTP requests, will result
in sub-optimal performance. By default, the HTTP request times out
after five seconds. InfluxDB will still attempt to write the points
after that time out but there will be no confirmation that they were
successfully written.
Alternatively you can set a limit on how much points to write per second using the pps option. This should relief some stress off your influxdb.
See:
https://docs.influxdata.com/influxdb/v1.7/tools/shell/#import-data-from-a-file-with-import

Garbage collection tuning/performance degradation for neo4j bulk .csv import

I am running a bulk import of data into a neo4j instance (I have run against 2.2.0 community and enterprise editions as well as 2.1.7 community) running in server mode. My application creates a bunch of nodes in memory, and will peridoically stop to write a series .csv files and send cypher to the neo4j instance to upload the files. (this was done due to performance issues with running the application using the plain old REST API).
Overall, I'm looking to upload something like 150-5000 million nodes, so this is, in principle, the type of thing that neo4j claims to be able to handle relatively well.
Well, anyway, what I'm noticing when I run this against production data is that the application runs in two states -- one where the csv upload processes between 2k-8k of nodes per second, and one where it processes between 80-200 nodes per second. The two states are interwoven when you look at the upload as a time series, and as time goes on, it spends increasingly long amounts of time in the slow state.
Nodes are created through a series of
MERGE (:{NODE_TYPE} {csvLine.key = n.primaryKey}) on create set [PROPERTY LIST];
statements, and I have indexes on everything that I'm doing merges against. This doesn't feel like a degradation in the insert statements, because the slowdown is not linear, but rather bimodal, this feels like there are garbage collection in the neo4j instance. What is the best way to tune the neo4j JVM garbage collector for frequent bulk inserts?
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
#neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.mapped_memory=100M
#neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional=-XX:hashCode=5
wrapper.java.initmemory=8194
wrapper.java.maxmemory=8194
This felt like the sweet spot for both the overall heap memory and the neostore stuff. Increasing the overall heap degraded performance. That said, the neo4j garbage collection logs frequently have that GC (Allocation Failure) message.
EDIT: in response to Michael Hunger:
the machine has 64 GB of RAM, and nothing seems to be maxed out. It also seems like only a small number of cores are being used at any time. Garbage collector profiling shows that the garbage collector seems to be running quite frequently.
The exact cypher statements are, for example:
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event_2015_4_773476.csv' AS csvLine MERGE (s:Event {primaryKey: csvLine.primaryKey}) ON CREATE SET s.checkSum= csvLine.checkSum,s.epochTime= toInt(csvLine.epochTime),s.epochTimeCreated= toInt(csvLine.epochTimeCreated),s.epochTimeUpdated= toInt(csvLine.epochTimeUpdated),s.eventDescription= csvLine.eventDescription,s.fileName= csvLine.fileName,s.ip= csvLine.ip,s.lineNumber= toInt(csvLine.lineNumber),s.port= csvLine.port,s.processPid= csvLine.processPid,s.rawEventLine= csvLine.rawEventLine,s.serverId= csvLine.serverId,s.status= toInt(csvLine.status);
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event__File_2015_4_773476.csv' AS csvLine MATCH (n:SC_CSR{primaryKey: csvLine.Event_id}), (s:File{fileName: csvLine.File_id}) MERGE n-[:DATA_SOURCE]->s;
Though there are serveral such statements being made
I have tried a single concurrent transaction as well as running several (~3) such statements in parallel (which gives a roughly 2x improvement). I've tried tuning the periodic commit frequency, and the size of the file. It seems that this maximizes performance when the csv file is roughly 100k lines, which means that really, the periodic commit can be off.
I have not run profile on the staments. I will do that, but I thought that the eager merget problem was avoided by using MERGE ... on create statements.
IN general your config looks ok, what RAM does your machine have?
For the things you merge against I'd recommend constraint instead of an index.
What's your tx size? And how many concurrent tx do you run?
Instead of your generic merge statement (which wouldn't compile) can you share the concrete statements?
Did you profile the statements? Perhaps you run into the eager pipe problem:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Do you use periodic commit?
How large are you CSV files?
See: http://neo4j.com/developer/guide-import-csv/

Loading data from a text file Informix

I want to know if there is any other way to load data from a text file other than using external tables.
Text file looks something like
101 fname1 lname1 D01..
102 fname2 lname2 D02..
I want to load it into a table with columns emp_id, fname, lname, dept etc.
Thanks!
There are three utilities in Informix to load data to the database from flat files:
Load SQL command. Very simple to use, but not very flexible. I would recommend this for small amounts of records (less than 10k)
Dbload, which is a command line utility, a bit more complex than the load sql command. This will allow you to have more control on how the records are loaded: commit intervals, starting point in the flat file, number of errors before exiting, etc. I'd recommend this utility for a small to medium sized data loads (>10k<100k).
HPL, or High Performance Loader, which is a rather complex utility that can load data at a very high rate of speed, but with a lot of overhead. Recommended for large to x-large data loads.
As ceinmart suggested in comments you can do it from server side or from client side. From server side you can use DB-Access and LOAD command. From client side you can use any tool you like. For such tasks I often use Jython that can use Python string and CSV libraries as well as JDBC database drivers. With Jython you can use csv module to read data from file and PreparedStatement to insert it into database. In my other answer: Substring in Informix you will see such PreparedStatement.

Neo4j throws java heap exception while creating relationships

I done bulk upload on Neo4j of two different file, say file "A"(contaning 10000 records) and file "B" (contaning 9000 records).
Now I have third file say file "C" (having 10 million record (rows))
File "C" describes the relation between file "A" and file "B".
When the processing starts for file "C" it throws Java heap size exception, I have 4 GB of ram and increased heap size up to 3 GB. Althoug if I reduce the size of file "C" up to 2 million records then it works fine
Iam using Neo4j 1.9 version.
Please suggest why is it so? and how to sole it.
Thanks in advance :-)
Are you doing this with Neo4j's normal API, or with the bulk inserter? I'm assuming the normal API, and I'm assuming you're doing everything in one transaction? Either use the bulk inserter, or break up your transactions, as transactions are kept in Memory until flushed to disk on commit, which is most likely causing your Heap errors.

Resources