Neo4j throws java heap exception while creating relationships - neo4j

I done bulk upload on Neo4j of two different file, say file "A"(contaning 10000 records) and file "B" (contaning 9000 records).
Now I have third file say file "C" (having 10 million record (rows))
File "C" describes the relation between file "A" and file "B".
When the processing starts for file "C" it throws Java heap size exception, I have 4 GB of ram and increased heap size up to 3 GB. Althoug if I reduce the size of file "C" up to 2 million records then it works fine
Iam using Neo4j 1.9 version.
Please suggest why is it so? and how to sole it.
Thanks in advance :-)

Are you doing this with Neo4j's normal API, or with the bulk inserter? I'm assuming the normal API, and I'm assuming you're doing everything in one transaction? Either use the bulk inserter, or break up your transactions, as transactions are kept in Memory until flushed to disk on commit, which is most likely causing your Heap errors.

Related

Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Garbage collection tuning/performance degradation for neo4j bulk .csv import

I am running a bulk import of data into a neo4j instance (I have run against 2.2.0 community and enterprise editions as well as 2.1.7 community) running in server mode. My application creates a bunch of nodes in memory, and will peridoically stop to write a series .csv files and send cypher to the neo4j instance to upload the files. (this was done due to performance issues with running the application using the plain old REST API).
Overall, I'm looking to upload something like 150-5000 million nodes, so this is, in principle, the type of thing that neo4j claims to be able to handle relatively well.
Well, anyway, what I'm noticing when I run this against production data is that the application runs in two states -- one where the csv upload processes between 2k-8k of nodes per second, and one where it processes between 80-200 nodes per second. The two states are interwoven when you look at the upload as a time series, and as time goes on, it spends increasingly long amounts of time in the slow state.
Nodes are created through a series of
MERGE (:{NODE_TYPE} {csvLine.key = n.primaryKey}) on create set [PROPERTY LIST];
statements, and I have indexes on everything that I'm doing merges against. This doesn't feel like a degradation in the insert statements, because the slowdown is not linear, but rather bimodal, this feels like there are garbage collection in the neo4j instance. What is the best way to tune the neo4j JVM garbage collector for frequent bulk inserts?
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
#neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.mapped_memory=100M
#neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional=-XX:hashCode=5
wrapper.java.initmemory=8194
wrapper.java.maxmemory=8194
This felt like the sweet spot for both the overall heap memory and the neostore stuff. Increasing the overall heap degraded performance. That said, the neo4j garbage collection logs frequently have that GC (Allocation Failure) message.
EDIT: in response to Michael Hunger:
the machine has 64 GB of RAM, and nothing seems to be maxed out. It also seems like only a small number of cores are being used at any time. Garbage collector profiling shows that the garbage collector seems to be running quite frequently.
The exact cypher statements are, for example:
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event_2015_4_773476.csv' AS csvLine MERGE (s:Event {primaryKey: csvLine.primaryKey}) ON CREATE SET s.checkSum= csvLine.checkSum,s.epochTime= toInt(csvLine.epochTime),s.epochTimeCreated= toInt(csvLine.epochTimeCreated),s.epochTimeUpdated= toInt(csvLine.epochTimeUpdated),s.eventDescription= csvLine.eventDescription,s.fileName= csvLine.fileName,s.ip= csvLine.ip,s.lineNumber= toInt(csvLine.lineNumber),s.port= csvLine.port,s.processPid= csvLine.processPid,s.rawEventLine= csvLine.rawEventLine,s.serverId= csvLine.serverId,s.status= toInt(csvLine.status);
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event__File_2015_4_773476.csv' AS csvLine MATCH (n:SC_CSR{primaryKey: csvLine.Event_id}), (s:File{fileName: csvLine.File_id}) MERGE n-[:DATA_SOURCE]->s;
Though there are serveral such statements being made
I have tried a single concurrent transaction as well as running several (~3) such statements in parallel (which gives a roughly 2x improvement). I've tried tuning the periodic commit frequency, and the size of the file. It seems that this maximizes performance when the csv file is roughly 100k lines, which means that really, the periodic commit can be off.
I have not run profile on the staments. I will do that, but I thought that the eager merget problem was avoided by using MERGE ... on create statements.
IN general your config looks ok, what RAM does your machine have?
For the things you merge against I'd recommend constraint instead of an index.
What's your tx size? And how many concurrent tx do you run?
Instead of your generic merge statement (which wouldn't compile) can you share the concrete statements?
Did you profile the statements? Perhaps you run into the eager pipe problem:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Do you use periodic commit?
How large are you CSV files?
See: http://neo4j.com/developer/guide-import-csv/

Aggregation queries on millions of nodes

I'm trying to query ~8 million nodes in a neo4j database. I can do queries which hit the index for exact matches easily enough, but is there a performant way to do aggregations?
MATCH (r:Resident) RETURN r.forename, count(r.forename) ORDER BY count(r.forename)
This query just sits there until I eventually restart my server. I've read the performance guides and I'm watching vm_stat and it seems to be quickly running out of pages free. I've tried tuning the memory / JVM heap settings to various things, but I'm not sure I completely know what I'm doing ;) I've got an 8 GB MacBook Air with an SSD drive in case that's helpful for suggesting settings. Also, here's my stats on my DB from webadmin:
10,236,226 nodes
56,280,161 properties
10,190,430 relationships
2 relationship types
14,535 MB database disk usage
I inserted 8M nodes with just 1 prop, and got this query down to ~20s without changing the default settings (after warming up the cache--first time took 90s), which is comparable to other databases like postgres (which I also tested).
Some things you could try to do:
raise the sizes on the appropriate files mmio settings (as per the file sizes in data/graph.db/) in conf/neo4j.properties (at the top)
increase the node cache size in neo4j.properties
increase the heap init/max in neo4j-wrapper.conf
make sure you have enough RAM left over

How to explain the performance of Cypher's LOAD CSV clause?

I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer

mnesia memory allocation

i was testing the application by inserting some 1000 users and each user having 1000 contacts in a database table under mnesia and during insertion at some part the error i got is as follows:
Crash dump was written to: erl_crash.dump
binary_alloc: Cannot allocate 422879872 bytes of memory (of type "binary").
Aborted
i started the erl emulator with erl +MBas af (B-binary allocator af- a fit) and tried again but the error was same,
note:: i am using erlang r12b version and the system ram is 8gb on ubuntu 10.04
so may i know how to solve it?
the records definitions are:
%% database
-record(database,{dbid,guid,data}).
%% changelog
-record(changelog,{dbid,timestamp,changelist,type}).
here data is a vcard(contact info) , dbid and type is "contacts", guid is an integer automatically generated by the server
the database record contains all the vcard data of all users.if there are 1000 users and each user having 1000 contacts then we will have 10^6 records.
the changelog record will contain what are the changes done on the database table at that timestamp
the code for creation of tables are::
mnesia:create_table(database, [{type,bag}, {attributes,Record_of_database},
{record_name,database},
{index,guid},
{disc_copies,[node()]}])
mnesia:create_table(changelog, [{type,set}, {attributes,Record_of_changelog},
{record_name,changelog},
{index,timestamp},
{disc_copies,[node()]}])
the insertion of records on table is:
commit_data(DataList = [#database{dbid=DbID}|_]) ->
io:format("commit data called~n"),
[mnesia:dirty_write(database,{database,DbId,Guid,Key})|| {database,DbId,Guid,X}<-DataList].
write_changelist(Username,Dbname,Timestamp,ChangeList) ->
Type="contacts",
mnesia:dirty_write(changelog,{changelog,DbID,Timestamp,ChangeList,Type}).
I suppose that the list DataList is huge and should not be sent at once from a remote node. It should be sent in small pieces. The client can send one by one item from the DataList generated at the client. Also, because this problem occurs during insertion, i think that we should parallelise the list comprehension. We could have a parallel map where for each item in the list, the insertion is done in a separate process. Then, i also think that something is still wrong with the list comprehension. Variable Key is unbound and variable X is unused. Otherwise, probably the entire methodology needs a change. Lets see what others think. Thanks
This error normally occurs when there is no memory to allocate for binary heap by ERTS memory allocator called binary_alloc. Check the current binary heap size using erlang:system_info() or erlang:memory() or erlang:memory(binary) commands. If the binary heap size is huge then run erlang:garbage_collect() to free all non-referenced binary objects in binary heap. This will free the memory ..
In case you use long strings (it is just list in erlang) for vcard or somewehre else, they consumes much memory.
If this is the case, you change them to binary to suppress memory usage (use list_to_binary before insert to mnesia).
This may be not helpfull, because I don't know about your data structure (type, length and so on)...

Resources