Talend- Memory issues. Working with big files

Talend- Memory issues. Working with big files - memory

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.

From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Related

Dask - Understanding diagnostics - memory:list

I am working on some fairly complex application that is making use of Dask framework, trying to increase the performance. To that end I am looking at the diagnostics dashboard. I have two use-cases. On first I have a 1GB parquet file split in 50 parts, and on second use case I have the first part of the above file, split over 5 parts, which is what used for the following charts:
The red node is called "memory:list" and I do not understand what it is.
When running the bigger input this seems to block the whole operation.
Finally this is what I see when I go inside those nodes:
I am not sure where I should start looking to understand what is generating this memory:list node, especially given how there is no stack button inside the task as it often happens. Any suggestions ?

Red nodes are in memory. So this computation has occurred, and the result is sitting in memory on some machine.
It looks like the type of the piece of data is a Python list object. Also, the name of the task is list-159..., so probably this is the result of calling the list Python function.

Import of large dataset in Neo4j taking really long (>12 hours) with Neo4j import tool

I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.
...
Available memory:
Free machine memory: 184.49 GB
Max heap memory : 26.52 GB
Nodes
[>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B
Done in 1h 7m 18s 54ms
Prepare node index
[*SORT:11.52 GB--------------------------------------------------------------------------------]881M
...
My question is how can I speed this up? I am thinking the following:
1. Split up the import command for nodes and relationships and do the nodes import.
2. Create indexes on the nodes
3. Do merge/match to get rid of dupes
4. Do rels import.
Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?
Thanks.
UPDATE
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?
FURTHER UPDATE
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this
1|Author
2|Author
3|Author
and the header (in a separate file) looks like this
authorID:ID(Author)|:LABEL
And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" \
--bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true \
--nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"
Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.

I think I found the issue. I was using some of the tips here http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets where it says I can re-use the same csv file with different headers -- once for nodes and once for relationships. I underestimated the 1-n (ness) of the data I was using, causing a lot of duplicates on the ID. That stage was basically almost all spent on trying to sort and then dedupe. Re-working my queries to extract the data split into nodes and rels files, fixed that problem. Thanks for looking into this!
So basically, ideally always having separate files for each type of node and rel will give fastest results (at least in my tests).

Have a look at the batch importer I wrote for a stress test:
https://github.com/graphaware/neo4j-stress-test
I used both neo4j index and in memory map between two commit. It is really fast and works for both version of neo4j.
Ignore the tests and get the batch importer.

Garbage collection tuning/performance degradation for neo4j bulk .csv import

I am running a bulk import of data into a neo4j instance (I have run against 2.2.0 community and enterprise editions as well as 2.1.7 community) running in server mode. My application creates a bunch of nodes in memory, and will peridoically stop to write a series .csv files and send cypher to the neo4j instance to upload the files. (this was done due to performance issues with running the application using the plain old REST API).
Overall, I'm looking to upload something like 150-5000 million nodes, so this is, in principle, the type of thing that neo4j claims to be able to handle relatively well.
Well, anyway, what I'm noticing when I run this against production data is that the application runs in two states -- one where the csv upload processes between 2k-8k of nodes per second, and one where it processes between 80-200 nodes per second. The two states are interwoven when you look at the upload as a time series, and as time goes on, it spends increasingly long amounts of time in the slow state.
Nodes are created through a series of
MERGE (:{NODE_TYPE} {csvLine.key = n.primaryKey}) on create set [PROPERTY LIST];
statements, and I have indexes on everything that I'm doing merges against. This doesn't feel like a degradation in the insert statements, because the slowdown is not linear, but rather bimodal, this feels like there are garbage collection in the neo4j instance. What is the best way to tune the neo4j JVM garbage collector for frequent bulk inserts?
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
#neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.mapped_memory=100M
#neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional=-XX:hashCode=5
wrapper.java.initmemory=8194
wrapper.java.maxmemory=8194
This felt like the sweet spot for both the overall heap memory and the neostore stuff. Increasing the overall heap degraded performance. That said, the neo4j garbage collection logs frequently have that GC (Allocation Failure) message.
EDIT: in response to Michael Hunger:
the machine has 64 GB of RAM, and nothing seems to be maxed out. It also seems like only a small number of cores are being used at any time. Garbage collector profiling shows that the garbage collector seems to be running quite frequently.
The exact cypher statements are, for example:
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event_2015_4_773476.csv' AS csvLine MERGE (s:Event {primaryKey: csvLine.primaryKey}) ON CREATE SET s.checkSum= csvLine.checkSum,s.epochTime= toInt(csvLine.epochTime),s.epochTimeCreated= toInt(csvLine.epochTimeCreated),s.epochTimeUpdated= toInt(csvLine.epochTimeUpdated),s.eventDescription= csvLine.eventDescription,s.fileName= csvLine.fileName,s.ip= csvLine.ip,s.lineNumber= toInt(csvLine.lineNumber),s.port= csvLine.port,s.processPid= csvLine.processPid,s.rawEventLine= csvLine.rawEventLine,s.serverId= csvLine.serverId,s.status= toInt(csvLine.status);
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event__File_2015_4_773476.csv' AS csvLine MATCH (n:SC_CSR{primaryKey: csvLine.Event_id}), (s:File{fileName: csvLine.File_id}) MERGE n-[:DATA_SOURCE]->s;
Though there are serveral such statements being made
I have tried a single concurrent transaction as well as running several (~3) such statements in parallel (which gives a roughly 2x improvement). I've tried tuning the periodic commit frequency, and the size of the file. It seems that this maximizes performance when the csv file is roughly 100k lines, which means that really, the periodic commit can be off.
I have not run profile on the staments. I will do that, but I thought that the eager merget problem was avoided by using MERGE ... on create statements.

IN general your config looks ok, what RAM does your machine have?
For the things you merge against I'd recommend constraint instead of an index.
What's your tx size? And how many concurrent tx do you run?
Instead of your generic merge statement (which wouldn't compile) can you share the concrete statements?
Did you profile the statements? Perhaps you run into the eager pipe problem:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Do you use periodic commit?
How large are you CSV files?
See: http://neo4j.com/developer/guide-import-csv/

Neo4j In memory configurations, multithreading, and slow writes

How do I improve performance when writing to neo4j. I currently have neo4j set up on a server and I am currently running it in embedded more. I believe my configurations are storing all the content of my graph database in memory based upon configurations I've found online
neostore.nodestore.db.mapped_memory=0
neostore.relationship.db.mapped_memory=0
neostore.propertystore.db.mapped_memory=0
neostore.propertystore.db.strings.mapped_memory=0
neostore.propertystore.db.arrays.mapped_memory=0
neostore.propertystore.db.index.keys.mapped_memory=0
neostore.propertystore.db.index.mapped_memory=0
node_auto_indexing=true
node_keys_indexable=type,id
cache_type=strong
use_memory_mapped_buffers=false
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
Please let me know if this is incorrect. The problem that I am encountering is that when I try to persist information to the graph database. It appears that those times are not very quick in comparison to our MYSQL times of the samething(ex. to add 250 items would take about 3sec and in MYSQL it takes 1sec) . I read online that when you have multiple indexes that that can slow down performance on persisting data so I am working on that right now to see if that is my culprit. But, I just wanted to make sure that my configurations seem to be inline when it comes to running your graph database in memory.
Second question to this topic. Okay, if my configurations are good and my database is indeed in memory, then is there a way to optimize persisting data just in case this isn't the silver bullet. If we ran one thread against our test that executes this functionality, oppose to 10 threads, its seems like the times for execution bubbles up
ex.( thread 1 finishes 1s, thread 2 finishes 2s, thread 3 finishes 3s,etc). Is there some special multithreaded configuration that I am missing to improve the performance when mulitple threads are hitting it at one time.
Neo4J version
1.9.1-enterprise
My Jvm configs are
-Xms25G -Xmx25G -XX:+UseNUMA -XX:+UseSerialGC
My Machine Specs:
File system type ext3

You cache arguments are invalid.
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
These can only be used with the GCR cache. Setting the cache isn't going to put everything in memory for you at start up, you will have to write code to do this for you. Something like this:
GlobalGraphOperations ggo = GlobalGraphOperations.at(graphDatabaseFactory);
for (Node n : ggo.getAllNodes()) {
for (String propertyKey : n.getPropertyKeys()) {
n.getProperty(propertyKey);
}
for (Relationship relationship : n.getRelationships()) {
}
}
Beware with the strong cache, if you have a lot of nodes/relationships, eventually your cache will become large and performing GC against it will cause long pauses in your system.
My recommendation would be to use the memory mapped files, as this is an OS handled and will be outside of heap space. It doesn't provide near the speed of caching, but it will provide a speed up if you have to read from the neo store.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?

It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".

The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart