"GC overhead limit exceeded" for long running streaming dataflow job - google-cloud-dataflow

Running my streaming dataflow job for a longer period of time tends to end up in a "GC overhead limit exceeded" error which brings the job to a halt. How can I best proceed to debug this?
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.cloud.dataflow.worker.repackaged.com.google.common.collect.HashBasedTable.create (HashBasedTable.java:76)
at com.google.cloud.dataflow.worker.WindmillTimerInternals.<init> (WindmillTimerInternals.java:53)
at com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.start (StreamingModeExecutionContext.java:490)
at com.google.cloud.dataflow.worker.StreamingModeExecutionContext.start (StreamingModeExecutionContext.java:221)
at com.google.cloud.dataflow.worker.StreamingDataflowWorker.process (StreamingDataflowWorker.java:1058)
at com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000 (StreamingDataflowWorker.java:133)
at com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run (StreamingDataflowWorker.java:841)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
at java.lang.Thread.run (Thread.java:745)
Job ID: 2018-02-06_00_54_50-15974506330123401176
SDK: Apache Beam SDK for Java 2.2.0
Scio version: 0.4.7

I've run into this issue a few times. My approach typically starts with trying to isolate the transform step that is causing the memory error in Dataflow. It's a longer process, but you can usually make an educated guess about which is the problematic transform. Remove the transform, execute the pipeline, and check if the error persists.
Once I determine the problematic transform, I start looking at the implementation for any memory inefficiencies. This is usually related to initializing objects (memory allocation) or design where a transform has a really high fanout; creating a bunch of output. But it could be something as trivial as string manipulation.
From here, it's just a matter of continuing to isolate the issue. Dataflow does have memory limitations. You could potentially increase the hardware of the Compute Engine instances backing the workers. However, this isn't a scalable solution.
You should also consider implementing the pipeline using ONLY Apache Beam Java. This will rule out Scio as the issue. This usually isn't the case though.

Related

Apache Beam Dataflow Bigquery Streming insertions out of memory error

I'm intermittently getting out-of-memory issues on the dataflow job when inserting the data into Bigauqery using Apache Beam SDK for Java 2.29.0.
Here is the stack trace
Error message from worker: java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:982)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:1022)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.flushRows(BatchedStreamingWrite.java:375)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.access$800(BatchedStreamingWrite.java:69)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite$BatchAndInsertElements.finishBundle(BatchedStreamingWrite.java:271)
Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.base/java.lang.Thread.start0(Native Method)
java.base/java.lang.Thread.start(Thread.java:803)
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:129)
java.base/java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:724)
com.google.api.client.http.javanet.NetHttpRequest.writeContentToOutputStream(NetHttpRequest.java:188)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:117)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.lambda$insertAll$1(BigQueryServicesImpl.java:906)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$BoundedExecutorService$SemaphoreCallable.call(BigQueryServicesImpl.java:1492)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:834)
I tried increasing the worker node size still seeing the same issue.
I really recommend you to upgrade your Beam version to 2.42.0 (latest).
Also check if you have some aggregation like groupBy or groupByKey that are costly in memory inside a worker.
You can also use Dataflow prime, that is the last execution engine for Dataflow and allows to prevent errors like outOfMemory in a worker with vertical autoscaling :
dataflow prime
Dataflow prime can be enabled with a program argument, example for Beam Java :
--dataflowServiceOptions=enable_prime
Dataflow prime helps in this case, but you have to check and optimize your job if needed and avoid costly operations if it's possible (memory leaks, useless aggregation, costly serialization...)
OutOfMemory issues can be very tough to debug because the symptom you see may be totally unrelated to the sources of memory pressure. So your pipeline is throwing this when trying to create a thread in the insertAll method, but it's possible that most of your memory usage is coming from some other part of your pipeline.
There's some in-depth advice on debugging memory issues at https://cloud.google.com/community/tutorials/dataflow-debug-oom-conditions
If the memory pressure is coming from BigQueryIO, take a look at various config options such as maxStreamingRowsToBatch.

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

"The Dataflow appears to be stuck" for a job usually working

So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:
Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
So here I am :-)
The jobId: 2017-08-29_13_30_03-3908175820634599728
Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?
I'm working with Dataflow-Version 1.9.0
Thanks Google Dataflow Team
It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).
Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.

Memory profiling on Google Cloud Dataflow

What would be the best way to debug memory issues of a dataflow job?
My job was failing with a GC OOM error, but when I profile it locally I cannot reproduce the exact scenarios and data volumes.
I'm running it now on 'n1-highmem-4' machines, and I don't see the error anymore, but the job is very slow, so obviously using machine with more RAM is not the solution :)
Thanks for any advice,
G
Please use the option --dumpHeapOnOOM and --saveHeapDumpsToGcsPath (see docs).
This will only help if one of your workers actually OOMs. Additionally you can try running jmap -dump PID on the harness process on the worker to obtain a heap dump at runtime if it's not OOMing but if you observe high memory usage nevertheless.

building multiple jobs in jenkins performance

In Jenkins I have 100 java projects. Each has its own build file.
Every time I want clear the build file and compile all source files again.
Using bulkbuilder plugin I tried compling all the jobs.. Having 100 jobs run parallel.
But performance is very bad. Individually if the job takes 1 min. in the batch it takes 20mins. More the batch size more the time it takes.. I am running this on powerful server so no problem of memory and CPU.
Please Suggest me how to over come this.. what configurations need to be done in jenkins.
I am launching jenkins using war file.
Thanks..
Even though you say you have enough memory and CPU resources, you seem to imply there is some kind of bottleneck when you increase the number of parallel running jobs. I think this is understandable. Even though I am not a java developer, I think most of the java build tools are able to parallelize build internally. I.e. building a single job may well consume more than one CPU core and quite a lot of memory.
Because of this I suggest you need to monitor your build server and experiment with different batch sizes to find an optimal number. You should execute e.g. "vmstat 5" while builds are running and see if you have idle cpu left. Also keep an eye on the disk I/O. If you increase the batch size but disk I/O does not increase, you are consuming all of the I/O capacity and it probably will not help much if you increase the batch size.
When you have found the optimal batch size (i.e. how many executors to configure for the build server), you can maybe tweak other things to make things faster:
Try to spend as little time checking out code as possible. Instead of deleting workspace before build starts, configure the SCM plugin to remove files that are not under version control. If you use git, you can use a local reference repo or do a shallow clone or something like that.
You can also try to speed things up by using SSD disks
You can get more servers, run Jenkins slaves on them and utilize the cpu and I/O capacity of multiple servers instead of only one.

Resources