Apache Beam Dataflow Bigquery Streming insertions out of memory error - google-cloud-dataflow

I'm intermittently getting out-of-memory issues on the dataflow job when inserting the data into Bigauqery using Apache Beam SDK for Java 2.29.0.
Here is the stack trace
Error message from worker: java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:982)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:1022)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.flushRows(BatchedStreamingWrite.java:375)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.access$800(BatchedStreamingWrite.java:69)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite$BatchAndInsertElements.finishBundle(BatchedStreamingWrite.java:271)
Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.base/java.lang.Thread.start0(Native Method)
java.base/java.lang.Thread.start(Thread.java:803)
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:129)
java.base/java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:724)
com.google.api.client.http.javanet.NetHttpRequest.writeContentToOutputStream(NetHttpRequest.java:188)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:117)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.lambda$insertAll$1(BigQueryServicesImpl.java:906)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$BoundedExecutorService$SemaphoreCallable.call(BigQueryServicesImpl.java:1492)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:834)
I tried increasing the worker node size still seeing the same issue.

I really recommend you to upgrade your Beam version to 2.42.0 (latest).
Also check if you have some aggregation like groupBy or groupByKey that are costly in memory inside a worker.
You can also use Dataflow prime, that is the last execution engine for Dataflow and allows to prevent errors like outOfMemory in a worker with vertical autoscaling :
dataflow prime
Dataflow prime can be enabled with a program argument, example for Beam Java :
--dataflowServiceOptions=enable_prime
Dataflow prime helps in this case, but you have to check and optimize your job if needed and avoid costly operations if it's possible (memory leaks, useless aggregation, costly serialization...)

OutOfMemory issues can be very tough to debug because the symptom you see may be totally unrelated to the sources of memory pressure. So your pipeline is throwing this when trying to create a thread in the insertAll method, but it's possible that most of your memory usage is coming from some other part of your pipeline.
There's some in-depth advice on debugging memory issues at https://cloud.google.com/community/tutorials/dataflow-debug-oom-conditions
If the memory pressure is coming from BigQueryIO, take a look at various config options such as maxStreamingRowsToBatch.

Related

Lift worker RAM memory limit

I am trying to adjust a build job within jenkins, the problem is, it keeps failing due to lack of memory, I've adjusted java xmx but it did not solve the problem.
Turns out, I have RAM memory limit within the worker, I tried running those commands as part of the build script : "free -m" and "cat /proc/meminfo" and they both confirmed that job is being run with 1GB RAM limit, the server has more but the build isn't using it and it keeps failing due to lack of memory.
Please help me fix this, how can I lift that limit ? thank you
Heap or Permgen?
There are two OutOfMemoryErrors which people usually encounter. The first is related to heap space: java.lang.OutOfMemoryError: Heap space When you see this, you need to increase the maximum heap space. You can do this by adding the following to your JVM arguments -Xmx200m where you replace the number 200 with the new heap size in megabytes.
The second is related to PermGen: java.lang.OutOfMemoryError: PermGen space. When you see this, you need to increase the maximum Permanent Generation space, which is used for things like class files and interned strings. You can do this by adding the following to your JVM arguments -XX:MaxPermSize=128m where you replace the number 128 with the new PermGen size in megabytes.
Also note:
Memory Requirements for the Master
The amount of memory Jenkins needs is largely dependent on many factors, which is why the RAM allotted for it can range from 200 MB for a small installation to 70+ GB for a single and massive Jenkins master. However, you should be able to estimate the RAM required based on your project build needs.
Each build node connection will take 2-3 threads, which equals about 2 MB or more of memory. You will also need to factor in CPU overhead for Jenkins if there are a lot of users who will be accessing the Jenkins user interface.
It is generally a bad practice to allocate executors on a master, as builds can quickly overload a master’s CPU/memory/etc and crash the instance, causing unnecessary downtime. Instead, it is advisable to set up agents that the Jenkins master can delegate jobs to, keeping the bulk of the work off of the master itself.
Finally, there is a monitoring plugin from Jenkins that you can use:
https://wiki.jenkins.io/display/JENKINS/Monitoring
Sources:
https://wiki.jenkins.io/display/JENKINS/Monitoring
https://wiki.jenkins.io/display/JENKINS/Builds+failing+with+OutOfMemoryErrors
https://www.jenkins.io/doc/book/hardware-recommendations/#:~:text=The%20amount%20of%20memory%20Jenkins,single%20and%20massive%20Jenkins%20master.

"GC overhead limit exceeded" for long running streaming dataflow job

Running my streaming dataflow job for a longer period of time tends to end up in a "GC overhead limit exceeded" error which brings the job to a halt. How can I best proceed to debug this?
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.cloud.dataflow.worker.repackaged.com.google.common.collect.HashBasedTable.create (HashBasedTable.java:76)
at com.google.cloud.dataflow.worker.WindmillTimerInternals.<init> (WindmillTimerInternals.java:53)
at com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.start (StreamingModeExecutionContext.java:490)
at com.google.cloud.dataflow.worker.StreamingModeExecutionContext.start (StreamingModeExecutionContext.java:221)
at com.google.cloud.dataflow.worker.StreamingDataflowWorker.process (StreamingDataflowWorker.java:1058)
at com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000 (StreamingDataflowWorker.java:133)
at com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run (StreamingDataflowWorker.java:841)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
at java.lang.Thread.run (Thread.java:745)
Job ID: 2018-02-06_00_54_50-15974506330123401176
SDK: Apache Beam SDK for Java 2.2.0
Scio version: 0.4.7
I've run into this issue a few times. My approach typically starts with trying to isolate the transform step that is causing the memory error in Dataflow. It's a longer process, but you can usually make an educated guess about which is the problematic transform. Remove the transform, execute the pipeline, and check if the error persists.
Once I determine the problematic transform, I start looking at the implementation for any memory inefficiencies. This is usually related to initializing objects (memory allocation) or design where a transform has a really high fanout; creating a bunch of output. But it could be something as trivial as string manipulation.
From here, it's just a matter of continuing to isolate the issue. Dataflow does have memory limitations. You could potentially increase the hardware of the Compute Engine instances backing the workers. However, this isn't a scalable solution.
You should also consider implementing the pipeline using ONLY Apache Beam Java. This will rule out Scio as the issue. This usually isn't the case though.

What do KilledWorker exceptions mean in Dask?

My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?
This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors.
Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it died. It retries those tasks on other workers but also marks them as suspicious. If the same task is present on several workers when they die then eventually the scheduler will give up on trying to retry this task, and instead marks it as failed with the exception KilledWorker.
Often this means that your task has some other issue. Perhaps it causes a segmentation fault or allocates too much memory. Perhaps it uses a library that is not threadsafe. Or perhaps it is just very unlucky. Regardless, you should inspect your worker logs to determine why your workers are failing. This is likely a bigger issue than your task failing.
You can control this behavior by modifying the following entry in your ~/.config/dask/distributed.yaml file.
allowed-failures: 3 # number of retries before a task is considered bad

"The Dataflow appears to be stuck" for a job usually working

So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:
Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
So here I am :-)
The jobId: 2017-08-29_13_30_03-3908175820634599728
Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?
I'm working with Dataflow-Version 1.9.0
Thanks Google Dataflow Team
It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).
Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.

Memory profiling on Google Cloud Dataflow

What would be the best way to debug memory issues of a dataflow job?
My job was failing with a GC OOM error, but when I profile it locally I cannot reproduce the exact scenarios and data volumes.
I'm running it now on 'n1-highmem-4' machines, and I don't see the error anymore, but the job is very slow, so obviously using machine with more RAM is not the solution :)
Thanks for any advice,
G
Please use the option --dumpHeapOnOOM and --saveHeapDumpsToGcsPath (see docs).
This will only help if one of your workers actually OOMs. Additionally you can try running jmap -dump PID on the harness process on the worker to obtain a heap dump at runtime if it's not OOMing but if you observe high memory usage nevertheless.

Resources