Java OutOfMemoryError using PubsubIO - google-cloud-dataflow

I am writing a simple Dataflow pipeline in Java:
PubsubIO -> ConvertToTableRowDoFn -> BigQueryIO
The pipeline is working -- data arrives in BigQuery as expected -- but I'm seeing OutOfMemoryErrors in the Dataflow worker logs.
One experiment I tried is slowing down the ConvertToTableRowDoFn by adding Thread.sleep(100). I was thinking that this would make the batch sizes smaller for BigQueryIO, but to my surprise, this made the OutOfMemoryErrors more frequent!
This makes me think that something in PubsubIO is reading data too quickly or doing too much buffering. Any tips for how to investigate this, or pointers on how PubsubIO does buffering in the Google Dataflow environment?

Recompiled beam with FILE_TRIGGERING_RECORD_COUNT = 100000 instead of 500000 and we haven't seen any OOMs since.

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

What can cause Data Freshness to keep increasing in Dataflow?

We have a Dataflow job that has a low system latency and a high "data freshness" (or "data watermark lag").
After upgrading to Beam 2.15 (from 2.12) we see that this metric keeps increasing, which would be caused by something stuck in the pipeline. However, this is not the case, as all data was consumed (from a PubSub subscription). Permissions also seem ok as we can consume (unless that is not enough?).
We also checked individual watermarks on all components of the pipeline, and they are ok (very recent).
Thanks!
This is indeed quite odd. Here are some reasons why you might be seeing this:
There may be a bug in a new Beam SDK, or in Dataflow when estimating the watermark.
It may be that you updated the topology of your pipeline, and hit a bug related to watermark calculation for old/new topology.
The job may indeed be stuck, and you may have missed some data that actually did not make it across the pipeline.
My advice, if you're seeing this, is to open a support case with Dataflow support.

Debugging slow reads from BigQuery on Google Cloud Dataflow

Background:
We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.
Problem:
Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.
Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)
Is it maybe possible to define a timeout for the job?
This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.
On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.
The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

BigQueryIO Read performance using withTemplateCompatibility

Apache Beam 2.1.0 had a bug with template pipelines that read from BigQuery which meant they could only be executed once. More details here https://issues.apache.org/jira/browse/BEAM-2058
This has been fixed with the release of Beam 2.2.0, you can now read from BigQuery using the withTemplateCompatibility option, your template pipeline can now be run multiple times.
pipeline
.apply("Read rows from table."
, BigQueryIO.readTableRows()
.withTemplateCompatibility()
.from("<your-table>")
.withoutValidation())
This implementation seems to come with a huge performance cost to BigQueryIO read operation, I now have batch pipelines what ran in 8-11 minutes now consistently taking 45-50 minutes to complete. The only difference between both pipelines is the .withTemplateCompatibility().
Am trying to understand the reasons for the huge drop in performance and if there is any way to improve them.
Thanks.
Solution: based on jkff's input.
pipeline
.apply("Read rows from table."
, BigQueryIO.readTableRows()
.withTemplateCompatibility()
.from("<your-table>")
.withoutValidation())
.apply("Reshuffle", Reshuffle.viaRandomKey())
I suspect this is due to the fact that withTemplateCompatibility comes at the cost of disabling dynamic rebalancing for this read step.
I would expect it to have significant impact only if you're reading a small or moderate amount of data, but performing very heavy processing on it. In this case, try adding a Reshuffle.viaRandomKey() onto your BigQueryIO.read(). It will materialize a temporary copy of the data, but will parallelize downstream processing much better.

Cloud Dataflow Streaming continuously failing to insert

My dataflow pipeline functions as so:
Read from Pubsub
Transform data into rows
Write the rows to bigquery
On, occasion data is passed which fails to insert. That is alright, I know the reason for this failure. But dataflow continuously attempts to insert this data over and over and over and over. I would like to limit the number of retries as it bloats the worker logs with irrelevant information. Therefore making it extremely difficult to troubleshoot what is the problem when the same error repeatedly appears.
When running the pipeline locally I get:
no evaluator registered for Read(PubsubSource)
I would love to be able to test the pipeline locally. But it does not seem that dataflow supports this option with PubSub.
To clear the errors I am left with no other choice than canceling the pipeline and running a new job on the Google Cloud. Which costs time & money. Is there a way to limit the errors? Is there a way to test my pipeline locally? Is there a better approach to debugging the pipeline?
Dataflow UI
Job ID: 2017-02-08_09_18_15-3168619427405502955
To run the pipeline locally with unbounded data sets, on #Pablo's suggestion use the InProcessPipelineRunner.
dataflowOptions.setRunner(InProcessPipelineRunner.class);
Running the program locally has allowed me to handle errors with exceptions and optimize my workflow rapidly.

Resources