How can I debug why my Dataflow job is stuck? - google-cloud-dataflow

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?

The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

Related

Refusing to split GroupedShuffleRangeTracker proposed split position is out of range

I am sporadically getting the following errors:
W Refusing to split
at '\x00\x00\x00\x15\xbc\x19)b\x00\x01': proposed
split position is out of range
['\x00\x00\x00\x15\x00\xff\x00\xff\x00\xff\x00\xff\x00\x01',
'\x00\x00\x00\x15\xbc\x19)b\x00\x01'). Position of last group
processed was '\x00\x00\x00\x15\xbc\x19)a\x00\x01'.
When it happens, the error is logged every so often and the job never seems to end. Although it seems that it did actually complete the job otherwise.
In the last instance I am using 10 workers and have auto scaling disabled. I am using the Python implementation of Apache Beam.
This is not an error, it's part of normal operation of a pipeline. We should probably reduce its logging level to INFO and rephrase it, because it very frequently confuses people.
This message (rather obscurely) signals that Dataflow is trying to apply dynamic rebalancing, but there's no work that can be further subdivided.
I.e. your job is stuck doing something non-parallelizable on a small number of workers, while other workers are staying idle. To investigate this further, one would need to look at the code of your job and the Dataflow job id.

Beam Runner hooks for Throughput-based autoscaling

I'm curious if anyone can point me towards greater visibility into how various Beam Runners manage autoscaling. We seem to be experiencing hiccups during both the 'spin up' and 'spin down' phases, and we're left wondering what to do about it. Here's the background of our particular flow:
1- Binary files arrive on gs://, and object notification duly notifies a PubSub topic.
2- Each file requires about 1Min of parsing on a standard VM to emit about 30K records to downstream areas of the Beam DAG.
3- 'Downstream' components include things like inserts to BigQuery, storage in GS:, and various sundry other tasks.
4- The files in step 1 arrive intermittently, usually in batches of 200-300 every hour, making this - we think - an ideal use case for autoscaling.
What we're seeing, however, has us a little perplexed:
1- It looks like when 'workers=1', Beam bites off a little more than it can chew, eventually causing some out-of-RAM errors, presumably as the first worker tries to process a few of the PubSub messages which, again, take about 60 seconds/message to complete because the 'message' in this case is that a binary file needs to be deserialized in gs.
2- At some point, the runner (in this case, Dataflow with jobId 2017-11-12_20_59_12-8830128066306583836), gets the message additional workers are needed and real work can now get done. During this phase, errors decrease and throughput rises. Not only are there more deserializers for step1, but the step3/downstream tasks are evenly spread out.
3-Alas, the previous step gets cut short when Dataflow senses (I'm guessing) that enough of the PubSub messages are 'in flight' to begin cooling down a little. That seems to come a little too soon, and workers are getting pulled as they chew through the PubSub messages themselves - even before the messages are 'ACK'd'.
We're still thrilled with Beam, but I'm guessing the less-than-optimal spin-up/spin-down phases are resulting in 50% more VM usage than what is needed. What do the runners look for beside PubSub consumption? Do they look at RAM/CPU/etc??? Is there anything a developer can do, beside ACK a PubSub message to provide feedback to the runner that more/less resources are required?
Incidentally, in case anyone doubted Google's commitment to open-source, I spoke about this very topic with an employee there yesterday, and she expressed interest in hearing about my use case, especially if it ran on a non-Dataflow runner! We hadn't yet tried our Beam work on Spark (or elsewhere), but would obviously be interested in hearing if one runner has superior abilities to accept feedback from the workers for THROUGHPUT_BASED work.
Thanks in advance,
Peter
CTO,
ATS, Inc.
Generally streaming autoscaling in Dataflow works like this :
Upscale: If the pipeline's backlog is more than a few seconds based on current throughput, pipeline is upscaled. Here CPU utilization does not directly affect the amount of upsize. Using CPU (say it is at 90%), does not help in answering the question 'how many more workers are required'. CPU does affect indirectly since pipelines fall behind when they they don't enough CPU thus increasing backlog.
Downcale: When backlog is low (i.e. < 10 seconds), pipeline is downcaled based on current CPU consumer. Here, CPU does directly influence down size.
I hope the above basic description helps.
Due to inherent delays involved in starting up new GCE VMs, the pipeline pauses for a minute or two during resizing events. This is expected to improve in near future.
I will ask specific questions about the job you mentioned in description.

Cloud Dataflow Streaming continuously failing to insert

My dataflow pipeline functions as so:
Read from Pubsub
Transform data into rows
Write the rows to bigquery
On, occasion data is passed which fails to insert. That is alright, I know the reason for this failure. But dataflow continuously attempts to insert this data over and over and over and over. I would like to limit the number of retries as it bloats the worker logs with irrelevant information. Therefore making it extremely difficult to troubleshoot what is the problem when the same error repeatedly appears.
When running the pipeline locally I get:
no evaluator registered for Read(PubsubSource)
I would love to be able to test the pipeline locally. But it does not seem that dataflow supports this option with PubSub.
To clear the errors I am left with no other choice than canceling the pipeline and running a new job on the Google Cloud. Which costs time & money. Is there a way to limit the errors? Is there a way to test my pipeline locally? Is there a better approach to debugging the pipeline?
Dataflow UI
Job ID: 2017-02-08_09_18_15-3168619427405502955
To run the pipeline locally with unbounded data sets, on #Pablo's suggestion use the InProcessPipelineRunner.
dataflowOptions.setRunner(InProcessPipelineRunner.class);
Running the program locally has allowed me to handle errors with exceptions and optimize my workflow rapidly.

How do I make sure my Dataflow pipeline scales?

We've often seen people write Dataflow pipelines that don't scale well. This is frustrating since Dataflow is meant to scale transparently, but there still are some antipatterns in Dataflow pipelines that make it difficult to scale. What are some common antipatterns and tips for avoiding them?
Scaling Your Dataflow Pipeline
Hi, Reuven Lax here. I’m a member of the Dataflow engineering team, where I lead the design and implementation of our streaming runner. Prior to Dataflow I led the team that built MillWheel for a number of years. MillWheel was described in this VLDB 2013 paper, and is the basis for the streaming technology underlying Dataflow.
Dataflow usually removes the need for you to think too much about how to make a pipeline scale. A lot of work has gone into sophisticated algorithms that can automatically parallelize and tune your pipeline across many machines. However as with any such system, there are some anti-patterns that can bottleneck your pipeline at scale. In this post we will go over three of these anti-patterns, and discuss how to address them. It’s assumed that you are already familiar with the Dataflow programming model. If not, I recommend beginning with our Getting Started guide and Tyler Akidau’s Streaming 101 and Streaming 102 blog posts. You may also read the Dataflow model paper published in VLDB 2015.
Today we’re going to talk about scaling your pipeline - or more specifically, why your pipeline might not scale. When we say scalability, we mean the ability of the pipeline to operate efficiently as input size increases and key distribution changes. The scenario: you’ve written a cool new Dataflow pipeline, which the high-level operations we provide made easy to write. You’ve tested this pipeline locally on your machine using DirectPipelineRunner and everything looks fine. You’ve even tried deploying it on a small number of Compute VMs, and things still look rosy. Then you try and scale up to a larger data volume, and the picture becomes decidedly worse. For a batch pipeline, it takes far longer than expected for the pipeline to complete. For a streaming pipeline, the lag reported in the Dataflow UI keeps increasing as the pipeline falls further and further behind. We’re going to explain some reasons this might happen, and how to address them.
Expensive Per-Record Operations
One common problem we see is pipelines that perform needlessly expensive or slow operations for each record processed. Technically this isn’t a hard scaling bottleneck - given enough resources, Dataflow can still distribute this pipeline on enough machines to make it perform well. However when running over many millions or billions of records, the cost of these per-record operations adds up to an unexpectedly-large number. Usually these problems aren’t noticeable at all at lower scale.
Here’s an example of one such operation, taken from a real Dataflow pipeline.
import javax.json.Json;
...
PCollection<OutType> output = input.apply(ParDo.of(new DoFn<InType, OutType>() {
public void processElement(ProcessContext c) {
JsonReader reader = Json.createReader();
// Perform some processing on entry.
...
}
}));
At first glance it’s not obvious that anything is wrong with this code, yet when run at scale this pipeline ran extremely slowly.
Since the actual business logic of our code shouldn't have caused a slowdown, we suspected that something was adding per-record overhead to our pipeline. To get more information on this, we had to ssh to the VMs to get actual thread profiles from workers. After a bit of digging, we found threads were often stuck in the following stack trace:
java.util.zip.ZipFile.getEntry(ZipFile.java:308)
java.util.jar.JarFile.getEntry(JarFile.java:240)
java.util.jar.JarFile.getJarEntry(JarFile.java:223)
sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983)
sun.misc.URLClassPath$1.next(URLClassPath.java:240)
sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:250)
java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader$3.next(URLClassLoader.java:598)
java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:354)
java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
javax.json.spi.JsonProvider.provider(JsonProvider.java:89)
javax.json.Json.createReader(Json.java:208)
<.....>.processElement(<filename>.java:174)
Each call to Json.createReader was searching the classpath trying to find a registered JsonProvider. As you can see from the stack trace, this involves loading and unzipping JAR files. Doing this per record on a high-scale pipeline is not likely to perform very well!
The solution here was for the user to create a static JsonReaderFactory and use that to instantiate the individual reader objects. You might be tempted to create a JsonReaderFactory per bundle of records instead, inside Dataflow’s startBundle method. However, while this will work well for a batch pipeline, in streaming mode the bundles may be very small - sometimes just a few records. As a result, we don’t recommend doing expensive work per bundle either. Even if you believe your pipeline will only be used in batch mode, you may in the future want to run it as a streaming pipeline. So future-proof your pipelines, by making sure they’ll work well in either mode!
Hot Keys
A fundamental primitive in Dataflow is GroupByKey. GroupByKey allows one to group a PCollection of key-value pairs so that all values for a specific key are grouped together to be processed as a unit. Most of Dataflow’s built-in aggregating transforms - Count, Top, Combine, etc. - use GroupByKey under the cover. You might have a hot key problem if a single worker is extremely busy (e.g. high CPU use determined by looking at the set of GCE workers for the job) while other workers are idle, yet the pipeline falls farther and farther behind.
The DoFn that processes the result of a GroupByKey is given an input type of KV<KeyType, Iterable<ValueType>>. This means that the entire set of all values for that key (within the current window if using windowing) is modeled as a single Iterable element. In particular, this means that all values for that key must be processed on the same machine, in fact on the same thread. Performance problems can occur in the presence of hot keys - when one or more keys receive data faster than can be processed on a single cpu. For example, consider the following code snippet
p.apply(Read.from(new UserWebEventSource())
.apply(new ExtractBrowserString())
.apply(Window.<Event>into(FixedWindow.of(1, Duration.standardSeconds(1))))
.apply(GroupByKey.<String, Event>create())
.apply(ParDo.of(new ProcessEventsByBrowser()));
This code keys all user events by the user’s web browser, and then processes all events for each browser as a unit. However there is a small number of very popular browsers (such as Chrome, IE, Firefox, Safari), and those keys will be very hot - possibly too hot to process on one CPU. In addition to performance, this is also a scalability bottleneck. Adding more workers to the pipeline will not help if there are four hot keys, since those keys can processed on at most four workers. You’ve structured your pipeline so that Dataflow can’t scale it up without violating the API contract.
One way to alleviate this is to structure the ProcessEventsByBrowser DoFn as a combiner. A combiner is a special type of user function that allows piecewise processing of the iterable. For example, if the goal was to count the number of events per browser per second, Count.perKey() can be used instead of a ParDo. Dataflow is able to lift part of the combining operation above the GroupByKey, which allows for more parallelism (for those of you coming from the Database world, this is similar to pushing a predicate down); some of the work can be done in a previous stage which hopefully is better distributed.
Unfortunately, while using a combiner often helps, it may not be enough - especially if the hot keys are very hot; this is especially true for streaming pipelines. You might also see this when using the global variants of combine (Combine.globally(), Count.globally(), Top.largest(), among others.). Under the covers these operations are performing a per-key combine on a single static key, and may not perform well if the volume to this key is too high. To address this we allow you to provide extra parallelism hints using the Combine.PerKey.withHotKeyFanout or Combine.Globally.withFanout. These operations will create an extra step in your pipeline to pre-aggregate the data on many machines before performing the final aggregation on the target machines. There's no magic number for these operations, but the general strategy would be to split any hot key into enough sub-shards so that any single shard is well under the per-worker throughput that your pipeline can sustain.
Large Windows
Dataflow provides a sophisticated windowing facility for bucketing data according to time. This is most useful in streaming pipelines when processing unbounded data, however, it is fully supported for batch, bounded pipelines as well. When a windowing strategy has been attached to a PCollection, any subsequent grouping operation (most notably GroupByKey) performs a separate grouping per window. Unlike other systems that provide only globally-synchronized windows, Dataflow windows the data for each key separately. This is what us to provide flexible per-key windows such as sessions. For more information, I recommend that you read the windowing guide in the Dataflow documentation.
As a consequence of the fact that windows are per key, Dataflow buffers elements on the receiver side while waiting for each window to close. If using very-long windows - e.g. a 24-hour fixed window - this means that a lot of data has to be buffered, which can be a performance bottleneck for the pipeline. This can manifest as slowness (like for hot keys), or even as out of memory errors on the workers (visible in the logs). We again recommend using combiners to reduce the data size. The difference between writing this:
pcollection.apply(Window.into(FixedWindows.of(1, TimeUnit.DAYS)))
.apply(GroupByKey.<KeyType, ValueType>create())
.apply(ParDo.of(new DoFn<KV<KeyType, Iterable<ValueType>>, Long>() {
public void processElement(ProcessContext c) {
c.output(c.element().size());
}
}));
… and this ...
pcollection.apply(Window.into(FixedWindows.of(1, TimeUnit.DAYS)))
.apply(Count.perKey());
… isn’t just brevity. In the latter snippet Dataflow knows that a count combiner is being applied, and so only needs to store the count so far for each key, no matter how long the window is. In contrast, Dataflow understands less about the first snippet of code and is forced to buffer an entire day’s worth of data on receivers, even though the two snippets are logically equivalent!
If it’s impossible to express your operation as a combiner, then we recommend looking at the triggers API. This will allow you to optimistically process portions of the window before the window closes, and so reduce the size of buffered data.
Note that many of these limitations do not apply to the batch runner. However as mentioned above, you're always better off future proofing your pipeline and making sure it runs well in both modes.
We've talked about hot keys, large windows, and expensive per-record operations. Other guidance can be found in our documentation. Although this post has focused on challenges you may encounter with scaling your pipeline, there are many benefits to Dataflow that are largely transparent -- things like dynamic work rebalancing to minimize straggler effects, throughput-based autoscaling, and job resource management adapt to many different pipeline and data shapes without user intervention. We're always trying to make our system more adaptive, and plan to automatically incorporate some of the above strategies into the core execution engine over time. Thanks for reading, and happy Dataflowing!

Quartz.Net jobs not always running - can't find any reason why

We're using Quartz.Net to schedule about two hundred repeating jobs. Each job uses the same IJob implementing class, but they can have different schedules. In practice, they end up having the same schedule, so we have about two hundred job details, each with their own (identical) repeating/simple trigger, scheduled. The interval is one hour.
The task this job performs is to download an rss feed, and then download all of the media files linked to in the rss feed. Prior to downloading, it wipes the directory where it is going to place the files. A single run of a job takes anywhere from a couple seconds to a dozen seconds (occasionally more).
Our method of scheduling is to call GetScheduler() on a new StdSchedulerFactory (all jobs are scheduled at once into the same IScheduler instance). We follow the scheduling with an immediate Start().
The jobs appear to run fine, but upon closer inspection we are seeing that a minority of the jobs occasionally - or almost never - run.
So, for example, all two hundred jobs should have run at 6:40 pm this evening. Most of them did. But a handful did not. I determine this by looking at the file timestamps, which should certainly be updated if the job runs (because it deletes and redownloads the file).
I've enabled Quartz.Net logging, and added quite a few logging statements to our code as well.
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
After that, all activity stops. No jobs run, no log messages are created. Zero.
And then, at the next firing interval, Quartz starts up again and my log files update, and various files start downloading. But - it certainly appears like some JobDetail instances never make it to the head of the line (so to speak) or do so very infrequently. Over the entire weekend, some jobs appeared to update quite frequently, and recently, and others had not updated a single time since starting the process on Friday (it runs in a Windows Service shell, btw).
So ... I'm hoping someone can help me understand this behavior of Quartz.
I need to be certain that every job runs. If it's trigger is missed, I need Quartz to run it as soon as possible. From reading the documentation, I thought this would be the default behavior - for SimpleTrigger with an indefinite repeat count it would reschedule the job for immediate execution if the trigger window was missed. This doesn't seem to be the case. Is there any way I can determine why Quartz is not firing these jobs? I am logging at the trace level and they just simply aren't there. It creates and executes an awful lot of jobs, but if I notice one missing - all I can find is that it ran it the last time (for example, sometimes it hasn't run for hours or days). Nothing about why it was skipped (I expected Quartz to log something if it skips a job for any reason), etc.
Any help would really, really be appreciated - I've spent my entire day trying to figure this out.
After reading your post, it sounds a lot like the handful of jobs that are not executing are very likely misfiring. The reason that I believe this:
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
In Quartz.NET the default misfire threshold is 1 minute. Chances are, you need to examine your logging configuration to determine why those misfire events are not being logged. I bet if you throw open the the floodgates on your logging (ie. set everything to debug, and make sure that you definitely have a logging directive for the Quartz scheduler class), and then rerun your jobs. I'm almost positive that the problem is the misfire events are not showing up in your logs because the logging configuration is lacking something. This is understandable, because logging configuration can get very confusing, very quickly.
Also, in the future, you might want to consult the quartz.net forum on google, since that is where some of the more thorny issues are discussed.
http://groups.google.com/group/quartznet?pli=1
Now, your other question about setting the policy for what the scheduler should do, I can't specifically help you there, but if you read the API docs closely, and also consult the google discussion group, you should be able to easily set the misfire policy flag that suits your needs. I believe that Trigger's have a MisfireInstruction property which you can configure.
Also, I would argue that misfires introduce a lot of "noise" and should be avoided; perhaps bumping up the thread count on your scheduler would be a way to avoid misfires? The other option would be to stagger your job execution into separate/multiple batches.
Good luck!

Resources