How do I make sure my Dataflow pipeline scales? - google-cloud-dataflow

We've often seen people write Dataflow pipelines that don't scale well. This is frustrating since Dataflow is meant to scale transparently, but there still are some antipatterns in Dataflow pipelines that make it difficult to scale. What are some common antipatterns and tips for avoiding them?

Scaling Your Dataflow Pipeline
Hi, Reuven Lax here. I’m a member of the Dataflow engineering team, where I lead the design and implementation of our streaming runner. Prior to Dataflow I led the team that built MillWheel for a number of years. MillWheel was described in this VLDB 2013 paper, and is the basis for the streaming technology underlying Dataflow.
Dataflow usually removes the need for you to think too much about how to make a pipeline scale. A lot of work has gone into sophisticated algorithms that can automatically parallelize and tune your pipeline across many machines. However as with any such system, there are some anti-patterns that can bottleneck your pipeline at scale. In this post we will go over three of these anti-patterns, and discuss how to address them. It’s assumed that you are already familiar with the Dataflow programming model. If not, I recommend beginning with our Getting Started guide and Tyler Akidau’s Streaming 101 and Streaming 102 blog posts. You may also read the Dataflow model paper published in VLDB 2015.
Today we’re going to talk about scaling your pipeline - or more specifically, why your pipeline might not scale. When we say scalability, we mean the ability of the pipeline to operate efficiently as input size increases and key distribution changes. The scenario: you’ve written a cool new Dataflow pipeline, which the high-level operations we provide made easy to write. You’ve tested this pipeline locally on your machine using DirectPipelineRunner and everything looks fine. You’ve even tried deploying it on a small number of Compute VMs, and things still look rosy. Then you try and scale up to a larger data volume, and the picture becomes decidedly worse. For a batch pipeline, it takes far longer than expected for the pipeline to complete. For a streaming pipeline, the lag reported in the Dataflow UI keeps increasing as the pipeline falls further and further behind. We’re going to explain some reasons this might happen, and how to address them.
Expensive Per-Record Operations
One common problem we see is pipelines that perform needlessly expensive or slow operations for each record processed. Technically this isn’t a hard scaling bottleneck - given enough resources, Dataflow can still distribute this pipeline on enough machines to make it perform well. However when running over many millions or billions of records, the cost of these per-record operations adds up to an unexpectedly-large number. Usually these problems aren’t noticeable at all at lower scale.
Here’s an example of one such operation, taken from a real Dataflow pipeline.
import javax.json.Json;
...
PCollection<OutType> output = input.apply(ParDo.of(new DoFn<InType, OutType>() {
public void processElement(ProcessContext c) {
JsonReader reader = Json.createReader();
// Perform some processing on entry.
...
}
}));
At first glance it’s not obvious that anything is wrong with this code, yet when run at scale this pipeline ran extremely slowly.
Since the actual business logic of our code shouldn't have caused a slowdown, we suspected that something was adding per-record overhead to our pipeline. To get more information on this, we had to ssh to the VMs to get actual thread profiles from workers. After a bit of digging, we found threads were often stuck in the following stack trace:
java.util.zip.ZipFile.getEntry(ZipFile.java:308)
java.util.jar.JarFile.getEntry(JarFile.java:240)
java.util.jar.JarFile.getJarEntry(JarFile.java:223)
sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983)
sun.misc.URLClassPath$1.next(URLClassPath.java:240)
sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:250)
java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader$3.next(URLClassLoader.java:598)
java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:354)
java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
javax.json.spi.JsonProvider.provider(JsonProvider.java:89)
javax.json.Json.createReader(Json.java:208)
<.....>.processElement(<filename>.java:174)
Each call to Json.createReader was searching the classpath trying to find a registered JsonProvider. As you can see from the stack trace, this involves loading and unzipping JAR files. Doing this per record on a high-scale pipeline is not likely to perform very well!
The solution here was for the user to create a static JsonReaderFactory and use that to instantiate the individual reader objects. You might be tempted to create a JsonReaderFactory per bundle of records instead, inside Dataflow’s startBundle method. However, while this will work well for a batch pipeline, in streaming mode the bundles may be very small - sometimes just a few records. As a result, we don’t recommend doing expensive work per bundle either. Even if you believe your pipeline will only be used in batch mode, you may in the future want to run it as a streaming pipeline. So future-proof your pipelines, by making sure they’ll work well in either mode!
Hot Keys
A fundamental primitive in Dataflow is GroupByKey. GroupByKey allows one to group a PCollection of key-value pairs so that all values for a specific key are grouped together to be processed as a unit. Most of Dataflow’s built-in aggregating transforms - Count, Top, Combine, etc. - use GroupByKey under the cover. You might have a hot key problem if a single worker is extremely busy (e.g. high CPU use determined by looking at the set of GCE workers for the job) while other workers are idle, yet the pipeline falls farther and farther behind.
The DoFn that processes the result of a GroupByKey is given an input type of KV<KeyType, Iterable<ValueType>>. This means that the entire set of all values for that key (within the current window if using windowing) is modeled as a single Iterable element. In particular, this means that all values for that key must be processed on the same machine, in fact on the same thread. Performance problems can occur in the presence of hot keys - when one or more keys receive data faster than can be processed on a single cpu. For example, consider the following code snippet
p.apply(Read.from(new UserWebEventSource())
.apply(new ExtractBrowserString())
.apply(Window.<Event>into(FixedWindow.of(1, Duration.standardSeconds(1))))
.apply(GroupByKey.<String, Event>create())
.apply(ParDo.of(new ProcessEventsByBrowser()));
This code keys all user events by the user’s web browser, and then processes all events for each browser as a unit. However there is a small number of very popular browsers (such as Chrome, IE, Firefox, Safari), and those keys will be very hot - possibly too hot to process on one CPU. In addition to performance, this is also a scalability bottleneck. Adding more workers to the pipeline will not help if there are four hot keys, since those keys can processed on at most four workers. You’ve structured your pipeline so that Dataflow can’t scale it up without violating the API contract.
One way to alleviate this is to structure the ProcessEventsByBrowser DoFn as a combiner. A combiner is a special type of user function that allows piecewise processing of the iterable. For example, if the goal was to count the number of events per browser per second, Count.perKey() can be used instead of a ParDo. Dataflow is able to lift part of the combining operation above the GroupByKey, which allows for more parallelism (for those of you coming from the Database world, this is similar to pushing a predicate down); some of the work can be done in a previous stage which hopefully is better distributed.
Unfortunately, while using a combiner often helps, it may not be enough - especially if the hot keys are very hot; this is especially true for streaming pipelines. You might also see this when using the global variants of combine (Combine.globally(), Count.globally(), Top.largest(), among others.). Under the covers these operations are performing a per-key combine on a single static key, and may not perform well if the volume to this key is too high. To address this we allow you to provide extra parallelism hints using the Combine.PerKey.withHotKeyFanout or Combine.Globally.withFanout. These operations will create an extra step in your pipeline to pre-aggregate the data on many machines before performing the final aggregation on the target machines. There's no magic number for these operations, but the general strategy would be to split any hot key into enough sub-shards so that any single shard is well under the per-worker throughput that your pipeline can sustain.
Large Windows
Dataflow provides a sophisticated windowing facility for bucketing data according to time. This is most useful in streaming pipelines when processing unbounded data, however, it is fully supported for batch, bounded pipelines as well. When a windowing strategy has been attached to a PCollection, any subsequent grouping operation (most notably GroupByKey) performs a separate grouping per window. Unlike other systems that provide only globally-synchronized windows, Dataflow windows the data for each key separately. This is what us to provide flexible per-key windows such as sessions. For more information, I recommend that you read the windowing guide in the Dataflow documentation.
As a consequence of the fact that windows are per key, Dataflow buffers elements on the receiver side while waiting for each window to close. If using very-long windows - e.g. a 24-hour fixed window - this means that a lot of data has to be buffered, which can be a performance bottleneck for the pipeline. This can manifest as slowness (like for hot keys), or even as out of memory errors on the workers (visible in the logs). We again recommend using combiners to reduce the data size. The difference between writing this:
pcollection.apply(Window.into(FixedWindows.of(1, TimeUnit.DAYS)))
.apply(GroupByKey.<KeyType, ValueType>create())
.apply(ParDo.of(new DoFn<KV<KeyType, Iterable<ValueType>>, Long>() {
public void processElement(ProcessContext c) {
c.output(c.element().size());
}
}));
… and this ...
pcollection.apply(Window.into(FixedWindows.of(1, TimeUnit.DAYS)))
.apply(Count.perKey());
… isn’t just brevity. In the latter snippet Dataflow knows that a count combiner is being applied, and so only needs to store the count so far for each key, no matter how long the window is. In contrast, Dataflow understands less about the first snippet of code and is forced to buffer an entire day’s worth of data on receivers, even though the two snippets are logically equivalent!
If it’s impossible to express your operation as a combiner, then we recommend looking at the triggers API. This will allow you to optimistically process portions of the window before the window closes, and so reduce the size of buffered data.
Note that many of these limitations do not apply to the batch runner. However as mentioned above, you're always better off future proofing your pipeline and making sure it runs well in both modes.
We've talked about hot keys, large windows, and expensive per-record operations. Other guidance can be found in our documentation. Although this post has focused on challenges you may encounter with scaling your pipeline, there are many benefits to Dataflow that are largely transparent -- things like dynamic work rebalancing to minimize straggler effects, throughput-based autoscaling, and job resource management adapt to many different pipeline and data shapes without user intervention. We're always trying to make our system more adaptive, and plan to automatically incorporate some of the above strategies into the core execution engine over time. Thanks for reading, and happy Dataflowing!

Related

Is apache-beam a good choice when the event time ordering has to preserved when writing to sink?

I'm considering using apache beam to write a streaming pipeline to apply a stream of mutations to replicate events from a source database into a destination database in the order of event time. The source could be either kafka or pubsub.
An example would be something like this except that the order in which the mutations are applied to the sink must be in order in which they arrived.
I did go over some of the previous questions asked on preserving order:
Processing Total Ordering of Events By Key using Apache Beam
Sort elements within a fixed window - Cloud Dataflow - This seems to be same use case i'm interested in.
I understand that if I go down the apache beam road i would have to
choose a windowing strategy with accommodation for late data (either a fixed windowing strategy with a allowed lateness or with global window, have triggers to emit panes and buffer for late data)
apply transformations
GroupByKey over a single key(so that everthing goes to the same worker), sort and write to sink
In addition to the above, I would have to make sure the windows(if i follow a fixed window strategy) are executed in order. Step 3 is bound to be the bottleneck.
If [2] above in the list of steps is a lot of computation then apache beam would make sense to take advantage of parallelism which beam offers. But if [2] is just a simple one to one mapping, does apache beam make sense for this replication usecase. Please let me know if i'm missing something.
Note: We do have a batch pipeline on dataflow using apache beam to load a datadump on gcs to database where the entire data is on disk and the order in which its written to sink does not matter.
Preserving order it's possible, but not sure if it's straightforward or efficient.
It also depends on how much data (elements/sec) you're expecting as well as what the sink type is. Potentially you could have the pipeline write out ordered entries to GCS, and the sink just reads the files in, in order, as a secondary process.
Your other option, of using parallel writes and make sure the database is usable only till the output watermark time of the last beam stage, it's maybe doable, but not really the core use case of Dataflow/Apache Beam.
Maybe there could be ways to process the stream out of order, but write to an intermediate sink that can easily be read from in order. i.e. writing out the mutation batches with a step or file number that can easily be used to order the files when applied to the final sink.
The window + write to final sink architecture is going to be difficult to get right, probably too complex for low volume of elements, and too inefficient for large volume. This is a good example of what this could look like.
But again, keep in mind that all this approaches are definitely not the core use case for Dataflow/Apache Beam.

Apache Beam/Dataflow Reshuffle

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as:
A PTransform that returns a PCollection equivalent to its input but
operationally provides some of the side effects of a GroupByKey, in
particular preventing fusion of the surrounding transforms,
checkpointing and deduplication by id.
What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual use case would be helpful.
There are a couple cases when you may want to reshuffle your data. The following is not an exhaustive list, but should give you and idea about why you may reshuffle:
When one of your ParDo transforms has a very high fanout
This means that the parallelism is increased after your ParDo. If you don't break the fusion here, your pipeline will not be able to split data into multiple machines to process it.
Consider the extreme case of a DoFn that generates a million output elements for every input element. Consider that this ParDo receives 10 elements in its input. If you don't break fusion between this high-fanout ParDo and its downstream transforms, it will only be able to run on 10 machines, although you will have millions of elements.
A good way to diagnose this is looking at the number of elements in an input PCollection vs the number of elements of an output PCollection. If the latter is significantly larger than the first, then you may want to consider adding a reshuffle.
When your data is not well balanced across machines**
Imagine that your pipeline consumes 9 files of 10MB and one file of 10GB. If each file is read by a single machine, you will have one machine with a lot more data than the others.
If you don't reshuffle this data, most of your machines will be idle while your pipeline runs. Reshuffling it allows you to rebalance the data to be processed more evenly across machines.
A good way to diagnose this is by looking at how many workers are executing work in your pipeline. If the pipeline is slow, and there is only one worker processing data, then you can benefit from a reshuffle.

Debugging slow reads from BigQuery on Google Cloud Dataflow

Background:
We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.
Problem:
Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.
Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)
Is it maybe possible to define a timeout for the job?
This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.
On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.
The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

Significance of creating single pipeline having multiple input sources over multiple pipelines each having separate input sources defined?

I am working on a project which receives requests from multiple clients through pubsub which dataflow pipelines will process in streaming mode to give out the responses. Each flow has some logic in common and also has read/writes from/to BigTable/BigQuery.
What are the pros and cons ( both development and maintenance side ) of using one single pipeline which receives input from different clients over separate pipeline for each input ?
In terms of development, these have about the same amount of complexity: you probably still have the common code written in one place, or perhaps even the entire pipeline code is identical but you're launching it with different parameters for different clients.
Maintenance-wise, there are pros and cons to both approaches.
One pipeline is likely to be cheaper. E.g. if traffic is overall very low and processing all the clients could fit on 1 machine, then it will actually happen on 1 machine - but if you do separate pipelines, each of them can't use less than 1 machine, so you'll be using at least N all the time.
One pipeline might be easier to observe and monitor in the UI, and easier to deploy. That, though, depends on the structure of the pipeline: are you going to pipe all clients' data through the same transforms, or, say, have 1 read transform per client (say, if each client is reading from a different PubSub topic and writing to a different BigQuery table)? If it's all the same transforms, then you'll get the benefit of launching the pipeline once and not having to do anything at all when a client is added or removed (otherwise, you'll need to update the pipeline).
With several pipelines (one per client), it's easier to isolate the issues with different clients. E.g. you could stop processing individual clients one by one, or update them one by one (say, if you're testing out some experimental code and don't want to break all the clients at the same time if it's wrong). It becomes unlikely that a bug in the pipeline will cause one client's data to mix up with another client's data.

Cloud Dataflow Running really slow when reading/writing from Cloud Storage (GCS)

Since using the release of the latest build of Cloud Dataflow (0.4.150414) our jobs are running really slow when reading from cloud storage (GCS). After running for 20 minutes with 10 VMs we were only able to read in about 20 records when previously we could read in millions without issue.
It seems to be hanging, although no errors are being reported back to the console.
We received an email informing us that the latest build would be slower and that it could be countered by using more VMs but we got similar results with 50 VMs.
Here is the job id for reference: 2015-04-22_22_20_21-5463648738106751600
Instance: n1-standard-2
Region: us-central1-a
Your job seems to be using side inputs to a DoFn. Since there has been a recent change in how Cloud Dataflow SDK for Java handles side inputs, it is likely that your performance issue is related to that. I'm reposting my answer from a related question.
The evidence seems to indicate that there is an issue with how your pipeline handles side inputs. Specifically, it's quite likely that side inputs may be getting re-read from BigQuery again and again, for every element of the main input. This is completely orthogonal to the changes to the type of virtual machines used by Dataflow workers, described below.
This is closely related to the changes made in the Dataflow SDK for Java, version 0.3.150326. In that release, we changed the side input API to apply per window. Calls to sideInput() now return values only in the specific window corresponding to the window of the main input element, and not the whole side input PCollectionView. Consequently, sideInput() can no longer be called from startBundle and finishBundle of a DoFn because the window is not yet known.
For example, the following code snippet has an issue that would cause re-reading side input for every input element.
#Override
public void processElement(ProcessContext c) throws Exception {
Iterable<String> uniqueIds = c.sideInput(iterableView);
for (String item : uniqueIds) {
[...]
}
c.output([...]);
}
This code can be improved by caching the side input to a List member variable of the transform (assuming it fits into memory) during the first call to processElement, and use that cached List instead of the side input in subsequent calls.
This workaround should restore the performance you were seeing before, when side inputs could have been called from startBundle. Long-term, we will work on better caching for side inputs. (If this doesn't help fully resolve the issue, please reach out to us via email and share the relevant code snippets.)
Separately, there was, indeed, an update to the Cloud Dataflow Service around 4/9/15 that changed the default type of virtual machines used by Dataflow workers. Specifically, we reduced the default number of cores per worker because our benchmarks showed it as cost effective for typical jobs. This is not a slowdown in the Dataflow Service of any kind -- it just runs with less resources per worker, by default. Users are still given the options to override both the number of workers as well as the type of the virtual machine used by workers.
We had a similar issue. It is when the side-input is reading from a BigQuery table that has had its data streamed in, rather than bulk loaded. When we copy the table(s), and read from the copies instead everything works fine.
If your tables are streamed, try copying them and reading the copies instead. This is a workaround.
See: Dataflow performance issues

Resources