How do you determine how many resources to provision in a Google Dataflow streaming pipeline? - google-cloud-dataflow

I'm curious how to decide on how to provision resources for Apache Beam pipelines running on Google's Dataflow platform. I've built a streaming pipeline (Beam Java 2.0.0) that takes a PubSub JSON string, transforms it to a BQ TableRow, then routes it to the correct tables. There are also two transforms within the pipeline, one with a 5 minute sliding window every minute and another window with a 1 minute fixed time duration.
For some context, each incoming message is about a 1KB JSON string, and at an extreme peak the pipeline will receive 250,000 messages in one second. My sliding time window could possibly grow to have 5,000,000 million tablerows / minute before it closes (worst case scenario, but that's what we're planning for). Our typical peak traffic usage is about 75k messages / second. However, 90% of the time our pipeline is processing only 30 messages / second.
We're running on dataflow with autoscaling enabled, and by default Google provisions 4 CPUs, 15GB, and 420gb * max_number of workers for streaming pipelines. With 10 max workers set, we're going to be paying for 4.2TB of disk usage a month. That seems a bit overkill, but I don't know what data I should be looking at to verify my theory.
Something I've been thinking about is to instead use 2 CPUs and 7.5 GB of memory with 20GB of SSD per worker, and setting the max number of workers at 50. Under this configuration, we'd have at minimum 4 workers.
Summary of my spiel:
- How do you determine the CPU, RAM, and disk space you need for your streaming pipelines?
- How do you determine that a pipeline should provision SSD resources instead of standard harddrives?
- What metric measurements can I look at to measure performance of my pipeline?

Since pipelines are very different, there is no all purpose general way to say how many workers and what sizes of disks to use. There are several approaches that do work well though:
Dataflow's horizontal scaling is very close to linear. This means
that if you run a sampled pipeline (eg by sampling 10% of your input
traffic) you can very quickly estimate the resources the full
pipeline will need, without overpaying. You can tell if the pipeline is "keeping up" with the input, if the system lag stays low, and the data watermark continues to advance. You can then estimate the
maximum number of workers that your pipeline will need at peak input rate using this strategy. Lets call this number m
Having done the above, you can then rely on autoscaling, having set the maxNumWorkers flag to a number k*m where k will effectively determine how quickly your pipeline can catch up from a backlog at peak load. Eg, at k=1 the pipeline can only keep up with peak load, so a backlog at peak load may never be drained, or wait for non-peak load to drain. at k=2 the pipeline can process 2x the peak load, so it will catch up faster. Of course this is a tradeoff for how many resources you are willing to pay for during backlog, and how much catchup latency you are willing to tolerate.
Autoscaling will also ensure that the pipeline downscales during non-peak load, so that you will not be paying for all of the resources during non-peak times.
A few other notes:
Streaming dataflow tends to perform better with 4 CPU workers vs 2 CPU workers. This is because there is some per-worker overhead, and certain tuning for work parallelism that is optimized to 4 CPU workers.
SSD use should already be enabled by default when using dataflow, as SSDs drastically improve write throughput and lead to much better performance.

Related

GCP Dataflow - Throughput gradually slows down, Workers underutilized

I have a Beam script running in GCP Dataflow. This data flow performs the below steps:
Read a number of files that are PGP encrypted. (Total size more than 100 GB, individual files are of 2 GB in size)
Decrypt the files to form a PCollection
Do a wait() on PCollection
Do some processing on each record in the PCollection before writing into an output file
Behavior seen with GCP Dataflow:
When reading the input files and decrypting the files, it starts with one workers, and then scales upto 30 workers. But, only one worker continues to be utilized, utilization in all other workers is less than 10 %
Initially, throughput was 150K records per second while decryption. So, 90% of the decryption gets completed in 1 hours, which is good. But, then the throughput slows down gradually, even to just 100 records per second. So, it takes another 1-2 hours to complete the remaining 10% of the workload.
Any idea why the workers are underutilized? If there is no utilization, why are they not scaled down? Here, I am paying unnecessarily for a large number of VM-s :-(. Second, why the throughput slows reduction towards the end, and thereby significantly increasing the time for completion?
There is an issue related to the throughput and input behavior of the Cloud Dataflow. I suggest you to track the improvements being made to the autoscaling and utilization behavior of workers here.
The default architecture for Dataflow worker processing and autoscaling is not as responsive in some cases compared to when the Dataflow Streaming Engine feature is enabled. I would recommend you to try running the relevant Dataflow pipeline with Streaming Engine enabled, since it provides a more responsive autoscaling performance based on CPU utilization for your pipeline.
I hope you find the above pieces of information useful.
Can you try to implement your solution without wait() ?
For example,
FileIO.match().filepattern() -> ParDo(DoFn to decrypt files) -> fileIO.readmatches() -> ParDo(DoFn to read files)
See the example here.
This should allow your pipeline to better parallelize.

Dataflow: system lag keeps going up

We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.

Autoscaling in Google Cloud Dataflow is not working as expected

I am trying to enable autoscaling in my dataflow job as described in this article. I did that by setting the relevant algorithm via the following code:
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED)
After I set this and deployed my job, it always works with the max. number of CPUs available, i.e. if I set max number of workers to 10, then it uses all 10 CPUs although average CPU usage is about 50%. How does this THROUGHPUT_BASED algorithm works and where I am making mistake?
Thanks.
Although Autoscaling tries to reduce both the backlog and CPU, backlog reduction takes priority. Specific values backlog matters, Dataflow calculates 'backlog in seconds' roughly as 'backlog / throughput' and tries to keep it below 10 seconds.
In your case, I think what is preventing downscaling from 10 is due to policy regarding persistent disks (PDs) used for pipeline execution. When max workers is 10, Dataflow uses 10 persistent disks and tries to keep the number of workers at any time such that these disks are distributed roughly equally. As a consequence when the pipeline is at its max workers of 10, it tries to downscale to 5 rather than 7 or 8. In addition, it tries to keep projected CPU after downscaling to no more than 80%.
These two factors might be effectively preventing downscaling in your case. If CPU utilization is 50% with 10 workers, the projected CPU utilization is 100% for 5 workers, so it does not downscale since it is above the target 80%.
Google Dataflow is working on new execution engine that does not depend on persistent disks and does not suffer from the limitation of amout of downscaling.
A work around for this is to set higher max_workers and your pipeline might still stay at 10 or below. But that incurs a small increase in cost for PDs.
Another remote possibility is that sometimes even after upscaling estimated 'backlog seconds' might not stay below 10 seconds even with enough CPU. This could be due to various factors (user code processing, pubsub batching, etc). Would like to hear if that is affecting your pipeline.

How to determine concurrency (threads) while using shoryuken for background jobs?

In my Ruby on Rails application, I'm using shoryouken for background processing. I've many sqs queues (6-7) in my application. One of the queue has 2000-3000 jobs and it takes around 3 hours for the worker to process these 2-3k jobs with a default concurrency of 25. So based on what factors can we decide to increase the concurrency (which is the number of threads to process jobs). Please do comment if anything is unclear in the question.
Concurrency defaults to 25, but can be changed by altering your shoryuken.yml configuration (see below) or by adding the concurrency argument as so: shoryuken -c {desiredCount}
concurrency: 25 # Update with your desired value.
delay: 25 # The delay in seconds to pause a queue when it's empty. Default 0
queues:
- [high_priority, 6]
- [default, 2]
- [low_priority, 1]
You will need to test the optimal value for performance as you'll run into I/O and CPU bottlenecks as number of concurrent threads rises. Once you've reached the optimal value for your instance(s), you'll need to either increase the number of instances running this job or upgrade the instance(s).
If the bottleneck exists instead on your DB or other resource, you'll need to adjust it accordingly. (Not likely to be the case, but included for thoroughness' sake)
EDIT: Optimizing Performance
In response to your question on optimizing the thread count, the quickest/best way to determine the optimal concurrency value is to change concurrency and measure real-world throughput. There's other approaches, but the golden rule for performance is always to measure in a live production environment. Synthetic benchmarks are only helpful to the extent that they mirror real-time performance. (See also: premature optimization).
This is a case where you can easily end up overthinking things (then again, overthinking things is a perennial problem in development). Just measure with the appropriate metrics (CPU utilization, memory utilization, number of jobs completed per minute), and change the number of threads until you either maximize throughput or run into a bottleneck.
If your tasks are CPU bound you'll see your CPU utilization maxing out. If your tasks are I/O bound you'll see that after some point an increase in concurrent threads does not translate to an increase in throughput even though your CPU utilization fails to rise.
An I/O bottleneck can happen when any of the resources you're reading/writing are unable to keep up with your CPU demands. This includes system resources (memory, disk space), your database performance (DB CPU utilization, read/write limits), as well as other APIs you're connecting with. Network capacity is also a theoretical bottleneck but if it was you'd be big enough to have hired someone with experience in this area. Because there's so many different ways for this to happen, the only real way to figure it out what the bottlenecks are is to have your monitoring in place.
Re: formula, the short answer is that there's no one formula that you can use in this case. The long answer is probably yes, but you'd arrive at the optimum value in the course of collecting all the values you'd need to calculate it.
EDIT 2 : Concurrency, Latency, and Throughput
I realized I forgot to add one more piece of advice. When you're working with background tasks that users are not waiting for, your throughput (jobs per unit of time) is the only thing you want to optimize. Do not optimize for individual job time. It also means you cannot profile the current (and presumably un-bound) performance and get useful data because bottlenecks/constraints are target dependent. The constraints that exist for throughput will NOT be the same as the constraints that exist for individual task time.
(Technically speaking, your concurrency setting is your current constraint)
Three main factors are
Number of Cores
Type of Job - I/O or CPU bound
Is there another application or process running on server
Ideally for a cpu bound task keep number of thread to number of cpu cores.
For I/O bound task it requires benchmarking and calculating wait time for an I/O, and then you can decide the optimal value. For rough estimate if you have 4 cores than for I/O bound task you must keep at max 8 threads.
If you have your rails app running on the same then you will need to reduce number of cores.
Increasing the number of cores will not increase your performance if your system doesnt support.
Refer : http://baddotrobot.com/blog/2013/06/01/optimum-number-of-threads/

Is this a good use-case for Dataflow?

We currently are using google taskqueues to batch up requests to store analytics data into Keen and Stathat (more performant with batch puts). In order to consume from the taskqueues, we have a set of process brokers and workers to consume from the taskqueues. Seeing as dataflow is something where we just write the logic for pushing to our analytics solutions and we can specify a batch size to pull when processing in our dataflow program, I was curious if the overhead (seems more taylored to much larger applications) of dataflow is a good fit.
Your use case seems like a good one for Dataflow. Rather than publishing to a task queue you could publish to pubsub as a way to stream your data to your Dataflow job. Your Dataflow job could use Dataflow windows and triggers to batch your data based on size and/or time. You could then write each batch to your datastore.
Dataflow should work well on small datasets. The overhead would likely be in the cost of unused CPU cycles of Dataflow workers. Dataflow allows you to control the number of workers so you can allocate a number of workers suitable for your data size.
Utilization will depend on how evenly your load is spread out in time. If your peak and average loads are quite different then you can make a tradeoff between latency and utilization. If you want to maintain low latency then you can pick the number of workers so that you keep up during peak times. On the other hand if you want to maximize utilization, you can provision the number of workers based on average load. During peak times you would start to accumulate a backlog of messages in pubsub. The system would get rid of that backlog during non-peak times when there was spare capacity.
Right now Dataflow doesn't support writing custom sinks for unbounded data. One way to work around this is to do the writes from a DoFn rather than a sink. This should work just fine provided you can do your writes in an idempotent way so that writing a record multiple times won't cause problems.
Windowing and triggers are a way of dividing your data into finite batches to which aggregations (e.g. grouping, summing, etc...) can be applied. This blog post explains it better than I could (look at the section "windowing").

Resources