Understanding Threading in Google Cloud DataFlow Workers - google-cloud-dataflow

I made a simple program which waits for 60 seconds. I have 300 input elements to process.
Number of threads - Batch - 1 and Streaming - 300 per this document
https://cloud.google.com/dataflow/docs/resources/faq#beam-java-sdk
In streaming mode - with 1 worker and 300 threads, job should complete in 2 to 3 minutes considering the overhead of spawning workers etc. My understanding is there will be 300 threads for each of 300 input elements and all sleep for 60 seconds and the job should get done. However, the job takes more time to complete.
Similarly, in Batch mode with 1 worker (1 Thread) and 300 input elements, it should take 300 minutes to complete.
Can someone clarify how this happens at worker level ?

There is considerable overhead in starting up and tearing down worker VMs, so it's hard to generalize from a short experiment such as this. In addition, there's not promise that there will be a given number of workers for streaming or batch, as this is an implementation-dependent parameter that my change at any time for any runner (and indeed may even be chosen dynamically).

Related

Dataflow pipeline pauses after scaling down with low CPU utilization

In our streaming pipeline we read data from pubsub, do some validations and then group it by a key in a 10 second gap session window. Afterwards the data is processed further and written to bigtable and pubsub again.
We're using apache beam 2.28 and the dataflow streaming engine. During the day we process more data than over night and the pipeline scales up the number of workers (n2d-standard-4) automatically. Mostly it scales up from 2 workers to 4 or 5 to reduce the backlog. After that it will scale down again as the CPU utilization is too low for 4 or 5 workers.
It is at this point that the CPU utilization drops to nearly 0% for all workers and the entire pipeline starts lagging behind massively. The result is that the number of workers is scaled up to a higher number again and the pipeline processing the data further. After the backlog is reduced again, the number of workers is gradually lowered and the same issue arises.
metrics
What we notice is that in the GroupByKey step, the input throughput stays more or less the same, but the output throughput drops to 0.
GroupByKey throughput
I know using GroupByKey can have hotkeys, but then I would expect the CPU utilization of 1 worker to be very high while the others have nothing to do.
Does anyone know what might be causing this issue?
The issue was caused by by the combination of using the session window with a groupbykey, how the watermark for a pubsub unbounded source works and when the acknowledges are being sent to pubsub.
Our session window with a gap of 10 seconds sometimes didn't output any messages for a couple of minutes (due to no early trigger being configured and messages continuously arriving for the same key within the 10 second session gap). Because these steps are part of the first fused stage in the actual execution of our pipeline, this lead to some messages not being acknowledged to pubsub (the ack is only sent when the first fused stage is completed). The oldest unacknowledged message time on the subscription kept on rising, causing the watermark not to advance.
This issue was became more outspoken due to the acknowledgement deadline being set to 10 minutes. When the number of workers scaled down, this caused the issue described in the original question.
We were able to solve this by adding a Reshuffle before the creation of the session window (with the groupbykey) and decreasing the acknowledgement deadline.
https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization

Slow Dataflow job draining

I have a dataflow job that reads from pubsub subscription and writes to a Redis instance on a fixed time windows. The job seems to be running well with 4 workers and almost 0s systems latency until I try to drain it which causes upscaling to 10 workers and takes hours to finish.
I suspect this is caused by the windowing/grouping since the output collection metric suggest that it keeps producing elements long after the drain is started.
That's the windowing that I'm using.
beam.WindowInto(
window.FixedWindows(size=120),
trigger=Repeatedly(
AfterAny(AfterCount(100), AfterProcessingTime(120))),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=Duration(seconds=2 * 24 * 60 * 60))
Given your large allowed lateness, when you drain your pipeline it causes every window for every key seen in the last 48 hours to close all at once, which is probably why there's so much work and it's upscaling. This could be especially bad if your keys are not often re-used.

GCP Dataflow - Throughput gradually slows down, Workers underutilized

I have a Beam script running in GCP Dataflow. This data flow performs the below steps:
Read a number of files that are PGP encrypted. (Total size more than 100 GB, individual files are of 2 GB in size)
Decrypt the files to form a PCollection
Do a wait() on PCollection
Do some processing on each record in the PCollection before writing into an output file
Behavior seen with GCP Dataflow:
When reading the input files and decrypting the files, it starts with one workers, and then scales upto 30 workers. But, only one worker continues to be utilized, utilization in all other workers is less than 10 %
Initially, throughput was 150K records per second while decryption. So, 90% of the decryption gets completed in 1 hours, which is good. But, then the throughput slows down gradually, even to just 100 records per second. So, it takes another 1-2 hours to complete the remaining 10% of the workload.
Any idea why the workers are underutilized? If there is no utilization, why are they not scaled down? Here, I am paying unnecessarily for a large number of VM-s :-(. Second, why the throughput slows reduction towards the end, and thereby significantly increasing the time for completion?
There is an issue related to the throughput and input behavior of the Cloud Dataflow. I suggest you to track the improvements being made to the autoscaling and utilization behavior of workers here.
The default architecture for Dataflow worker processing and autoscaling is not as responsive in some cases compared to when the Dataflow Streaming Engine feature is enabled. I would recommend you to try running the relevant Dataflow pipeline with Streaming Engine enabled, since it provides a more responsive autoscaling performance based on CPU utilization for your pipeline.
I hope you find the above pieces of information useful.
Can you try to implement your solution without wait() ?
For example,
FileIO.match().filepattern() -> ParDo(DoFn to decrypt files) -> fileIO.readmatches() -> ParDo(DoFn to read files)
See the example here.
This should allow your pipeline to better parallelize.

enqueuing jobs using sucker punch

I have one doubt with enqueuing the job using sucker punch.
I have 2000+ search keywords in my database I want to know the google and bing ranking for each keyword in my database. For this I'm using Authority Labs API. But AuthorityLabs will only process 1000 POST request in 1 hour. I'm sending each request to AuthorityLab as a background job using sucker punch. How can I limit only 1000 jobs will run in 1 hour, remaining jobs only start after one hour. Also I want to run this jobs daily for analysing the rank change.
Rate limiting is not a concern of your queue system, much less of SuckerPunch that is not designed to handle advanced delaying/queuing stuff, it just moves asynchronous jobs to a thread from a thread pool.
If you really want to have rate limiting, use a real queue system like Sidekiq, and put some actual code to work.
Sidekiq Enterprise supports it natively: https://github.com/mperham/sidekiq/wiki/Ent-Rate-Limiting
Sidekiq-throttler seems to provide the same functionality: https://github.com/gevans/sidekiq-throttler
But you can also just delay execution (so pre-emptively limiting the rate), by enqueuing jobs at specific times in the future (each executing 4 minutes after the other) or enqueuing just one job that executes itself (doing next outstanding request) and enqueues itself again with 4 minutes delay.
As always with open source, check the code and decide by yourself.
Could you do something like this?
YourProcessingJob.set(wait: 1.hours).perform_later
Possibly in a custom rake task...

How to limit both total number of Quartz jobs and number running on single node in a cluster

If I have a 3 node cluster. I need to run a specific Quartz job as follows:
There is at a given time, many (say 30) of these jobs that need to be run.
Limit the number of a that Quartz job running on all clusters combined at the same time (to 10, because of system resources)
Limit the number of a that Quartz job running on a single server at the same time (to 5, because of CPU load)
How do I limit both the total number of simultaneous job instances to 10, and the number running on any one host to 5? Is this even possible?
Note that I cannot limit the number of threads as I have other jobs that need to run on the same servers at the same time, and those need threads as well.
Thanks.
While not exactly limiting the consecutive job count, you can limit the maximum thread count with the thread pool configuration. See Quartz Configuration Reference.
The Grails Quartz plugin comes with a handy script for installing the config file:
grails install-quartz-config
org.quartz.threadPool.threadCount
Can be any positive integer, although you should realize that only
numbers between 1 and 100 are very practical. This is the number of
threads that are available for concurrent execution of jobs. If you
only have a few jobs that fire a few times a day, then 1 thread is
plenty! If you have tens of thousands of jobs, with many firing every
minute, then you probably want a thread count more like 50 or 100
(this highly depends on the nature of the work that your jobs perform,
and your systems resources!).
I think that I found the answer.
The answer is to run two (or more) separate Quartz schedulers. A job in the first scheduler would schedule the job for the second scheduler, and the second would run them. The second scheduler could then be limited to (in this case) 5 threads, although the first scheduler could have more.
Some information about this can be found in
http://quartz-scheduler.org/documentation/quartz-2.2.x/cookbook/MultipleSchedulers
However I do not know how to implement two separate Quartz Schedulers in Grails. If anyone could help with that I would appreciate it. There is an existing Stack Overflow question about this though, although it is unanswered.

Resources