Dataflow control high fanout between steps - google-cloud-dataflow

I have 3 dataflow steps in a Dataflow pipeline.
Reads from pubsub , saves in a table and splits into multiple events(puts into context output).
For each split, queries db and decorates the event with additional data.
Publishes to another pubsub topic for further procession.
PROBLEM:
After step 1, its splitting into 10K to 20K events.
Now in step 2 its running out of database connections. (I have a static hikari connection pool).
It works absolutely fine will less data. I am using a n1-standard-32 machine.
What should I do to limit the input to the next step? So that the parallelism is restricted or throttle events to next step.

I think basic idea is to reduce parallelism when executing step2 (If you have a massive parallelism, you will need 20k connections for 20k events because 20k events are processed in parallel).
Ideas include:
Stateful ParDo's execution is serialized per key per window, which means only one connection is need for a stateful ParDo because only one element should be processed at a given time for a key and a window.
One connection per bundle. You can initialize a connection at startBundle and make elements within a same bundle use a same connection (if my understanding is correct, within a bundle, execution is likely serialized).

Related

How does dataflow manage current processes during upscaling streaming job?

When dataflow streaming job with autoscaling enabled is deployed, it uses single worker.
Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ.
Let's also assume that PubSub queue is already a bit big.
So pipeline get started and loads some pubsubs processing them on single worker.
After couple of minutes it gets realized that some extra workers are needed and creates them.
Many pubsub messages are already loaded and are being processed but not acked yet.
And here is my question: how dataflow will manage those unacked yet, being processed elements?
My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers.
Is this expected behavior?
Another question is - what next? First wins? Or new wins?
I mean, we have the same pubsub message that is still being processed on first worker and on the new one.
What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?
Dataflow provides exactly-once processing of every record. Funnily enough, this does not mean that user code is run only once per record, whether by the streaming or batch runner.
It might run a given record through a user transform multiple times, or it might even run the same record simultaneously on multiple workers; this is necessary to guarantee at-least once processing in the face of worker failures. Only one of these invocations can “win” and produce output further down the pipeline.
More information here - https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1

Orchestration/notification of processing events

I have the following SCDF use case.
I have a couple hundred files to process and put in the db
A producer will get a single file, reads the first N number of rows and send it to source (rabbit mq) , then reads the next N number of rows and sends it to source again, etc, until done.
A consumer will receive these file chunks (from rabbit mq), do some minor enriching, and write it to the DB (sink)
I will have some number of streams > 1 running (say 4 for example) for some parallel processing of these files
My question is: Does SCDF have a mechanism to know when all consumers are completed (and hence the queue(s) are exhausted) so I can know when to start some other process (could be another stream/task/anything) that needs the db fully populated to begin
Yes sink1 is the only consumer of source1. In a streaming application, there is no concept of “COMPLETED”. By definition, stream processing is logically unbounded and stream apps (sources and sinks) are designed to run forever. Tasks, on the other hand, are short lived, finite processes that exit when they are complete. The application logic defines when the task is complete. Processing a file, or a chunk of a file is the most common use case. A stream can monitor a file system, or remote file source such as sftp, or s3, and launch a task whenever a new file appears. The task processes the file and marks the execution as COMPLETE.
This type of use case is better suited for task/batch. See https://dataflow.spring.io/docs/recipes/batch/sftp-to-jdbc/ which details the recommended architecture. You can use define a composed task to run the ingest and then the next task.

Apache Beam/Dataflow Reshuffle

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as:
A PTransform that returns a PCollection equivalent to its input but
operationally provides some of the side effects of a GroupByKey, in
particular preventing fusion of the surrounding transforms,
checkpointing and deduplication by id.
What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual use case would be helpful.
There are a couple cases when you may want to reshuffle your data. The following is not an exhaustive list, but should give you and idea about why you may reshuffle:
When one of your ParDo transforms has a very high fanout
This means that the parallelism is increased after your ParDo. If you don't break the fusion here, your pipeline will not be able to split data into multiple machines to process it.
Consider the extreme case of a DoFn that generates a million output elements for every input element. Consider that this ParDo receives 10 elements in its input. If you don't break fusion between this high-fanout ParDo and its downstream transforms, it will only be able to run on 10 machines, although you will have millions of elements.
A good way to diagnose this is looking at the number of elements in an input PCollection vs the number of elements of an output PCollection. If the latter is significantly larger than the first, then you may want to consider adding a reshuffle.
When your data is not well balanced across machines**
Imagine that your pipeline consumes 9 files of 10MB and one file of 10GB. If each file is read by a single machine, you will have one machine with a lot more data than the others.
If you don't reshuffle this data, most of your machines will be idle while your pipeline runs. Reshuffling it allows you to rebalance the data to be processed more evenly across machines.
A good way to diagnose this is by looking at how many workers are executing work in your pipeline. If the pipeline is slow, and there is only one worker processing data, then you can benefit from a reshuffle.

Is there any way to set numWorkers dynamically in the middle of dataflow job running?

I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.

Delayed Processing Jobs in Ruby: How much is not blocking my path

I have this project which still uses delayed job as processing job queue. I've recently found an edge case which is making me question a few things: I have this AR (I'm using MySQL, by the way) object, which on update sends a message to all the elements of an has_many association. In order to do that, I have to instantiate all the elements of this association an call the message on them. It seemed only fair enough to delay the call of this message for each one of them.
Now the association has grown quite a bit, where in an edge case I have 40000 objects belonging to that association. The message sending thereby now involves the (synchronous) creation of 40000 delayed-job jobs. Since these happen inside an after update callback an not after commit, they are thereby (ab)using the same connection, not taking advantage of any context-switching. Short version, I have a pipe of 1 Update statement and 40000 Inserts on the same connection. This update is gobbling quite a few minutes in production, for that reason.
Now, there are a lot of ways around this: Change the callback to an after commit, creating 1 (synchronous) delayed job which will create 40000 jobs (I don't want to handle the 40000 (AR) objects in one job, the 40000 now will be 120000 tomorrow, and that's memory-armageddon), etc etc...
But what I'm really considering is switching my delayed processing queue to resque or sidekiq. They use redis, so write performance is far better. They use something rather than MySQL, which means the connections will not block each other. My only issue is: how much would 40000 writes at once to redis cost me? And: does any one of these options first store the jobs in memory, not blocking the response to the client and belatedly stores them in redis? So, my real question is: how much would this delaying delay me in such an edge case?
Indeed, Redis can process writes faster than MySQL. Try running redis-benchmark, you'll see figures of 100k+ writes/sec.
does any one of these options first store the jobs in memory, not blocking the response to the client and belatedly stores them in redis?
No, they do it synchronously.
I don't want to handle the 40000 (AR) objects in one job
Maybe you should try hybrid approach: process chunks of N objects per job. Batch writes should be faster than 40k individual writes. And it scales well (batch size will stay the same, be it 40k or 400k items).

Resources