In the execution of each Dataflow job, job is taking around 2-4 mins for the creation and deletion of VMs(worker pool).
Please let me know if there is any way to minimize this?
OR
Can we create VMs for processing before execution of Dataflow job so that execution time can bring down?
Dataflow is fully managed. From documentation:
You should not attempt to manage or otherwise interact directly with
your Compute Engine Managed Instance Group; the Dataflow service will
take care of that for you. Manually altering any Compute Engine
resources associated with your Dataflow job is an unsupported
operation.
Related
When dataflow streaming job with autoscaling enabled is deployed, it uses single worker.
Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ.
Let's also assume that PubSub queue is already a bit big.
So pipeline get started and loads some pubsubs processing them on single worker.
After couple of minutes it gets realized that some extra workers are needed and creates them.
Many pubsub messages are already loaded and are being processed but not acked yet.
And here is my question: how dataflow will manage those unacked yet, being processed elements?
My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers.
Is this expected behavior?
Another question is - what next? First wins? Or new wins?
I mean, we have the same pubsub message that is still being processed on first worker and on the new one.
What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?
Dataflow provides exactly-once processing of every record. Funnily enough, this does not mean that user code is run only once per record, whether by the streaming or batch runner.
It might run a given record through a user transform multiple times, or it might even run the same record simultaneously on multiple workers; this is necessary to guarantee at-least once processing in the face of worker failures. Only one of these invocations can “win” and produce output further down the pipeline.
More information here - https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1
I wanted to check if there is scenario where there are 30-40 jobs running concurrently in cloud dataflow. Is there a setting by which the workers used on 1 job can be shared across other or use managed instance group as compute option.
The reason for asking is if the risk of running out of compute instances or exceeding the quota can be managed.
Cloud Dataflow manages the GCE instances internally. This means that it is unable to share the instances with other jobs. Please see here for more information.
I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.
I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?
Thanks,
If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
CountingInput.unbounded().withRate(1, Duration.standardMinutes(1)))
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
...
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.
I have a dataflow job with Autoscaling enabled, which resized the worker pool to 14 during execution. By the time the job had finished the job log reported 6 OutOfMemoryErrors but the whole pipeline, as well as each execution step, had status succeeded. Can I trust the job status, or could I have data loss due to the worker failures?
You can trust the job status and results, because Dataflow is designed to process data in a way that is resilient to such failures. Further information can be found in the description of Service Optimization and Execution. Specifically:
The Dataflow service is fault-tolerant, and may retry your code
multiple times in the case of worker issues. The Dataflow service may
create backup copies of your code, and can have issues with manual
side effects (such as if your code relies upon or creates temporary
files with non-unique names).