Deploying Dataflow job that runs for X hours - google-cloud-dataflow

We are deploying/triggering Dataflow streaming jobs through Airflow using flex template. We want these streaming jobs to run, say for 24 hours (or until a certain clock time), then stop/cancel on its own. Is there a parameter in Dataflow (pipeline setting like max workers) that will do this?

I think there is no parameter and automatic approach to stop or drain a Dataflow job.
You can do that with an Airflow dag.
Example you can create a cron dag with Airflow (every 24 hours) having the responsability to stop or drain the Dataflow job, there is a built in operator to do that :
stop_dataflow_job = DataflowStopJobOperator(
task_id="stop-dataflow-job",
location="europe-west3",
job_name_prefix="start-template-job",
)
To stop one or more Dataflow pipelines you can use
DataflowStopJobOperator. Streaming pipelines are drained by default,
setting drain_pipeline to False will cancel them instead. Provide
job_id to stop a specific job, or job_name_prefix to stop all jobs
with provided name prefix.

Related

Best way to run bitbucket scheduled pipelines on weekdays

Bitbucket scheduled pipelines UI does not have an option for us to enter a cron expression and we can only run the pipeline hourly, daily or weekly. There is an option to create schedule via API call with cron expression in payload, however, unfortunately it does not accept a complex cron expression.
What could be the best way to achieve running the pipelines just on weekdays?
Looking for a better solution than these.
Have multiple daily pipelines mon-fri.
Have a daily pipeline and a check inside running logic for day.
Is there a better option?

Dataflow worker pool creation and deletion time overhead

In the execution of each Dataflow job, job is taking around 2-4 mins for the creation and deletion of VMs(worker pool).
Please let me know if there is any way to minimize this?
OR
Can we create VMs for processing before execution of Dataflow job so that execution time can bring down?
Dataflow is fully managed. From documentation:
You should not attempt to manage or otherwise interact directly with
your Compute Engine Managed Instance Group; the Dataflow service will
take care of that for you. Manually altering any Compute Engine
resources associated with your Dataflow job is an unsupported
operation.

Calling dataflow job automatically on a finite interval

Is it possible to automatically call a dataflow job in the gap of every 10 minutes. Any insights how to achieve this?
Yes. This is explained in the blog post Scheduling Dataflow pipelines using App Engine Cron Service or Cloud Functions
.

Increase the recurring job polling interval for hangfire and enabling/disabling the recurring job process

I am trying to create a background processor windows service using hangfire.
I would like to increase the recurring job polling interval to more than 1 minute(hard-coded by default). The reason for doing the same is that recurring polling can affect the performance of the database.
Is there a possibility to enable/disable the hangfire recurring Job feature. This is required in case there are multiple instances of the service installed.
When you create a recurring job in Hangfire, even if you have multiple Hangfire servers, the job will not be run on two servers at the same time.
You can use Cron expression to define the frequency at which to run your job, as described in Hangfire docs:
RecurringJob.AddOrUpdate(() => YourJob(), "0 12 * */2");
However, your need may be to avoid triggering a job when the previous instance is still running. For this situation, I would recommend setting a flag (in the DB for example) when your job starts and removing it when it ends. Then check if the flag is present before actually starting your process.
Update
As you stated you want to prevent the RecurringJobScheduler from running on some servers, I have looked into the code and it seems there is no option to do this.
You can check the file BackgroundJobServer.cs where the scheduler is added to the process list and the RecurringJobScheduler.cs where the DB is queried. The value of 1 minute is hardcoded, as specified in the comments.
I think your only option is the pull request you have already made :(

Running periodic Dataflow job

I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?
Thanks,
If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
CountingInput.unbounded().withRate(1, Duration.standardMinutes(1)))
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
...
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.

Resources