Is it possible to automatically call a dataflow job in the gap of every 10 minutes. Any insights how to achieve this?
Yes. This is explained in the blog post Scheduling Dataflow pipelines using App Engine Cron Service or Cloud Functions
.
Related
We are deploying/triggering Dataflow streaming jobs through Airflow using flex template. We want these streaming jobs to run, say for 24 hours (or until a certain clock time), then stop/cancel on its own. Is there a parameter in Dataflow (pipeline setting like max workers) that will do this?
I think there is no parameter and automatic approach to stop or drain a Dataflow job.
You can do that with an Airflow dag.
Example you can create a cron dag with Airflow (every 24 hours) having the responsability to stop or drain the Dataflow job, there is a built in operator to do that :
stop_dataflow_job = DataflowStopJobOperator(
task_id="stop-dataflow-job",
location="europe-west3",
job_name_prefix="start-template-job",
)
To stop one or more Dataflow pipelines you can use
DataflowStopJobOperator. Streaming pipelines are drained by default,
setting drain_pipeline to False will cancel them instead. Provide
job_id to stop a specific job, or job_name_prefix to stop all jobs
with provided name prefix.
Bitbucket scheduled pipelines UI does not have an option for us to enter a cron expression and we can only run the pipeline hourly, daily or weekly. There is an option to create schedule via API call with cron expression in payload, however, unfortunately it does not accept a complex cron expression.
What could be the best way to achieve running the pipelines just on weekdays?
Looking for a better solution than these.
Have multiple daily pipelines mon-fri.
Have a daily pipeline and a check inside running logic for day.
Is there a better option?
Is that possible to call dataflow using cloud run in GCP or any other alternative ways to run dataflow every 30 mins.
I would thankful if someone share reference material for implementation procedure.
Running dataflow locally on Cloud Run
You could do the following:
Given that you want to invoke dataflow every 30 mins or at a predefined/regular interval, consider using Cloud Scheduler. Cloud Scheduler is a fully-managed cron like service and it will allow you to say invoke a URL, every 30 mins or whatever is the frequency that you would want.
The URL that you are invoking can be a Google Cloud Function. The code inside your function will be the execution code that launches your Dataflow template.
You should consider using a Flex template and invoking the dataflow job using the REST API and Cloud Scheduler.
If you want to run a regular dataflow job, you could follow Romin's advice and use a Cloud Function to execute custom code to launch a dataflow job and invoke the function from the Cloud Scheduler. This is more complex IMHO. Using flex templates might just be easier.
I'm using gcloud dataflow job and want individual execution times for all the steps in my dataflow including nested transforms. I'm using a streaming dataflow and the pipeline currently looks like this:
Current dataflow
Can anyone please suggest a solution?
Answer is WallTime. You can access this info by clicking one of the task in your pipeline(even nested).
Elapsed time of a job is the total time takes to complete your dataflow job while wall time is the sum time taken to run each step by the assigned workers. See the below image for more details.
We have multiple jenkins-jobs scheduled at roughly near the same time every night.
I would like a report-summary of status to be available to me / or sent to me.
I do not repeatedly want to do a walk through test suite every day.
Much appreciated any advice on topic ?
The Global Build Stats plugin might fit your needs. It does not support scheduled email, but if you need that you could use the rest API it exposes to write your own.