I am trying to explore the Dataflow sql feature. https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro
I am able to submit and successfully execute the Dataflow job using SQL. Now I want to schedule this job every 15 mins.
I know for normal Dataflow job using JAVA sdk we can schedule the job using template.
I am unable to find any documentation on scheduling Dataflow SQL jobs.
Related
We are deploying/triggering Dataflow streaming jobs through Airflow using flex template. We want these streaming jobs to run, say for 24 hours (or until a certain clock time), then stop/cancel on its own. Is there a parameter in Dataflow (pipeline setting like max workers) that will do this?
I think there is no parameter and automatic approach to stop or drain a Dataflow job.
You can do that with an Airflow dag.
Example you can create a cron dag with Airflow (every 24 hours) having the responsability to stop or drain the Dataflow job, there is a built in operator to do that :
stop_dataflow_job = DataflowStopJobOperator(
task_id="stop-dataflow-job",
location="europe-west3",
job_name_prefix="start-template-job",
)
To stop one or more Dataflow pipelines you can use
DataflowStopJobOperator. Streaming pipelines are drained by default,
setting drain_pipeline to False will cancel them instead. Provide
job_id to stop a specific job, or job_name_prefix to stop all jobs
with provided name prefix.
Bitbucket scheduled pipelines UI does not have an option for us to enter a cron expression and we can only run the pipeline hourly, daily or weekly. There is an option to create schedule via API call with cron expression in payload, however, unfortunately it does not accept a complex cron expression.
What could be the best way to achieve running the pipelines just on weekdays?
Looking for a better solution than these.
Have multiple daily pipelines mon-fri.
Have a daily pipeline and a check inside running logic for day.
Is there a better option?
Is that possible to call dataflow using cloud run in GCP or any other alternative ways to run dataflow every 30 mins.
I would thankful if someone share reference material for implementation procedure.
Running dataflow locally on Cloud Run
You could do the following:
Given that you want to invoke dataflow every 30 mins or at a predefined/regular interval, consider using Cloud Scheduler. Cloud Scheduler is a fully-managed cron like service and it will allow you to say invoke a URL, every 30 mins or whatever is the frequency that you would want.
The URL that you are invoking can be a Google Cloud Function. The code inside your function will be the execution code that launches your Dataflow template.
You should consider using a Flex template and invoking the dataflow job using the REST API and Cloud Scheduler.
If you want to run a regular dataflow job, you could follow Romin's advice and use a Cloud Function to execute custom code to launch a dataflow job and invoke the function from the Cloud Scheduler. This is more complex IMHO. Using flex templates might just be easier.
Currently i am using Flex Template to launch a job from a microservice. I am trying to find a better way (than using job polling method) to get the dataflow job status. Basically trying to publish dataflow job status onto pubsub by datflow job itself when completed.
Can someone help me with this?
There is currently no way to make a Dataflow job itself send its status to a Pub/Sub topic.
Instead, you can create logs exports to send your Dataflow logs to a Pub/Sub topic with inclusion and exclusion filters and further perform text searches on the Pub/Sub messages to deduce the status of your job. For example, you can create an inclusion filter on dataflow.googleapis.com%2Fjob-message", and, among the received messages, one that contains a string like "Workflow failed." is from a batch job that failed.
Is it possible to automatically call a dataflow job in the gap of every 10 minutes. Any insights how to achieve this?
Yes. This is explained in the blog post Scheduling Dataflow pipelines using App Engine Cron Service or Cloud Functions
.