Is there a way to update a Dataflow job using the gcloud command? - google-cloud-dataflow

I am trying to write a script to automate the deployment of a Java Dataflow job. The script creates a template and then uses the command
gcloud dataflow jobs run my-job --gcs-location=gs://my_bucket/template
The issue is, I want to update the job if the job already exists and it's running. I can do the update if I run the job via maven, but I need to do this via gcloud so I can have a service account for deployment and another one for running the job. I tried different things (adding --parameters update to the command line), but I always get an error. Is there a way to update a Dataflow job exclusively via gcloud dataflow jobs run?

Referring to the official documentation, which describes gcloud beta dataflow jobs - a group of subcommands for working with Dataflow jobs, there is no possibility to use gcloud for update the job.
As for now, the Apache Beam SDKs provide a way to update an ongoing streaming job on the Dataflow managed service with new pipeline code, you can find more information here. Another way of updating an existing Dataflow job is by using REST API, where you can find Java example.
Additionally, please follow Feature Request regarding recreating job with gcloud.

Related

passing container arguments to cloud run job (not service)

I want to execute a GCP cloud run job (not service) that periodically processes some data. I see that when i configure a cloud run job I can fill out "Container, Variables & Secrets, Connections, Security" field for container arguments. But I want to pass different arguments every time I execute them and I am wondering if there is a way. I haven't been able to find a way to do.
If there is no such way, am I supposed to use cloud run jobs only if I want them to do the same thing periodically?
But I want to pass different arguments every time I execute them and I am wondering if there is a way.
You can update the job before every execution.
I use gcloud CLI tool to run my jobs from a pipeline. I do multiple steps, first I use gcloud beta run jobs update my-job (with updates of variables) and after that I start an execution using gcloud beta run jobs execute my-job.
There is also a flag on the update command, --execute-now that you might want to use to start an execution when you update the job.
As stated in the official documentation:
Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests, and cannot accept arbitrary parameters at execution.
Therefore, there is no way for a Cloud Run job to receive parameters at execution time.
Also note that, at the moment, Cloud Run Jobs is in Preview:
Preview — Using Cloud Run jobs
This feature is covered by the Pre-GA Offerings Terms of the Google Cloud Terms of Service. Pre-GA features might have limited support, and changes to pre-GA features might not be compatible with other pre-GA versions. For more information, see the launch stage descriptions.

How to scale down OpenShift/Kubernetes pods automatically on a schedule?

I have a requirement to scale down OpenShift pods at the end of each business day automatically.
How might I schedule this automatically?
OpenShift, like Kubernetes, is an api-driven application. Essentially all application functionality is exposed over the control-plane API running on the master hosts.
You can use any orchestration tool that is capable of making API calls to perform this activity. Information on calling the OpenShift API directly can be found in the official documentation in the REST API Reference Overview section.
Many orchestration tools have plugins that allow you to interact with OpenShift/Kubernetes API more natively than running network calls directly. In the case of Jenkins for example there is the OpensShift Pipeline Jenkins plugin that allows you to perform OpenShift activities directly from Jenkins pipelines. In the cases of Ansible there is the k8s module.
If you were to combine this with Jenkins capability to run jobs on a schedule you have something that meets your requirements.
For something much simpler you could just schedule Ansible or bash scripts on a server via cron to execute the appropriate API commands against the OpenShift API.
Executing these commands from within OpenShift would also be possible via the CronJob object.

Get list of Dataflow pipeline jobs programmatically using Java SDK

I know there's a gcloud command for this:
gcloud dataflow jobs list --help
NAME
gcloud dataflow jobs list - lists all jobs in a particular project, optionally filtered by region
DESCRIPTION
By default, 100 jobs in the current project are listed; this can be overridden with the gcloud --project flag, and the --limit flag.
Using the --region flag will only list jobs from the given regional endpoint.
But I'd like to retrieve this list programmatically through Dataflow Java SDK.
The problem I'm trying to solve:
I have a Dataflow pipeline in streaming mode and I want to set the update option (https://cloud.google.com/dataflow/pipelines/updating-a-pipeline) accordingly based on if this job has been deployed or not.
e.g. When I'm deploying this job for the first time, the code shouldn't set this update flag to true since there's no existing job to update (otherwise the driver program will complain and fail to launch); and the code should be able to query the list of running jobs and acknowledge the job's been running and set the update option to update it (otherwise DataflowJobAlreadyExistsException is thrown).
I've found org.apache.beam.runners.dataflow.DataflowClient#listJobs(String) that can achieve this.

How to delete a gcloud Dataflow job?

The Dataflow jobs are cluttered all over my dashboard, and I'd like to delete the failed jobs from my project. But in the dashboard, I don't see any option to delete the Dataflow job. I'm looking for something like below at least,
$ gcloud beta dataflow jobs delete JOB_ID
To delete all jobs,
$ gcloud beta dataflow jobs delete
Can someone please help me with this?
Unfortunately, this is not currently possible. You cannot delete a Dataflow job. This is something that you could request via the public issue tracker (I've wanted it in the past too).
gcloud dataflow jobs --help
COMMANDS
COMMAND is one of the following:
cancel
Cancels all jobs that match the command line arguments.
describe
Outputs the Job object resulting from the Get API.
drain
Drains all jobs that match the command line arguments.
list
Lists all jobs in a particular project.
run
Runs a job from the specified path.
show
Shows a short description of the given job.
As Graham mentions, it is not possible to delete Dataflow jobs. However, note that you can filter the job list to only show the jobs you care about. For example, Status:Running,Succeeded will exclude all failed or cancelled jobs.
On the commandline, you can use --status=(active|terminated|all):
gcloud beta dataflow jobs list --status=active

Google cloud dataflow job status

I run dataflow jobs within a unix shell script, and need to know each job final/completion status, is there any command line tool to grab the job completion status?
There's indeed a CLI to retrieve jobs execution status :
gcloud dataflow jobs list --project=<PROJECT_ID> --filter="id=<JOB_ID>" --format="get(state)"
Yes! Dataflow has a CLI that is available as part of gcloud.
You may need to install gcloud alpha components:
$ gcloud components update alpha
After that, you should be able to use gcloud alpha dataflow jobs list to list all the jobs in a project or gcloud alpha dataflow jobs show <JOBID> for more information on a specific job.
You can find more details about this command and others below at https://cloud.google.com/sdk/gcloud/reference/alpha/dataflow/jobs/list

Resources