I know there's a gcloud command for this:
gcloud dataflow jobs list --help
NAME
gcloud dataflow jobs list - lists all jobs in a particular project, optionally filtered by region
DESCRIPTION
By default, 100 jobs in the current project are listed; this can be overridden with the gcloud --project flag, and the --limit flag.
Using the --region flag will only list jobs from the given regional endpoint.
But I'd like to retrieve this list programmatically through Dataflow Java SDK.
The problem I'm trying to solve:
I have a Dataflow pipeline in streaming mode and I want to set the update option (https://cloud.google.com/dataflow/pipelines/updating-a-pipeline) accordingly based on if this job has been deployed or not.
e.g. When I'm deploying this job for the first time, the code shouldn't set this update flag to true since there's no existing job to update (otherwise the driver program will complain and fail to launch); and the code should be able to query the list of running jobs and acknowledge the job's been running and set the update option to update it (otherwise DataflowJobAlreadyExistsException is thrown).
I've found org.apache.beam.runners.dataflow.DataflowClient#listJobs(String) that can achieve this.
Related
I want to execute a GCP cloud run job (not service) that periodically processes some data. I see that when i configure a cloud run job I can fill out "Container, Variables & Secrets, Connections, Security" field for container arguments. But I want to pass different arguments every time I execute them and I am wondering if there is a way. I haven't been able to find a way to do.
If there is no such way, am I supposed to use cloud run jobs only if I want them to do the same thing periodically?
But I want to pass different arguments every time I execute them and I am wondering if there is a way.
You can update the job before every execution.
I use gcloud CLI tool to run my jobs from a pipeline. I do multiple steps, first I use gcloud beta run jobs update my-job (with updates of variables) and after that I start an execution using gcloud beta run jobs execute my-job.
There is also a flag on the update command, --execute-now that you might want to use to start an execution when you update the job.
As stated in the official documentation:
Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests, and cannot accept arbitrary parameters at execution.
Therefore, there is no way for a Cloud Run job to receive parameters at execution time.
Also note that, at the moment, Cloud Run Jobs is in Preview:
Preview — Using Cloud Run jobs
This feature is covered by the Pre-GA Offerings Terms of the Google Cloud Terms of Service. Pre-GA features might have limited support, and changes to pre-GA features might not be compatible with other pre-GA versions. For more information, see the launch stage descriptions.
I am trying to write a script to automate the deployment of a Java Dataflow job. The script creates a template and then uses the command
gcloud dataflow jobs run my-job --gcs-location=gs://my_bucket/template
The issue is, I want to update the job if the job already exists and it's running. I can do the update if I run the job via maven, but I need to do this via gcloud so I can have a service account for deployment and another one for running the job. I tried different things (adding --parameters update to the command line), but I always get an error. Is there a way to update a Dataflow job exclusively via gcloud dataflow jobs run?
Referring to the official documentation, which describes gcloud beta dataflow jobs - a group of subcommands for working with Dataflow jobs, there is no possibility to use gcloud for update the job.
As for now, the Apache Beam SDKs provide a way to update an ongoing streaming job on the Dataflow managed service with new pipeline code, you can find more information here. Another way of updating an existing Dataflow job is by using REST API, where you can find Java example.
Additionally, please follow Feature Request regarding recreating job with gcloud.
The Dataflow jobs are cluttered all over my dashboard, and I'd like to delete the failed jobs from my project. But in the dashboard, I don't see any option to delete the Dataflow job. I'm looking for something like below at least,
$ gcloud beta dataflow jobs delete JOB_ID
To delete all jobs,
$ gcloud beta dataflow jobs delete
Can someone please help me with this?
Unfortunately, this is not currently possible. You cannot delete a Dataflow job. This is something that you could request via the public issue tracker (I've wanted it in the past too).
gcloud dataflow jobs --help
COMMANDS
COMMAND is one of the following:
cancel
Cancels all jobs that match the command line arguments.
describe
Outputs the Job object resulting from the Get API.
drain
Drains all jobs that match the command line arguments.
list
Lists all jobs in a particular project.
run
Runs a job from the specified path.
show
Shows a short description of the given job.
As Graham mentions, it is not possible to delete Dataflow jobs. However, note that you can filter the job list to only show the jobs you care about. For example, Status:Running,Succeeded will exclude all failed or cancelled jobs.
On the commandline, you can use --status=(active|terminated|all):
gcloud beta dataflow jobs list --status=active
I am working on a Robotic process automation where i need to automate 10 different process flows.Robot needs to run 24/7.My solution is hosted in AWS cloud and i have got 10 cloud machines to run the scripts.
I have a master Jenkins job which will retrieve the list of automated jobs to execute from a database and i have 10 different jobs configured in Jenkins server.Number of jobs that i need to run at the same time varies from time to time.It may be N different scripts or N instances of the same script with different data combinations.
Challenge i am facing is in post build action i am not able to control the list of scripts/jobs that i need to run based on the output from Jenkins master job.Is there any way to run only the job i need based on the output from a build command?
I was able to achieve it using Jenkins Flexible Publish plugin.
I run dataflow jobs within a unix shell script, and need to know each job final/completion status, is there any command line tool to grab the job completion status?
There's indeed a CLI to retrieve jobs execution status :
gcloud dataflow jobs list --project=<PROJECT_ID> --filter="id=<JOB_ID>" --format="get(state)"
Yes! Dataflow has a CLI that is available as part of gcloud.
You may need to install gcloud alpha components:
$ gcloud components update alpha
After that, you should be able to use gcloud alpha dataflow jobs list to list all the jobs in a project or gcloud alpha dataflow jobs show <JOBID> for more information on a specific job.
You can find more details about this command and others below at https://cloud.google.com/sdk/gcloud/reference/alpha/dataflow/jobs/list