Google cloud dataflow job status - google-cloud-dataflow

I run dataflow jobs within a unix shell script, and need to know each job final/completion status, is there any command line tool to grab the job completion status?

There's indeed a CLI to retrieve jobs execution status :
gcloud dataflow jobs list --project=<PROJECT_ID> --filter="id=<JOB_ID>" --format="get(state)"

Yes! Dataflow has a CLI that is available as part of gcloud.
You may need to install gcloud alpha components:
$ gcloud components update alpha
After that, you should be able to use gcloud alpha dataflow jobs list to list all the jobs in a project or gcloud alpha dataflow jobs show <JOBID> for more information on a specific job.
You can find more details about this command and others below at https://cloud.google.com/sdk/gcloud/reference/alpha/dataflow/jobs/list

Related

Is there a way to update a Dataflow job using the gcloud command?

I am trying to write a script to automate the deployment of a Java Dataflow job. The script creates a template and then uses the command
gcloud dataflow jobs run my-job --gcs-location=gs://my_bucket/template
The issue is, I want to update the job if the job already exists and it's running. I can do the update if I run the job via maven, but I need to do this via gcloud so I can have a service account for deployment and another one for running the job. I tried different things (adding --parameters update to the command line), but I always get an error. Is there a way to update a Dataflow job exclusively via gcloud dataflow jobs run?
Referring to the official documentation, which describes gcloud beta dataflow jobs - a group of subcommands for working with Dataflow jobs, there is no possibility to use gcloud for update the job.
As for now, the Apache Beam SDKs provide a way to update an ongoing streaming job on the Dataflow managed service with new pipeline code, you can find more information here. Another way of updating an existing Dataflow job is by using REST API, where you can find Java example.
Additionally, please follow Feature Request regarding recreating job with gcloud.

How do I check for build status, running Jenkins jobs from a BitBucket pipeline?

We're using BitBucket to host our Git repositories.
We have defined build jobs in a locally-hosted Jenkins server.
We are wondering whether we could use BitBucket pipelines to trigger builds in Jenkins, after pull request approvals, etc.
Triggering jobs in Jenkins, through its REST API is fairly straightforward.
1: curl --request POST --user $username:$api_token --head http://jenkins.mydomain/job/myjob/build
This returns a location response header. By doing a GET on that, we can obtain information about the queued item:
2: curl --user $username:$api_token http://jenkins.mydomain/queue/item/<item#>/api/json
This returns JSON describing the queued item, indicating whether the item is blocked, and why. If it's not, it includes the URL for the build. With that, we can check the status of the build, itself:
3: curl -–user $username:$api_token http://jenkins.mydomain/job/myjob/<build#>/api/json
This will return yet more json, indicating whether the job is currently building, and if it's completed, whether the build succeeded.
Now BitBucket pipeline steps run in Docker containers, and have to run on Linux. Our Jenkins build jobs run on a number of platforms, not all of which are Linux. But BitBucket shouldn't care. Making the necessary REST API calls can be done in Linux, as I am in the examples above.
But how do we script this?
Do we create a single step that runs a shell script that runs command #1, then repeatedly calls command #2 until the build is started, then repeatedly calls command #3 until the build is done?
Or do we create three steps, one for each? Do BitBucket pipelines provide for looping on steps? Calling a step, waiting for a bit, then calling it again until it succeeds?
I think you should either use Bitbucket pipeline or Jenkins pipeline. Using both will give you to many options and make the project more complex than it should be.

Get list of Dataflow pipeline jobs programmatically using Java SDK

I know there's a gcloud command for this:
gcloud dataflow jobs list --help
NAME
gcloud dataflow jobs list - lists all jobs in a particular project, optionally filtered by region
DESCRIPTION
By default, 100 jobs in the current project are listed; this can be overridden with the gcloud --project flag, and the --limit flag.
Using the --region flag will only list jobs from the given regional endpoint.
But I'd like to retrieve this list programmatically through Dataflow Java SDK.
The problem I'm trying to solve:
I have a Dataflow pipeline in streaming mode and I want to set the update option (https://cloud.google.com/dataflow/pipelines/updating-a-pipeline) accordingly based on if this job has been deployed or not.
e.g. When I'm deploying this job for the first time, the code shouldn't set this update flag to true since there's no existing job to update (otherwise the driver program will complain and fail to launch); and the code should be able to query the list of running jobs and acknowledge the job's been running and set the update option to update it (otherwise DataflowJobAlreadyExistsException is thrown).
I've found org.apache.beam.runners.dataflow.DataflowClient#listJobs(String) that can achieve this.

How to delete a gcloud Dataflow job?

The Dataflow jobs are cluttered all over my dashboard, and I'd like to delete the failed jobs from my project. But in the dashboard, I don't see any option to delete the Dataflow job. I'm looking for something like below at least,
$ gcloud beta dataflow jobs delete JOB_ID
To delete all jobs,
$ gcloud beta dataflow jobs delete
Can someone please help me with this?
Unfortunately, this is not currently possible. You cannot delete a Dataflow job. This is something that you could request via the public issue tracker (I've wanted it in the past too).
gcloud dataflow jobs --help
COMMANDS
COMMAND is one of the following:
cancel
Cancels all jobs that match the command line arguments.
describe
Outputs the Job object resulting from the Get API.
drain
Drains all jobs that match the command line arguments.
list
Lists all jobs in a particular project.
run
Runs a job from the specified path.
show
Shows a short description of the given job.
As Graham mentions, it is not possible to delete Dataflow jobs. However, note that you can filter the job list to only show the jobs you care about. For example, Status:Running,Succeeded will exclude all failed or cancelled jobs.
On the commandline, you can use --status=(active|terminated|all):
gcloud beta dataflow jobs list --status=active

Distributed execution with Jenkins

I am working on a Robotic process automation where i need to automate 10 different process flows.Robot needs to run 24/7.My solution is hosted in AWS cloud and i have got 10 cloud machines to run the scripts.
I have a master Jenkins job which will retrieve the list of automated jobs to execute from a database and i have 10 different jobs configured in Jenkins server.Number of jobs that i need to run at the same time varies from time to time.It may be N different scripts or N instances of the same script with different data combinations.
Challenge i am facing is in post build action i am not able to control the list of scripts/jobs that i need to run based on the output from Jenkins master job.Is there any way to run only the job i need based on the output from a build command?
I was able to achieve it using Jenkins Flexible Publish plugin.

Resources