How to list gcloud dataflow jobs filters to list only draining jobs? - google-cloud-dataflow

I want to get the list of the jobs that have the status "Draining"
# gives 0 results
gcloud dataflow jobs list --filter="status=Draining"
gcloud dataflow jobs list --filter="status=draining"
gcloud dataflow jobs list --filter="Status=Draining"
gcloud dataflow jobs list --filter="Status=draining"
# gives an error, because status can only be "all", "active", "terminated"
gcloud dataflow jobs list --status="Draining"

The flag STATUS has only three options active, all and terminated, that is why you are getting an error in your last command.
You can follow this google cloud documentation to know more about the Status field.
To get the list of the jobs that have the are in "Draining" mode you can use bellow command:
gcloud dataflow jobs list --filter='STATE=DRAINING'
Addition to #Mazlum Tosun’s answer you only need to use --region if you want to have a region specific resource list.

I confirm the #kiran-mathew comment, you can use the following gcloud command to list all the Dataflow jobs with the draining state :
gcloud dataflow jobs list --region="europe-west1" --filter='STATE=DRAINING'
An example of result :
JOB_ID
NAME
TYPE
CREATION_TIME
STATE
REGION
2022-12-14_15_00_47-1012997933146788402
job-name
Streaming
2022-12-14 23:00:48
Draining
europe-west1

Related

Jenkins Build Name Status

I have a deploy job that has a prerequisite that another job being successfully run with the latest git hash.
Is it possible to check for that condition and then run the prerequisite job if the condition is not met?
You can obtain this information using Jenkins' REST API:
http://JENKINS-HOST:8080/job/JOB-NAME/lastBuild/api/xml
Git commit hash is under SHA1 node and status under result node. You can use jenkins pipeline in order to run this job if the status is not SUCCESS.

Programmatically terminating PubSubIO.readMessages from Subscription after configured time?

I am looking to schedule the Dataflow which has PubSubIO.readString from a PubSub topic's subscripton. How can i have the job to be terminating after a configured interval? My usecase is not to keep the job running through the entire day, so looking to schedule to start, and then stop after a configured interval from within the job.
Pipeline
.apply(PubsubIO.readMessages().fromSubscription("some-subscription"))
From docs:
If you need to stop a running Cloud Dataflow job, you can do so by
issuing a command using either the Cloud Dataflow Monitoring Interface
or the Cloud Dataflow Command-line Interface.
I would assume that you are not interested in stopping jobs manually via Console, which leaves you with the command line solution. If you intend to schedule your dataflow job to run e.g. daily, then you know at which time you want it to stop too (launch time + "configured interval"). In that case, you could configure a cron job to run the gcloud dataflow jobs cancel at that time every day. For instance, the following script would cancel all active jobs having been launched within the day:
#!/bin/bash
gcloud dataflow jobs list --status=active --created-after=-1d \
| awk '{print $1;}' | tail -n +2 \
| while read -r JOB_ID; do gcloud dataflow jobs cancel $JOB_ID; done
Another solution would be to invoke the gcloud command within your java code, using Runtime.getRuntime.exec(). You can schedule this to run after a specific interval using java.util.Timer().schedule() as noted here. This way you can ensure your job is going to stop after the provided time interval regardless of when you launched it.
UPDATE
#RoshanFernando correctly noted in comments that there's actually an SDK method to cancel a pipeline.

DirectPipelineRunner in Dataflow to read from Local machine to Google Cloud storage

I tried running a Dataflow pipeline to read from Local machine(windows) and write to Google cloud storage using a DirectPipelineRunner. The job failed with the error below specifying FileNotFoundException(so i believe the dataflow job is unable to read my location). I am running the job from my local machine to run the GCP based template that i created. I am able to see it in the GCP Dataflow dashboard, but fails with the below error. Please help. I also tried IP or hostname of my local machine along with my local location, but faced this FileNotFoundException?
Error:
java.io.FileNotFoundException: No files matched spec: C:/data/sampleinput.txt
at org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:172)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:158)
at org.apache.beam.sdk.io.FileBasedSource.split(FileBasedSource.java:261)
at com.google.cloud.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:275)
COMMAND TO RUN THE TEMPLATE:
gcloud dataflow jobs run jobname --gcs-location gs://<somebucketname of template>/<templatename> --parameters inputFilePattern=C:/data/sampleinput.txt,outputLocation=gs://<bucketname>/output/outputfile,runner=DirectPipelineRunner
CODE:
PCollection<String> textData =pipeline.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()));
textData.apply("Write Text Data",TextIO.write().to(options.getOutputLocation()));
The gcloud dataflow jobs run command runs your job on Cloud Dataflow. That means the Dataflow workers will try to find C:/data/sampleinput.txt, which does not exist on these workers, obviously.
You can fix this by uploading sampleinput.txt to a bucket and supply the URI gs://<bucketname>/sampleinput.txt as inputFilePattern. Then the Dataflow workers will be able to find your input file and the job should succeed.

Google Dataflow jobs stuck analysing the graph

We have submitted a couple of jobs that seem to have stuck on the graph analyzing step.
A weird error appears on top of the Google Dataflow jobs list page:
A job with ID "2018-01-19_03_27_48-15138951989738918594" doesn't exist
Also, trying to list it using gcloud tool shows them as in Unknown state:
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2018-01-19_03_27_48-15138951989738918594 myjobname2 Streaming 2018-01-19 12:27:48 Unknown europe-west1
2018-01-19_03_21_05-1065108814086334743 myjobname Streaming 2018-01-19 12:21:06 Unknown europe-west1
Trying to cancel them using gcloud tool as well doesn't work either:
$ gcloud beta dataflow jobs --project=myproject cancel 2018-01-19_03_21_05-1065108814086334743
Failed to cancel job [2018-01-19_03_21_05-1065108814086334743]: (9027838c1500ddff): Could not cancel workflow; user does not have sufficient permissions on project: myproject, or the job does not exist in the project.
Any idea?

In Google Cloud, how to view all DataFlow streams sources/sinks matched to gPubSub topics/subscriptions?

We have aprox 100 Google Cloud PubSub topics/subscriptions, DataFlows, and BigQuery / BigTable tables.
I can list pubsub topics:
gcloud beta pubsub topics list
I could use xargs and for each topic, list its subscriptions:
gcloud beta pubsub topics list-subscriptions $topic_id
I can list all BigQuery tables:
bq ls [project_id:][dataset_id]
and all BigTable tables:
cbt -project $project -instance $instance ls
I can list all running DataFlow jobs:
gcloud beta dataflow jobs list --status=active
but I CANNOT list all sources and sinks:
gcloud beta dataflow jobs describe $job_id
- doesnt show this info
If we had 1000 flows, queues & tables - I dont see how we could easily track this complexity.
my questions is: using Google Cloud tools (console and/or CLU), how can I get a birds eye map of our system flow sources & sinks and avoid distributed spaghetti ?
There is some of this information available in the console for each job.
If you click on say, a PubSubIO.Read step, you can see the Pubsub Topic there. In the Summary of the pipeline, you can see Pipeline Options that can contain output table names and other options you specified.
The latter summary is available vi the CLI . It's under "displayData" when you retrieve information with "--full".

Resources