How to cancel google dataflow job that is stuck "cancelling" - google-cloud-dataflow

I was trying to cancel a google dataflow job, but it has been stuck "cancelling" for like 15mins now.
When I run the command: gcloud beta dataflow jobs list --status=active
It shows the job as active. I then run the command: gcloud beta dataflow jobs cancel [job id here].
It prompts me that it has been canceled, but it still appears as active in the status list.

These types of Google-end issues are best reported in the Public Issue Tracker where the engineering team responsible can be notified.
Providing further information in the issue report to Google such as your project number and the job ID of the stuck Dataflow job will help in resolving the issue more quickly.

Related

Unable to stop a streaming dataflow on google cloud

There is a streaming dataflow running on google cloud (Apache beam 2.5). The dataflow was showing some system lag so I tries to update that dataflow with --update flag. Now the old dataflow is in Updating state and the new dataflow that initiated after the update process is in Pending state.
Now at this point everything is stuck. I am unable to stop/cancel the jobs now. Old job is still in updating state and no status change operation is permitted. I tried to change the state of the job using gcloud dataflow jobs cancel and REST api but it's showing job cannot be updated as it's in RELOAD state. The new initiate job is in not started/pending state. Unable to change the state of this as well. It's showing job is not in condition to perform this operation.
Please let me know how to stop/cancel/delete this streaming dataflow.
Did you try to cancel the job from both command line gcloud tool and web console UI? If nothing works, I think you need to contact Google Cloud Support.

Gcloud Dataflow Step Execution Time

I'm using gcloud dataflow job and want individual execution times for all the steps in my dataflow including nested transforms. I'm using a streaming dataflow and the pipeline currently looks like this:
Current dataflow
Can anyone please suggest a solution?
Answer is WallTime. You can access this info by clicking one of the task in your pipeline(even nested).
Elapsed time of a job is the total time takes to complete your dataflow job while wall time is the sum time taken to run each step by the assigned workers. See the below image for more details.

Dataflow Workers unable to connect to Dataflow Service

I am using Google Dataprep to start Dataflow jobs and am facing some difficulties.
For background, we used Dataprep for some weeks and it worked without problem before we started to have authorization issues with the service account. When we finally solved this, we restarted the jobs we used to launch but they failed with "The Dataflow appears to be stuck.".
We tried with another very simple job but we met the same error. Here are the full error messages, the job fails after one hour being stuck:
Dataflow -
(1ff58651b9d6bab2): Workflow failed. Causes: (1ff58651b9d6b915): The Dataflow appears to be stuck.
Dataprep -
The Dataflow job (ID: 2017-11-15_00_23_23-9997011066491247322) failed. Please
contact Support and provide the Dataprep Job ID 20825 and the Dataflow Job ID.
It seems this kind of error has various origins and I have no clue about where to start.
Thanks in advance
Please check if there have been any changes to your project's default network. This is the common reason for workers not being able to contact the service, causing 1 hour timeouts.
Update:
After looking into further, <project-number>-compute#developer.gserviceaccount.com service account for Compute Engine is missing under 'Editor' role. This is usually automatically created. Probably this was removed later by mistake. See 'Compute Engine Service Account' section in https://cloud.google.com/dataflow/security-and-permissions.
We are working on fixes to improve early detection of such missing permissions so that the failure points the root cause better.
This implies your other Dataflow jobs fail similarly as well.
the best route would be to contact Google Support.
The issue is related to the Dataflow side and would require some more research on the Dataflow backend by Google

Scheduled job does not appear to run and no kernel files are created

I have a scheduled notebook job that has been running without issue for a number of days, however, last night it stopped running. Note that I am able to run the job manually without issue.
I raised a previous question on this topic: How to troubleshoot a DSX scheduled notebook?
Following the above instructions, I noticed that there were no log files created at the times when the job should have run. Because I'm able to run the job manually and there are no kernel logs created at the times the schedule job should have run, I'm presuming there is an issue with the scheduler service.
Are there any other steps I can perform to investigate this issue?
This sounds like a problem with the Scheduling service. I recommend to take it up with DSX support. Currently there is no management UX telling you why a specific job failed or letting you restart a particular execution (that would be a good fit for an enhancement request to provide via https://datascix.uservoice.com/).

Jenkins - monitoring the estimated time of builds

I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds

Resources