Dataflow jobs failing and showing no logs - google-cloud-dataflow

I created pipelines in Dataflow using the standard template JDBC to BigQuery and there are a few jobs that are unexpectedly failing and not showing any logs.
The thing is, when a job fails because of the resources, the job needed more vCPUs than was avaliable in the region or the memory was not enough for example, these kind of errors are displayed in the logs, as you can see below.
But some jobs just fail with no logs and the resources are sufficient.
Does anyone know how to find the logs in this case?

Change the severity of the logs. If you choose Default, you should see more logs. For how the job page looks like for that failed job, I would say you are probably going to need to have a look at the worker logs as well.
Depending on the error, the Diagnostics tab may have some summarized info of what kind error has made the job fail.

Related

Dataflow Workers unable to connect to Dataflow Service

I am using Google Dataprep to start Dataflow jobs and am facing some difficulties.
For background, we used Dataprep for some weeks and it worked without problem before we started to have authorization issues with the service account. When we finally solved this, we restarted the jobs we used to launch but they failed with "The Dataflow appears to be stuck.".
We tried with another very simple job but we met the same error. Here are the full error messages, the job fails after one hour being stuck:
Dataflow -
(1ff58651b9d6bab2): Workflow failed. Causes: (1ff58651b9d6b915): The Dataflow appears to be stuck.
Dataprep -
The Dataflow job (ID: 2017-11-15_00_23_23-9997011066491247322) failed. Please
contact Support and provide the Dataprep Job ID 20825 and the Dataflow Job ID.
It seems this kind of error has various origins and I have no clue about where to start.
Thanks in advance
Please check if there have been any changes to your project's default network. This is the common reason for workers not being able to contact the service, causing 1 hour timeouts.
Update:
After looking into further, <project-number>-compute#developer.gserviceaccount.com service account for Compute Engine is missing under 'Editor' role. This is usually automatically created. Probably this was removed later by mistake. See 'Compute Engine Service Account' section in https://cloud.google.com/dataflow/security-and-permissions.
We are working on fixes to improve early detection of such missing permissions so that the failure points the root cause better.
This implies your other Dataflow jobs fail similarly as well.
the best route would be to contact Google Support.
The issue is related to the Dataflow side and would require some more research on the Dataflow backend by Google

Dataflow Job fails with "Unable to bring up enough workers"

My Dataflow job is failing with the following message, how should I debug?
Workflow failed. Causes: (65a939e801f185b6): Unable to bring up enough
workers: minimum 1, actual 0.
The service will output this message when it is unable to allocate a virtual machine from Compute Engine to execute the job. Please check your quota in the console.
I had problems with the same thing. However, switching zone solved the problem for me. I believe it gives the same error message sometimes when there are no free resources.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Jenkins - monitoring the estimated time of builds

I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds

Jenkins jobs going missing

I have a heavily parallelized build across 45 slaves (one master that just handles launches).
The problem I am running into is that about 3% of the jobs disappear.
The project setup is a "master" job that then launches (via the parameterized job plugin) N jobs across N slaves. Most of the time, the console output for the master job is correct with regards to job numbers of the distributed build steps.
Occasionally, however, the job indicated in console actually belongs to a completely different build.
Where do I even start looking to track this down? The jenkins logs are eerily empty of any information about failed jobs or problems launching jobs.
My best guess at the moment is that the missing jobs were actually queued waiting for executors when something happened to remove them. But I have no evidence to support this.
Thoughts, suggestions, helpful links all greatly appreciated,
Here's how you can get more info: http://[jenkins_server]/log/ -> Add new log recorder -> enter a name of your choice -> OK -> Add -> enter hudson.model.Run as Logger -> set Log Level to all -> Save.
Now http://[jenkins_server]/log/[your log name]/ will provide you with more info as far as running your jobs is concerned.
As long as the bugs https://issues.jenkins-ci.org/browse/JENKINS-15156 and its linked ones are open, it will happen in certain cases. It does not matter what you use for parallel building or dependant building... it is just core problem. Leave it or Live it.
I doubt additional logging is a fix or answer to your problem.
My answer would be - debug and send patches to devs.

Resources