Scheduled job does not appear to run and no kernel files are created - dsx

I have a scheduled notebook job that has been running without issue for a number of days, however, last night it stopped running. Note that I am able to run the job manually without issue.
I raised a previous question on this topic: How to troubleshoot a DSX scheduled notebook?
Following the above instructions, I noticed that there were no log files created at the times when the job should have run. Because I'm able to run the job manually and there are no kernel logs created at the times the schedule job should have run, I'm presuming there is an issue with the scheduler service.
Are there any other steps I can perform to investigate this issue?

This sounds like a problem with the Scheduling service. I recommend to take it up with DSX support. Currently there is no management UX telling you why a specific job failed or letting you restart a particular execution (that would be a good fit for an enhancement request to provide via https://datascix.uservoice.com/).

Related

Jenkins job gets stuck in queue. How to make it fail automatically if the worker is offline

So on my jenkins sometimes my worker "slave02" gets offline and needs to manually get unstuck. I will not get into details, because it's not the point of this question here.
The scenario so far:
I've configured a job intentionally to get processed on that exact worker. But obviously it would not start since the worker is offline. I want to get notified when that job gets stuck in queue. I've tried to use Build Timeout Jenkins Plugin and I've configured it to fail the build if it waits for longer than 5 minutes to complete the job.
The problem with this is that the plugin makes sure the job fails 5 minutes after the build gets started... which does not help in my case. Because the job doesn't start, rather it sits in queue waiting to get processed but that never happens. So my question is - is there a way to make the job check if that worker is down to just automatically fail the build and send notification?
I am pretty sure that can be done but I could not find a thread where this type of scenario is being discussed.

How to send warning email when build queue exceeds a particular length?

I manage a Jenkins server with a few hundred projects in the whole ecosystem. Many of the projects rely on upstream servers, that, unfortunately, are not always responsive. When I have a lag on these servers, my build queue can get to 10 or more. Is there a plugin or setting to send a warning email when the build queue exceeds a particular length?
I have been unable to find a plugin that does this, but you can query Jenkins for the information as detailed here: Jenkins command to get number of builds in queue.
If you have a Jenkins slave available you could set up a job that runs every 15 minutes and just hit each of the other Jenkins servers with the API call to get build queue counts (this is easy if you have just one master and many slaves.)
If you wanted to stay completely outside of Jenkins (not add another job to the mix) you could write a script to poll the Jenkins API for the information. You could then run that script under, say, a 15 minute (or some other relevant time step) timer using cron (or windows scheduled task). Admittedly then you have to dedicate some resources to running this job.
It looks like you could use python to get the build queue and check the length of the returned list. get_queue_info()
I haven't mucked about with the Jenkins API much myself so I'm not sure offhand exactly what the script would need, but it should be simple enough once you dig into it.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Jenkins - monitoring the estimated time of builds

I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds

rails backgroundjob running jobs in parallel?

I'm very happy with By so far, only I have this one issue:
When one process takes 1 or 2 hours to complete, all other jobs in the queue seem to wait for that one job to finish. Worse still is when uploading to a server which time's out regularly.
My question: is Bj running jobs in parallel or one after another?
Thank you,
Damir
BackgroundJob will only allow one worker to run per webserver instance. This is by design to keep things simple. Here is a quote from Bj's README:
If one ignores platform specific details the design of Bj is quite simple: the
main Rails application submits jobs to table, stored in the database. The act
of submitting triggers exactly one of two things to occur:
1) a new long running background runner to be started
2) an existing background runner to be signaled
The background runner refuses to run two copies of itself for a given
hostname/rails_env combination. For example you may only have one background
runner processing jobs on localhost in development mode.
The background runner, under normal circumstances, is managed by Bj itself -
you need do nothing to start, monitor, or stop it - it just works. However,
some people will prefer manage their own background process, see 'External
Runner' section below for more on this.
The runner simply processes each job in a highest priority oldest-in fashion,
capturing stdout, stderr, exit_status, etc. and storing the information back
into the database while logging it's actions. When there are no jobs to run
the runner goes to sleep for 42 seconds; however this sleep is interuptable,
such as when the runner is signaled that a new job has been submitted so,
under normal circumstances there will be zero lag between job submission and
job running for an empty queue.
You can learn more on the github page: Here

Resources