Jenkins job gets stuck in queue. How to make it fail automatically if the worker is offline - jenkins

So on my jenkins sometimes my worker "slave02" gets offline and needs to manually get unstuck. I will not get into details, because it's not the point of this question here.
The scenario so far:
I've configured a job intentionally to get processed on that exact worker. But obviously it would not start since the worker is offline. I want to get notified when that job gets stuck in queue. I've tried to use Build Timeout Jenkins Plugin and I've configured it to fail the build if it waits for longer than 5 minutes to complete the job.
The problem with this is that the plugin makes sure the job fails 5 minutes after the build gets started... which does not help in my case. Because the job doesn't start, rather it sits in queue waiting to get processed but that never happens. So my question is - is there a way to make the job check if that worker is down to just automatically fail the build and send notification?
I am pretty sure that can be done but I could not find a thread where this type of scenario is being discussed.

Related

How to stop a Flink job using REST API

I am trying to deploy a job to Flink from Jenkins. Thus far I have figured out how to submit the jar file that is created in the build job. Now I want to find any Flink jobs running with the old jar, stop them gracefully, and start a new job utilizing my new jar.
The API has methods to list the jobs, cancel jobs, and submit jobs. However, there does not seem to be a stop job endpoint. Any ideas on how to gracefully stop a job using API?
Even though the stop endpoint is not documented, it does exist and behaves similarly to the cancel one.
Basically, this is the bit missing in the Flink REST API documentation:
Stop Job
DELETE request to /jobs/:jobid/stop.
Stops a job, result on success is {}.
For those who are not aware of the difference between cancelling and stopping (copied from here):
The difference between cancelling and stopping a (streaming) job is the following:
On a cancel call, the operators in a job immediately receive a cancel() method call to cancel them as
soon as possible.
If operators are not not stopping after the cancel call, Flink will start interrupting the thread periodically
until it stops.
A “stop” call is a more graceful way of stopping a running streaming job. Stop is only available for jobs
which use sources that implement the StoppableFunction interface. When the user requests to stop a job,
all sources will receive a stop() method call. The job will keep running until all sources properly shut down.
This allows the job to finish processing all inflight data.
As i'm using Flink 1.7, below is how to cancel/stop flink job about this version.
Already Tested By Myself
Request path:
/jobs/{jobid}
jobid - 32-character hexadecimal string value that identifies a job.
Request method: PATCH
Query parameters:
mode (optional): String value that specifies the termination mode. Supported values are: "cancel, stop".
Example
10.xx.xx.xx:50865/jobs/4c88f503005f79fde0f2d92b4ad3ade4?mode=cancel
host an port is available when start yarn-seesion
jobid is available when you submit a job
Ref:
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html`

Scheduled job does not appear to run and no kernel files are created

I have a scheduled notebook job that has been running without issue for a number of days, however, last night it stopped running. Note that I am able to run the job manually without issue.
I raised a previous question on this topic: How to troubleshoot a DSX scheduled notebook?
Following the above instructions, I noticed that there were no log files created at the times when the job should have run. Because I'm able to run the job manually and there are no kernel logs created at the times the schedule job should have run, I'm presuming there is an issue with the scheduler service.
Are there any other steps I can perform to investigate this issue?
This sounds like a problem with the Scheduling service. I recommend to take it up with DSX support. Currently there is no management UX telling you why a specific job failed or letting you restart a particular execution (that would be a good fit for an enhancement request to provide via https://datascix.uservoice.com/).

How to send warning email when build queue exceeds a particular length?

I manage a Jenkins server with a few hundred projects in the whole ecosystem. Many of the projects rely on upstream servers, that, unfortunately, are not always responsive. When I have a lag on these servers, my build queue can get to 10 or more. Is there a plugin or setting to send a warning email when the build queue exceeds a particular length?
I have been unable to find a plugin that does this, but you can query Jenkins for the information as detailed here: Jenkins command to get number of builds in queue.
If you have a Jenkins slave available you could set up a job that runs every 15 minutes and just hit each of the other Jenkins servers with the API call to get build queue counts (this is easy if you have just one master and many slaves.)
If you wanted to stay completely outside of Jenkins (not add another job to the mix) you could write a script to poll the Jenkins API for the information. You could then run that script under, say, a 15 minute (or some other relevant time step) timer using cron (or windows scheduled task). Admittedly then you have to dedicate some resources to running this job.
It looks like you could use python to get the build queue and check the length of the returned list. get_queue_info()
I haven't mucked about with the Jenkins API much myself so I'm not sure offhand exactly what the script would need, but it should be simple enough once you dig into it.

Sidekiq not adding jobs to queue

Sometime ago I wrote a small Ruby application which uses Sidekiq to convert video files and pushes them further to few online video hosting services. I use two Workers and Queues, one to actually convert file and second to publish converted files. Jobs are pushed to first Queue by Rails application for conversion, and after successful processing Conversion Worker pushes Upload job to second queue.
Rails -> Converter Queue -> Uploader Queue
Recently I discover a massive memory leak in converter library which appears after every few jobs and overloads whole server, so I did a little hack to avoid this by stopping whole Sidekiq Worker process using Interrupt exception and starting it again by Systemd.
It works perfectly until yesterday. I get notification from my client that files are not converted. I did some investigation to find out whats failing and found that jobs are not added to Converter queue. It starts failing without any changes in code or services. When Rails adds jobs to Sidekiq Queue it receives proper Job ID, no exception or warning at all, but the job simply not appears in any Queue. I checked Redis logs, Systemd logs, dmesg, every logs that i could check and did not find even the slightest warning - it seems that jobs get lost in vacuum :/ In fact, after more digging and debugging I discover that if one job is pushed rapidly ( 100 times in a loop ), then there is a chance that Sidekiq will add job to Queue. Of course, sometimes it will add all jobs, and sometimes not even single one.
The second Queue works perfectly - it picks every single job that I add to it. When I try to add 1000 new jobs, second Queue queues them all, when Converter queue gets at best 10 jobs. Things gets really weird when I try to use another Queue - I pushed 100 jobs to a new Queue, of course all of them are added properly and then I instruct Conversion worker to use that new Queue. And it works - I can add new Jobs to that Queue and it seems that all of them are pushed successfully - but when Worker finish processing all jobs that were pushed before that Worker was assigned to this Queue it starts to failing again. Disabling code that restarts Worker after every job didn't help at all.
Funny thing is that in fact jobs are pushed to Queue but only when I pushes them multiple times, and it seems totally random when Job is added properly. This bugs appears from nowhere, for few months things works perfectly and recently starts failing without any changes in code or server. Logs are perfectly clear, Sidekiq is used with the same Redis server without any problems by few other applications - it seems that only this particular Worker have this problem. I did not found any references to similar bug on the web and I spent two days trying to debug this and find source of this weird behavior, and I found nothing, everything seems to work perfectly and Jobs are simply disappearing somewhere between push and Redis database.

Blocking a triggered Jenkins job until something *outside* Jenkins is done

I have a Jenkins job which starts a long-running process outside of Jenkins. The job itself is triggered by Gerrit.
If this job is triggered again while the long-running process is ongoing, I need to ensure that the job remains on the Jenkins queue until said process has completed. Effectively I want to ensure that the job never runs in parallel with itself, with the wrinkle that "the job" is really the Jenkins job plus the external long-running process.
I can't find any way to achieve this. The External Resource Dispatcher plugin seems like it could work, but every time I've configured it on our system, Jenkins got extremely unstable (refusing page loads for minutes on end, slave threads dying with NPEs). Everything else I can see, such as the Exclusions plugin, depend on Jenkins itself controlling the entirety of the job.
I've tried hacking something together with node labels - having the job depend on a label "can_run", assigning that label to master, and then having the job execute a Groovy script that removes that label from master. (Theoretically there would be another Jenkins job that adds the label back, which would be triggered by the end of the long-running process.) But it didn't work: if there were any queued instances of the job on Jenkins, they went ahead and started right away even though the label had been removed.
I don't know what else to try! Is there anything other than a required node label being missing which will cause Jenkins to queue the job if it is triggered, but not start it?
I guess the long-running process is triggered and your job return immediately, which make it an async process, right? I would suggest you handle the long-running process detection and waiting logic in your trigger process. Every time before you trigger the job, check if the long-running process is running, if not, trigger it.
Actually I am not quite getting what you are trying to do. Basically because of that long-running process, it is impossible for you to run 2 jobs in parallel. If this is true, make it non parallel job.

Resources