Sidekiq Pro callback when batch is retried? - ruby-on-rails

I'm using sidekiq pro for my application, and it's been working great. But I'd like to have a way to notify my users that a failed job is being retried.
A flow would go something like this:
Batch starts
worker1 runs successfully
worker2 runs successfully
worker3 fails
oncomplete fires, stuff happens
worker3 restarts
** onretry fires, notification sent to user
worker 3 runs successfully
onsuccess fires, stuff happens
My imaginary onretry doesn't exist in the documentation but I'm hoping there's a way to fake it. I know that I can tell if the batch has failures via the status object, but I don't see a way to get a retry event. Is there such a thing?

The most workable approach is likely a server-side middleware which can detect a retry in progress for a batched job and send an email.

Related

Jenkins job gets stuck in queue. How to make it fail automatically if the worker is offline

So on my jenkins sometimes my worker "slave02" gets offline and needs to manually get unstuck. I will not get into details, because it's not the point of this question here.
The scenario so far:
I've configured a job intentionally to get processed on that exact worker. But obviously it would not start since the worker is offline. I want to get notified when that job gets stuck in queue. I've tried to use Build Timeout Jenkins Plugin and I've configured it to fail the build if it waits for longer than 5 minutes to complete the job.
The problem with this is that the plugin makes sure the job fails 5 minutes after the build gets started... which does not help in my case. Because the job doesn't start, rather it sits in queue waiting to get processed but that never happens. So my question is - is there a way to make the job check if that worker is down to just automatically fail the build and send notification?
I am pretty sure that can be done but I could not find a thread where this type of scenario is being discussed.

Sidekiq multiple dependant jobs, when to completed or retry?

In my Rails application, I have a model called Report
Report has one or many chunks (called Chunk) that would generate a piece of content based on external service calls (APIs, etc.)
When user requests to generate a report, by using Sidekiq, I queue the "chunk's jobs" in order to run them in the background and notify user that we will be emailing them the result once the report is generated.
Report uses a state machine, to flag whether or not all the jobs are successfully finished. All the chunks must be completed before we flag the report as ready. If one fails, we need to either try again, or give up at some point.
I determined the states as draft (default), working, finished The finish result is a combination of all the services pieces together. 'Draft' is when the chunks are still in the queue and none of them has started generating any content.
How would you tackle this situation with Sidekiq? How do you keep a track (live) which chunk's services are finished, or working or failed, so we can flag the report finished or failed?
I'd like to see a way to periodically check the jobs to see where they are standing, and change a state when they all finished successfully, or flag it fail, if all the retries give up!
Thank you
We had a similar need in our application to determine when sidekiq jobs were finished during automated testing.
What we used is the sidekiq-status gem: https://github.com/utgarda/sidekiq-status
Here's the rough usage:
job_id = Job.perform_async()
You'd then pass the job ID to the place where it will try to check the status of the job
Sidekiq::Status::status job_id #=> :working, :queued, :failed, :complete
Hope this helps.
This is a Sidekiq Pro feature called Batches.
https://github.com/mperham/sidekiq/wiki/Batches

How to stop a Flink job using REST API

I am trying to deploy a job to Flink from Jenkins. Thus far I have figured out how to submit the jar file that is created in the build job. Now I want to find any Flink jobs running with the old jar, stop them gracefully, and start a new job utilizing my new jar.
The API has methods to list the jobs, cancel jobs, and submit jobs. However, there does not seem to be a stop job endpoint. Any ideas on how to gracefully stop a job using API?
Even though the stop endpoint is not documented, it does exist and behaves similarly to the cancel one.
Basically, this is the bit missing in the Flink REST API documentation:
Stop Job
DELETE request to /jobs/:jobid/stop.
Stops a job, result on success is {}.
For those who are not aware of the difference between cancelling and stopping (copied from here):
The difference between cancelling and stopping a (streaming) job is the following:
On a cancel call, the operators in a job immediately receive a cancel() method call to cancel them as
soon as possible.
If operators are not not stopping after the cancel call, Flink will start interrupting the thread periodically
until it stops.
A “stop” call is a more graceful way of stopping a running streaming job. Stop is only available for jobs
which use sources that implement the StoppableFunction interface. When the user requests to stop a job,
all sources will receive a stop() method call. The job will keep running until all sources properly shut down.
This allows the job to finish processing all inflight data.
As i'm using Flink 1.7, below is how to cancel/stop flink job about this version.
Already Tested By Myself
Request path:
/jobs/{jobid}
jobid - 32-character hexadecimal string value that identifies a job.
Request method: PATCH
Query parameters:
mode (optional): String value that specifies the termination mode. Supported values are: "cancel, stop".
Example
10.xx.xx.xx:50865/jobs/4c88f503005f79fde0f2d92b4ad3ade4?mode=cancel
host an port is available when start yarn-seesion
jobid is available when you submit a job
Ref:
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html`

Preventing Timeout (H12 error) on Heroku

So I'm using Heroku to host a simple script, which runs whenever a specific page is loaded. The script takes longer than 30 seconds to run, which Heroku returns as an H12 error - Request Timeout (https://devcenter.heroku.com/articles/limits#router). I can't use a background process for this task, as I'm using its run time as a loading screen for the user. I know the process will still complete, but I want a 200 code to be spent when the script finishes.
Is there a way to send a single byte every, say, 20 seconds, so that the request doesn't time-out, and will stop whenever the script finishes? (a response from the heroku page will start a rolling 55-second window preventing timeout). Do I have to run another process simultaneously to check if the longer process is finished, sending a kind of 'heartbeat' to the requesting page, letting it know the process is still running - and preventing heroku from timing out? I'm extremely new to rails, any and all help is appreciated!

Locked delayed_job row lingers in the database after capistrano deploy

Whenever I deploy with capistrano or run cap production delayed_job:restart, I end up with the currently-running delayed_job row remaining locked.
The delayed_job process is successfully stopped, a new delayed_job process is started, and a new row is locked by the new process. The problem is that the last process' row is still sitting there & marked as locked. So I have to go into the database manually, delete the row, and then manually add that job back into the queue for the new delayed_job process to get to.
Is there a way for the database cleanup & re-queue of the previous job to happen automatically?
I have the same problem. This happens whenever a job is forcibly killed. Part of the problem is that worker processes are managed by the daemons gem rather than delayed_job itself. I'm currently investigating ways to fix this, such as:
Setting a longer timeout before daemons forcibly terminates (nothing about this in docs for delayed_joob or daemons)
Clearing locks before starting delayed_job workers
I'll post back here whan and if I come up with a solution.
Adjust your Daemon wait time or raise an exception on SIGINT.
#John Carney is correct. In short, all delayed_job workers get sent something like a SIGINT (nice interrupt) on a redeploy. delayed_job workers, by default, will complete their current job (if they are working on one) and then gracefully terminate.
However, if the job that they are working on is a longer-running job, there's an amount of time the Daemon manager waits before it gets annoyed and sends a more serious interrupt signal, like a SIGTERM or SIGKILL. This wait time and what gets sent really depends on your setup and configuration.
When that happens, the delayed_job worker gets killed immediately without being able to finish the job it is working on or even cleanup after itself and mark the job as no longer locked.
This ends up in a "stranded" job that is marked as "locked" but locked to a process/worker that no longer exists. Not good.
That's the crux of the issue and what is happening. To get around this, you have two main options, depending on your what your jobs look like (we use both):
1. Raise an exception when an interrupt is received.
You can do this by setting the raise_signal_exceptions configuration to either :term or true:
Delayed::Worker.raise_signal_exceptions = :term
This configuration options accepts :term, true or false (default). You can read more on the original commit here.
I would try first with :term and see if that solves your issue. If not, you may need to set it to true.
Setting to :term or true will gracefully raise an exception and unlock the job for another delayed_job worker to pickup the job and start working on it.
Setting it to true means that your delayed_job workers won't even attempt to finish the current job that they are working on. They will just immediately raise an exception, unlock the job and terminate themselves.
2. Adjust how your workers are interrupted/terminated/killed on a redeploy.
This really depends on your redeploy, etc. In our case, we are using Cloud66 to handle deploys so we just had to configure this with them. But this is what ours looks like:
stop_sequence: int, 172800, term, 90, kill # Allows long-running delayed jobs to finish before being killed (i.e. on redeploy). Sends SIGINT, waits 48 hours, sends SIGTERM, waits 90 seconds, sends SIGKILL.
On a redeploy, this tells the Daemon manager to follow these steps will each delayed_job worker:
Send a SIGINT.
Wait 172800 seconds (2 days) - we have very long-running jobs.
Send a SIGTERM, if the worker is still alive.
Wait 90 seconds.
Send a SIGKILL, if the worker is still alive.
Anyway, that should help get you on the right track to configuring this properly for yourself.
We use both methods by setting a lengthy timeout as well as raising an exception when a SIGTERM is received. This ensures that if there is a job that runs past the 2 day limit, it will at least raise an exception and unlock the job, allowing us to investigate instead of just leaving a stranded job that is locked to a process that no longer exists.

Resources