Sidekiq/Rails: How to catch exception AND retry - ruby-on-rails

I have sidekiq jobs running Selenium. If the job crashes, I need to
catch the exception to shut down the selenium driver (otherwise all coming jobs will crash as well),
notify error handler (Sentry)
Have sidekiq try again
Today I can catch and notify, but by catching the exception Sidekiq will not retry the job.
My question is similar to this one but a) it got no answer and b) that user did not want to notify its error handling service.
How do I get Sidekiq to retry the job even though I catch the exception?

How about re-raising exception? It would be something like this:
def perform
# perform_code
rescue ErrorClass => error
# handle error
raise error
end
This way, sidekiq is gonna repeat this task (because it raises an error), but the handling code will also be executed.

Related

Is there way to run code before Sidekiq is restarted in the middle of a job?

I have a Sidekiq job that runs every 4 minutes.
This job checks if the current code block is being executed before executing the code again
process = ProcessTime.where("name = 'ad_queue_process'").first
# Return if job is running
return if process.is_running == true
If Sidekiq restarts midway through the code block, code that updates the status of the job never runs
# Done running, update the process times and allow it to be ran again
process.update_attributes(is_running: false, last_execution_time: Time.now)
Which leads the the Job never running unless i run an update statement to set is_running = false
Is there any way to execute code before Sidekiq is restarted?
Update:
Thanks to #Aaron, and following our discussion (comments below), the ensure block (which is executed by the forked worker-threads) can only be ran for a few unguaranteed milliseconds before the main-thread forcefully terminates these worker-threads, in order for the main-thread to do some "cleanup" up the exception stack, in order to avoid getting SIGKILL-ed by Heroku. Therefore, make sure that your ensure code should be really fast!
TL;DR:
def perform(*args)
# your code here
ensure
process.update_attributes(is_running: false, last_execution_time: Time.now)
end
The ensure above is always called regardless if the method "succeeded" or an Exception is raised. I tested this: see this repl code, and click "Run"
In other words, this is always called even on a SignalException, even if the signal is SIGTERM (gracefully shutdown signal), but ONLY EXCEPT on SIGKILL (force unrescueable shutdown). You can verify this behaviour by checking my repl code, and then change Process.kill('TERM', Process.pid) to Process.kill('KILL', Process.pid), and then click "run" again (you'll notice that the puts won't be called)
Looking at Heroku docs, I quote:
When Heroku is going to shut down a dyno (for a restart or a new deploy, etc.), it first sends a SIGTERM signal to the processes in the dyno.
After Heroku sends SIGTERM to your application, it will wait a few seconds and then send SIGKILL to force it to shut down, even if it has not finished cleaning up. In this example, the ensure block does not get called at all, the program simply exits
... which means that the ensure block will be called because it's a SIGTERM and not a SIGKILL, only except if the shutting down takes a looong time, which may due to (some reasons I could think of ATM):
Something inside your perform code (or any ruby code in the stack; even gems) that also rescued the SignalException, or even rescued the root Exception class because SignalException is a subclass of Exception) but takes a long time cleaning up (i.e. cleaning up connections to DB or something, or I/O stuff that hangs your application)
Or, your own ensure block above takes a looong time. I.E when doing the process.update_attributes(...), for some reason the DB temporary hangs / network delay or timeout, then that update might not succeed at all! and will ran out of time, of which from my quote above, after a few seconds after the SIGTERM, the application will be forced to be stopped by Heroku sending a SIGKILL.
... which all means that my solution is still not fully reliable, but should work under normal situations
Handle sidekiq shutdown exception
class SomeWorker
include Sidekiq::Worker
sidekiq_options queue: :default
def perform(params)
...
rescue Sidekiq::Shutdown
SomeWorker.perform_async(params)
end
end

WorkFusion RPAExpress Exception Handling

I'm using Error Handling in WorkFusion.
Is there a way to see the error message in the catch block i.e. exception occurred block.
How about using:
<log>exception_msg_var</log>
or
println exception_msg_var
exporting exceptions to datastore?
To get the error message in RPA Express, you can keep the code outside of exception handling and then, the software will through error message on its' own. Once you get the type of error (by running the bot once), you can put the solution in catch block by keeping the code inside exceptional handling feature.

What causes dask job failure with CancelledError exception

I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.
If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.

Preventing iOS Automation Instruments from automatically retrying the test after failure

When my test runs into a critical failure such as tapping an invalid element the Automation Instrument attempts to restart the test from the beginning which results in a lot of errors and can even lag my system, making it difficult to stop the test. I don't have the repeat option enabled. Is there a way of preventing this behavior?
I reckon what you can do is: try capturing when your test runs into a failure by means of a try/catch block.
When your test fails, it will jump inside the catch block and you can stop it there.
Maybe something like this.
try {
// Run your tests
} catch (exception){
UIALogger.logFail("Test failed with error message: " + exception.message);
}
I think that the logFail() method should be enough to keep your tests from running indefinately.

Is it possible to get the stacktrace of an error in the error_logger handler?

im currently writing an error_logger handler and would like to get the stacktrace where the error happend (more precisely: at the place, where error_logger:error* was called). But I cannot use the erlang:get_stacktrace() method, since i'm in a different process.
Does anyone know a way to get a stacktrace here?
Thanks
get_stacktrace() returns "stack back-trace of the last exception". Throw and catch an exception inside error_logger:error() and then you can get the stacktrace.
error() ->
try throw(a) of
_ -> a
catch
_:_ -> io:format("track is ~p~n", erlang:get_stacktrace())
end.
I have not fully debugged it, but I suppose that the error functions simply send a message (fire and forget) to the error logger process, so at the time your handler is called after the message has been received, the sender might be doing something completely different. The message sent might contain the backtrace, but I highly doubt it.

Resources