We have a job to make an API call to an external service. We are facing an issue with error handling with it. Let's say our job looks like this:
class MakeApiRequestJob < ApplicationJob
retry_on ApiClient::RequestThrottledError, wait: :exponentially_longer, attempts: 10
def perform
raise ApiClient::RequestThrottledError
end
end
The sample code above produces an error log like this:
ERROR -- : [ActiveJob] [MakeApiRequestJob] [job id] Error performing MakeApiRequestJob (Job ID: [job id]) from Sidekiq(default) in 25573.04ms: ApiClient::RequestThrottledError (Your request was throttled.):
followed by:
INFO -- : Retrying MakeApiRequestJob in 3 seconds, due to a ApiClient::RequestThrottledError.
The problem we have is that, the first ERROR line triggers an alert in our monitoring system. We know already that our requests can be throttled time to time, and we are comfortable if the request is accepted eventually. Only in case the job is permanently failed, which is when 10 retries have failed, we want to be notified by an alert.
Does anyone manage to manipulate ActiveJob::LogSubscriber and suppress it from producing error lines?
Related
One of my Sidekiq worker classes had validations for data size missing, hence one of the enqueued job is pulling in huge data from the database and failing abruptly with following message and immediately enqueuing another job with the same job_id.
Error performing MyWorkerClass (Job ID:
my_job_id) from Sidekiq(my_queue_name) in
962208.79ms: Sidekiq::Shutdown (Sidekiq::Shutdown):
As soon as I get this message, a new job is enqueued.
Performing MyWorkerClass (Job ID: my_job_id) from
Sidekiq(my_queue_name) with arguments: 1, {"param1"=>"param1_value",
"param2"=>"param2_value", "param3"=>"param3_value"}
I am figuring out a way to fix this problem but for now I want to stop this particular job from running continuously. I couldn't find this job on my sidekiq UI dashboard.
Also I tried to find and delete this job using following methods but couldn't find the job. All the variables printed below are Nil.
a = Sidekiq::Queue.new('my_queue_name').find_job("my_job_id")
b = Sidekiq::ScheduledSet.new.find_job("my_job_id")
c = Sidekiq::RetrySet.new.find_job("my_job_id")
d = Sidekiq::JobSet.new('my_queue_name').find_job("my_job_id")
puts a.inspect
puts b.inspect
puts c.inspect
puts d.inspect
I want help with the following:
How to avoid this abrupt shutdown for long running jobs in the future
Find the long running job and kill it.
Thank you in Advance !
In my app I am trying to perform two worker tasks sequentially.
First, a PDF is being created with Wicked pdf and then, once the PDF is created, to send an email to two different recipients with the PDF attached.
This is what is called in the controller :
PdfWorker.perform_async(#d.id)
MailingWorker.perform_in(1.minutes, #d.id,#d.class.name.to_s)
First worker creates the PDF and second worker sends email.
Here is second worker :
class MailingWorker
include Sidekiq::Worker
sidekiq_options retry: false
def perform(d_id,model)
#d = eval(model).find(d_id)
#model = model
if #d.pdf.present?
ProfessionnelMailer.notification_d(#d).deliver
ClientMailer.notification_d(#d).deliver
else
MailingWorker.perform_in(1.minutes, #d.id, #model.to_s)
end
end
end
The if statement checks if the PDF has been created. If true two mails are sent, otherwise, the same worker is called again one minute later, just to let the Heroku server extra time to process the PDF creation in case it takes more time or a long queue.
Though if the PDF has definitely failed to be processed, the above ends up in an infinite loop.
Is there a way to fix this ?
One option I see is calling the second worker inside the PDF creation worker though I don't really want to nest workers too deep. It makes my controller more clear to have them separate, I can see the sequence of actions. But any advice welcome.
Another option is to use sidekiq_options retry: 5 and request a retry of the controller that could be counted towards the full total of 5 retries, instead of retrying the worker with else MailingWorker.perform_in(1.minutes, #d.id, #model.to_s) but I don't know how to do this. As per this thread https://github.com/mperham/sidekiq/issues/769 it would be to raise an exception but I am not sure how to do this ... (also I am not sure how long the retry will wait before being processed with the exception method, with the solution above I can control the time frame..)
If you do not want to have nested workers, then in MailingWorker instead of enqueuing it again, raise an exception if the PDF is not present.
Also, configure the worker retry option, so that sidekiq will push it to the retry queue and run it again in sometime. According to the documentation,
Sidekiq will retry failures with an exponential backoff using the
formula (retry_count ** 4) + 15 + (rand(30) * (retry_count + 1)) (i.e.
15, 16, 31, 96, 271, ... seconds + a random amount of time). It will
perform 25 retries over approximately 21 days.
Worker code will be more like,
class MailingWorker
include Sidekiq::Worker
sidekiq_options retry: 5
def perform(d_id,model)
#d = eval(model).find(d_id)
#model = model
if #d.pdf.present?
ProfessionnelMailer.notification_d(#d).deliver
ClientMailer.notification_d(#d).deliver
else
raise "PDF not present"
end
end
end
I believe the "correct" and most asynchroneous way to do this is to have two queues, and two workers:
Queue 1: CreatePdfWorker
Queue 2: SendPdfWorker
When the CreatePdfWorker has generated the PDF, it then enqueues the SendPdfWorker with the newly generated PDF and recipients.
This way, each worker can work independently and pluck from the queue asynchroneously, and you're not struggling against the design choices of Sidekiq.
I use a Sidekiq queue to process communications with an unreliable, 3rd party API. Since this API is often down for a couple minutes at a time and then back up again, Sidekiq has been handy. When a connection issue happens, an error is raised and Sidekiq throws the job back in the queue to be retried again later, after some time has passed.
I use NewRelic to not only help debug crashes, but also for monitoring. My problem is that this current methodology above creates errors in NewRelic. If the 3rd party API is down for more than a couple of minutes, the error count accumulates enough to cause notifications to send out through NewRelic.
What I'd like to do is only raise an error from my worker when a certain number of retries have occurred for a job. I'm using sidekiq_retries_exhausted to do this. My problem is that I'm not quite sure how to put jobs back in the queue after they have an error without raising an error.
Does Sidekiq provide any facilities to return a job to a queue, increment the number of retries for the job, and have it sit there until it's due to run again, as if an exception was raised in the worker class?
You raise a specific error and tell the error service to ignore errors of that type. For NewRelic:
https://docs.newrelic.com/docs/agents/ruby-agent/installation-configuration/ruby-agent-configuration#error_collector.ignore_errors
Here is what I did to keep intentional retry errors out of AirBrake:
class TaskWorker
include Sidekiq::Worker
class RetryNotAnError < RuntimeError
end
def perform task_id
task = Task.find(task_id)
task.do_cool_stuff
if task.finished?
#log.debug "Task #{task_id} was successful."
return false
else
#log.debug "Task #{task_id} will try again later."
raise RetryNotAnError, task_id
end
end
end
Tell Airbrake to ignore it:
Airbrake.configure do |config|
config.ignore << 'RetryNotAnError'
end
It's good to make your exception name OBVIOUSLY not an error (e.g. RetryLaterNotAnError), as it will still show up in logs and such, and you don't want to freak people out when they see a bunch of them.
ps. That said, I would really like to see Sidekiq to provide an explicit, errorless retry mechanism.
If using Sidekiq Enterprise, one other option might be to utilize the optional set of additional error types that will then get treated as Sidekiq::Limiter::OverLimit violations.
For my purposes, I've used a new error class and then added it to the list in the config. Here are the notes from the sidekiq-ent code (not in the public sidekiq repo) on how to modify your config file:
# An optional set of additional error types which would be
# treated as a rate limit violation, so the job would automatically
# be rescheduled as with Sidekiq::Limiter::OverLimit.
#
# Sidekiq::Limiter.errors << MyApp::TooMuch
# Sidekiq::Limiter.errors = [Foo::Error, MyApp::Limited]
Inside the specific job you can specify the max_retries, or it will default to 20:
sidekiq_options max_limiter_retries: 10
Inside the job, I'll rescue the "expected" intermittent error that I'd rather not ignore completely and then raise the error I've added to the list, something like this:
rescue RestClient::RequestTimeout => e
raise SidekiqSoftRetry.new(e.inspect)
end
Here's what that looks like in my initialization file-- and Mike Perham was kind enough to respond with the option to update the global retry limit.
class SidekiqSoftRetry < RuntimeError
end
Sidekiq::Limiter::DEFAULT_OPTIONS[:reschedule] = 10
Sidekiq::Limiter.configure do |config|
config.errors.concat(
[
SidekiqSoftRetry,
]
)
end
Is the Delayed Job error hook fired once when the job first has an error or it is fired each time the job has errors on the retries too. My code seems to be firing the hook once on the 1st error and it doesn't fire on the retries errors?
error hook fires after every failed attempt, while failure fires once after number of attempts is greater than max_attempts.
If the error is running only once, check for:
max_attempts is set to one. Try explicitly setting the max attempts:
def max_attempts
3
end
You have an exception in your error hook. Try adding a rescue clause:
def error
# your code
rescue => e
Rails.logger.error "houston we have a problem #{e.message}"
end
Our app is hosted on heroku and we use delayed job when sending info to a remote system (via a GET to a url with some url params)
the remote system returns a success code usually, but it it;s real busy it returns a tryagain code.
suppose the our method is
def send_info
the_url = "http://mydomain.com/dosomething?arg=#{self.someval}"
the_result = open(the_url).read
successflag = get_success_flag_from(the_result)
end
and so somewhere in our code we do
#widget.delay.send_info
and that all works fine.
Except it does not automatically handle the case where the remote said to try back later.
Is there any way for the send_info method (which is what delayed job will execute) to "tell" delayed_job "retry me again"? Do we need to throw some custom exception or something?
Raising any kind of exception ought to cause delayed_job to requeue it (subject to only-trying-so-many-times); if you don't especially need a custom exception you can just raise a RuntimeError.