I have a Rails 5 application using raven-ruby to send exceptions to Sentry which then sends alerts to our Slack.
Raven.configure do |config|
config.dsn = ENV['SENTRY_DSN']
config.environments = %w[ production development ]
config.excluded_exceptions += []
config.async = lambda { |event|
SentryWorker.perform_async(event.to_hash)
}
end
class SentryWorker < ApplicationWorker
sidekiq_options queue: :default
def perform(event)
Raven.send_event(event)
end
end
It's normal for our Sidekiq jobs to throw exceptions and be retried. These are mostly intermittent API errors and timeouts which clear up on their own in a few minutes. Sentry is dutifully sending these false alarms to our Slack.
I've already added the retry_count to the jobs. How can I prevent Sentry from sending exceptions with a retry_count < N to Slack while still alerting for other exceptions? An example that should not be alerted will have extra context like this:
sidekiq: {
context: Job raised exception,
job: {
args: [{...}],
class: SomeWorker,
created_at: 1540590745.3296254,
enqueued_at: 1540607026.4979043,
error_class: HTTP::TimeoutError,
error_message: Timed out after using the allocated 13 seconds,
failed_at: 1540590758.4266324,
jid: b4c7a68c45b7aebcf7c2f577,
queue: default,
retried_at: 1540600397.5804272,
retry: True,
retry_count: 2
},
}
What are the pros and cons of not sending them to Sentry at all vs sending them to Sentry but not being alerted?
Summary
An option that has worked well for me is by configuring Sentry's should_capture alongside Sidekiq's sidekiq_retries_exhausted with a custom attribute on the exception.
Details
1a. Add the custom attribute
You can add a custom attribute to an exception. You can define this on any error class with attr_accessor:
class SomeError
attr_accessor :ignore
alias ignore? ignore
end
1b. Rescue the error, set the custom attribute, & re-raise
def perform
# do something
rescue SomeError => e
e.ignore = true
raise e
end
Configure should_capture
should_capture allows you to capture exceptions when they meet a defined criteria. The exception is passed to it, on which you can access the custom attribute.
config.should_capture { |e| !e.ignore? }
Flip the custom attribute when retries are exhausted
There are 2 ways to define the behaviour you want to happen when a job dies, depending on the version of Sidekiq being used. If you want to apply globally & have sidekiq v5.1+, you can use a death handler. If you want to apply to a particular worker or have less than v5.1, you can use sidekiq_retries_exhausted.
sidekiq_retries_exhausted { |_job, ex| ex.ignore = false }
You can filter out the entire event if the retry_count is < N (can be done inside that sidekiq worker you posted). You will loose the data on how often this happens without alerting, but the alerts themselves will not be too noisy.
class SentryWorker < ApplicationWorker
sidekiq_options queue: :default
def perform(event)
retry_count = event.dig(:extra, :sidekiq, :job, retry_count)
if retry_count.nil? || retry_count > N
Raven.send_event(event)
end
end
end
Another idea is to set a different fingerprint depending on whether this is a retry or not. Like this:
class MyJobProcessor < Raven::Processor
def process(data)
retry_count = event.dig(:extra, :sidekiq, :job, retry_count)
if (retry_count || 0) < N
data["fingerprint"] = ["will-retry-again", "{{default}}"]
end
end
end
See https://docs.sentry.io/learn/rollups/?platform=javascript#custom-grouping
I didn't test this, but this should split up your issues into two, depending on whether sidekiq will retry them. You can then ignore one group but can still look at it whenever you need the data.
A much cleaner approach if you are trying to ignore exceptions belonging to a certain class is to add them to your config file
config.excluded_exceptions += ['ActionController::RoutingError', 'ActiveRecord::RecordNotFound']
In the above example, the exceptions Rails uses to generate 404 responses will be suppressed.
See the docs for more configuration options
From my point of view, the best option is Sentry holds all the exceptions and you could modify Sentry and set alerts to send or not the exceptions to the Slack.
In order to configure the Alerts in Sentry: In the sentry account, you could go to the ALerts option in the main menu.
In the following picture I configure an alert to only send to slack a notification if occurs an Exception of type ControllerException more than 10 times
Using this alert we only receive the notification in Slack when all conditions are accomplished
I'm writing the following module to capture SIGTERM that gets occasionally sent to my Delayed Job workers, and sets a variable called term_now that lets my job gracefully terminate itself before it's complete.
The following code in my module works perfect if I put it inline in my job, but I need it for several jobs and when I put it in a module it doesn't work.
I assume it's not working because it only passes term_now one time (when it's false), and even when it returns true it doesn't pass it again, therefore it never stops the job.
module StopJobGracefully
def self.execute(&block)
begin
term_now = false
old_term_handler = trap('TERM') do
term_now = true
old_term_handler.call
end
yield(term_now)
ensure
trap('TERM', old_term_handler)
end
end
end
Here's the working inline code how it's normally used (this is the code I'm trying to convert to a module):
class SMSRentDueSoonJob
def perform
begin
term_now = false
old_term_handler = trap('TERM') do
term_now = true
old_term_handler.call
end
User.find_in_batches(batch_size: 1000) do
if term_now
raise 'Gracefully terminating job early...'
end
# do lots of complicated work here
end
ensure
trap('TERM', old_term_handler)
end
end
end
you basically answered it yourself. in the example code you provided, term_now will only become true when the trap snapped before yield is called.
what you need to do is provide a mechanism that periodically fetches the information, so that you can check within the runs of ie find_in_batches.
so instead of yielding the result, your module should have a term_now method that might return an instance variable #term_now.
I have scheduled a job
Worker.perform_at(time, args)
And I can fetch the scheduled jobs
job = Sidekiq::ScheduledSet.new.find_job(jid)
job.args # this is the args I passed above
I need to update the args that will be passed to the worker when it is called, i.e. update job.args. How do I do that?
This won't work:
job.args = new_args
Sidekiq::ScheduledSet.new.to_a[0] = job
Well update the task is not the way achieving it cancel job and create new with new args:
job = Sidekiq::ScheduledSet.new.find_job(jid)
## time = job.time // Or just set time needed.
Sidekiq::Status.cancel jid
Worker.perform_at(time, new_args)
it will also make it easier for you to debug and log the jobs because when you edit/update them on the fly could cause bugs that very hard to identify.
I have some code that potentially can run for a longer period of time. However if it does I want to kill it, here is what I'm doing at the moment :
def perform
Timeout.timeout(ENV['JOB_TIMEOUT'].to_i, Exceptions::WorkerTimeout) { do_perform }
end
private
def do_perform
...some code...
end
Where JOB_TIMEOUT is an environment variable with value such as 10.seconds. I've got reports that this still doesn't prevent my job from running longer that it should.
Is there a better way to do this?
I believe delayed_job does some exception handling voodoo with multiple retries etc, not to mention that I think do_perform will return immediately and the job will continue as usual in another thread. I would imagine a better approach is doing flow control inside the worker
def perform
# A nil timeout will continue with no timeout, protect against unset ENV
timeout = (ENV['JOB_TIMEOUT'] || 10).to_i
do_stuff
begin
Timeout.timeout(timeout) { do_long_running_stuff }
rescue Timeout::Error
clean_up_after_self
notify_business_logic_of_failure
end
end
This will work. Added benefits are not coupling delayed_job so tightly with your business logic - this code can be ported to any other job queueing system unmodified.
all,
I have a custom Delayed::Job setup that uses the the success and error callbacks to change the attributes of the object that is being modified in the background. This object is interacting with an external API. To test this, I'm using RSpec with VCR to record external API interactions.
Here's my worker:
class SuperJob < Struct.new(:Thingy_id)
include JobMethods
def perform
thing = Thingy.find(Thingy_id)
run_update(thing)
end
def success(job)
thing = Thingy.find_by_job_id(job.id)
thing.update(job_finished_at: Time.now, job_id: nil)
end
def error(job, exception)
thing = Thingy.find_by_job_id(job.id)
thing.update(job_id: -1, disabled: true)
end
end
Here are my DJ settings:
Delayed::Worker.delay_jobs = !Rails.env.test?
Delayed::Worker.max_run_time = 2.minutes
I've successfully used RSpec to test the results of the success callback. What I'd like to do is test the results of the error callback. The external API doesn't have any particular length limit on the time of the response, to for my app I'd like to limit the maximum wait time to 2 minutes (as seen in the max_run_time setting for DJ).
Now, how do I test that? The API isn't returning a timeout, so I'm not sure how I need to handle this in VCR. The DJ job isn't running in a queue and I don't particularly want the suite to delay for 2 minutes on every run.
Thoughts or suggestions would be greatly appreciated! Thanks!