How to get sidekiq retry_count from inside a job - ruby-on-rails

I am trying to send an alert every time retry_count of a sidekiq job reaches 5(to warn an engineer to check why the worker is failing) and then continued being retried as usual.
Is there a way to get the retry count for a particular job from inside the job?
I could just use:
sidekiq_retry_in do |count|
(warn engineer here)
10 * (count + 1) # (i.e. 10, 20, 30, 40)
end
and send a message from in there, but I think its a bit of a hack.
Any ideas? googling didn't surface any results.

There is no way to get the retry count from within the job, by design.

Related

Unable to search/delete a continuously retrying sidekiq job

One of my Sidekiq worker classes had validations for data size missing, hence one of the enqueued job is pulling in huge data from the database and failing abruptly with following message and immediately enqueuing another job with the same job_id.
Error performing MyWorkerClass (Job ID:
my_job_id) from Sidekiq(my_queue_name) in
962208.79ms: Sidekiq::Shutdown (Sidekiq::Shutdown):
As soon as I get this message, a new job is enqueued.
Performing MyWorkerClass (Job ID: my_job_id) from
Sidekiq(my_queue_name) with arguments: 1, {"param1"=>"param1_value",
"param2"=>"param2_value", "param3"=>"param3_value"}
I am figuring out a way to fix this problem but for now I want to stop this particular job from running continuously. I couldn't find this job on my sidekiq UI dashboard.
Also I tried to find and delete this job using following methods but couldn't find the job. All the variables printed below are Nil.
a = Sidekiq::Queue.new('my_queue_name').find_job("my_job_id")
b = Sidekiq::ScheduledSet.new.find_job("my_job_id")
c = Sidekiq::RetrySet.new.find_job("my_job_id")
d = Sidekiq::JobSet.new('my_queue_name').find_job("my_job_id")
puts a.inspect
puts b.inspect
puts c.inspect
puts d.inspect
I want help with the following:
How to avoid this abrupt shutdown for long running jobs in the future
Find the long running job and kill it.
Thank you in Advance !

Unexpected sidekiq jobs get executed

I'm using sidekiq cron to run some jobs. I have a parent job which only runs once, and that parent job starts 7 million child jobs. However, in my sidekiq dashboard, it says over 42 million jobs enqueued. I checked those enqueued jobs, they are my child jobs. I'm trying to figure out why so many more jobs than expected are enqueued. I checked the log in sidekiq, one thing I noticed is, "Cron Jobs - add job with name: new_topic_post_job" shows up many times in the log. new_topic_post is the name of the parent job in schedule.yml. Following lines also show up many times
2019-04-18T17:01:22.558Z 12605 TID-osb3infd0 WARN: Processing recovered job from queue queue:low (queue:low_i-03933b94d1503fec0.nodemodo.com_4): "{\"retry\":false,\"queue\":\"low\",\"backtrace\":true,\"class\":\"WeeklyNewTopicPostCron\",\"args\":[],\"jid\":\"f37382211fcbd4b335ce6c85\",\"created_at\":1555606809.2025042,\"locale\":\"en\",\"enqueued_at\":1555606809.202564}"
2019-04-18T17:01:22.559Z 12605 TID-osb2wh8to WeeklyNewTopicPostCron JID-f37382211fcbd4b335ce6c85 INFO: start
WeeklyNewTopicPostCron is the name of the parent job class. Wondering does this mean my parent job runs multiple times instead of only 1? If so, what's the cause? I'm pretty sure the time in the cron job is right, I set it to "0 17 * * 4" which means it only runs once a week. Also I set retry to false for parent job and 3 for child jobs. So even all child jobs fail, we should still only have 21 million jobs. Following is my cron job setting in schedule.yml
new_topic_post_job:
cron: "0 17 * * 4"
class: "WeeklyNewTopicPostCron"
queue: low
and this is WeeklyNewTopicPostCron:
class WeeklyNewTopicPostCron
include Sidekiq::Worker
sidekiq_options queue: :low, retry: false, backtrace: true
def perform
processed_user_ids = Set.new
TopicFollower.select("id, user_id").find_in_batches(batch_size: 1000000) do |topic_followers|
new_user_ids = []
topic_followers.map(&:user_id).each { |user_id| new_user_ids << user_id if processed_user_ids.add?(user_id) }
batch_size = 1000
offset = 0
loop do
batched_user_ids_for_redis = new_user_ids[offset, batch_size]
Sidekiq::Client.push_bulk('class' => NewTopicPostSender,
'args' => batched_user_ids_for_redis.map { |user_id| [user_id, 7] }) if batched_user_ids_for_redis.present?
break if batched_user_ids_for_redis.size < batch_size
offset += batch_size
end
end
end
end
Most probably your parent sidekiq job is causing the sidekiq process to crash, which then results in a worker restart. On restart sidekiq probably tries to recover the interrupted job and starts processing it again (from the beginning). Some details here:
https://github.com/mperham/sidekiq/wiki/Reliability#recovering-jobs
This probably happens multiple times before the parent job eventually finishes, and hence the extremely high number of child jobs are created. You can easily verify this by checking the process id of the sidekiq process while this job is being run and it most probably will keep changing after a while:
ps aux | grep sidekiq
It could be that you have some monit configuration to restart sidekiq in case memory usage goes too high.Or it might be that this query is causing the process to crash:
TopicFollower.select("id, user_id").find_in_batches(batch_size: 1000000)
Try reducing the batch_size. 1million feels like too high a number. But my best guess is that the sidekiq process dies while processing the long running parent process.

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

Retry Sidekiq worker from within worker

In my app I am trying to perform two worker tasks sequentially.
First, a PDF is being created with Wicked pdf and then, once the PDF is created, to send an email to two different recipients with the PDF attached.
This is what is called in the controller :
PdfWorker.perform_async(#d.id)
MailingWorker.perform_in(1.minutes, #d.id,#d.class.name.to_s)
First worker creates the PDF and second worker sends email.
Here is second worker :
class MailingWorker
include Sidekiq::Worker
sidekiq_options retry: false
def perform(d_id,model)
#d = eval(model).find(d_id)
#model = model
if #d.pdf.present?
ProfessionnelMailer.notification_d(#d).deliver
ClientMailer.notification_d(#d).deliver
else
MailingWorker.perform_in(1.minutes, #d.id, #model.to_s)
end
end
end
The if statement checks if the PDF has been created. If true two mails are sent, otherwise, the same worker is called again one minute later, just to let the Heroku server extra time to process the PDF creation in case it takes more time or a long queue.
Though if the PDF has definitely failed to be processed, the above ends up in an infinite loop.
Is there a way to fix this ?
One option I see is calling the second worker inside the PDF creation worker though I don't really want to nest workers too deep. It makes my controller more clear to have them separate, I can see the sequence of actions. But any advice welcome.
Another option is to use sidekiq_options retry: 5 and request a retry of the controller that could be counted towards the full total of 5 retries, instead of retrying the worker with else MailingWorker.perform_in(1.minutes, #d.id, #model.to_s) but I don't know how to do this. As per this thread https://github.com/mperham/sidekiq/issues/769 it would be to raise an exception but I am not sure how to do this ... (also I am not sure how long the retry will wait before being processed with the exception method, with the solution above I can control the time frame..)
If you do not want to have nested workers, then in MailingWorker instead of enqueuing it again, raise an exception if the PDF is not present.
Also, configure the worker retry option, so that sidekiq will push it to the retry queue and run it again in sometime. According to the documentation,
Sidekiq will retry failures with an exponential backoff using the
formula (retry_count ** 4) + 15 + (rand(30) * (retry_count + 1)) (i.e.
15, 16, 31, 96, 271, ... seconds + a random amount of time). It will
perform 25 retries over approximately 21 days.
Worker code will be more like,
class MailingWorker
include Sidekiq::Worker
sidekiq_options retry: 5
def perform(d_id,model)
#d = eval(model).find(d_id)
#model = model
if #d.pdf.present?
ProfessionnelMailer.notification_d(#d).deliver
ClientMailer.notification_d(#d).deliver
else
raise "PDF not present"
end
end
end
I believe the "correct" and most asynchroneous way to do this is to have two queues, and two workers:
Queue 1: CreatePdfWorker
Queue 2: SendPdfWorker
When the CreatePdfWorker has generated the PDF, it then enqueues the SendPdfWorker with the newly generated PDF and recipients.
This way, each worker can work independently and pluck from the queue asynchroneously, and you're not struggling against the design choices of Sidekiq.

Autoscaling Resque workers on Heroku in real time

I would like to up/down-scale my dynos automatically dependings on the size of the pending list.
I heard about HireFire, but the scaling is only made every minutes, and I need it to be (almost) real time.
I would like to scale my dynos so that the pending list be ~always empty.
I was thinking about doing it by myself (with a scheduler (~15s delay) and using Heroku API), because I'm not sure there is anything out there; and if not, do you know any monitoring tools which could send an email alert if the queue lenght exceed a fixed size ? (similar to apdex on newrelic).
A potential custom code solution is included below. There are also two New Relic plgins that do Resque monitoring. I'm not sure if either do email alerts based on exceeding a certain queue size. Using resque hooks you could output log messages that could trigger email alerts (or slack, hipchat, pagerduty, etc) via a service like Papertrail or Loggly. THis might look something like:
def after_enqueue_pending_check(*args)
job_count = Resque.info[:pending].to_i
if job_count > PENDING_THRESHOLD
Rails.logger.warn('pending queue threshold exceeded')
end
end
Instead of logging you could send an email but without some sort of rate limiting on the emails you could easily get flooded if the pending queue grows rapidly.
I don't think there is a Heroku add-on or other service that can do the scaling in realtime. There is a gem that will do this using the deprecated Heroku API. You can do this using resque hooks and the Heroku platform-api. This untested example uses the heroku platform-api to scale the 'worker' dynos up and down. Just as an example I included 1 worker for every three pending jobs. The downscale will only every reset the workers to 1 if there are no pending jobs and no working jobs. This is not ideal and should be updated to fit your needs. See here for information about ensuring that then scaling down the workers you don't lose jobs: http://quickleft.com/blog/heroku-s-cedar-stack-will-kill-your-resque-workers
require 'platform-api'
def after_enqueue_upscale(*args)
heroku = PlatformAPI.connect_oauth('OAUTH_TOKEN')
worker_count = heroku.formation.info('app-name','worker')["quantity"]
job_count = Resque.info[:pending].to_i
# one worker for every 3 jobs (minimum of 1)
new_worker_count = ((job_count / 3) + 1).to_i
return if new_worker_count <= worker_count
heroku.formation.update('app-name', 'worker', {"quantity" => new_worker_count})
end
def after_perform_downscale
heroku = PlatformAPI.connect_oauth('OAUTH_TOKEN')
if Resque.info[:pending].to_i == 0 && Resque.info[:working].to_i == 0
heroku.formation.update('app-name', 'worker', {"quantity" => 1})
end
end
Im having a similiar issue and have ran into "Hirefire"
https://www.hirefire.io/.
For ruby, use:
https://github.com/hirefire/hirefire-resource
It runs similar to theoretically works like AdepScale (https://www.adeptscale.com/). However Hirefire can also scale workers and does not limit itself to just dynos. Hope this helps!

Resources