Unexpected sidekiq jobs get executed - ruby-on-rails

I'm using sidekiq cron to run some jobs. I have a parent job which only runs once, and that parent job starts 7 million child jobs. However, in my sidekiq dashboard, it says over 42 million jobs enqueued. I checked those enqueued jobs, they are my child jobs. I'm trying to figure out why so many more jobs than expected are enqueued. I checked the log in sidekiq, one thing I noticed is, "Cron Jobs - add job with name: new_topic_post_job" shows up many times in the log. new_topic_post is the name of the parent job in schedule.yml. Following lines also show up many times
2019-04-18T17:01:22.558Z 12605 TID-osb3infd0 WARN: Processing recovered job from queue queue:low (queue:low_i-03933b94d1503fec0.nodemodo.com_4): "{\"retry\":false,\"queue\":\"low\",\"backtrace\":true,\"class\":\"WeeklyNewTopicPostCron\",\"args\":[],\"jid\":\"f37382211fcbd4b335ce6c85\",\"created_at\":1555606809.2025042,\"locale\":\"en\",\"enqueued_at\":1555606809.202564}"
2019-04-18T17:01:22.559Z 12605 TID-osb2wh8to WeeklyNewTopicPostCron JID-f37382211fcbd4b335ce6c85 INFO: start
WeeklyNewTopicPostCron is the name of the parent job class. Wondering does this mean my parent job runs multiple times instead of only 1? If so, what's the cause? I'm pretty sure the time in the cron job is right, I set it to "0 17 * * 4" which means it only runs once a week. Also I set retry to false for parent job and 3 for child jobs. So even all child jobs fail, we should still only have 21 million jobs. Following is my cron job setting in schedule.yml
new_topic_post_job:
cron: "0 17 * * 4"
class: "WeeklyNewTopicPostCron"
queue: low
and this is WeeklyNewTopicPostCron:
class WeeklyNewTopicPostCron
include Sidekiq::Worker
sidekiq_options queue: :low, retry: false, backtrace: true
def perform
processed_user_ids = Set.new
TopicFollower.select("id, user_id").find_in_batches(batch_size: 1000000) do |topic_followers|
new_user_ids = []
topic_followers.map(&:user_id).each { |user_id| new_user_ids << user_id if processed_user_ids.add?(user_id) }
batch_size = 1000
offset = 0
loop do
batched_user_ids_for_redis = new_user_ids[offset, batch_size]
Sidekiq::Client.push_bulk('class' => NewTopicPostSender,
'args' => batched_user_ids_for_redis.map { |user_id| [user_id, 7] }) if batched_user_ids_for_redis.present?
break if batched_user_ids_for_redis.size < batch_size
offset += batch_size
end
end
end
end

Most probably your parent sidekiq job is causing the sidekiq process to crash, which then results in a worker restart. On restart sidekiq probably tries to recover the interrupted job and starts processing it again (from the beginning). Some details here:
https://github.com/mperham/sidekiq/wiki/Reliability#recovering-jobs
This probably happens multiple times before the parent job eventually finishes, and hence the extremely high number of child jobs are created. You can easily verify this by checking the process id of the sidekiq process while this job is being run and it most probably will keep changing after a while:
ps aux | grep sidekiq
It could be that you have some monit configuration to restart sidekiq in case memory usage goes too high.Or it might be that this query is causing the process to crash:
TopicFollower.select("id, user_id").find_in_batches(batch_size: 1000000)
Try reducing the batch_size. 1million feels like too high a number. But my best guess is that the sidekiq process dies while processing the long running parent process.

Related

Sidekiq - Enqueuing a job to be performed 0.seconds from now

I'm using sidekiq for background job and I enqueue a job like this:
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***) ・・・ ①
When waiting_time is nil,
it becomes
SampleJob.set(wait: 0.seconds).perform_later(***)
Of course it works well, but I'm worried about performance because worker enqueued with wait argument is derived by poller,
so I wonder if I should remove set(wait: waiting_time.to_i.seconds) when
waiting_time is nil.
i.e.)
if waiting_time.present?
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***)
else
SampleJob.perform_later(***)
end ・・・ ②
Is there any differences in performance or speed between ① and ②?
Thank you in advance.
There is no difference. It looks like this is already considered in the Sidekiq library.
https://github.com/mperham/sidekiq/blob/main/lib/sidekiq/worker.rb#L261
# Optimization to enqueue something now that is scheduled to go out now or in the past
#opts["at"] = ts if ts > now

Unable to search/delete a continuously retrying sidekiq job

One of my Sidekiq worker classes had validations for data size missing, hence one of the enqueued job is pulling in huge data from the database and failing abruptly with following message and immediately enqueuing another job with the same job_id.
Error performing MyWorkerClass (Job ID:
my_job_id) from Sidekiq(my_queue_name) in
962208.79ms: Sidekiq::Shutdown (Sidekiq::Shutdown):
As soon as I get this message, a new job is enqueued.
Performing MyWorkerClass (Job ID: my_job_id) from
Sidekiq(my_queue_name) with arguments: 1, {"param1"=>"param1_value",
"param2"=>"param2_value", "param3"=>"param3_value"}
I am figuring out a way to fix this problem but for now I want to stop this particular job from running continuously. I couldn't find this job on my sidekiq UI dashboard.
Also I tried to find and delete this job using following methods but couldn't find the job. All the variables printed below are Nil.
a = Sidekiq::Queue.new('my_queue_name').find_job("my_job_id")
b = Sidekiq::ScheduledSet.new.find_job("my_job_id")
c = Sidekiq::RetrySet.new.find_job("my_job_id")
d = Sidekiq::JobSet.new('my_queue_name').find_job("my_job_id")
puts a.inspect
puts b.inspect
puts c.inspect
puts d.inspect
I want help with the following:
How to avoid this abrupt shutdown for long running jobs in the future
Find the long running job and kill it.
Thank you in Advance !

Retry Sidekiq worker from within worker

In my app I am trying to perform two worker tasks sequentially.
First, a PDF is being created with Wicked pdf and then, once the PDF is created, to send an email to two different recipients with the PDF attached.
This is what is called in the controller :
PdfWorker.perform_async(#d.id)
MailingWorker.perform_in(1.minutes, #d.id,#d.class.name.to_s)
First worker creates the PDF and second worker sends email.
Here is second worker :
class MailingWorker
include Sidekiq::Worker
sidekiq_options retry: false
def perform(d_id,model)
#d = eval(model).find(d_id)
#model = model
if #d.pdf.present?
ProfessionnelMailer.notification_d(#d).deliver
ClientMailer.notification_d(#d).deliver
else
MailingWorker.perform_in(1.minutes, #d.id, #model.to_s)
end
end
end
The if statement checks if the PDF has been created. If true two mails are sent, otherwise, the same worker is called again one minute later, just to let the Heroku server extra time to process the PDF creation in case it takes more time or a long queue.
Though if the PDF has definitely failed to be processed, the above ends up in an infinite loop.
Is there a way to fix this ?
One option I see is calling the second worker inside the PDF creation worker though I don't really want to nest workers too deep. It makes my controller more clear to have them separate, I can see the sequence of actions. But any advice welcome.
Another option is to use sidekiq_options retry: 5 and request a retry of the controller that could be counted towards the full total of 5 retries, instead of retrying the worker with else MailingWorker.perform_in(1.minutes, #d.id, #model.to_s) but I don't know how to do this. As per this thread https://github.com/mperham/sidekiq/issues/769 it would be to raise an exception but I am not sure how to do this ... (also I am not sure how long the retry will wait before being processed with the exception method, with the solution above I can control the time frame..)
If you do not want to have nested workers, then in MailingWorker instead of enqueuing it again, raise an exception if the PDF is not present.
Also, configure the worker retry option, so that sidekiq will push it to the retry queue and run it again in sometime. According to the documentation,
Sidekiq will retry failures with an exponential backoff using the
formula (retry_count ** 4) + 15 + (rand(30) * (retry_count + 1)) (i.e.
15, 16, 31, 96, 271, ... seconds + a random amount of time). It will
perform 25 retries over approximately 21 days.
Worker code will be more like,
class MailingWorker
include Sidekiq::Worker
sidekiq_options retry: 5
def perform(d_id,model)
#d = eval(model).find(d_id)
#model = model
if #d.pdf.present?
ProfessionnelMailer.notification_d(#d).deliver
ClientMailer.notification_d(#d).deliver
else
raise "PDF not present"
end
end
end
I believe the "correct" and most asynchroneous way to do this is to have two queues, and two workers:
Queue 1: CreatePdfWorker
Queue 2: SendPdfWorker
When the CreatePdfWorker has generated the PDF, it then enqueues the SendPdfWorker with the newly generated PDF and recipients.
This way, each worker can work independently and pluck from the queue asynchroneously, and you're not struggling against the design choices of Sidekiq.

How to get sidekiq retry_count from inside a job

I am trying to send an alert every time retry_count of a sidekiq job reaches 5(to warn an engineer to check why the worker is failing) and then continued being retried as usual.
Is there a way to get the retry count for a particular job from inside the job?
I could just use:
sidekiq_retry_in do |count|
(warn engineer here)
10 * (count + 1) # (i.e. 10, 20, 30, 40)
end
and send a message from in there, but I think its a bit of a hack.
Any ideas? googling didn't surface any results.
There is no way to get the retry count from within the job, by design.

Daemon eats too much CPU when being idle

I am using blue-daemons fork of daemons gem (since the second one looks totally abandoned) along with daemons-rails gem, which wraps daemons for rails.
The problem is that my daemon eats too much CPU when it's idle (10-20 times higher then it's actually performing the job).
By being idle, I mean that I have special flag - Status.active?. If Status.active? is true, then I perform the job, if it's false, then I just sleep 10 secs and iterate next step in the while($running) do block and check status again and again.
I don't want to hard stop job because there is really sensitive data and I don't want the process to break it. Is there any good way to handle that high CPU usaget? I tried Sidekiq, but it looks like it's primary aim is to run jobs on demand or on schedule, but I need the daemon to run on non-stop basis.
$running = true
Signal.trap("TERM") do
$running = false
end
while($running) do
while Status.active? do
..... DO LOTS OF WORK .....
else
sleep 10
end
end

Resources