Unable to search/delete a continuously retrying sidekiq job

Unable to search/delete a continuously retrying sidekiq job - ruby-on-rails

One of my Sidekiq worker classes had validations for data size missing, hence one of the enqueued job is pulling in huge data from the database and failing abruptly with following message and immediately enqueuing another job with the same job_id.
Error performing MyWorkerClass (Job ID:
my_job_id) from Sidekiq(my_queue_name) in
962208.79ms: Sidekiq::Shutdown (Sidekiq::Shutdown):
As soon as I get this message, a new job is enqueued.
Performing MyWorkerClass (Job ID: my_job_id) from
Sidekiq(my_queue_name) with arguments: 1, {"param1"=>"param1_value",
"param2"=>"param2_value", "param3"=>"param3_value"}
I am figuring out a way to fix this problem but for now I want to stop this particular job from running continuously. I couldn't find this job on my sidekiq UI dashboard.
Also I tried to find and delete this job using following methods but couldn't find the job. All the variables printed below are Nil.
a = Sidekiq::Queue.new('my_queue_name').find_job("my_job_id")
b = Sidekiq::ScheduledSet.new.find_job("my_job_id")
c = Sidekiq::RetrySet.new.find_job("my_job_id")
d = Sidekiq::JobSet.new('my_queue_name').find_job("my_job_id")
puts a.inspect
puts b.inspect
puts c.inspect
puts d.inspect
I want help with the following:
How to avoid this abrupt shutdown for long running jobs in the future
Find the long running job and kill it.
Thank you in Advance !

Related

Sidekiq - Enqueuing a job to be performed 0.seconds from now

I'm using sidekiq for background job and I enqueue a job like this:
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***)　・・・ ①
When waiting_time is nil,
it becomes
SampleJob.set(wait: 0.seconds).perform_later(***)
Of course it works well, but I'm worried about performance because worker enqueued with wait argument is derived by poller,
so I wonder if I should remove set(wait: waiting_time.to_i.seconds) when
waiting_time is nil.
i.e.)
if waiting_time.present?
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***)
else
SampleJob.perform_later(***)
end ・・・ ②
Is there any differences in performance or speed between ① and ②？
Thank you in advance.

There is no difference. It looks like this is already considered in the Sidekiq library.
https://github.com/mperham/sidekiq/blob/main/lib/sidekiq/worker.rb#L261
# Optimization to enqueue something now that is scheduled to go out now or in the past
#opts["at"] = ts if ts > now

How can I retrieve the execution status of parallel triggered child jobs to a pipeline script

have a pipeline script that executes child jobs in parallel.
Say I have 5 data (a,b,c,d,e) that has to be executed on 3 jobs (J1, J2, J3)
My pipeline script is in the below format
for (int i = 0; i < size; i++) { def index = i branches["branch${i}"] = { build job: 'SampleJob', parameters: [ string(name: 'param1', value:'${data}'), string(name:'dummy', value: "${index}")] } } parallel branches
My problem is, say the execution is happening on Job 1 with the data 1,2,3,4,5 and if the data 3 execution is failed on Job 1 then the data 3 execution should be stopped there itself and should not happen on the subsequent parallel execution on Jobs 2 and 3.
Is there any way that I can read the execution status of parallelly execution job status on the Pipeline script so that I can restrict data 3 execution to block in Jobs 2 and 3.
I am quite blocked here for a long time. Hoping for a solution from my community. Thanks a lot in advance.

In summary, it sounds like you want to
run multiple jobs in parallel against different pieces of data. I will call the set of related jobs the "batch".
avoid starting a queued job if any of the jobs in the batch have failed
automatically abort a running job if any of the jobs in the batch have failed
The jobs need some way to communicate their failure to the others. Use a shared storage location to store the "failure flag". If the file exists, then one or more of the jobs have failed.
For example, a shared NFS path: /shared/jenkins/jobstate/<BATCH_ID>/failed
At the start of the job, check for the existence of this path. Exit if it does. The file doesn't necessarily need to contain any data - its presence is enough.
Since you need running jobs to abort early if the failure flag exists, you will need to poll that location periodically. For example, after each unit of work. Again, if the file exists then exit early.
If you don't use NFS, that's ok. You could also use an object storage bucket. The important thing is that the state is accessible to all the relevant build jobs.

Spring Cloud Dataflow Task Execution Fails on subsequent runs

Name: spring-cloud-dataflow-server
Version: 2.5.0.BUILD-SNAPSHOT
I have a very simple task created. First run it always COMPLETES fine with NO ISSUES. If task is run again it FAILS with following error.
Subsequent Launch of same task fails with below exception and it's a fresh run after the previous execution completed fully. If a task is run one time can't it be run again?
(log from Task Execution Details - Execution ID: 246)
Caused by: org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters={-spring.cloud.data.flow.taskappname=composed-task-runner, -spring.cloud.task.executionid=246, -graph=threetasks-t1 && threetasks-t2 && threetasks-t3, -spring.datasource.username=root, -spring.cloud.data.flow.platformname=default, -dataflow-server-uri=http://10.104.227.49:9393, -management.metrics.export.prometheus.enabled=true, -management.metrics.export.prometheus.rsocket.host=prometheus-proxy, -spring.datasource.url=jdbc:mysql://10.110.89.91:3306/mysql, -spring.datasource.driverClassName=org.mariadb.jdbc.Driver, -spring.datasource.password=manager, -management.metrics.export.prometheus.rsocket.port=7001, -management.metrics.export.prometheus.rsocket.enabled=true, -spring.cloud.task.name=threetasks}. If you want to run this job again, change the parameters.

A Job instance in a Spring Batch application requires a unique Job Parameter and this is by design.
In this case, since you are using the Composed Task, you can use the property --increment-instance-enabled=true as part of the composed task definition to handle it. This property will make sure to have the Job Instance get the unique Job parameters.
You can check the list of properties supported for Composed Task Runner here

Unexpected sidekiq jobs get executed

I'm using sidekiq cron to run some jobs. I have a parent job which only runs once, and that parent job starts 7 million child jobs. However, in my sidekiq dashboard, it says over 42 million jobs enqueued. I checked those enqueued jobs, they are my child jobs. I'm trying to figure out why so many more jobs than expected are enqueued. I checked the log in sidekiq, one thing I noticed is, "Cron Jobs - add job with name: new_topic_post_job" shows up many times in the log. new_topic_post is the name of the parent job in schedule.yml. Following lines also show up many times
2019-04-18T17:01:22.558Z 12605 TID-osb3infd0 WARN: Processing recovered job from queue queue:low (queue:low_i-03933b94d1503fec0.nodemodo.com_4): "{\"retry\":false,\"queue\":\"low\",\"backtrace\":true,\"class\":\"WeeklyNewTopicPostCron\",\"args\":[],\"jid\":\"f37382211fcbd4b335ce6c85\",\"created_at\":1555606809.2025042,\"locale\":\"en\",\"enqueued_at\":1555606809.202564}"
2019-04-18T17:01:22.559Z 12605 TID-osb2wh8to WeeklyNewTopicPostCron JID-f37382211fcbd4b335ce6c85 INFO: start
WeeklyNewTopicPostCron is the name of the parent job class. Wondering does this mean my parent job runs multiple times instead of only 1? If so, what's the cause? I'm pretty sure the time in the cron job is right, I set it to "0 17 * * 4" which means it only runs once a week. Also I set retry to false for parent job and 3 for child jobs. So even all child jobs fail, we should still only have 21 million jobs. Following is my cron job setting in schedule.yml
new_topic_post_job:
cron: "0 17 * * 4"
class: "WeeklyNewTopicPostCron"
queue: low
and this is WeeklyNewTopicPostCron:
class WeeklyNewTopicPostCron
include Sidekiq::Worker
sidekiq_options queue: :low, retry: false, backtrace: true
def perform
processed_user_ids = Set.new
TopicFollower.select("id, user_id").find_in_batches(batch_size: 1000000) do |topic_followers|
new_user_ids = []
topic_followers.map(&:user_id).each { |user_id| new_user_ids << user_id if processed_user_ids.add?(user_id) }
batch_size = 1000
offset = 0
loop do
batched_user_ids_for_redis = new_user_ids[offset, batch_size]
Sidekiq::Client.push_bulk('class' => NewTopicPostSender,
'args' => batched_user_ids_for_redis.map { |user_id| [user_id, 7] }) if batched_user_ids_for_redis.present?
break if batched_user_ids_for_redis.size < batch_size
offset += batch_size
end
end
end
end

Most probably your parent sidekiq job is causing the sidekiq process to crash, which then results in a worker restart. On restart sidekiq probably tries to recover the interrupted job and starts processing it again (from the beginning). Some details here:
https://github.com/mperham/sidekiq/wiki/Reliability#recovering-jobs
This probably happens multiple times before the parent job eventually finishes, and hence the extremely high number of child jobs are created. You can easily verify this by checking the process id of the sidekiq process while this job is being run and it most probably will keep changing after a while:
ps aux | grep sidekiq
It could be that you have some monit configuration to restart sidekiq in case memory usage goes too high.Or it might be that this query is causing the process to crash:
TopicFollower.select("id, user_id").find_in_batches(batch_size: 1000000)
Try reducing the batch_size. 1million feels like too high a number. But my best guess is that the sidekiq process dies while processing the long running parent process.

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?

It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it

To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.

Note: this technique will not work when running Beam from within a Flex Template Job...

The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Unable to search/delete a continuously retrying sidekiq job - ruby-on-rails

Related

Sidekiq - Enqueuing a job to be performed 0.seconds from now

How can I retrieve the execution status of parallel triggered child jobs to a pipeline script

Spring Cloud Dataflow Task Execution Fails on subsequent runs

Unexpected sidekiq jobs get executed

How can I programmatically cancel a Dataflow job that has run for too long?

Categories

Resources