The problem is that the sidekiq worker that process the object runs before the object exists in the database. The job is sending to the queue in after_commit callback in the object model. It's posible because I have two replicated databases one for reads and other for inserts. So the time that the process is from enqueue to fail is minor to the time that the data is replicated in the database.
What is the best approach for the solution? I was thinking to add some wait time between enqueue and process to ensure that the data is in the slave database. Is it possible in sidekiq configuration or something like that?
You could do a few things:
Implement a check in the worker to make sure the object exists; otherwise, re-enqueue the job. Probably want to think about this to make sure you don't accidentally re-enqueue bad jobs forever, but this seems like a good sanity check for you.
Introduce a delay. In particular, sidekiq can wait to pull jobs from the queue until a specified time.
"Sidekiq allows you to schedule the time when a job will be executed. You use perform_in(interval, *args) or perform_at(timestamp, *args) rather than the standard perform_async(*args):
MyWorker.perform_in(3.hours, 'mike', 1)
MyWorker.perform_at(3.hours.from_now, 'mike', 1)"
See https://github.com/mperham/sidekiq/wiki/Scheduled-Jobs for more details on this option.
Personally I would go for #1 but #2 might be a quicker fix if you're desperate.
Related
I am currently developing a Rails application which takes a long list of links as input, scrapes them using a background worker (Resque), then serves the results to the user. However, in some cases, there are numerous URLs and I would like to be able to make multiple requests in parallel / concurrency such that it would take much less time, rather than waiting for one request to complete to a page, scraping it, and moving on to the next one.
Is there a way to do this in heroku/rails? Where might I find more information?
I've come across resque-pool but I'm not sure whether it would solve this issue and/or how to implement. I've also read about using different types of servers to run rails in order to make concurrency possible, but don't know how to modify my current situation to take advantage of this.
Any help would be greatly appreciated.
Don't use Resque. Use Sidekiq instead.
Resque runs in a single-threaded process, meaning the workers run synchronously, while Sidekiq runs in a multithreaded process, meaning the workers run asynchronously/simutaneously in different threads.
Make sure you assign a URL to scrape per worker. It's no use if one worker scrape multiple URLs.
With Sidekiq, you can pass the link to a worker, e.g.
LINKS = [...]
LINKS.each do |link|
ScrapeWoker.perform_async(link)
end
The perform_async doesn't actually execute the job right away. Instead, the link is just put in a queue in redis along with the worker class, and so on, and later (could be milliseconds later) workers are assigned to execute each job in queue in its own thread by running the perform instance method in ScrapeWorker. Sidekiq will make sure to retry again if exception occur during execution of a worker.
PS: You don't have pass a link to the worker. You can store the links to a table and then pass the ids of the records to workers.
More info about sidekiq
Adding these two lines to your code will also let you wait until the last job is complete before proceeding:
this line ensures that your program waits for at least one job is enqueued before checking that all jobs are completed as to avoid misinterpreting an unfilled queue as the completion of all jobs
sleep(0.2) until Sidekiq::Queue.new.size > 0 || Sidekiq::Workers.new.size > 0
this line ensures your program waits till all jobs are done
sleep(0.5) until Sidekiq::Workers.new.size == 0 && Sidekiq::Queue.new.size == 0
How can I manage to execute job after the first job that has executed is done in sidekiq. For example:
I triggered the first job for this morning
GoodWorker.perform_async(params) #=> JID-eetc
while it is still in progress I've executed again a job in the same worker dynamically
GoodWorker.perform_ascyn(params) #=> JID-eetc2
and etc.
What's going on now is Sidekiq processing the jobs all of the time,
is there a way performing the job one at a time?
Short answer: no.
Long answer: You can use a mutex to guarantee that only one instance of a worker is executing at a time. If you're running on a cluster, you'll need to use Redis or some other medium to maintain the mutex. Otherwise, you might try putting these jobs in their own queue, and firing up a separate instance of Sidekiq that only monitors that queue, with a concurrency of one.
Can you not setup Sidekiq to only have one thread? Then only one job will be executed at a time.
I have a queue that happened to contain wrong parameters for the perform_async worker. I don't want to loose the jobs, but edit arguments so they will succeed next time or on forced retry.
Is this possible?
Sidekiq stores his jobs in Redis so maybe try using some GUI for redis (like http://redisdesktop.com/) find job you need to update, edit, save. It can be done in some loop to update multiple of them.
So I have a Resque worker that calls an API, the issue is the API has a rate limit of 2 requests per second.
Is there a way to add a delay between each job processed in a specific queue?
P.S. the queue could have thousands of pending jobs.
Why not sleep for a given amount of time at the end of the process? Well, perhaps you want your resque worker to be doing something useful instead. In CPU time, half a second is a lot of time - you could have done something useful there, like process a job from another queue that's not rate limited.
I have this same problem myself, so I'm motivated to find a solution. It seems like there are two easy-ish ways to do it. The first idea is to use resque scheduler and pre-compute the time to run the job at before inserting it. This seems error-prone to me. The second is to use a gem like https://github.com/flyerhzm/resque-restriction (disclaimer: just found it through some googling. haven't used it yet) and rate-limit as you pull jobs off the queue. Seems like a robust solution in theory. Note that if you can't execute the job yet, it never comes off the queue, so you'll pull something else instead - much more efficient use of your workers.
Per my comment, I'd recommend just performing a sleep for a given number of seconds at the end of each Resque process method.
I have some rails code that calls model1.func1(). A controller action calls this, where multiple people can be hitting it, as does a scheduled rake task. I want to make sure that model1.func1() cannot be called in parallel. If another thread needs to call at the same time, it should wait for model1.func() to finish. I guess I want to queue these calls. I was going to use sidekiq for this, but with only one worker. I read on a forum that
Sidekiq is not appropriate for the serial job and I don't want to make
it appropriate. Different tools are useful for different reasons,
jack of all trades master of none, etc.
What do you guys recommend instead?
I would consider beanstalkd with one worker process.