Sidekiq handling re-queue when processing large data - ruby-on-rails

See the updated question below.
Original question:
In my current Rails project, I need to parse large xml/csv data file and save it into mongodb.
Right now I use this steps:
Receive uploaded file from user, store the data into mongodb
Use sidekiq to perform async process of the data in mongodb.
After process finished, delete the raw data.
For small and medium data in localhost, the steps above run well. But in heroku, I use hirefire to dynamically scale worker dyno up and down. When worker still processing the large data, hirefire see empty queue and scale down worker dyno. This send kill signal to the process, and leave the process in incomplete state.
I'm searching a better way to do the parsing, allow the parsing process got killed anytime (saving the current state when receiving kill signal), and allow the process got re-queued.
Right now I'm using Model.delay.parse_file and it don't get re-queued.
UPDATE
After reading sidekiq wiki, I found article about job control. Can anyone explain the code, how it works, and how it preserve it's state when receiving SIGTERM signal and the worker get re-queued?
Is there any alternative way to handle job termination, save current state, and continue right from the last position?
Thanks,

Might be easier to explain the process and the high level steps, give a sample implementation (a stripped down version of one that I use), and then talk about throw and catch:
Insert the raw csv rows with an incrementing index (to be able to resume from a specific row/index later)
Process the CSV stopping every 'chunk' to check if the job is done by checking if Sidekiq::Fetcher.done? returns true
When the fetcher is done?, store the index of the currently processed item on the user and return so that the job completes and control is returned to sidekiq.
Note that if a job is still running after a short timeout (default 20s) the job will be killed.
Then when the job runs again simply, start where you left off last time (or at 0)
Example:
class UserCSVImportWorker
include Sidekiq::Worker
def perform(user_id)
user = User.find(user_id)
items = user.raw_csv_items.where(:index => {'$gte' => user.last_csv_index.to_i})
items.each_with_index do |item, i|
if (i+1 % 100) == 0 && Sidekiq::Fetcher.done?
user.update(last_csv_index: item.index)
return
end
# Process the item as normal
end
end
end
The above class makes sure that each 100 items we check that the fetcher is not done (a proxy for if shutdown has been started), and ends execution of the job. Before the execution ends however we update the user with the last index that has been processed so that we can start where we left off next time.
throw catch is a way to implement this above functionality a little cleaner (maybe) but is a little like using Fibers, nice concept but hard to wrap your head around. Technically throw catch is more like goto than most people are generally comfortable with.
edit
Also you could not make call to Sidekiq::Fetcher.done? and record the last_csv_index on each row or on each chunk of rows processed, that way if your worker is killed without having the opportunity to record the last_csv_index you can still resume 'close' to where you left off.

You are trying to address the concept of idempotency, the idea that processing a thing multiple times with potential incomplete cycles does not cause problems. (https://github.com/mperham/sidekiq/wiki/Best-Practices#2-make-your-jobs-idempotent-and-transactional)
Possible steps forward
Split the file up into parts and process those parts with a job per part.
Lift the threshold for hirefire so that it will scale when jobs are likely to have fully completed (10 minutes)
Don't allow hirefire to scale down while a job is working (set a redis key on start and clear on completion)
Track progress of the job as it is processing and pick up where you left off if the job is killed.

Related

Understanding race_condition_ttl in Rails

I am trying to understand the race_condition_ttl directive in Rails when using Rails.cache.fetch.
I have a controller action that looks like this:
def foo
#foo = Rails.cache.fetch("foo-testing", expires_in: 30.seconds, race_condition_ttl: 60.seconds) do
Time.now.to_s
end
#foo # this gets used in a view down the line...
end
Based on what I'm reading in the Rails docs, this value should expire after 30 seconds, but the stale value is allowed to be served for another 60 seconds. However, I can't figure out how to reproduce conditions that will show me this behavior working. Here is how I'm trying to test it.
100.times.map do
t = Thread.new { RestClient.get("http://myenvironment/foo") }
t
end.map {|t| t.join.value }.uniq
I have my Rails app running on a VM behind a standard nginx/unicorn setup. I am trying to spawn 100 threads hitting the site simultaneously to simulate the "dog pile effect". However, when I run my test code, all the threads report the same value back. What I would expect to see is that one thread gets the fresh value, while at least one other thread gets served some stale content.
Any pointers are welcome! Thanks so much.
You are setting race_condition_ttl to 60 seconds which means your threads will only start getting the new value after this time expires, even not taking into account the initial 30 seconds.
Your test doesn't look like it would take 1.5 minutes to run which would be required in order for the threads to start getting the new value. From the Rails Cache docs:
Yes, this process is extending the time for a stale value by another few seconds. Because of extended life of the previous cache, other processes will continue to use slightly stale data for a just a bit longer.
The text implies using a small race_condition_ttl and it makes sense both for its purpose and your test.
UPDATE
Also note that the life of stale cache is extended only if it expired recently. Otherwise a new value is generated and :race_condition_ttl does not play any role.
Without reading source it is not particularly clear how Rails decides when its server is getting hammered or what exactly recently means in the quote above. It seems clear though that the first process (of many) of those waiting to access the cache gets to set the new value while extending life of the previous one. The presence of waiting processes might be the condition Rails looks for. In any case the expected behaviour should be observed after both initial timeout and ttl expire and cache starts serving the updated value. The delay between initial timeout and the time new value starts showing up should be similar to the ttl. Of course the precondition is the server should be hammered around the moment of initial timeout expiration.

Moving a Resque job between queues

Is there anyway to move a resque job between two different queues?
We sometimes get in the situation that we have a big queue and a job that is near the end we find a need to "bump up its priority." We thought it might be an easy way to simply move it to another queue that had a worker waiting for any high priority jobs.
This happens rarely and is usually a case where we get a special call from a customer, so scaling, re-engineering don't seem totally necessary.
There is nothing built-in in Resque. You can use rpoplpush like:
module Resque
def self.move_queue(source, destination)
r = Resque.redis
r.llen("queue:#{source}").times do
r.rpoplpush("queue:#{source}", "queue:#{destination}")
end
end
end
https://gist.github.com/rafaelbandeira3/7088498
If it's a rare occurrence you're probably better off just manually pushing a new job into a shorter queue. You'll want to make sure that your system has a way to identify that the job has already run and to bail out so that when the job in the long queue is finally reached it is not processed again (if double processing is a problem for you).

How to avoid meeting Heroku's API rate limit with delayed job and workless

My Survey model has about 2500 instances and I need to apply the set_state method to each instance twice. I need to apply it the second time only after every instance has had the method applied to it once. (The state of an instance can depend on the state of other instances.)
I'm using delayed_job to create delayed jobs and workless to automatically scale up/down my worker dynos as required.
The set_state method typically takes about a second to execute. So I've run the following at the heroku console:
2.times do
Survey.all.each do |survey|
survey.delay.set_state
sleep(4)
end
end
Shouldn't be any issues with overloading the API, right?
And yet I'm still seeing the following in my logs for each delayed job:
Heroku::API::Errors::ErrorWithResponse: Expected(200) <=> Actual(429 Unknown)
I'm not seeing any infinite loops -- it just returns this message as soon as I create the delayed job.
How can I avoid blowing Heroku's API rate limits?
Reviewing workless, it looks like it incurs an API call per delayed job to check the worker count and potentially a second API call to scale up/down. So if you are running 5000 (2500x2) jobs within a short period, you'll end up with 5000+ API calls. Which would be well in excess of the 1200/requests per hour limit. I've commented over there to hopefully help toward reducing the overall API usage (https://github.com/lostboy/workless/issues/33#issuecomment-20982433), but I think we can offer a more specific solution for you.
In the mean time, especially if your workload is pretty predictable (like this). I'd recommend skipping workless and doing that portion yourself. ie it sounds like you already know WHEN the scaling would need to happen (scale up right before the loop above, scale down right after). If that is the case you could do something like this to emulate the behavior in workless:
require 'heroku-api'
heroku = Heroku::API.new(:api_key => ENV['HEROKU_API_KEY'])
client.post_ps_scale(ENV['APP_NAME'], 'worker', Survey.count)
2.times do
Survey.all.each do |survey|
survey.delay.set_state
sleep(4)
end
end
min_workers = ENV['WORKLESS_MIN_WORKERS'].present? ? ENV['WORKLESS_MIN_WORKERS'].to_i : 0
client.post_ps_scale(ENV['APP_NAME'], 'worker', min_workers)
Note that you'll need to remove workless from these jobs also. I didn't see a particular way to do this JUST for certain jobs though, so you might want to ask on that project if you need that. Also, if this needs to be 2 pass (the first time through needs to finish before the second), the 4 second sleep may in some cases be insufficient but that is a different can of worms.
I hope that helps narrow in on what you needed, but I'm certainly happy to discuss further and/or elaborate on the above as needed. Thanks!

Long running delayed_job jobs stay locked after a restart on Heroku

When a Heroku worker is restarted (either on command or as the result of a deploy), Heroku sends SIGTERM to the worker process. In the case of delayed_job, the SIGTERM signal is caught and then the worker stops executing after the current job (if any) has stopped.
If the worker takes to long to finish, then Heroku will send SIGKILL. In the case of delayed_job, this leaves a locked job in the database that won't get picked up by another worker.
I'd like to ensure that jobs eventually finish (unless there's an error). Given that, what's the best way to approach this?
I see two options. But I'd like to get other input:
Modify delayed_job to stop working on the current job (and release the lock) when it receives a SIGTERM.
Figure out a (programmatic) way to detect orphaned locked jobs and then unlock them.
Any thoughts?
Abort Job Cleanly on SIGTERM
A much better solution is now built into delayed_job. Use this setting to throw an exception on TERM signals by adding this in your initializer:
Delayed::Worker.raise_signal_exceptions = :term
With that setting, the job will properly clean up and exit prior to heroku issuing a final KILL signal intended for non-cooperating processes:
You may need to raise exceptions on SIGTERM signals, Delayed::Worker.raise_signal_exceptions = :term will cause the worker to raise a SignalException causing the running job to abort and be unlocked, which makes the job available to other workers. The default for this option is false.
Possible values for raise_signal_exceptions are:
false - No exceptions will be raised (Default)
:term - Will only raise an exception on TERM signals but INT will wait for the current job to finish.
true - Will raise an exception on TERM and INT
Available since Version 3.0.5.
See this commit where it was introduced.
TLDR:
Put this at the top of your job method:
begin
term_now = false
old_term_handler = trap 'TERM' do
term_now = true
old_term_handler.call
end
AND
Make sure this is called at least once every ten seconds:
if term_now
puts 'told to terminate'
return true
end
AND
At the end of your method, put this:
ensure
trap 'TERM', old_term_handler
end
Explanation:
I was having the same problem and came upon this Heroku article.
The job contained an outer loop, so I followed the article and added a trap('TERM') and exit. However delayed_job picks that up as failed with SystemExit and marks the task as failed.
With the SIGTERM now trapped by our trap the worker's handler isn't called and instead it immediately restarts the job and then gets SIGKILL a few seconds later. Back to square one.
I tried a few alternatives to exit:
A return true marks the job as successful (and removes it from the queue), but suffers from the same problem if there's another job waiting in the queue.
Calling exit! will successfully exit the job and the worker, but it doesn't allow the worker to remove the job from the queue, so you still have the 'orphaned locked jobs' problem.
My final solution was the one given at at the top of my answer, it comprises of three parts:
Before we start the potentially long job we add a new interrupt handler for 'TERM' by doing a trap (as described in the Heroku article), and we use it to set term_now = true.
But we must also grab the old_term_handler which the delayed job worker code set (which is returned by trap) and remember to call it.
We still must ensure that we return control to Delayed:Job:Worker with sufficient time for it to clean up and shutdown, so we should check term_now at least (just under) every ten seconds and return if it is true.
You can either return true or return false depending on whether you want the job to be considered successful or not.
Finally it is vital to remember to remove your handler and install back the Delayed:Job:Worker one when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).
New to the site, so can't comment on Dave's post, and need to add a new answer.
The issue I have with Dave's approach is that my tasks are long (minutes up to 8 hours), and are not repetitive at all. I can't "ensure to call" every 10 seconds.
Also, I have tried Dave's answer, and the job is always removed from the queue, regardless of what I return -- true or false. I am unclear as to how to keep the job on the queue.
See this this pull request. I think this may work for me. Please feel free to comment on it and support the pull request.
I am currently experimenting with a trap then rescue the exit signal... No luck so far.
That is what max_run_time is for: after max_run_time has elapsed from the time the job was locked, other processes will be able to acquire the lock.
See this discussion from google groups
I ended up having to do this in a few places, so I created a module that I stick in lib/, and then run ExitOnTermSignal.execute { long_running_task } from inside my delayed job's perform block.
# Exits whatever is currently running when a SIGTERM is received. Needed since
# Delayed::Job traps TERM, so it does not clean up a job properly if the
# process receives a SIGTERM then SIGKILL, as happens on Heroku.
module ExitOnTermSignal
def self.execute(&block)
original_term_handler = Signal.trap 'TERM' do
original_term_handler.call
# Easiest way to kill job immediately and having DJ mark it as failed:
exit
end
begin
yield
ensure
Signal.trap 'TERM', original_term_handler
end
end
end
I use a state machine to track the progress of jobs, and make the process idempotent so I can call perform on a given job/object multiple times and be confident it won't re-apply a destructive action. Then update the rake task/delayed_job to release the log on TERM.
When the process restarts it will continue as intended.

Need alternative to filters/observers for Ruby on Rails project

Rails has a nice set of filters (before_validation, before_create, after_save, etc) as well as support for observers, but I'm faced with a situation in which relying on a filter or observer is far too computationally expensive. I need an alternative.
The problem: I'm logging web server hits to a large number of pages. What I need is a trigger that will perform an action (say, send an email) when a given page has been viewed more than X times. Due to the huge number of pages and hits, using a filter or observer will result in a lot of wasted time because, 99% of the time, the condition it tests will be false. The email does not have to be sent out right away (i.e. a 5-10 minute delay is acceptable).
What I am instead considering is implementing some kind of process that sweeps the database every 5 minutes or so and checks to see which pages have been hit more than X times, recording that state in a new DB table, then sending out a corresponding email. It's not exactly elegant, but it will work.
Does anyone else have a better idea?
Rake tasks are nice! But you will end up writing more custom code for each background job you add. Check out the Delayed Job plugin http://blog.leetsoft.com/2008/2/17/delayed-job-dj
DJ is an asynchronous priority queue that relies on one simple database table. According to the DJ website you can create a job using Delayed::Job.enqueue() method shown below.
class NewsletterJob < Struct.new(:text, :emails)
def perform
emails.each { |e| NewsletterMailer.deliver_text_to_email(text, e) }
end
end
Delayed::Job.enqueue( NewsletterJob.new("blah blah", Customers.find(:all).collect(&:email)) )
I was once part of a team that wrote a custom ad server, which has the same requirements: monitor the number of hits per document, and do something once they reach a certain threshold. This server was going to be powering an existing very large site with a lot of traffic, and scalability was a real concern. My company hired two Doubleclick consultants to pick their brains.
Their opinion was: The fastest way to persist any information is to write it in a custom Apache log directive. So we built a site where every time someone would hit a document (ad, page, all the same), the server that handled the request would write a SQL statement to the log: "INSERT INTO impressions (timestamp, page, ip, etc) VALUES (x, 'path/to/doc', y, etc);" -- all output dynamically with data from the webserver. Every 5 minutes, we would gather these files from the web servers, and then dump them all in the master database one at a time. Then, at our leisure, we could parse that data to do anything we well pleased with it.
Depending on your exact requirements and deployment setup, you could do something similar. The computational requirement to check if you're past a certain threshold is still probably even smaller (guessing here) than executing the SQL to increment a value or insert a row. You could get rid of both bits of overhead by logging hits (special format or not), and then periodically gather them, parse them, input them to the database, and do whatever you want with them.
When saving your Hit model, update a redundant column in your Page model that stores a running total of hits, this costs you 2 extra queries, so maybe each hit takes twice as long to process, but you can decide if you need to send the email with a simple if.
Your original solution isn't bad either.
I have to write something here so that stackoverflow code-highlights the first line.
class ApplicationController < ActionController::Base
before_filter :increment_fancy_counter
private
def increment_fancy_counter
# somehow increment the counter here
end
end
# lib/tasks/fancy_counter.rake
namespace :fancy_counter do
task :process do
# somehow process the counter here
end
end
Have a cron job run rake fancy_counter:process however often you want it to run.

Resources