How long should a Sidekiq job last? - ruby-on-rails

In the Sidekiq wiki it is stated:
Make your jobs small and simple
I get simple, I get idempotent and transactional, but what is small? Maybe required Memory and Computing Time is a good measure? My Sidekiq jobs take between 10sec and 30min.
I think that 10sec is okay, but what about the long-running task of 30min? I am loading all the data of a certain type from the database into memory, run lengthy computations on them and then write back the results. All three things in one worker job.
Is that fine? Or should I instead invoke from a worker job, multiple worker jobs that run the small computations? The problem is, these small computations may need some complex hash tables to do the computations and it was suggested not to persist this in Redis, only small simple values.

That depends on how often you want/have to invoke the job and whether it is acceptable to you that it takes so long.
If you run the job in shorter intervals than it takes to finish it certainly is too long.
Splitting it up into multiple workers would only help here if you could improve the total run time (e.g. if some of it can be run at the same time)
So rule of thumb as always: as long as it suits your needs it is ok.
However:
on long jobs you should consider that the job might fail mid-execution for whatever reason (server crashes etc).
Can you still continue this lengthy job there or will it be rolled back properly?
Also: What happens if data is changed while you are executing the job?

Related

Running large amount of long running background jobs in Rails

We're building a web-app where users will be uploading potentially large files that will need to be processed in the background. The task involves calling 3rd-party APIs so each job can take several hours to complete. We're using DelayedJob to run the background jobs. With every user kicking off a background job, each of which will take a few hours to finish, that will add up to a lot of background jobs every quickly. I am wondering what would be the best way to setup the deployment for this? We're currently hosted on DigitalOcean. I've kicked off 10 DelayedJob workers. Each one (when ideal) takes up 157MB. When actively running it utilizes around 900 MB. Our user-base right now is pretty small so it's not an issue but will be one soon. So on a 4GB droplet, I can probably run like 2 or 3 workers at a time. How should we approach this issue? Should we be looking at using DigitalOcean's API to auto-spin cheap droplets on demand? Should we subscribe to high-memory droplets on a monthly basis instead? If we go with auto-spinning droplets, should we stick with DigitalOcean or would Heroku make more sense? Or is the entire approach wrong and should we be approaching it from an entire different direction? Any help/advice would be very much appreciated.
Thanks!
It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.
If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?
Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.
Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.
Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.
Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.
I thing memory and time is more problem for you. you have to use sidekiq gem for this process because it will consume less time and memory consumption for doing the same job,because it uses redis as database which is key value pair db.if the problem continues go with java script.

How to correctly use Resque workers?

I have the following tasks to do in a rails application:
Download a video
Trim the video with FFMPEG between a given duration (Eg.: 00:02 - 00:09)
Convert the video to a given format
Move the converted video to a folder
Since I wanted to make this happen in background jobs, I used 1 resque worker that processes a queue.
For the first job, I have created a queue like this
#queue = :download_video that does it's task, and at the end of the task I am going forward to the next task by calling Resque.enqueue(ConvertVideo, name, itemId). In this way, I have created a chain of queues that are enqueued when one task is finished.
This is very wrong, since if the first job starts to enqueue the other jobs (one from another), then everything get's blocked with 1 worker until the first list of queued jobs is finished.
How should this be optimised? I tried adding more workers to this way of enqueueing jobs, but the results are wrong and unpredictable.
Another aspect is that each job is saving a status in the database and I need the jobs to be processed in the right order.
Should each worker do a single job from above and have at least 4 workers? If I double the amount to 8 workers, would it be an improvement?
Have you considered using sidekiq ?
As said in Sidekiq documentation :
resque uses redis for storage and processes messages in a single-threaded process. The redis requirement makes it a little more difficult to set up, compared to delayed_job, but redis is far better as a queue than a SQL database. Being single-threaded means that processing 20 jobs in parallel requires 20 processes, which can take a lot of memory.
sidekiq uses redis for storage and processes jobs in a multi-threaded process. It's just as easy to set up as resque but more efficient in terms of raw processing speed. Your worker code does need to be thread-safe.
So you should have two kind of jobs : download videos and convert videos and any download video job should be done in parallel (you can limit that if you want) and then each stored in one queue (the "in-between queue") before being converted by multiple convert jobs in parallel.
I hope that helps, this link explains quite well the best practices in Sidekiq : https://github.com/mperham/sidekiq/wiki/Best-Practices
As #Ghislaindj noted Sidekiq might be an alternative - largely because it offers plugins that control execution ordering.
See this list:
https://github.com/mperham/sidekiq/wiki/Related-Projects#execution-ordering
Nonetheless, yes, you should be using different queues and more workers which are specific to the queue. So you have a set of workers all working on the :download_video queue and then you other workers attached to the :convert_video queue, etc.
If you want to continue using Resque another approach would be to use delayed execution, so when you enqueue your subsequent jobs you specify a delay parameter.
Resque.enqueue_in(10.seconds, ConvertVideo, name, itemId)
The down-side to using delayed execution in Resque is that it requires the resque-scheduler package, so you're introducing a new dependency:
https://github.com/resque/resque-scheduler
For comparison Sidekiq has delayed execution natively available.
Have you considered merging all four tasks into just one? In this case you can have any number of workers, one will do the job. It will work very predictable, you can even know how much time will take to finish the task. You also don't have problems when one of the subtasks takes longer than all others and it piles up in the queue.

How to add a rate limit to a Resque queue?

So I have a Resque worker that calls an API, the issue is the API has a rate limit of 2 requests per second.
Is there a way to add a delay between each job processed in a specific queue?
P.S. the queue could have thousands of pending jobs.
Why not sleep for a given amount of time at the end of the process? Well, perhaps you want your resque worker to be doing something useful instead. In CPU time, half a second is a lot of time - you could have done something useful there, like process a job from another queue that's not rate limited.
I have this same problem myself, so I'm motivated to find a solution. It seems like there are two easy-ish ways to do it. The first idea is to use resque scheduler and pre-compute the time to run the job at before inserting it. This seems error-prone to me. The second is to use a gem like https://github.com/flyerhzm/resque-restriction (disclaimer: just found it through some googling. haven't used it yet) and rate-limit as you pull jobs off the queue. Seems like a robust solution in theory. Note that if you can't execute the job yet, it never comes off the queue, so you'll pull something else instead - much more efficient use of your workers.
Per my comment, I'd recommend just performing a sleep for a given number of seconds at the end of each Resque process method.

Grails non time based queuing

I need to process files which get uploaded and it can take as little as 1 second or as much as 10 minutes. Currently my solution is to make a quartz job with a timer of 30 seconds and then process and arbitrary job whenever it hits. There are several problems with this.
One: if the job will take less than a few seconds it is wasteful to make things wait 30 seconds for the job queue.
Two: if there is only one long job in the queue it could feasibly try to do it twice.
What I want is a timeless queue. When things are added the are started immediately if there is a free worker. Is there a solution for this? I was looking at jesque, but I couldn't tell if it can do this.
What you are looking for is a basic message queue. There are lots of options out there, but my favorite for Grails is RabbitMQ. The Grails plugin for it is quite good and it performs well in my experience.
In general, message queues allow you to have N producers (things creating jobs") adding work messages to a queue and then M consumers pulling jobs off of the queue and processing them. When a worker completes it's job, it simply asks the queue for the next job to process and if there is none, it just waits for the queue to give it something to do. The queue also keeps track of success / failure of message processing (you can control this) so that you don't give the same message to more than one worker.
This has the advantage of not relying on polling (so you can start processing as soon as things come in) and it's also much more scaleable. You can scale both your producers and consumers up or down as needed, decoupling the inputs from the outputs so that you can take a traffic spike and then work your way through it as you have the resources (workers) available.
To solve problem one just make the job check for new uploaded files every 5 seconds (or 3 seconds, or 1 second). If the check for uploaded files is quick then there is no reason you can't run it often.
For problem two you just need to record when you start processing a file to ensure it doesn't get picked-up twice. You could create a table in the database, or store the information in memory somewhere.

Quartz.Net jobs not always running - can't find any reason why

We're using Quartz.Net to schedule about two hundred repeating jobs. Each job uses the same IJob implementing class, but they can have different schedules. In practice, they end up having the same schedule, so we have about two hundred job details, each with their own (identical) repeating/simple trigger, scheduled. The interval is one hour.
The task this job performs is to download an rss feed, and then download all of the media files linked to in the rss feed. Prior to downloading, it wipes the directory where it is going to place the files. A single run of a job takes anywhere from a couple seconds to a dozen seconds (occasionally more).
Our method of scheduling is to call GetScheduler() on a new StdSchedulerFactory (all jobs are scheduled at once into the same IScheduler instance). We follow the scheduling with an immediate Start().
The jobs appear to run fine, but upon closer inspection we are seeing that a minority of the jobs occasionally - or almost never - run.
So, for example, all two hundred jobs should have run at 6:40 pm this evening. Most of them did. But a handful did not. I determine this by looking at the file timestamps, which should certainly be updated if the job runs (because it deletes and redownloads the file).
I've enabled Quartz.Net logging, and added quite a few logging statements to our code as well.
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
After that, all activity stops. No jobs run, no log messages are created. Zero.
And then, at the next firing interval, Quartz starts up again and my log files update, and various files start downloading. But - it certainly appears like some JobDetail instances never make it to the head of the line (so to speak) or do so very infrequently. Over the entire weekend, some jobs appeared to update quite frequently, and recently, and others had not updated a single time since starting the process on Friday (it runs in a Windows Service shell, btw).
So ... I'm hoping someone can help me understand this behavior of Quartz.
I need to be certain that every job runs. If it's trigger is missed, I need Quartz to run it as soon as possible. From reading the documentation, I thought this would be the default behavior - for SimpleTrigger with an indefinite repeat count it would reschedule the job for immediate execution if the trigger window was missed. This doesn't seem to be the case. Is there any way I can determine why Quartz is not firing these jobs? I am logging at the trace level and they just simply aren't there. It creates and executes an awful lot of jobs, but if I notice one missing - all I can find is that it ran it the last time (for example, sometimes it hasn't run for hours or days). Nothing about why it was skipped (I expected Quartz to log something if it skips a job for any reason), etc.
Any help would really, really be appreciated - I've spent my entire day trying to figure this out.
After reading your post, it sounds a lot like the handful of jobs that are not executing are very likely misfiring. The reason that I believe this:
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
In Quartz.NET the default misfire threshold is 1 minute. Chances are, you need to examine your logging configuration to determine why those misfire events are not being logged. I bet if you throw open the the floodgates on your logging (ie. set everything to debug, and make sure that you definitely have a logging directive for the Quartz scheduler class), and then rerun your jobs. I'm almost positive that the problem is the misfire events are not showing up in your logs because the logging configuration is lacking something. This is understandable, because logging configuration can get very confusing, very quickly.
Also, in the future, you might want to consult the quartz.net forum on google, since that is where some of the more thorny issues are discussed.
http://groups.google.com/group/quartznet?pli=1
Now, your other question about setting the policy for what the scheduler should do, I can't specifically help you there, but if you read the API docs closely, and also consult the google discussion group, you should be able to easily set the misfire policy flag that suits your needs. I believe that Trigger's have a MisfireInstruction property which you can configure.
Also, I would argue that misfires introduce a lot of "noise" and should be avoided; perhaps bumping up the thread count on your scheduler would be a way to avoid misfires? The other option would be to stagger your job execution into separate/multiple batches.
Good luck!

Resources