How can I configure Delayed jobs to not wait for a task before starting the others? - ruby-on-rails

I am using Delayed jobs for my Ruby app hosted in Heroku to perform a very long task that can take up to 5 minutes.
I've noticed that, in development mode at least, when this task is running the ones that come afterwards are not started until that one finishes. I would like other tasks to be able to start running without having to wait for the other to finish (to have at least 3 concurrent tasks, for example).
I don't wish to increase the number of workers in Heroku ($$$).
I noticed the 'pool' param in delayed jobs but I don't fully understand if this is what I need or how to use it.
https://github.com/collectiveidea/delayed_job/blob/master/README.md
I achieved it using threads in the task code, but maybe this is not the best way to do it.
If you could tell me exactly how I could achieve concurrency in delayed jobs I would really appreciate it.

A DJ worker only runs a single job at a time. If you want concurrent processing of your background jobs, you'll need multiple background workers.
You are way better off implementing sidekiq.

Related

how to continuously deploy with long running jobs

We currently use delayed_job and rails to manage some long running jobs in our system. Some of these jobs take potentially hours to run, but we also like to deploy rather frequently, often many times a day. The problem with this setup is that we have to restart delayed_job during deployment to pick up code changes, so that any new jobs are processed with the latest code.
The solution we've arrived at is that for any job that needs to run for more than some small amount of time, we fork the delayed job so that it returns immediately, and the forked process handles the work. This way a deploy can restart all the delayed job processes, while the long-running 'job' keeps going until it's finished as an orphaned process.
We've looked at sidekiq, but it looks like we'd have the same issue there when trying to deploy new code.
Has anyone developed a solution they would recommend for dealing with long-running background processes that span multiple deployments?

How to correctly use Resque workers?

I have the following tasks to do in a rails application:
Download a video
Trim the video with FFMPEG between a given duration (Eg.: 00:02 - 00:09)
Convert the video to a given format
Move the converted video to a folder
Since I wanted to make this happen in background jobs, I used 1 resque worker that processes a queue.
For the first job, I have created a queue like this
#queue = :download_video that does it's task, and at the end of the task I am going forward to the next task by calling Resque.enqueue(ConvertVideo, name, itemId). In this way, I have created a chain of queues that are enqueued when one task is finished.
This is very wrong, since if the first job starts to enqueue the other jobs (one from another), then everything get's blocked with 1 worker until the first list of queued jobs is finished.
How should this be optimised? I tried adding more workers to this way of enqueueing jobs, but the results are wrong and unpredictable.
Another aspect is that each job is saving a status in the database and I need the jobs to be processed in the right order.
Should each worker do a single job from above and have at least 4 workers? If I double the amount to 8 workers, would it be an improvement?
Have you considered using sidekiq ?
As said in Sidekiq documentation :
resque uses redis for storage and processes messages in a single-threaded process. The redis requirement makes it a little more difficult to set up, compared to delayed_job, but redis is far better as a queue than a SQL database. Being single-threaded means that processing 20 jobs in parallel requires 20 processes, which can take a lot of memory.
sidekiq uses redis for storage and processes jobs in a multi-threaded process. It's just as easy to set up as resque but more efficient in terms of raw processing speed. Your worker code does need to be thread-safe.
So you should have two kind of jobs : download videos and convert videos and any download video job should be done in parallel (you can limit that if you want) and then each stored in one queue (the "in-between queue") before being converted by multiple convert jobs in parallel.
I hope that helps, this link explains quite well the best practices in Sidekiq : https://github.com/mperham/sidekiq/wiki/Best-Practices
As #Ghislaindj noted Sidekiq might be an alternative - largely because it offers plugins that control execution ordering.
See this list:
https://github.com/mperham/sidekiq/wiki/Related-Projects#execution-ordering
Nonetheless, yes, you should be using different queues and more workers which are specific to the queue. So you have a set of workers all working on the :download_video queue and then you other workers attached to the :convert_video queue, etc.
If you want to continue using Resque another approach would be to use delayed execution, so when you enqueue your subsequent jobs you specify a delay parameter.
Resque.enqueue_in(10.seconds, ConvertVideo, name, itemId)
The down-side to using delayed execution in Resque is that it requires the resque-scheduler package, so you're introducing a new dependency:
https://github.com/resque/resque-scheduler
For comparison Sidekiq has delayed execution natively available.
Have you considered merging all four tasks into just one? In this case you can have any number of workers, one will do the job. It will work very predictable, you can even know how much time will take to finish the task. You also don't have problems when one of the subtasks takes longer than all others and it piles up in the queue.

Does Sidekiq execute jobs in the order they are sent to a worker?

I have a rake task which is going to call 4 more rake tasks, in order:
rake:one
rake:two
rake:three
rake:four
Rake tasks one, two, and three are getting data and adding it to my database. Then rake:four is going to do something with that data. But I need to make sure that one, two, and three are complete first. Each rake task is actually spinning up Sidekiq workers to run in the background. In this scenario, would all of the workers created by rake:one finish first, then rake:two, etc?
If not, how can I ensure that the workers are executed in order?
Sidekiq processes jobs in the order which they are created, but by default it processes multiple jobs simultaneously, and there is no guarantee that a given job will finish before another job is started.
Quoting from https://github.com/mperham/sidekiq/wiki/FAQ:
How can I process a certain queue in serial?
You can't, by design. Sidekiq is designed for asynchronous processing
of jobs that can be completed in isolation and independent of each
other. Jobs will be popped off of Redis in the order in which they
were pushed but there's no guarantee that Job #1 will execute fully
before Job #2 is started.
If you need serial execution, you should look into other systems which
give those types of guarantees.
Note you can create a Sidekiq process dedicated to processing a queue
with a single worker. This will give you serial execution but it's a
hack.
Also note you can use third-party extensions for sidekiq to achieve
that goal.
You can simply create one meta rake task, which will include all those tasks in right order.
Or as a less hacky solution: reduce number of workers per queue to 1:
https://github.com/brainopia/sidekiq-limit_fetch#limits
And add all your jobs to this queue

How can I monitor recurrent rake tasks run by heroku scheduler?

I just got the last month heroku bill, and the scheduled rake tasks were a relatively heavy burden. We are pretty early in our development process, so we just developed some rake tasks to get the job done recently, and didn't had much concern in theirs optimization.
Now we want to improve theirs performance and theirs heroku processing hours usage. We use New Relic to monitor the webapp performance, but apparently this type of rake tasks are ignored by default, and it's unclear how to override that.
Anyone had a similiar problem? How can I track the scheduled tasks in close to real time to monitor performance, optimize, and don't get suprise bills?
Whilst you can't really monitor rake tasks that well, there are a few little things you can do. One is the use of logging. Output start and end times of tasks to logs, and you can then see what's been happening duration wise. If you couple this with something like the Papertrail add-on then you can do additional interrogation later on.
As for running the jobs themselves, there's a couple of ways that you can run background processes which are dependant on how they need to run:
If you're needing to run jobs on a schedule, there's a few options available. Firstly there's the Heroku scheduler, which is pretty good, but doesn't guarantee executions will happen. Normally you would use this to kick off a rake task which will bring up a one-off dyno for the duration of the task - therefore you need to ensure in development that these tasks are as efficient as possible.
Alternatively, if you're looking at jobs that need a little more control or using a clock process. Essentially this is a dyno running 24/7 that does nothing but kick off other jobs at preset intervals and times. This would normally be done using the clockwork gem. The downside of this approach is that you need to pay for a clock process all the time.
A third approach, and one that might work is delayed job, with it's runat option, allowing you to queue a job to be run in the future (and jobs can re-queue themselves). There are a few issues with this in that a failure can kill the whole chain, and you need a full time worker running to process them all.
Therefore, in order to minimize your bills, ensure that your rake tasks are as performant and reliable, and then choose the scheduling option that suits you. If you're looking at schedules plus user created events, delayed_job might be the best option. If you're looking at a few tasks running periodically, then go scheduler. If you're looking at running lots of time critical jobs on a regular basis, go with clockwork.
Either way, you should be able to constrain a fair amount of processing into just one or two processes depending on your approach.
I know this question is almost 10 years old, but there is a new way!
You can now monitor your Heroku Scheduler jobs using One-off Dyno Metrics. This Heroku add-on gathers metrics for all detached one-off dynos running in your Heroku app. It was created to be an extension of Heroku's Application Metrics and works out of the box.
when you are running on heroku cedar there is a way to get a free setup for your workers. this is no answer to your monitoring question, but it might be interesting anyways: http://blog.nofail.de/2011/07/heroku-cedar-background-jobs-for-free/
You can force the New Relic agent to start in your rake tasks and report their performance data.
Not the answer to the specific question,but...
One method of reducing overhead is using Unicorn server to get multiple workers working on one dyno. It depends on your set up, but most people who've taken the time to test it can comfortably get 3 - 4 worker processes running concurrently. It's a huge boost in clearing cues or tasks. Just be careful not to max out the allocated memory for the dyno.

Should the resque-scheduler queue be expected to handle /lots/ of delayed jobs?

I am currently using resque and resque-scheduler in an application that will have to handle a lot of recurring jobs - "do this every hour", "do this every day" etc. At the moment, I simply queue up the next run of the job in the job itself, the HourlyJob queue has a .enqueue_at(1.hour.from_now, HourlyJob) etc.
Should I be doing this? It "feels" like I should have a static recurring job using resque-schedulers cron-type functionality that then schedules up say the next 5 minutes worth of delayed jobs... but all I am really doing is moving the work from the (probably fast, redis based) resque-scheduler to my (probably less well implemented, mysql based) code, surely?
Is there anything wrong with how I'm doing it now?
I'd personally use the cron style provided by resque-scheduler, your use case is exactly what it was built for:
Your more directly indicate these are recurring jobs.
Everything is located in the same YAML file rather then multiple job classes/modules.
By queuing the next run of the job inside the actual job:
You run the risk of the next run going missing when your worker/job/server fails.
Your needlessly using more memory in Redis, the scheduler process will not add the jobs to Redis until there ready to be run.
Hops this helps.

Resources