Rails 3 + Heroku + Delayed jobs - Help me understand! - ruby-on-rails

I'm having problems understanding this article: http://blog.darkhax.com/2010/07/30/auto-scale-your-resque-workers-on-heroku .
I don't quite get it why do I need Redis + Resque when I have delayed jobs provided by Heroku.
From my understanding, I still have to pay for the workers, correct? What's my main advantage of using that solution?
Regards.

If you don't know why you need Resque, then you don't need it ;)
Resque is for high-scalability. delayed_job is fine for smaller-scale stuff, but once you get to the size of, say, Github, you will need something like Resque. If delayed_job works for you, then stay with it. You don't need to worry about replacing it until your background jobs queue gets around 30,000 or so.

To autoscale heroku workers using delayed job, you can hook into the enqueue and after hooks and use the heroku api to query/update the number of workers.
For the most basic implementation on enqueue, check to see if there are workers and if not add a worker. On after, check to see if there are other delayed jobs and if not reduce the workers to 0.
You can obviously make this more sophisticated in the way that you scale.
Here is a basic implementation: https://github.com/phaza/Heroku-Delayed-Job-Autoscale

hirefireapp is a new-ish simple drop-in solution to auto-scaling workers.
It spawns workers for you based on queue size (configurable) and then "fires" them when they are no longer necessary. You pay for the dyno time (to the nearest second) and for the hirefireapp service. In theory you could roll your own using the open source hirefire gem too.
It also handles scaling the web side if you choose, so you can spawn more web dynos based on current latency.

You can also use Hirefireapp.com to monitor and scale your apps

Related

Scaling Dyno worker size dynamically on Heroku Rails application

I am working on a project that launches a process via a Rails worker that is very resource intensive and it can only be handled properly by a Performance Worker on Heroku, 1X workers are killed because they use too much RAM and 2X workers can barely handle the load exceeding their RAM limits by up to 160%. A performance worker does the job fine with no issues.
My question is, is there a way to dynamically switch the Dyno size to Performance before a job initiates and then scale it back down once the job is finished or a queue is empty?
I know HireFire exists but to my knowledge this service only increases the amount of workers based on a queue length etc? Another possible solution I thought about was using the Heroku API which has a Dyno endpoint to resize the worker dyno before the job starts and then resize it back down when the job ends.
Does anyone else have other recommendations, ideas or strategies for this issue?
Thanks!
The best way is the one you mentioned: use the Heroku Platform API to scale your Dyno size up before starting the job, and then down again afterwards.
This is because tools like HireFire only work by inspecting stuff like application response time, router queue, etc. -- so there's no way for them to know you're about to run some job and then scale up just for that.
Depending on the specifics of the usage, you may be able to just create a distinct dyno-type in your procfile that only runs this particular worker and is always scaled to performance, but isn't always running? You could even just run this with one-off runs, instead of scaling it potentially (this can also be done via the API, roughly equivalent to heroku run ...). That said, #rdegges answer should certainly work.

Good background processing options

I am looking for a good background job processor with following ability,
Works well with MySql
Can have priorities
Can easily schedule anything in background( not just emails)
Ability to reinitialize the job after completion (callback would be good. I have few task/jobs that keeps on running after every minutes), even a repetitive scheduler would work
Should not eat up lot of memory, (have this experience with DJ)
Few options that I am looking into Resque, DJ, Beanstalkd (haven't explored completely)
I have my production env in Amazon EC2 (if this helps for better solution)
Please suggest me which is a good option, is there something else apart from these that people use nowadays ?
I'd heartily recommend sidekiq - it's extremely flexible and it uses far less resources than Resque or DelayedJob.
It does require redis (like Resque), but redis is valuable addition to any Rails project since it can be reused as a session store and cache. Our primary db is MySQL and we deploy to EC2 :-) We've used delayed job and rescue in the past, but found them problematic and heavy on the resources they use. Sidekiq uses threads and a single sidekiq worker is as efficient as several DJ/Resque workers. Here's an interesting part of the project's README that I can corroborate:
You'll find that you might need 50 200MB resque processes to peg your
CPU whereas one 300MB Sidekiq process will peg the same CPU and
perform the same amount of work. Please see my blog post on Resque's
memory efficiency and how I was able to shrink a Carbon Five client's
resque processing farm from 9 machines to 1 machine.
To sum it all up:
It works fine with MySQL - not really, but it doesn't have problems with MySQL either
You can have priorities by setting up different processing queues
You can easily schedule anything (and there is special ala DJ support for e-mails in particular)
Not quite sure about that, we use whenever + cron for repetitive jobs
You're gonna love Sidekiq's small memory footprint

Will the resque queue loads the complete application?

I have 16 resque queues and when I try to see the memory allocaton for these queues it is showing like 4% of the memory for each fo these queues. But at that time all these queues are empty. SO, out of 100% of my memory nearly 64% is utilized by the environment load itself. Thats what I feel.
My doubt are
1. Will each of these resque queues loads the complete application into memory separately.
If Yes, can I make any change to the resque configuation in such a way that all resque queues use the same environment loaded in a single place in memory.
Thanks in advance
I think you are out of luck if you're using Resque. I believe this is why Sidekiq was developed as a nearly drop-in replacement for Resque. The author of Sidekiq wrote a blog post describing how he improved Resque's memory usage. Here's a little bit from the Sidekiq FAQs:
Why Sidekiq over Multi-threaded Resque?
Back at Carbon Five I worked on improving Resque to use threads
instead of forking. That project was the basis for Sidekiq. I would
suggest using Sidekiq over that fork of Resque for a few reasons:
MT Resque was a one-off for a Carbon Five client and is not supported.
There are a number of bugs that were not solved, e.g. the web UI's
display of worker threads, because they were not important to the
client.
Sidekiq was built from the ground up to use threads via
Celluloid.
Sidekiq has middleware, which lets you do cool things in
the job lifespan. Resque doesn't support middleware like this
natively.
In short, MT Resque: a quick hack to save one client a lot of money, Sidekiq: the well designed solution for the same problem.

How can I monitor recurrent rake tasks run by heroku scheduler?

I just got the last month heroku bill, and the scheduled rake tasks were a relatively heavy burden. We are pretty early in our development process, so we just developed some rake tasks to get the job done recently, and didn't had much concern in theirs optimization.
Now we want to improve theirs performance and theirs heroku processing hours usage. We use New Relic to monitor the webapp performance, but apparently this type of rake tasks are ignored by default, and it's unclear how to override that.
Anyone had a similiar problem? How can I track the scheduled tasks in close to real time to monitor performance, optimize, and don't get suprise bills?
Whilst you can't really monitor rake tasks that well, there are a few little things you can do. One is the use of logging. Output start and end times of tasks to logs, and you can then see what's been happening duration wise. If you couple this with something like the Papertrail add-on then you can do additional interrogation later on.
As for running the jobs themselves, there's a couple of ways that you can run background processes which are dependant on how they need to run:
If you're needing to run jobs on a schedule, there's a few options available. Firstly there's the Heroku scheduler, which is pretty good, but doesn't guarantee executions will happen. Normally you would use this to kick off a rake task which will bring up a one-off dyno for the duration of the task - therefore you need to ensure in development that these tasks are as efficient as possible.
Alternatively, if you're looking at jobs that need a little more control or using a clock process. Essentially this is a dyno running 24/7 that does nothing but kick off other jobs at preset intervals and times. This would normally be done using the clockwork gem. The downside of this approach is that you need to pay for a clock process all the time.
A third approach, and one that might work is delayed job, with it's runat option, allowing you to queue a job to be run in the future (and jobs can re-queue themselves). There are a few issues with this in that a failure can kill the whole chain, and you need a full time worker running to process them all.
Therefore, in order to minimize your bills, ensure that your rake tasks are as performant and reliable, and then choose the scheduling option that suits you. If you're looking at schedules plus user created events, delayed_job might be the best option. If you're looking at a few tasks running periodically, then go scheduler. If you're looking at running lots of time critical jobs on a regular basis, go with clockwork.
Either way, you should be able to constrain a fair amount of processing into just one or two processes depending on your approach.
I know this question is almost 10 years old, but there is a new way!
You can now monitor your Heroku Scheduler jobs using One-off Dyno Metrics. This Heroku add-on gathers metrics for all detached one-off dynos running in your Heroku app. It was created to be an extension of Heroku's Application Metrics and works out of the box.
when you are running on heroku cedar there is a way to get a free setup for your workers. this is no answer to your monitoring question, but it might be interesting anyways: http://blog.nofail.de/2011/07/heroku-cedar-background-jobs-for-free/
You can force the New Relic agent to start in your rake tasks and report their performance data.
Not the answer to the specific question,but...
One method of reducing overhead is using Unicorn server to get multiple workers working on one dyno. It depends on your set up, but most people who've taken the time to test it can comfortably get 3 - 4 worker processes running concurrently. It's a huge boost in clearing cues or tasks. Just be careful not to max out the allocated memory for the dyno.

delayed_job, daemons or other gem for recurring background jobs

I need to build a background job that goes through a list of RSS feeds and analyze them say every 10 minutes.
I have been using delayed_job for handling background jobs and I liked it a lot. I believe though that it's not built for recurring background jobs. I guess I can auto-schedule background job at the end of everyone (maybe with begin..rescue just to ensure it gets executes). Or preschedule say a month of advance worth of jobs and have another one that reschedule the every month..etc
This raised some concerned to me as I started asking myself: what if the server goes down in the middle of execution and the jobs didn't get scheduled?
I have also looked at Daemons gems which seemed the like it runs simple Ruby scripts with start/stop commands. I like the way delayed_job schedules and handles retries.
What do you recommend using in this case? What do you think the best way to design such a system with recurring background jobs? Also do you know a way I can monitor that background process and get notified if it stops?
I just implemented delayed_job for a similar task (using :run_at => 2.days.from_now) and found it to be a perfect fit. The easiest way to handle your concern about a process failing is to make the first step of the job to create the next job. Also, you can create a has_many relationship to the delayed_job model which would allow you to access the :last_error. Or, look at the "Hooks" section of readme and it has a perfect example for failure.
I think that this was a similar question: A cron job for rails: best practices? - not only are there answers, but also links to railscasts about background jobs in rails.
I used cron + delayed_job, but scheduled tasks were supposed to run few times a day, mostly just once.
Take a look at SimpleWorker. It's an elastic scheduling and background processing worker queue. It's cloud based and has persistence and redundancy so you don't need to worry if your servers go down or are restarted.
Very flexible in terms of scheduling, provides great introspection of jobs in the queue as well as notifications on status and errors.
Full disclosure: I work at SimpleWorker.

Resources