custom clock process on heroku - ruby-on-rails

I've just inherited a Rails project and before it was on a typical 'nix server. The decision was made to move it to heroku for the client and it up to me to get the background process working.
Currently it uses Whenever to schedule daily events(email etc) and fire up the delayed job queue on boot.
Heroku provides an example of documentation for a custom clock process using clockwork, can I going by this example use it with whenever? Any pitfalls I might come across? Will I need to create a separate worker dyno?
Scheduled Jobs and Custom Clock Processes in Ruby with Clockwork

Yes -- Heroku's Cedar stack lets you run whatever you want.
The basic building block of the Cedar stack is the dyno. Each dyno gets an ephemeral copy of your application, 512 MB of RAM, and a bunch of shared CPU time. Web dynos are expected to bind an HTTP server to the port specified in the $PORT environment variable, since that's where Heroku will send HTTP requests, but other than that, web dynos are identical to other types of dynos.
Your application tells Heroku how to run its various components by defining them in the Procfile. (See Declaring and Scaling Process Types with Procfile.) The Clock Processes article demonstrates a pattern where you use a worker (i.e. non-web) dyno to enqueue work based on arbitrary criteria. Again, you can do whatever you want here -- just define it in a Procfile and Heroku will happily run it. If you go with a clock process (e.g. a 24x7 whenever), you'll be using a whole dyno ($0.05/hour) to do nothing but schedule work.
In your case, I'd consider switching from Whenever to Heroku Scheduler. Scheduler is basically a Heroku-run cron, where the crontab entries are "spin up a dyno and run this command". You'll still pay $0.05/hour for the extra dynos, but unlike the clock + worker setup, you'll only pay for the time they actually spend running. It cleanly separates periodic tasks from the steady-state web + worker traffic, and it's usually significantly cheaper too.
The only other word of warning is that running periodic tasks in distributed systems is complex and has complex failure modes. Some of the platform incidents (corresponding with the big EC2 outages) have resulted in things like 2 simultaneous clock processes and duplicate scheduler runs. If you're doing something that needs to run serially (like emailing people once a day), consider guarding it with RDBMS locking, and double-checking that it's actually been ~23 hours since your daily job.

Heroku Scheduler is often a bad option for production use because it's unreliable and will skip running its tasks sometimes.
The good news is that if you run a jobs queue dyno with Sidekiq there are scheduling plugins for it, e.g. sidekiq-cron. With that you can use the same dyno for scheduling. And if you don't have a jobs worker yet you need to set it up just for scheduling if you need to run it reliably.
P.S. if you happen to run Delayed::Job for jobs queing there are scheduling plugins for it, too, e.g. this one.

Related

Is it possible to redeploy a Heroku app without restarting some process types

I'm running a Rails app on Heroku, and I have defined a custom process type to perform some long-running jobs, really long-running, a job can easily take something about an hour or more. I know it's better to split it into some small chunks, but that's quite problematic for that task.
And the issue is that when I push a new version — Heroku restarts all the dynos (web, workers, long workers — everything). I wonder is it possible to restart only some process types, e.g. only the web dynos?
No, that isn't possible. The easiest and most scalable way around this would be to split your long-running jobs into smaller chunks.
That way, you would have a lot of very small jobs being processed very quickly. When your app is restarted, you would be able to restart your process, as it wouldn't stop a long-running job.
Alternatively, one-off dynos won't be restarted when your app is deployed.
Using the heroku api, you can programmatically boot one-off dynos. Using that, you could start a one-off dyno for each long-running job you need to process.
That job would be processed (for up to 24 hours, where it would be cycled), and you would be able to deploy your app without restarting it.

On Heroku, using Ruby on Rails, I want to identify individual web servers

When an application on Heroku is restarted, I would like to run some set-up tasks. However, I only want them to be run once.
If I put the tasks in an initializer and I have multiple web dynos set up, every dyno will run the task.
Is it possible to identify individual web dynos, and perhaps which is the first to spin up?
I doubt you can effectively (read: easily) assign any of the web processes unique ownership of the task management, since you never know which Dyno will shutdown first and how long before a new Dyno is spawned.
i.e.:
If the first Dyno to be spawned is the owner, the "lock" on the task management might persist forever... even if the first Dyno closes down and other Dynos keep working.
If the last Dyno to be spawned is the owner, if the owner dies no tasks will be performed until a new Dyno is spawned (which might take a while).
On the other hand, it seems your tasks are already in a persistent storage.
If tasks were checked out using database transactions, or (for non SQL storage) using a single threaded database such as Redis with a List, you could probably have all the Dynos performing the tasks without any task being performed more than once.
...
Another option, much favored for simplicity and for manageability, is to use a single worker Dyno for your background tasks. This Dyno will run a separate script, maybe sharing some code (and maybe not) with the web dynos.
If I were paying Heroku for multiple web Dynos, it would only make sense to deploy a worker Dyno for these tasks.
EDIT (you can ignore this, it's just a few more thoughts)
Even though the best solution is probably using a single worker Dyno, here are a few more thoughts.
Runtime Flags
This thought came to me since you implied that it would be nice to be able to scale down to a single Dyno...
You can use a runtime flag set in the Procfile which will tell a specific Dyno to process tasks or not process tasks. During runtime you can check the ARGV array for the flag.
Here is an experimental example Procfile with this distinction for a Rails application:
web: bundle exec rails server -p $PORT run_tasks
onlyweb: bundle exec rails server -p $PORT
When scaling, make sure you only scale the non task running Dynos:
heroku ps:scale web=1 onlyweb=4
In your application you can check this during runtime:
if ARGV.include? 'run_tasks'
# initialize tasks for background handling.
end
API to activate task handling
Another possibility is to use a combination of Kernel.at_exit and an HTTP API call.
This solution is a bit like setting up a digital game of "tag", and should probably include the following steps:
Create an HTTP route that will accept a "tag". Upon receiving such a request, the receiving Dyno will become the task executing Dyno (will become "IT"), initializing their tasks stack AND storing their UUID in the Database (you can use require 'securerandom'; SecureRandom.uuid for this).
Between each task, have the "IT" Dyno check if it's still "IT" by checking that their UUID is still the one in the Database.
Create an initialization sequence where the last web Dyno takes over the task handling (the new player always becomes "IT").
Setup a callback using Kernel.at_exit that will attempt to "tag" a new Dyno in the event that the exiting Dyno is the task executing Dyno ("IT")...
These steps should ensure that there will always exist a task executing Dyno even while dynamically scaling up or down.
However, Now that I think of it, this might be considered a variant on the initial transactions concept, as the review of the UUID and the task being checked out of the task stack should probably both occurs within a transaction :-/

How long can a sucker_punch worker run for on heroku?

I have sucker_punch worker which is processing a csv file, I initially had a problem with the csv file disappearing when the dyno powered down, to fix that i'm gonna set up s3 for file storage.
But my current concern is whether a dyno powering down will stop my worker in it's tracks.
How can I prevent that?
Since sucker_punch uses a separate thread on the same dyno and does not use an external queue or persistence (the way delayed_job, sidekiq, and resque do) you will be subject to losing the job when your dyno gets rebooted or stopped and you'll have no way to restart the job. On Heroku, dynos are rebooted at least once a day. If you need persistence and the ability to retry a job in the event a dyno goes down, I'd say switch to one of the other job libraries:
https://github.com/collectiveidea/delayed_job
https://github.com/mperham/sidekiq
https://github.com/resque/resque
However, these require using a Heroku Addon. You can get a way with the free version but you will still have to pay for the extra worker process. Other than that you'd have to implement your own persistence and retrying by wrapping sucker_punch. Here's a discussion on adding those features to sucker_punch: https://github.com/brandonhilkert/sucker_punch/issues/21 They basically say to use Sidekiq instead.

Should I have a heroku worker dyno for poll a AWS SQS?

Im confusing about where should I have a script polling an Aws Sqs inside a Rails application.
If I use a thread inside the web app probably it will use cpu cycles to listen this queue forever and then affecting performance.
And if I reserve a single heroku worker dyno it costs $34.50 per month. It makes sense to pay this price for it for a single queue poll? Or it's not the case to use a worker for it?
The script code:
What it does: Listen to converted pdfs. Gets the responde and creates the object into a postgres database.
queue = AWS::SQS::Queue.new(SQSADDR['my_queue'])
queue.poll do |msg|
...
id = received_message['document_id']
#document = Document.find(id)
#document.converted_at = Time.now
...
end
I need help!! Thanks
You have three basic options:
Do background work as part of a worker dyno. This is the easiest, most straightforward option because it's the thing that's most appropriate. Your web processes handle incoming HTTP requests, and your worker process handles the SQS messages. Done.
Do background work as part of your web dyno. This might mean spinning up another thread (and dealing with the issues that can cause in Rails), or it might mean forking a subprocess to do background processing. Whatever happens, bear in mind the 512 MB limit of RAM consumed by a dyno, and since I'm assuming you have only one web dyno, be aware that dyno idling means your app likely isn't running 24x7. Also, this option smells bad because it's generally against the spirit of the 12-factor app.
Do background work as one-off processes. Make e.g. a rake handle_sqs task that processes the queue and exits once it's empty. Heroku Scheduler is ideal: have it run once every 20 minutes or something. You'll pay for the one-off dyno for as long as it runs, but since that's only a few seconds if the queue is empty, it costs less than an always-on worker. Alternately, your web app could use the Heroku API to launch a one-off process, programmatically running the equivalent heroku run rake handle_sqs.

How can I monitor recurrent rake tasks run by heroku scheduler?

I just got the last month heroku bill, and the scheduled rake tasks were a relatively heavy burden. We are pretty early in our development process, so we just developed some rake tasks to get the job done recently, and didn't had much concern in theirs optimization.
Now we want to improve theirs performance and theirs heroku processing hours usage. We use New Relic to monitor the webapp performance, but apparently this type of rake tasks are ignored by default, and it's unclear how to override that.
Anyone had a similiar problem? How can I track the scheduled tasks in close to real time to monitor performance, optimize, and don't get suprise bills?
Whilst you can't really monitor rake tasks that well, there are a few little things you can do. One is the use of logging. Output start and end times of tasks to logs, and you can then see what's been happening duration wise. If you couple this with something like the Papertrail add-on then you can do additional interrogation later on.
As for running the jobs themselves, there's a couple of ways that you can run background processes which are dependant on how they need to run:
If you're needing to run jobs on a schedule, there's a few options available. Firstly there's the Heroku scheduler, which is pretty good, but doesn't guarantee executions will happen. Normally you would use this to kick off a rake task which will bring up a one-off dyno for the duration of the task - therefore you need to ensure in development that these tasks are as efficient as possible.
Alternatively, if you're looking at jobs that need a little more control or using a clock process. Essentially this is a dyno running 24/7 that does nothing but kick off other jobs at preset intervals and times. This would normally be done using the clockwork gem. The downside of this approach is that you need to pay for a clock process all the time.
A third approach, and one that might work is delayed job, with it's runat option, allowing you to queue a job to be run in the future (and jobs can re-queue themselves). There are a few issues with this in that a failure can kill the whole chain, and you need a full time worker running to process them all.
Therefore, in order to minimize your bills, ensure that your rake tasks are as performant and reliable, and then choose the scheduling option that suits you. If you're looking at schedules plus user created events, delayed_job might be the best option. If you're looking at a few tasks running periodically, then go scheduler. If you're looking at running lots of time critical jobs on a regular basis, go with clockwork.
Either way, you should be able to constrain a fair amount of processing into just one or two processes depending on your approach.
I know this question is almost 10 years old, but there is a new way!
You can now monitor your Heroku Scheduler jobs using One-off Dyno Metrics. This Heroku add-on gathers metrics for all detached one-off dynos running in your Heroku app. It was created to be an extension of Heroku's Application Metrics and works out of the box.
when you are running on heroku cedar there is a way to get a free setup for your workers. this is no answer to your monitoring question, but it might be interesting anyways: http://blog.nofail.de/2011/07/heroku-cedar-background-jobs-for-free/
You can force the New Relic agent to start in your rake tasks and report their performance data.
Not the answer to the specific question,but...
One method of reducing overhead is using Unicorn server to get multiple workers working on one dyno. It depends on your set up, but most people who've taken the time to test it can comfortably get 3 - 4 worker processes running concurrently. It's a huge boost in clearing cues or tasks. Just be careful not to max out the allocated memory for the dyno.

Resources