Tell Sidekiq to use all available Heroku workers - ruby-on-rails

I need to batch process a large set of files (millions of database records) as quickly as possible. For this purpose, I split the files into 3 directories and set up Sidekiq with the standard configuration (no config file).
I then started 3 Heroku workers and called 3 methods, which started 3 Sidekiq workers, all with the "default" queue. Initially, Sidekiq used 2 Heroku workers and after a while it decided to use only 1 worker.
How can I force Sidekiq to use all 3 workers to get the job done asap?
Thanks

I found the solution at the bottom of this page: http://manuelvanrijn.nl/blog/2012/11/13/sidekiq-on-heroku-with-redistogo-nano/
# app/config/sidekiq.yml
:concurrency: 1
# Procfile
web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb
worker: bundle exec sidekiq -e production -C config/sidekiq.yml
Also, if you have many workers and a free / cheap Redis instance, make sure you limit the number of connections from each worker to the Redis server:
# app/config/initializers/sidekiq.rb
require 'sidekiq'
Sidekiq.configure_client do |config|
config.redis = { :size => 1 }
end
Sidekiq.configure_server do |config|
config.redis = { :size => 2 }
end
You can calculate the maximum of connections here: http://manuelvanrijn.nl/sidekiq-heroku-redis-calc/

I wanted to clarify a few things about your question. Your question reads "Tell Sidekiq to use all available Heroku workers". In fact, for each Dyno, a sidekiq process will execute using a command like bundle exec sidekiq -e production -C config/sidekiq.yml. Each one of these Sidekiq processes can handle multiple threads as specified in the config/sidekiq.yml file with a line like: :concurrency: 3 which is what the Sidekiq docs recommend for a Heroku standard-2x dyno (read here for more details https://github.com/mperham/sidekiq/wiki/Heroku), which only has 1gb of memory.
But technically you don't need to tell Sidekiq to use all available Heroku processes. There is another key element of this, which is the Redis server. Our main app will publish messages to a Redis server. Each Sidekiq process running on a given Dyno can be configured with the same Queue and thus all are subscribed to that Redis queue and pull the messages from the Queue. This is clearly stated by the creator of Sidekiq from the Sidekiq github page: https://github.com/mperham/sidekiq/issues/3603.
There are a couple of key points to share the load. First restrict the concurrency of a given Sidekiq process to a number like I mentioned above. Second, you can also limit the connections to the Redis server from within Sidekiq.configure_client. Finally, think of the Heroku load balancing somewhat different from how ALB in AWS works. ALB is a round robin that distributes traffic to instances in Target Groups based on certain metrics defined in Launch Templates and Auto Scaling Groups, such as vCPU utilization, memory utilization and read/write io. Rather the load balancing here is more like a publish-subscribe system where Sidekiq instances take work when they are able to and based on their restrictions both on concurrency and connection limits to the Redis server.
Finally, Heroku discourages long running jobs. The longer your job runs the higher the amount of memory it will eat up. Heroku dynos are expensive. The standard-2x is 4x the cost of a t3.micro in AWS for the same vCPU and Memory (1gb). Furthermore, in AWS you can create a spot fleet, where you purchase compute for 10 percent of the cost of its on-demand price and then execute these spot instances as batch jobs. In fact, AWS also has a service called AWS Batch. The spot fleet option doesn't exist in Heroku. Therefore, it's important to keep price in mind and thus how long the job runs. Read this article here where Heroku delineates why it is bad running long running jobs in Heroku environment: https://devcenter.heroku.com/articles/scaling#understanding-concurrency. Try to keep a job under 2 minutes.

Related

Is Puma WEB_CONCURRENCY on a per Dyno basis for Heroku?

I'm using Puma web server on Heroku, and currently have 3 of the standard 2x dynos. The app is Ruby on Rails.
My understanding is that by increasing WEB_CONCURRENCY in /config/puma.rb it increases the number of puma workers, at the expense of additional RAM usage.
Current Setup:
workers ENV.fetch("WEB_CONCURRENCY") { 5 }
Question:
Is the 5 concurrent workers on a per dyno basis, or overall?
If I have 3 dynos, does this mean I have 15 workers, or only 5?
I was previously looking for a way to check the current number of existing workers, but couldn't find any commands to do this on Heroku.
Yes, the web concurrency is on a per-dyno basis.
Each dyno is an independent container, running on a different server. So you should see each dyno as an independent server.

One Heroku worker process for both delayed_job and sidekiq?

Currently our Heroku app has two dynos: web and worker. The worker dyno is set up to run bundle exec rake jobs:work, which starts up delayed_job. I have some new Sidekiq jobs that I also need to run. (I plan to convert our delayed_job jobs to Sidekiq soon, but haven't yet.) My question is: do I need to add and pay for a third Heroku dyno ("sidekiqworker"?), or is there a way for me to specify that my existing worker dyno run both delayed_job and Sidekiq?
You will need to pay for a third heroku dyno unfortunately. I've experimented with naming both processes as "Workers" but only one would be registered while the other one wouldn't be. When adding a new process name, heroku updates and will set that new process name to 0 dynos.
Refer to this for more details multiple worker/web processes on a single heroku app

How the Sidekiq server process pulls jobs from the queue in Redis?

I've two Rails application running on two different instance(lets say Server1 and Server2) but they have similar codes and shares the same Postgresql DB.
I installed Sidekiq and pushing the jobs in Queue from both the servers, but I'm running the Sidekiq process only in Server1.
I've single Redis server and its running on Server1 which shares the Redis with Server2.
If a job pushed from Server2 it getting processed in Server1's Sidekiq process and its what I actually wanted.
My question is
How the Sidekiq process on Server1 knows that a job is pushed in Redis?
Whether the Sidekiq process continuously checks on the Redis server for any new jobs or the Redis server is intimating to the Sidekiq process about the new job?
I got confused and amazed about this!!!
Could anyone please clarify the Sidekiq's process to get the job from Redis server?
It will be helpful for newbies like me.
Sidekiq uses redis command named BRPOP.
This command gets an element from a list (which is your job queue). And if the list is empty, it waits for element to appear and then pops/returns it. This also works with multiple queues at the same time.
So no, sidekiq does not poll redis and redis does not push notifications to sidekiq.
Sidekiq uses a polling mechanism to check for new jobs in Redis. The default polling interval is set at 5 seconds and can be adjusted in the configuration file located at lib/sidekiq/config.rb [link]
# lib/sidekiq/config.rb
average_scheduled_poll_interval: 5
By the way, jobs are stored in Redis as a list and Sidekiq retrieves them by using the BRPOP (Blocking Right Pop) command to avoid any race conditions. This ensures that multiple Sidekiq processes running on different instances are able to retrieve the jobs in a coordinated manner.

Why can I see so many workers at Heroku? How do I restrict access to other workers?

When I access resque-web on my Rails-app running at Heroku, I can see more than 40 workers:
I have only 1 resque worker connected to my Heroku account. This worker processes all my queues:
resque: env TERM_CHILD=1 COUNT=1 QUEUE=* bundle exec rake resque:workers
Is there a way I can restrict other peoples workers to interfere with my queue?
I'm using the Redislab adon from Heroku.
Since your Redis Cloud instance is password-protected, it is unlikely that these are other peoples' workers. I'd venture a guess that they are simply stale (i.e. dead) workers.
Since resque workers register and keep their state in Redis, it is not uncommon that when a worker dies it's state information is kept in Redis. This SO question provided more pointers on how to deal with the situation.

How can I run Rails background jobs on AWS Elastic Beanstalk?

I just started to use AWS Elastic Beanstalk with my rails app and I need to use the Resque gem for background jobs. However, despite all effort to search how to run Resque worker on Elastic Beanstalk, I haven't been able to figure out how?
How can I run Rails background jobs with Resque on AWS Elastic Beanstalk? talks about running those as services in Elastic Beanstalk containers, however, it is still very confusing.
Here my ebextentions resque.config file:
services:
sysvinit:
resque_worker:
enabled: true
ensureRunning: true
commands:
resque_starter:
rake resque:work QUEUE='*'
EDIT
Now my resque.config file looks like this:
container_commands:
resque_starter: "rake resque:work QUEUE='*'"
services:
sysvinit:
resque_worker:
enabled: true
ensureRunning: true
commands:
resque_starter
And it is still not working.
EDIT 2
container_commands:
resque_starter:
command: "rake resque:work QUEUE=sqs_message_sender_queue"
cwd: /var/app/current/
ignoreErrors: true
Still it shows 0 workers.
I think it is suboptimal to run queues, like Resque, inside Elastic Beanstalk web environments. A web environment is intended to host web applications and spawn new instances when traffic and load increases. However, it would not make sense to have multiple Resque queues, each running on one of the instances.
Elastic Beanstalk offers worker environments which are intended to host code that executes background tasks. These worker environments pull jobs from an Amazon SQS queue (which makes an additional queue solution, like Resque, obsolete). An Amazon SQS queue scales easily and is easier to maintain (AWS just does it for you).
To use worker environments, which come with Amazon SQS queues, makes more sense, as it is out of the box supported and fits nicely into the Elastic Beanstalk landscape. There also exists a gem, Active Elastic Job, which makes it simple for Rails >= 4.2 applications to run background tasks in worker environments.
Disclaimer: I'm the author of Active Elastic Job.
First at all, I would recommend to run resque with help of supervisord this will help you make sure that worker will be restarted if process die.
On how to run command when you do deploy every time:
Log in by ssh to your beanstalk instance go to folder: /opt/elasticbeanstalk/hooks/appdeploy/
Here you will find list of hooks that execute every time when you do deploy. Also here you can put own script that will be executed every time when you do deploy. Also with same approach you can put script
to hooks that responsible to application server restart and have ability to restart you background job without connection by ssh.
Another option, put your command that start background worker is use container_commands instead of commands.
Also, please have a look to best articles that I have found about customization of beanstalk: http://www.hudku.com/blog/tag/elastic-beanstalk/ it would be good start point for customization of beanstalk environment for your need.
\

Resources