Unicorn workers freeze from time to time - ruby-on-rails

A couple of days ago I noticed a strange thing - from time to time server stops processing request for some time. At the top output it looks like this:
ten Unicorn workers process requests;
then, for some reason, they stop doing anything. I mean, all ten workers have 'sleeping' status;
for a ten-fifteen seconds they sleep;
and then suddenly all then workers at the same time start processing requests (lots of them were queued for 10s);
I have the following setup:
nginx, unicorn 4.6.2, postgres, redis for sessions and cache, MRI ruby 2.0.0p353.
My first thought was to blame redid (because if redis doesn't give sessions, all process will wait for it), but it seems it is not the case, because while unicorn workers freeze, redis serving other processes that do background jobs.
I don't understand what is the reason of this strange behaviour.
If someone have some thoughts on the matter I would gladly check it. If you need additional information - just tell me what to do, and I'll try to provide it.
UPDATE:
Unicorn config
strace on unicorn worker
strace on unicorn master
strace on nginx

It turned out (with the help of strace on worker processes) workers were trying to write logs on the disk. Disk was heavy loaded and processes were blocked.

Related

How to safely shut down Sidekiq on Heroku

I guess I need a sanity check here because if I want to prevent any sidekiq jobs from ending prematurely, Heroku Redis should handle this for me?
When I want to push new changes to a production site, I put the application in maintenance mode: heroku maintenance:on. Now when I do this and run heroku ps I can see both my web process and my worker (i.e. sidekiq) are up still (makes sense because its just to prevent users having access to the site).
If I shut down the worker dyno with a command like this: heroku ps:stop worker after the site is in maintenance mode, will this safely stop sidekiq workers before it does down? Also, from Sidekiq's documentation:
https://github.com/mperham/sidekiq/wiki/Deployment#heroku
It mentions a -t N switch where N is a number in seconds but that Heroku has a hard limit of allowing a process 30 seconds to shut down on its own. Am I correct that if I stop the worker process with the heroku command, it will give any currently running jobs N seconds to finish before giving it a SIGTERM signal?
If not, what additional steps do I need to take to make sure Sidekiq has safely shut down?
Sounds like you are fine. Heroku sends SIGTERM when you call ps:stop. Sending SIGTERM tells Sidekiq to shut down within N seconds. Your worker dyno should be safely down within 30 seconds.

Delayed Jobs workers getting timed out on engineyard

I think i'm having a problem where engineyard is adding a timeout to some of my delayed job workers, (seems to be 10 minutes). I have a copy process that can run for > 10 minutes and everytime it gets to that 10 minutes threshold the job is killed. Is there anyway to configure the engineyard timeout for worker instances?? I'm looking through and all I see is timeouts regarding nginx/apache
There isn't a timeout set for the Delayed Job workers, so this is more likely a memory usage issue. Monit tracks the memory consumed by the workers and will restart those that reach a set threshold. Monit's actions will be logged in /var/log/syslog, so this can be checked to confirm if Monit is terminating the workers. The memory threshold is set in the /etc/monit.d/delayed_job.monitrc file(s) and can be increased to fit the workers' requirements. After alteration of the configuration Monit must be reloaded using sudo monit reload.
If you submit a ticket at https://support.cloud.engineyard.com the support staff will be more than happy to help you further diagnose this issue.

How to detect and prevent spawning failing Unicorn workers

Situation: I am using Rails + Unicorn, deploying with Capistrano. Sometimes Rails app fails to start in production mode (though it is not the real production, but a staging env). This usually happens due to errors in deploy scripts or configuration (thus usually not detectable by tests). When this happens, unicorn master process kills the worker that failed and spawns a new one, which also fails and so on and so forth. During all that time unicorn consumes lots of CPU and pollutes logs with the same message.
Manual way (not good): Go to your home page to see if it works. Look at the htop. Tail the logs. Kill unicorn manually. Cons: easy to forget. Logs are polluted, CPU is loaded while you are reacting.
Another solution: Use unicorn's preload_app true. This will cause master process to fail fast. Cons: higher memory consumption in happy scenario.
Best practice: - ???
Is there any way to cleverly detect that unicorn master uselessly tries to spawn failing children and stop it?
You have something like "unicorn start" in your Capistrano script right? Make your Capistrano script ping Unicorn right after invoking that command. If Unicorn does not return an expected response within a timeout, then you know that something went wrong, and you can choose to rollback the deploy or performing some other action.
As for how to ping Unicorn, that depends. If you have Unicorn listening on a TCP socket then you can use curl. If you have Unicorn listening on a Unix domain socket then you have to write a little script that connects to it, like this:
require 'socket'
sock = UNIXSocket.new('/path-to-unicorn.sock')
sock.write("HEAD / HTTP/1.0\r\n")
sock.write("Host: www.foo.com\r\n")
sock.write("Connection: close\r\n")
sock.write("\r\n")
if sock.read !~ /something/
exit 1
end
But it sounds like Phusion Passenger Enterprise solves your problem beautifully. It has this feature called "deployment error resistance". When you deploy a new version and Phusion Passenger detects that it cannot spawn any processes for your new codebase, it will stop trying to spawn your new version and keep the processes for the old versions around indefinitely, until you manually give the signal that it's okay to spawn processes for the new version. In the mean time it will log all errors into the log file so that you can analyze the problem.
I would suggest brushing off your bash skills. The functionality you need is already in Unicorn as it leverages the Unix-y master/worker process.
You need a init.d script. Or at the very least godrb or monit. I recommend the init.d script route AND monitoring. Its more complex, but it can more easily be leveraged by your monitoring software and also gives you an automatic start on reboot.
The gist of it is:
Send the USR2 signal to the unicorn master process, this will fork the master process.
Then send the WINCH to the old master process that gets created, this will kill each worker.
Then you can send the old master process the QUIT signal.
Unicorn Signals
This will spin up a new master process running the new code and label the old one as (old). If it fails the old one should be returned to its prior state and you shouldn't suffer an outage, just a restart error. This is the beauty of unicorn. You can almost get instantaneous deploys of your code.
I'm using a lot of hedge words because I did this work on my apps over a year ago so there are a lot of cobwebs upstairs. Hope this helps!
This is by no mean a correct script. Its a good starting point though ... feel free to update the gist if you can improve upon it! :-)
Example Unicorn Control Script

How to do rolling restart with Unicorn?

Suppose I make a little change to my rails app such as changing the html layout. How would I do a rolling restart with Unicorn? Effectively one would like to bring up unicorn processes(or workers instead?) for the newest version of the rails app and then switch traffic from the old unicorn processes/workers to the new ones atomically. From Google searches I couldn't quite get a concrete definitive explanation of how to do this and all the gotchas surrounding it.
There are multiple methods, but one of them is as follows:
Send SIGUSR2 to the master process. Unicorn start a new master with worker processes, that live in parallel with your old master and old worker processes.
Wait until the new master and worker processes have started.
Kill the old master.
Source: http://unicorn.bogomips.org/SIGNALS.html
This is not very memory friendly though. You temporarily need twice the memory usage.
Phusion Passenger Enterprise supports rolling restarts (along with other cool features) but it restarts processes one-by-one and so does not need as much memory. It is possible to script one-by-one rolling restarts in Unicorn using the TTIN and TTOUT signals but Phusion Passenger does everything automatically for you without scripting.

Pragmatic ways to monitor Resque queues in Rails

I am looking to automate the starting/restarting of queues with Resque in my Ruby on Rails application. (running on JRuby)
I want to ensure the following criteria are met:
Workers are started after I deploy with capistrano
Workers are restarted if they die for whatever reason
Workers eating too much memory are stopped/restarted and can fire me an email alert
Are there tools that current provide this functionality or at least a subset of it? If there isn't anything that restarts the queue/worker, I would like to be notified at minimum so I can manually do it.
The easiest way to do it would be using a program such as God or Monit to get #2 and #3. For #1, you can just setup your Capistrano script to send a kill -INT to all the Resque workers, then the monitoring program will start them up again.
The advantaged to using kill -INT rather than manually stopping and starting the jobs in the Capistrano script is that your deploy won't have to wait for every worker to stop processing its job to start them back up. It also means if you have a long running job, you will quickly have whatever free workers were running on the new code as quickly as possible.
I'm not especially familiar with it, however I believe the god gem is used frequently for process management.

Resources