Unicorn workers dying for no reason - ruby-on-rails

All the unicorn workers are dying silently, no indication as to why, and I can't find any evidence of an external process killing them. I'm new to diagnosing this kind of stuff, and after several hours of research, experimenting, and trying to figure this out, I'm at a dead end.
Background info- it's a Rails 4.1 app, Ruby 2.0, running nginx and unicorn on a Ubuntu 14.04 server.
unicorn.rb
working_directory "/home/deployer/apps/ourapp/current"
pid "/home/deployer/apps/ourapp/current/tmp/pids/unicorn.pid"
stderr_path "/home/deployer/apps/ourapp/current/log/unicorn.log"
stdout_path "/home/deployer/apps/ourapp/current/log/unicorn.log"
listen "/tmp/unicorn.ourapp.sock"
worker_processes 2
timeout 30
excerpt from unicorn.log (last lines before it dies and after restart)
I, [2016-08-28T19:54:01.685757 #19559] INFO -- : worker=1 ready
I, [2016-08-28T19:54:01.817464 #19556] INFO -- : worker=0 ready
I, [2016-08-29T09:19:14.818267 #30343] INFO -- : unlinking existing socket=/tmp/unicorn.ourapp.sock
I, [2016-08-29T09:19:14.818639 #30343] INFO -- : listening on addr=/tmp/unicorn.ourapp.sock fd=10
I, [2016-08-29T09:19:14.818807 #30343] INFO -- : worker=0 spawning...
I, [2016-08-29T09:19:14.824358 #30343] INFO -- : worker=1 spawning...
Some pertinent info:
After a period of time ranging from about 8 - 20 hours, unicorn dies.
There's no error recorded in the unicorn log.
I searched all of /var/log for evidence of processes that were killed, and can only find one unrelated process that was killed a few days ago.
New Relic shows flat memory usage before the last random shutdown, with ruby using around 400mb. It's currently at 480mb with no problems, so I don't think it's hitting memory constraints.
Same with CPU usage...ruby was hovering around 0.1% before it died.
The last couple of times it died were in the middle of the night. The only requests coming in were from New Relic and Linode Longview monitoring.
Our production.log shows a last request before dying as a ping from New Relic. It Completed 200 OK in 264ms so it doesn't seem to be a request timing out.
It's happening in staging as well, and the log level is set to debug, and there are no additional clues in the staging logs.
Questions:
What could be killing the Unicorn workers that's not the out-of-memory manager or a shut down signal?
Could it be the OOM or a shut down signal, and it's being recorded in some place that I'm not looking, or just not being recorded for some reason?
Is there a way to capture what's happing to Unicorn in more detail?
I have no idea where to go from here, so any suggestions would be much appreciated.
UPDATE
As suggested, I used strace to find out that unicorn was being killed by an old crontab (I know I should have checked there earlier) added by the previous developers that was intended to restart the server every night. The stop command worked, but the start command was failing.
I still don't know why I wasn't able to find anything in my log searches, but after attaching strace to the main unicorn process (using something like strace -o /tmp/strace.out -s 2000 -fp <unicorn_process_id>), the strace log ended with a clear +++ killed by SIGKILL +++. I searched the logs again, and that led me to the crontab.
The underlying cause is probably pretty specific to my situation, but I'm really glad I know about strace now.

Related

Heroku Error H13

I've been getting this error now on & off for the past couple days since I deployed my application to heroku. It happens both before I started using unicorn as a server as well as afterwards. I can sometimes get it back up and running by using heroku run rake db:migrate then heroku restart but this only fixes it for a couple hours and it's broken again. As for the webpage it's saying "Application error". The logs aren't very helpful but here's what it says each time this error happens:
[2014-10-27T21:13:31.675956 #2] ERROR -- : worker=1 PID:8 timeout (16s > 15s), killing
[2014-10-27T21:13:31.731646 #14] INFO -- : worker=1 ready
[2014-10-27T21:13:31.694690 #2] ERROR -- : reaped #<Process::Status: pid 8 SIGKILL (signal 9)> worker=1
at=error code=H13 desc="Connection closed without response" method=GET
I'm just using the free version of heroku, I want to make sure it works before upgrading but is that my only option at this point?
Also I am able to run this locally perfectly fine using either rails server or foreman start.
Heroku docs say this about H13:
H13 - Connection closed without response
This error is thrown when a process in your web dyno accepts a connection, but then closes the socket without writing anything to it.
One example where this might happen is when a Unicorn web server is configured with a timeout shorter than 30s and a request has not been processed by a worker before the timeout happens. In this case, Unicorn closes the connection before any data is written, resulting in an H13.
A couple lines up, you have an error about a process timing out after 15s:
ERROR -- : worker=1 PID:8 timeout (16s > 15s), killing
Heroku help has a section on timeout settings:
Depending on your language you may be able to set a timeout on the app server level. One example is Ruby’s Unicorn. In Unicorn you can set a timeout in config/unicorn.rb like this:
timeout 15
The timer will begin once Unicorn starts processing the request, if 15 seconds pass, then the master process will send a SIGKILL to the worker but no exception will be raised.
That matches the error messages in your log. I'd look into it.

How do I monitor (non-zero-downtime) Unicorn?

I'm finding an awful lot of conflicting information about monitoring Unicorn, with people saying other config scripts are wrong, and posting their own. There seems to be no main config that Just Works™
I'm assuming preload_app and zero-downtime deployment are the main culprit. I'd love to have that, but for now I'm more interested in just getting monitoring running, period. So currently I have all those settings turned off.
Right now I'm using capistrano-unicorn which is a really great gem.
It gives me all the capistrano deploy hooks I need to reload unicorn. The app has successfully deployed already with it.
The main thing I want to do now is...
a) Make sure unicorn starts up automatically on server failure/reboot
b) Monitor unicorn to restart workers that die/hang/whatever.
If I'm using this gem, what might be the best approach to complete my goals (keeping in mind I don't necessarily need zero downtime)?
Thanks
Engine yard uses Monit, and it's a pretty little utility that does exactly what you need!
Here is the configuration for unicorn:
check process unicorn
with pidfile /path/to/unicorn.pid
start program = "command to start unicorn"
as uid yourUID and gid yourGID
stop program = "command to stop unicorn"
as uid yourUID and gid yourGID
if mem > 255.0 MB for 2 cycles then restart
if cpu > 100% for 2 cycles then restart
check process unicorn_worker0
with pidfile /path/to/unicorn_worker_0.pid
if mem > 255.0 MB for 2 cycles then exec "/bin/bash -c '/bin/kill -6 `cat /path/to/unicorn_worker_0.pid` && sleep 1'"
...

How to findout what cause unicorn workers timeout

People keeps claiming that my website always hang out at some pages. I checked the unicorn stderr log, and found many timeout errors like:
E, [2013-08-14T09:27:32.236478 #30027] ERROR -- : worker=5 PID:11619 timeout (601s > 600s), killing
E, [2013-08-14T09:27:32.252252 #30027] ERROR -- : reaped #<Process::Status: pid=11619,signaled(SIGKILL=9)> worker=5
I, [2013-08-14T09:27:32.266141 #4720] INFO -- : worker=5 ready
There are many error messages like that.
Then I go to the rails production log, find the exact requests by searching the unicorn error time minus 601s. These timeout request, all choked at the page rendering phase. The sql of these requests are done already. It just never gets an end:
Processing by XXXController#index as HTML
Rendered xxx/index.html.erb within layouts/application (41.4ms)
Rendered shared/_sidebar.html.erb (200.9ms)
No complete. Most of these requests served successfully. I don't know why at random time, it hang out there.
I have no idea what may cause this. Can anybody give me a clue of how to find the real reason that cause the unicorn workers to timeout?
Update:
We used NSC to transfer request and response to unicorn. And to try to improve the timeout issue, we added nginx between NSC and unicorn. It turns out the unicorn worker timeout still happens, and each timeout matches a nginx upstream timeout in nginx error log.
Does anyone knows whether there is some kind of bottle neck in TCP connection of unicorn?
I'm using Rack::Timeout to time out before unicorn. Unicorn timeout uses kill -9, and I don't think that gives you any way to do anything.

Rails oink with Heroku

This question has been asked before but no answer seems to work for me. I will break the problem down into its 3 components:
1) I receive a Heroku R14 memory (memory quota exceeded) occasionally (i.e. the site has been up 2 days on Heroku and I got this error twice for a period of about 10-15 mn [I was too emotional to count the time precisely]).
2) I installed the oink gem as advised by Heroku.
3) Oink definitely logs, as I get messages to that effect in heroku logs and in Webrick when I work locally. However, I am unable to access the logging summary that shows which functions exceed a memory threshold.
The only line that returns a result (but a wrong one) is :
oink --threshold=0 logfile_for_oink
But it returns empty lines as follows:
---- MEMORY THRESHOLD ----
THRESHOLD: 0 MB
-- SUMMARY --
Worst Requests:
Worst Actions:
Aggregated Totals:
Every other attempt - often copying advice already on StackOverflow - returns errors.
I will list the different attempts I have made (so no-one posts a suggestion I may have already tried) after this.
heroku run bundle exec oink --threshold=75 log/*
This line returns the following error:
/app/vendor/bundle/ruby/1.9.1/gems/oink-0.10.1/lib/oink/cli.rb:88:in `block in get_file_listing': Could not find "log/development.log" (RuntimeError)
Every variation on this, such as log/production.rb or /log/* or what have you has failed.
I also tried the advice on the following links to no avail:
Using oink gem with heroku
oink logs command not working on heroku
oink logs command not working on heroku
How can I run oink in heroku?
Can anyone help me?
Heroku prepends the log file with an additional timestamp so oink can't read it. You can use a regex though to fix it.
http://arches.io/2013/07/understand-memory-usage-on-heroku-rails-app-using-oink/

Cannot restart unicorn

I have a unicorn + nginx setup and suddenly when I run cap unicorn:upgrade (which sends a USR2 to the master process) it doesn't prefix the .pid file and it doesn't fork a new master process at all. When I open the log file I can see the line
reaped #<Process::Status: pid 32448 exit 10> exec()-ed
can anyone suggest something to do in order to see what's wrong?
Thanks
Does your unicorn config have preload_app(true) ? You may need to send a QUIT signal if it does.

Resources