People keeps claiming that my website always hang out at some pages. I checked the unicorn stderr log, and found many timeout errors like:
E, [2013-08-14T09:27:32.236478 #30027] ERROR -- : worker=5 PID:11619 timeout (601s > 600s), killing
E, [2013-08-14T09:27:32.252252 #30027] ERROR -- : reaped #<Process::Status: pid=11619,signaled(SIGKILL=9)> worker=5
I, [2013-08-14T09:27:32.266141 #4720] INFO -- : worker=5 ready
There are many error messages like that.
Then I go to the rails production log, find the exact requests by searching the unicorn error time minus 601s. These timeout request, all choked at the page rendering phase. The sql of these requests are done already. It just never gets an end:
Processing by XXXController#index as HTML
Rendered xxx/index.html.erb within layouts/application (41.4ms)
Rendered shared/_sidebar.html.erb (200.9ms)
No complete. Most of these requests served successfully. I don't know why at random time, it hang out there.
I have no idea what may cause this. Can anybody give me a clue of how to find the real reason that cause the unicorn workers to timeout?
Update:
We used NSC to transfer request and response to unicorn. And to try to improve the timeout issue, we added nginx between NSC and unicorn. It turns out the unicorn worker timeout still happens, and each timeout matches a nginx upstream timeout in nginx error log.
Does anyone knows whether there is some kind of bottle neck in TCP connection of unicorn?
I'm using Rack::Timeout to time out before unicorn. Unicorn timeout uses kill -9, and I don't think that gives you any way to do anything.
Related
All the unicorn workers are dying silently, no indication as to why, and I can't find any evidence of an external process killing them. I'm new to diagnosing this kind of stuff, and after several hours of research, experimenting, and trying to figure this out, I'm at a dead end.
Background info- it's a Rails 4.1 app, Ruby 2.0, running nginx and unicorn on a Ubuntu 14.04 server.
unicorn.rb
working_directory "/home/deployer/apps/ourapp/current"
pid "/home/deployer/apps/ourapp/current/tmp/pids/unicorn.pid"
stderr_path "/home/deployer/apps/ourapp/current/log/unicorn.log"
stdout_path "/home/deployer/apps/ourapp/current/log/unicorn.log"
listen "/tmp/unicorn.ourapp.sock"
worker_processes 2
timeout 30
excerpt from unicorn.log (last lines before it dies and after restart)
I, [2016-08-28T19:54:01.685757 #19559] INFO -- : worker=1 ready
I, [2016-08-28T19:54:01.817464 #19556] INFO -- : worker=0 ready
I, [2016-08-29T09:19:14.818267 #30343] INFO -- : unlinking existing socket=/tmp/unicorn.ourapp.sock
I, [2016-08-29T09:19:14.818639 #30343] INFO -- : listening on addr=/tmp/unicorn.ourapp.sock fd=10
I, [2016-08-29T09:19:14.818807 #30343] INFO -- : worker=0 spawning...
I, [2016-08-29T09:19:14.824358 #30343] INFO -- : worker=1 spawning...
Some pertinent info:
After a period of time ranging from about 8 - 20 hours, unicorn dies.
There's no error recorded in the unicorn log.
I searched all of /var/log for evidence of processes that were killed, and can only find one unrelated process that was killed a few days ago.
New Relic shows flat memory usage before the last random shutdown, with ruby using around 400mb. It's currently at 480mb with no problems, so I don't think it's hitting memory constraints.
Same with CPU usage...ruby was hovering around 0.1% before it died.
The last couple of times it died were in the middle of the night. The only requests coming in were from New Relic and Linode Longview monitoring.
Our production.log shows a last request before dying as a ping from New Relic. It Completed 200 OK in 264ms so it doesn't seem to be a request timing out.
It's happening in staging as well, and the log level is set to debug, and there are no additional clues in the staging logs.
Questions:
What could be killing the Unicorn workers that's not the out-of-memory manager or a shut down signal?
Could it be the OOM or a shut down signal, and it's being recorded in some place that I'm not looking, or just not being recorded for some reason?
Is there a way to capture what's happing to Unicorn in more detail?
I have no idea where to go from here, so any suggestions would be much appreciated.
UPDATE
As suggested, I used strace to find out that unicorn was being killed by an old crontab (I know I should have checked there earlier) added by the previous developers that was intended to restart the server every night. The stop command worked, but the start command was failing.
I still don't know why I wasn't able to find anything in my log searches, but after attaching strace to the main unicorn process (using something like strace -o /tmp/strace.out -s 2000 -fp <unicorn_process_id>), the strace log ended with a clear +++ killed by SIGKILL +++. I searched the logs again, and that led me to the crontab.
The underlying cause is probably pretty specific to my situation, but I'm really glad I know about strace now.
I am new to AWS Beanstalk-Rails-Puma-Nginx.
After deploying my RAILS app to Beanstalk, all my api calls work fine, but HTML pages are causing error.
When opening my HTML page -
Nginx throws 502 Bad Gateway error.
Puma log :
Started GET "/admin" for 182.70.76.160 at 2016-04-22 05:13:19 +0000
Processing by Devise::SessionsController#new as HTML
Rendered devise/sessions/new.html.erb within layouts/application (6.1ms)
[18858] ! Terminating timed out worker: 22913
var/app/current/production.log is empty.
Read somewhere, that adding SSL could solve. Is it required to added SSL?
Please help! I am stuck!
STATUS :
My assets were huge because of which it was killing itself. I was using a theme and removed all the unnecessary js, css and images.
Now, Puma doesn't terminate, but it doesnot compile assets. I had selected Ruby as application type so it should do it for me, correct?
Try setting worker timeout to a higher value in puma config. Default value is 60 seconds
worker_timeout 100
It is possible that you are creating more workers than the server could handle. Try decreasing the worker count or increasing the server capacity.
For now I moved to EC2 as EBS issues weren't getting solved.
I had the same issue on EC2 but I could fix it as I access my machine.
Puma workers were timing out because my assets weren't precompiled.
Everytime I take a new build on server, I have to run the following :
RAILS_ENV=production rake assets:precompile
I've been getting this error now on & off for the past couple days since I deployed my application to heroku. It happens both before I started using unicorn as a server as well as afterwards. I can sometimes get it back up and running by using heroku run rake db:migrate then heroku restart but this only fixes it for a couple hours and it's broken again. As for the webpage it's saying "Application error". The logs aren't very helpful but here's what it says each time this error happens:
[2014-10-27T21:13:31.675956 #2] ERROR -- : worker=1 PID:8 timeout (16s > 15s), killing
[2014-10-27T21:13:31.731646 #14] INFO -- : worker=1 ready
[2014-10-27T21:13:31.694690 #2] ERROR -- : reaped #<Process::Status: pid 8 SIGKILL (signal 9)> worker=1
at=error code=H13 desc="Connection closed without response" method=GET
I'm just using the free version of heroku, I want to make sure it works before upgrading but is that my only option at this point?
Also I am able to run this locally perfectly fine using either rails server or foreman start.
Heroku docs say this about H13:
H13 - Connection closed without response
This error is thrown when a process in your web dyno accepts a connection, but then closes the socket without writing anything to it.
One example where this might happen is when a Unicorn web server is configured with a timeout shorter than 30s and a request has not been processed by a worker before the timeout happens. In this case, Unicorn closes the connection before any data is written, resulting in an H13.
A couple lines up, you have an error about a process timing out after 15s:
ERROR -- : worker=1 PID:8 timeout (16s > 15s), killing
Heroku help has a section on timeout settings:
Depending on your language you may be able to set a timeout on the app server level. One example is Ruby’s Unicorn. In Unicorn you can set a timeout in config/unicorn.rb like this:
timeout 15
The timer will begin once Unicorn starts processing the request, if 15 seconds pass, then the master process will send a SIGKILL to the worker but no exception will be raised.
That matches the error messages in your log. I'd look into it.
I'm trying to debug an issue with workers and I saw this message in my log file:
2013-07-14T21:59:07.024756+00:00 app[web.1]: E, [2013-07-14T14:59:07.024559 #2] ERROR -- : worker=1 PID:261 timeout (30s > 29s), killing
2013-07-14T21:59:07.067325+00:00 app[web.1]: E, [2013-07-14T14:59:07.066999 #2] ERROR -- : reaped #<Process::Status: pid 261 SIGKILL (signal 9)> worker=1
2013-07-14T21:59:07.070701+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=POST path=/photos/687 host=dev.tacktile.org fwd="199.83.223.92" dyno=web.1 connect=8ms service=29345ms status=503 bytes=0
2013-07-14T21:59:07.898048+00:00 app[web.1]: I, [2013-07-14T14:59:07.897739 #269] INFO -- : worker=1 ready
If I'm reading this correctly, my worker was killed because it took longer than 30 seconds. I thought only web responses got killed if longer than 30 seconds. I'm putting this task into a delayed job and processing it with a worker because I know it's slow.
I hope I'm misunderstanding something.
Your log indicates dyno=web.1 - so it looks like the web dyno connection was terminated after 30 seconds, not a worker dyno like you indicate. Have you read the note attached to the definition of the h13 error that indicates:
One example where this might happen is when a Unicorn web server is
configured with a timeout shorter than 30s and a request has not been
processed by a worker before the timeout happens. In this case,
Unicorn closes the connection before any data is written, resulting in
an H13.
Perhaps that's related?
PS. Editing my answer I see by "worker" you mean "Unicorn worker" I guess? Looks like your unicorn worker died for some reason (which is perhaps why you got the H13). Heroku won't explicitly kill a sub-process like that AFAIK.
I'm not a Ruby on Rails expert, but it seems like what you call "worker" is actually a web process (as evident by the dyno name, web.1). I am guessing you use Unicorn, which spawns several processes, each dealing with a single web request at a time. Each such process is termed a "worker", I guess, so it's really a matter of terminology.
As to why it happens: could it be that your web path actually waits for your real worker to complete the request, and thus it too is taking >30sec?
Having experienced a few periods of downtime, we've recently upgraded to a production environment in Heroku (Crane database plus 2 x web dynos) however we've seen no improvement. In fact reliability seems to have decreased since upgrading.
The root cause seems to be the following exception:
PG::Error (SSL SYSCALL error: EOF detected
which causes the dyno to fail and - eventually - restart, but not before causing some downtime.
I've no idea what's causing it. Common culprits appear to be Resque and Unicorn, neither of which I'm using. We're on rails 3.2.11, on Heroku Cedar, using pg gem 1.14.1
Logs report the following at crash time:
2013-05-23T19:01:33+00:00 app[heroku-postgres]: source=HEROKU_POSTGRESQL_PINK measure.current_transaction=34490 measure.db_size=38311032bytes measure.tables=19 measure.active-connections=7 measure.waiting-connections=0 measure.index-cache-hit-rate=0.99438 measure.table-cache-hit-rate=0.8824
2013-05-23T19:01:35.123633+00:00 app[web.2]:
2013-05-23T19:01:35.123633+00:00 app[web.2]: PG::Error (SSL SYSCALL error: EOF detected
2013-05-23T19:01:35.123633+00:00 app[web.2]: ):
I have read the following: https://groups.google.com/forum/?fromgroups#!topic/heroku/a6iviwAFgdY but can't find anything that might help.
https://gist.github.com/ktopping/5657474
The above fixes the exception, which is useful (as it should declutter my logs, and even help speed up reconnecting to the database) but doesn't actually stop my main issue which is Heroku web dynos crashing more often than I would like.
Am investigating some other routes (Unicorn, rack-timeout).