Interpreting heroku logs, is worker killed prematurely? - ruby-on-rails

I'm trying to debug an issue with workers and I saw this message in my log file:
2013-07-14T21:59:07.024756+00:00 app[web.1]: E, [2013-07-14T14:59:07.024559 #2] ERROR -- : worker=1 PID:261 timeout (30s > 29s), killing
2013-07-14T21:59:07.067325+00:00 app[web.1]: E, [2013-07-14T14:59:07.066999 #2] ERROR -- : reaped #<Process::Status: pid 261 SIGKILL (signal 9)> worker=1
2013-07-14T21:59:07.070701+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=POST path=/photos/687 host=dev.tacktile.org fwd="199.83.223.92" dyno=web.1 connect=8ms service=29345ms status=503 bytes=0
2013-07-14T21:59:07.898048+00:00 app[web.1]: I, [2013-07-14T14:59:07.897739 #269] INFO -- : worker=1 ready
If I'm reading this correctly, my worker was killed because it took longer than 30 seconds. I thought only web responses got killed if longer than 30 seconds. I'm putting this task into a delayed job and processing it with a worker because I know it's slow.
I hope I'm misunderstanding something.

Your log indicates dyno=web.1 - so it looks like the web dyno connection was terminated after 30 seconds, not a worker dyno like you indicate. Have you read the note attached to the definition of the h13 error that indicates:
One example where this might happen is when a Unicorn web server is
configured with a timeout shorter than 30s and a request has not been
processed by a worker before the timeout happens. In this case,
Unicorn closes the connection before any data is written, resulting in
an H13.
Perhaps that's related?
PS. Editing my answer I see by "worker" you mean "Unicorn worker" I guess? Looks like your unicorn worker died for some reason (which is perhaps why you got the H13). Heroku won't explicitly kill a sub-process like that AFAIK.

I'm not a Ruby on Rails expert, but it seems like what you call "worker" is actually a web process (as evident by the dyno name, web.1). I am guessing you use Unicorn, which spawns several processes, each dealing with a single web request at a time. Each such process is termed a "worker", I guess, so it's really a matter of terminology.
As to why it happens: could it be that your web path actually waits for your real worker to complete the request, and thus it too is taking >30sec?

Related

Transcoding quicktime (.MOV) to mp4 causes huge memory consumption and crashes on heroku?

Converting a ~4.4MB quicktime (.MOV) video to .mp4 locally works fine, but on heroku it consumes over 1GB of memory and crashes the server.
Steps to reproduce
Take this video, place it in the root of a new or existing rails app.
Add gem 'streamio-ffmpeg', '~> 3.0', '>= 3.0.2' to your gemfile and bundle, push to heroku, bundle on heroku.
Run heroku run rails c to jump into the rails console and try to convert the movie:
require 'streamio-ffmpeg'
# Read video file
movie = FFMPEG::Movie.new("IMG_1459.MOV")
# Transcode video and save as an mp4
movie.transcode("IMG_1459.mp4")
The last line runs for a while then crashes.
Error logs
Converting the ~4.4MB video appears to consume over 1GB of memory (!!)
2022-03-05T13:00:24.678737+00:00 heroku[web.1]: Process running mem=1076M(210.3%)
2022-03-05T13:00:24.689256+00:00 heroku[web.1]: Error R15 (Memory quota vastly exceeded)
2022-03-05T13:00:24.691853+00:00 heroku[web.1]: Stopping process with SIGKILL
2022-03-05T13:00:24.810891+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=POST path="/posts" host=secure-inlet-07449.herokuapp.com request_id=7cd160ea-b332-4965-ae3f-563696dd054b fwd="1.145.252.113" dyno=web.1 connect=0ms service=29610ms status=503 bytes=0 protocol=https
2022-03-05T13:00:25.025812+00:00 heroku[web.1]: Process exited with status 137
2022-03-05T13:00:25.121873+00:00 heroku[web.1]: State changed from up to crashed
The logs are from similar code in an actual app.
Update
I think I seriously underestimated how much RAM it would consume. (I had assumed I had a memory leak or some unclosed connection somehow triggering a loop or similar). But it may be true that this is simply an incredibly memory-intensive task, and therefore is a bad idea to try to run in the app itself, and should be instead offloaded to an external processes (e.g. EC2's or AWS Lambdas etc).

R15 Issue on Heroku without exceeding memory limit

I have a Ruby on Rails site running on Heroku's performance-M dynamo, with autoscaling set up to 5 dynamos.
Recently, we have been receiving abrupt R15 and H12 errors on the site. During this, memory usage is shown well under the memory quota allowed for the dynamo.
Here are the errors shown in the log:
2019-09-16T10:12:08.523336+00:00 app[scheduler.2787]: Command :: identify -format '%wx%h,%[exif:orientation]' '/tmp/897302823996a945884a1d912c28d59520190916-4-1bn5w9k.jpg[0]' 2>/dev/null
2019-09-16T10:12:16.022212+00:00 heroku[scheduler.2787]: Process running mem=1022M(199.7%)
2019-09-16T10:12:16.022295+00:00 heroku[scheduler.2787]: Error R14 (Memory quota exceeded)
2019-09-16T10:12:16.365725+00:00 heroku[router]: at=info method=GET path="/favicon-16x16.png" host=www.site.com request_id=8755a947-ace9-471d-a192-a236785505b4 fwd="45.195.5.37" dyno=web.1 connect=1ms service=2ms status=200 bytes=928 protocol=https
2019-09-16T10:12:19.103405+00:00 heroku[scheduler.2787]: Process running mem=1279M(250.0%)
2019-09-16T10:12:19.103405+00:00 heroku[scheduler.2787]: Error R15 (Memory quota vastly exceeded)
2019-09-16T10:12:19.103405+00:00 heroku[scheduler.2787]: Stopping process with SIGKILL
2019-09-16T10:12:19.427029+00:00 heroku[scheduler.2787]: State changed from up to complete
2019-09-16T10:12:19.388039+00:00 heroku[scheduler.2787]: Process exited with status 137
2019-09-16T10:13:07.886016+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=GET path="/favicon.ico" host=www.site.com request_id=c7cea0a2-7345-44c6-926e-3ad5a0eb2066 fwd="45.195.5.37" dyno=web.2 connect=1ms service=30000ms status=503 bytes=0 protocol=https
As you can see, just before the R15 error, paperclip was trying to compress an image.
The beginning of the graphs in the following screenshots show the status of Heroku Metrics for the affected period:
Heroku Metrics Part 1
Heroku Metrics Part 2
Can anyone please help me figure out how the R15 error, which is related to memory leakage occurring while the metrics show the memory well in the limit? Any help regarding how to stop this situation from repeating will be helpful.
Thanks.
Your R15 error occurred on a one-off dyno created by Heroku Scheduler, completely separate from your web dynos. Your request timeouts appear to be unrelated to the memory issues in your scheduled task.
The scheduled task appears to be running on a 1X dyno (mem=1022M(199.7%)). To change this, launch the Heroku Scheduler add-on and change the dyno type.
For your request timeouts, check out Scout or New Relic to find the problematic endpoint and where in the stack is taking so long.

Unicorn workers dying for no reason

All the unicorn workers are dying silently, no indication as to why, and I can't find any evidence of an external process killing them. I'm new to diagnosing this kind of stuff, and after several hours of research, experimenting, and trying to figure this out, I'm at a dead end.
Background info- it's a Rails 4.1 app, Ruby 2.0, running nginx and unicorn on a Ubuntu 14.04 server.
unicorn.rb
working_directory "/home/deployer/apps/ourapp/current"
pid "/home/deployer/apps/ourapp/current/tmp/pids/unicorn.pid"
stderr_path "/home/deployer/apps/ourapp/current/log/unicorn.log"
stdout_path "/home/deployer/apps/ourapp/current/log/unicorn.log"
listen "/tmp/unicorn.ourapp.sock"
worker_processes 2
timeout 30
excerpt from unicorn.log (last lines before it dies and after restart)
I, [2016-08-28T19:54:01.685757 #19559] INFO -- : worker=1 ready
I, [2016-08-28T19:54:01.817464 #19556] INFO -- : worker=0 ready
I, [2016-08-29T09:19:14.818267 #30343] INFO -- : unlinking existing socket=/tmp/unicorn.ourapp.sock
I, [2016-08-29T09:19:14.818639 #30343] INFO -- : listening on addr=/tmp/unicorn.ourapp.sock fd=10
I, [2016-08-29T09:19:14.818807 #30343] INFO -- : worker=0 spawning...
I, [2016-08-29T09:19:14.824358 #30343] INFO -- : worker=1 spawning...
Some pertinent info:
After a period of time ranging from about 8 - 20 hours, unicorn dies.
There's no error recorded in the unicorn log.
I searched all of /var/log for evidence of processes that were killed, and can only find one unrelated process that was killed a few days ago.
New Relic shows flat memory usage before the last random shutdown, with ruby using around 400mb. It's currently at 480mb with no problems, so I don't think it's hitting memory constraints.
Same with CPU usage...ruby was hovering around 0.1% before it died.
The last couple of times it died were in the middle of the night. The only requests coming in were from New Relic and Linode Longview monitoring.
Our production.log shows a last request before dying as a ping from New Relic. It Completed 200 OK in 264ms so it doesn't seem to be a request timing out.
It's happening in staging as well, and the log level is set to debug, and there are no additional clues in the staging logs.
Questions:
What could be killing the Unicorn workers that's not the out-of-memory manager or a shut down signal?
Could it be the OOM or a shut down signal, and it's being recorded in some place that I'm not looking, or just not being recorded for some reason?
Is there a way to capture what's happing to Unicorn in more detail?
I have no idea where to go from here, so any suggestions would be much appreciated.
UPDATE
As suggested, I used strace to find out that unicorn was being killed by an old crontab (I know I should have checked there earlier) added by the previous developers that was intended to restart the server every night. The stop command worked, but the start command was failing.
I still don't know why I wasn't able to find anything in my log searches, but after attaching strace to the main unicorn process (using something like strace -o /tmp/strace.out -s 2000 -fp <unicorn_process_id>), the strace log ended with a clear +++ killed by SIGKILL +++. I searched the logs again, and that led me to the crontab.
The underlying cause is probably pretty specific to my situation, but I'm really glad I know about strace now.

Heroku Error H13

I've been getting this error now on & off for the past couple days since I deployed my application to heroku. It happens both before I started using unicorn as a server as well as afterwards. I can sometimes get it back up and running by using heroku run rake db:migrate then heroku restart but this only fixes it for a couple hours and it's broken again. As for the webpage it's saying "Application error". The logs aren't very helpful but here's what it says each time this error happens:
[2014-10-27T21:13:31.675956 #2] ERROR -- : worker=1 PID:8 timeout (16s > 15s), killing
[2014-10-27T21:13:31.731646 #14] INFO -- : worker=1 ready
[2014-10-27T21:13:31.694690 #2] ERROR -- : reaped #<Process::Status: pid 8 SIGKILL (signal 9)> worker=1
at=error code=H13 desc="Connection closed without response" method=GET
I'm just using the free version of heroku, I want to make sure it works before upgrading but is that my only option at this point?
Also I am able to run this locally perfectly fine using either rails server or foreman start.
Heroku docs say this about H13:
H13 - Connection closed without response
This error is thrown when a process in your web dyno accepts a connection, but then closes the socket without writing anything to it.
One example where this might happen is when a Unicorn web server is configured with a timeout shorter than 30s and a request has not been processed by a worker before the timeout happens. In this case, Unicorn closes the connection before any data is written, resulting in an H13.
A couple lines up, you have an error about a process timing out after 15s:
ERROR -- : worker=1 PID:8 timeout (16s > 15s), killing
Heroku help has a section on timeout settings:
Depending on your language you may be able to set a timeout on the app server level. One example is Ruby’s Unicorn. In Unicorn you can set a timeout in config/unicorn.rb like this:
timeout 15
The timer will begin once Unicorn starts processing the request, if 15 seconds pass, then the master process will send a SIGKILL to the worker but no exception will be raised.
That matches the error messages in your log. I'd look into it.

How to findout what cause unicorn workers timeout

People keeps claiming that my website always hang out at some pages. I checked the unicorn stderr log, and found many timeout errors like:
E, [2013-08-14T09:27:32.236478 #30027] ERROR -- : worker=5 PID:11619 timeout (601s > 600s), killing
E, [2013-08-14T09:27:32.252252 #30027] ERROR -- : reaped #<Process::Status: pid=11619,signaled(SIGKILL=9)> worker=5
I, [2013-08-14T09:27:32.266141 #4720] INFO -- : worker=5 ready
There are many error messages like that.
Then I go to the rails production log, find the exact requests by searching the unicorn error time minus 601s. These timeout request, all choked at the page rendering phase. The sql of these requests are done already. It just never gets an end:
Processing by XXXController#index as HTML
Rendered xxx/index.html.erb within layouts/application (41.4ms)
Rendered shared/_sidebar.html.erb (200.9ms)
No complete. Most of these requests served successfully. I don't know why at random time, it hang out there.
I have no idea what may cause this. Can anybody give me a clue of how to find the real reason that cause the unicorn workers to timeout?
Update:
We used NSC to transfer request and response to unicorn. And to try to improve the timeout issue, we added nginx between NSC and unicorn. It turns out the unicorn worker timeout still happens, and each timeout matches a nginx upstream timeout in nginx error log.
Does anyone knows whether there is some kind of bottle neck in TCP connection of unicorn?
I'm using Rack::Timeout to time out before unicorn. Unicorn timeout uses kill -9, and I don't think that gives you any way to do anything.

Resources