104: Connection reset by peer: nginx + rainbows + over 1 mb uploads - ruby-on-rails

I am running ThreadPool rainbows + nginx (unix socket)
On large file uploads I am getting the following in nginx error log (nothing in the application log):
readv() failed (104: Connection reset by peer) while reading upstream
The browser receives response:
413 Request Entity Too Large
Why does this happen?
"client_max_body_size 80M;" is set both on http and server level (just in case) in nginx
nginx communicates with rainbows over a unix socket (upstream socket + location # proxy_pass)
I don't see anything in the other logs. I have checked:
rainbows log
foreman log
application log
dmesg and /var/log/messages
This happens when uploading a file ~> 1 MB size

The ECONNRESET (Connection reset by peer) error means that connection was uncleanly closed by a backend application. This usually happens if backend application dies, e.g. due to segmentation fault, or killed by the OOM killer. To find out exact reason you have to examine your backend logs (if any) and/or system logs.

Maybe you have client_max_body_size set into your nginx.conf that limits the size of the body to 1Mb, e.g.
client_max_body_size 1M;
In this case you'd need to remove it to allow uploading files of more than 1M.

Turns out Rainbows had a configuration option called client_max_body_size that defaulted to 1 MB
The option is documented here
If this options is on, Rainbows will 413 to large requests silently. You might not know it's breaking unless you run something in front of it.
Rainbows! do
# let nginx handle max body size
client_max_body_size nil
end

Related

Preventing uwsgi_response_write_body_do() TIMEOUT

We use uwsgi with the python3 plugin, under nginx, to serve potentially hundreds of megabytes of data per query. Sometimes when nginx is queried from client a slow network connection, a uwsgi worker dies with "uwsgi_response_write_body_do() TIMEOUT !!!".
I understand the uwsgi python plugin reads from the iterator our app returns as fast as it can, trying to send the data over the uwsgi protocol unix socket to nginx. The HTTPS/TCP connection to the client from nginx will get backed up from a slow network connection and nginx will pause reading from its uwsgi socket. uwsgi will then fail some writes towards nginx, log that message and die.
Normally we run nginx with uwsgi buffering disabled. I tried enabling buffering, but it doesn't help as the amount of data it might need to buffer is 100s of MBs.
Our data is not simply read out of a file, so we can't use file offload.
Is there a way to configure uwsgi to pause reading from the our python iterator if that unix socket backs up?
The existing question here "uwsgi_response_write_body_do() TIMEOUT - But uwsgi_read_timeout not helping" doesn't help, as we have buffering off.
To answer my own question, adding socket-timeout = 60 is helping for all but the slowest client connection speeds.
That's sufficient so this question can be closed.

Rails + Nginx - why i should use fail_timeout=0 for multiple nodes?

In nginx example config file here https://github.com/defunkt/unicorn/blob/master/examples/nginx.conf
you may see that:
# The only setting we feel strongly about is the fail_timeout=0
# directive in the "upstream" block. max_fails=0 also has the same
# effect as fail_timeout=0 for current versions of nginx and may be
# used in its place.
As I understand, they think that all users will get 504 Bad Request error in case of one server in upstream block if one of the request was killed by timeout or returned something that considered as a bad requset (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream).
So in upstream block they have:
upstream app_server {
# fail_timeout=0 means we always retry an upstream even if it failed
# to return a good HTTP response (in case the unicorn master nukes a
# single worker for timing out).
# for UNIX domain socket setups:
server unix:/path/to/.unicorn.sock fail_timeout=0;
# for TCP setups, point these to your backend servers
# server 192.168.0.7:8080 fail_timeout=0;
# server 192.168.0.8:8080 fail_timeout=0;
# server 192.168.0.9:8080 fail_timeout=0;
}
I am using least_conn directive in upstream block. So if one of unicorns down it will very fast answer with, for example, 500 error. And because of that 99% of all requests will be send to this node. In other words, if one node down - the whole app is down.
I am thinking of trying something like that:
upstream app_server {
least_conn;
server 192.168.0.7:8080 fail_timeout=10s max_fails=5;
server 192.168.0.8:8080 fail_timeout=10s max_fails=5;
server 192.168.0.9:8080 fail_timeout=10s max_fails=5;
}
According to nginx doc (http://nginx.org/en/docs/http/ngx_http_upstream_module.html#server) it means that one of the servers will be marked as DOWN for next 10 second it it will send 5 bad answers in 10 second. I do not see any flaws. What do you think? I barely found any examples where fail_timeout is not 0.

nginx passenger server error

I am running rails applications with nginx+passenger
after nginx started serve, I can access it
but after sometime,may be one hour or half a day, it tells me the following message
Internal server error
An error occurred while starting the web application. It sent an unknown response type "".
then i need to reboot the server to let nginx serve normally
My server is running on AliYun and it's memory size is only 512M, is it too small too run passenger?
or what's wrong with the configureation?
It's only a workaround and you should find what is the actual problem (by monitoring memory usage, processor usage, open file handles etc) but until then you can use passenger_max_requests directive

Passenger error when lots of concurrent calls

While doing a load test I found passenger throwing below error at first when lots of concurrent requests hit server. And, client side it gives 502 error code. However, after some requests say 1000- 2000 requests its works fine.
2013/07/23 11:22:46 [error] 14131#0: *50226 connect() to /tmp/passenger.1.0.14107/generation-
0/request failed (11: Resource temporarily unavailable) while connecting to upstream, client: 10.251.18.167, server: 10.*, request: "GET /home HTTP/1.0", upstream: "passenger:/tmp/passenger.1.0.14107/generation-0/request:", host: hostname
Server Details.
Passenger 4.0.10
ruby 1.9.3/2.0
Server Ec2 m1.xlarge
64-bit 4core 15gb
Ubuntu 12:24 LTS
Its a web server which servers dynamic webpages for rails framework
Can somebody suggest what the issue might be?
A "temporarily unavailable" error in that context means the socket backlog is full. That can happen if your app cannot handle your requests fast enough. What happens is that the queue grows and grows, until it's full, and then you start getting those errors. In the mean time your users' response times grow and grow until they get an error. This is probably an application-level problem so it's best to try starting there. Try figuring out why your app is slow, at which request it is slow, and fix that. Or maybe you need to scale to more servers.

Nginx + unicorn (rails) often gives "Connection refused" in nginx error log

At work we're running some high traffic sites in rails. We often get a problem with the following being spammed in the nginx error log:
2011/05/24 11:20:08 [error] 90248#0: *468577825 connect() to unix:/app_path/production/shared/system/unicorn.sock failed (61: Connection refused) while connecting to upstream
Our setup is nginx on the frontend server (load balancing), and unicorn on our 4 app servers. Each unicorn is running with 8 workers. The setup is very similar to the one GitHub uses.
Most of our content is cached, and when the request hits nginx it looks for the page in memcached and serves that it if can find it - otherwise the request goes to rails.
I can solve the above issue - SOMETIMES - by doing a pkill of the unicorn processes on the servers followed by a:
cap production unicorn:check (removing all the pid's)
cap production unicorn:start
Do you guys have any clue to how I can debug this issue? We don't have any significantly high load on our database server when these problems occurs..
Something killed your unicorn process on one of the servers, or it timed out. Or you have an old app server in your upstream app_server { } block that is no longer valid. Nginx will retry it from time to time. The default is to re-try another upstream if it gets a connection error, so hopefully your clients didn't notice anything.
I don't think this is a nginx issue for me, restarting nginx didn't help. It seems to be gunicorn...A quick and dirty way to avoid this is to recycle the gunicorn instances when the system is not being used, say 1AM for example if that is an acceptable maintenance window. I run gunicorn as a service that will come back up if killed so a pkill script takes care of the recycle/respawn:
start on runlevel [2345]
stop on runlevel [06]
respawn
respawn limit 10 5
exec /var/web/proj/server.sh
I am starting to wonder if this is at all related to memory allocation. I have MongoDB running on the same system and it reserves all the memory for itself but it is supposed to yield if other applications require more memory.
Other things worth a try is getting rid of eventlet or other dependent modules when running gunicorn. uWSGI can also be used as an alternative to gunicorn.

Resources