We use uwsgi with the python3 plugin, under nginx, to serve potentially hundreds of megabytes of data per query. Sometimes when nginx is queried from client a slow network connection, a uwsgi worker dies with "uwsgi_response_write_body_do() TIMEOUT !!!".
I understand the uwsgi python plugin reads from the iterator our app returns as fast as it can, trying to send the data over the uwsgi protocol unix socket to nginx. The HTTPS/TCP connection to the client from nginx will get backed up from a slow network connection and nginx will pause reading from its uwsgi socket. uwsgi will then fail some writes towards nginx, log that message and die.
Normally we run nginx with uwsgi buffering disabled. I tried enabling buffering, but it doesn't help as the amount of data it might need to buffer is 100s of MBs.
Our data is not simply read out of a file, so we can't use file offload.
Is there a way to configure uwsgi to pause reading from the our python iterator if that unix socket backs up?
The existing question here "uwsgi_response_write_body_do() TIMEOUT - But uwsgi_read_timeout not helping" doesn't help, as we have buffering off.
To answer my own question, adding socket-timeout = 60 is helping for all but the slowest client connection speeds.
That's sufficient so this question can be closed.
Related
So I am building a web application for university which has a very high tick rate (clients recieving data from node server above 30 times per second via socketio). This works well in docker. Now I installed nginx, configured it and everything works well (no exposed ports, socket still running etc.) but now nginx logs in the docker terminal every single socket connection from every single client (at two clients well above 60 logs per second) and I think this also leads to performance issues and causes small lag to the clients. I did not find any solutions in their docs.
The application servers used by Ruby web applications that I know have the concept of worker processes. For example, Unicorn has this on the unicorn.rb configuration file, and for mongrel it is called servers, set usually on your mongrel_cluster.yml file.
My two questions about it:
1) Does every worker/server works as a web server and spam a processes/threads/fiber each time it receives a request, or it blocks when a new request is done if there is already other running?
2) Is this different from application server to application server? (Like unicorn, mongrel, thin, webrick...)
This is different from app server to app server.
Mongrel (at least as of a few years ago) would have several worker processes, and you would use something like Apache to load balance between the worker processes; each would listen on a different port. And each mongrel worker had its own queue of requests, so if it was busy when apache gave it a new request, the new request would go in the queue until that worker finished its request. Occasionally, we would see problems where a very long request (generating a report) would have other requests pile up behind it, even if other mongrel workers were much less busy.
Unicorn has a master process and just needs to listen on one port, or a unix socket, and uses only one request queue. That master process only assigns requests to worker processes as they become available, so the problem we had with Mongrel is much less of an issue. If one worker takes a really long time, it won't have requests backing up behind it specifically, it just won't be available to help with the master queue of requests until it finishes its report or whatever the big request is.
Webrick shouldn't even be considered, it's designed to run as just one worker in development, reloading everything all the time.
off the top of my head, so don't take this as "truth"
ruby (MRI) servers:
unicorn, passenger and mongrel all use 'workers' which are separate processes, all of these workers are started when you launch the master process and they persist until the master process exits. If you have 10 workers and they are all handling requests, then request 11 will be blocked waiting for one of them to complete.
webrick only runs a single process as far as I know, so request 2 would be blocked until request 1 finishes
thin: I believe it uses 'event I/O' to handle http, but is still a single process server
jruby servers:
trinidad, torquebox are multi-threaded and run on the JVM
see also puma: multi-threaded for use with jruby or rubinious
I think GitHub best explains unicorn in their (old, but valid) blog post https://github.com/blog/517-unicorn.
I think it puts backlog requests in a queue.
We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.
During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so.
CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.
Our setup
1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5
The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.
We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.
NGINX SETTINGS
worker_processes=16
worker_connections=4096
multi_accept=on
LINUX SETTINGS
fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"
Why is the app bypassing the load balancer so fast?
Can nginx as load balancer be the bottle neck?
Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?
Depending on your settings nginx might be the bottleneck...
Check/tune the following settings in nginx:
the worker_processes setting (should be equal to the number of cores/cpus)
the worker_connections setting (should be very high if you have lots of connections at peak)
set multi_accept on;
if on linux, in nginx make sure you're using epoll (use epoll;-directive)
check/tune the following settings of your OS:
number of allowed open file descriptors (sysctl -w fs.file-max=999999 on linux)
tcp read and write buffers (sysctl -w net.ipv4.tcp_rmem="4096 4096 16777216" and
sysctl - net.ipv4.tcp_wmem="4096 4096 16777216" on linux)
local port range (sysctl -w net.ipv4.ip_local_port_range="1024 65536" on linux)
Update:
so you have 16 workers and 4096 connections per workers
which means a maximum of 4096*16=65536 concurrent connections
you probably have multiple requests per browser (ajax, css, js, page itself, any images on the page, ...), let's say 4 request per browser
that allows for slightly over 16k concurrent users, is that enough for your peaks?
How do you set up your upstream server group and what is the load balancing method you use?
It's hard to imagine that Nginx itself is the bottleneck. Is it possible that some upstream app servers get hit much more than others and start to refuse connection due to backlog is full? See this load balancing issue on Heroku and see if you can get more help there.
After nginx version 1.2.2, nginx provides this least_conn. That might be an easy fix. I haven't tried it myself yet.
Specifies that a group should use a load balancing method where a
request is passed to the server with the least number of active
connections, taking into account weights of servers. If there are
several such servers, they are tried using a weighted round-robin
balancing method.
At work we're running some high traffic sites in rails. We often get a problem with the following being spammed in the nginx error log:
2011/05/24 11:20:08 [error] 90248#0: *468577825 connect() to unix:/app_path/production/shared/system/unicorn.sock failed (61: Connection refused) while connecting to upstream
Our setup is nginx on the frontend server (load balancing), and unicorn on our 4 app servers. Each unicorn is running with 8 workers. The setup is very similar to the one GitHub uses.
Most of our content is cached, and when the request hits nginx it looks for the page in memcached and serves that it if can find it - otherwise the request goes to rails.
I can solve the above issue - SOMETIMES - by doing a pkill of the unicorn processes on the servers followed by a:
cap production unicorn:check (removing all the pid's)
cap production unicorn:start
Do you guys have any clue to how I can debug this issue? We don't have any significantly high load on our database server when these problems occurs..
Something killed your unicorn process on one of the servers, or it timed out. Or you have an old app server in your upstream app_server { } block that is no longer valid. Nginx will retry it from time to time. The default is to re-try another upstream if it gets a connection error, so hopefully your clients didn't notice anything.
I don't think this is a nginx issue for me, restarting nginx didn't help. It seems to be gunicorn...A quick and dirty way to avoid this is to recycle the gunicorn instances when the system is not being used, say 1AM for example if that is an acceptable maintenance window. I run gunicorn as a service that will come back up if killed so a pkill script takes care of the recycle/respawn:
start on runlevel [2345]
stop on runlevel [06]
respawn
respawn limit 10 5
exec /var/web/proj/server.sh
I am starting to wonder if this is at all related to memory allocation. I have MongoDB running on the same system and it reserves all the memory for itself but it is supposed to yield if other applications require more memory.
Other things worth a try is getting rid of eventlet or other dependent modules when running gunicorn. uWSGI can also be used as an alternative to gunicorn.
I'm using erlang as a bridge between services and I was wondering what advice people had for handling downed connections?
I'm taking input from local files and piping them out to AMQP and it's conceivable that the AMQP broker could go down. For that case I would want to keep retrying to connect to the AMQP server but I don't want to peg the CPU with those connections attempts. My inclination is to put a sleep into the reboot of the AMQP code. Wouldn't that 'hack' essentially circumvent the purpose of failing quickly and letting erlang handle it? More generally, should the erlang supervisor behavior be used for handling downed connections?
I think it's reasonable to code your own semantics for handling connections to an external server yourself. Supervisors are best suited to handling crashed/locked/otherwise unhealthy processes in your own process tree not reconnections to an external service.
Is your process that pipes the local files in the same process tree as the AMQP broker or is it a separate service?