How many requests can Puma buffer simultaneously? - ruby-on-rails

I understand a benefit of Puma over other Rails web servers is how it handles slow clients. While a Puma server receives and downloads a (potentially slow) request, it can still receive and download other requests that might download quicker and be passed on to a worker for processing before the slow request has finished being received.
But I can't find any information about what, if any, limits there are to this.
Can Puma download any number of requests at the same time? If 1000 slow requests hit it at the same time, would the 1001st request reach a Puma worker first assuming it wasn't a slow request?
I guess what I'm interested in generally is what impact multiple slow requests have on other requests, including each other - because I'm working on an application that's likely to involve plenty of 'slow requests' (image uploads from phones via 3G).
This great article by #nate-berkopec helps explain in principle how Puma helps with slow clients: "In clustered mode, then, Puma can deal with slow requests (thanks to a separate master process whose responsibility it is to download requests and pass them on)..." Any more light anyone can shed would be very welcome.

There are a number of considerations, such as the IO polling system, memory and concurrency concerns.
IO polling system
Edit (Sep. 9th, 2020): By now the Puma server is running on the nio4r and should no longer be subject to the limits of the select system call (where file descriptor values are limited to 1023).
As far as I know, Puma uses the select system call (unlike iodine or passenger, which also protect you from slow clients but use kqueue or epoll).
The select system call is limited on most systems (usually up to 1024 clients / maxfd). I would assume that would create a limit.
However, I know Puma is working on replacing the select system call with something both portable and effective (such as leveraging the nio4r gem).
I don't know if that was already achieved, but it will break this limit and probably improve performance.
Memory
Slow clients still consume memory as they slowly fill the buffer with their header data or slowly download the buffered data that was sent (keeping the buffer in the memory until the download is complete).
Memory limitations will always add limitations to slow client handling.
Some limitations can be elevated, such as sending static files using X-Sendfile (supported with iodine, and when Puma or passenger are running under nginx)... but this isn't really something you can fix.
Concurrency
Puma handles slow clients within the Ruby's GIL (global instruction lock). This means that no other threads / instructions can execute while Puma is handling slow clients.
This is often a non issue, but a large enough number of slow clients increases the cost of context switching and system calls. This could (potentially) slow a server considerably.
Both Passenger and iodine perform slow client buffering outside of the GIL, allowing these system calls to be truly concurrent (when multiple CPU cores are available).
This will mitigate the issue but not fully solve the issue.
Conclusion and Caveats
The biggest issue is usually the IO polling system. A solution for this is on Puma's roadmap (maybe already implemented, I'm not sure).
The other issues (memory limits and concurrency limits) are relatively less important, but they can't be mitigated without using language extensions (the iodine server is written in C and Passenger is written in C++).
Since Puma doesn't (currently) require any language extensions (except for it's integrated HTTP parsers in C and Java), these issue remain.
I should point out that I'm the author for the iodine HTTP/Websocket server, so I'm somewhat biased.

Related

Why are multiple Ruby processes on an EC2 server causing 100% CPU utilisation?

I have a Rails application which has 100% CPU utilization most of the time.
I am not able to figure out why there is so much load on the server. I am using the Puma web server with a default configuration, and am running multiple background jobs using the sucker-punch gem. There are 7 files which are using sucker punch jobs with 5 workers:
include SuckerPunch::Job
workers 5
I ran the top -i query and found the following processes running on the server. I can see multiple Ruby commands on the server. Can someone tell me whether this is normal behavior on a server, or if something is wrong?
Some Ways to Reduce Resource Contention
Your user space load is high (~48%), so you'll probably want to reduce the number of workers in your web application, increase the number of CPUs available on your instance, move to a version of Ruby that has better concurrency and real multi-core support (e.g. Rubinius or JRuby), or some combination of these options. Depending on what your code is actually doing, you may also need to re-architect your application to offload expensive I/O from the application server.
In addition, your steal time is quite high (~41%), so your EC2 instance is probably overloaded. Simply moving your application to a less-loaded instance may free up sufficient resources to reduce application wait times.

Async App Server versus Multiple Blocking Servers

tl;dr Many Rails apps or one Vertx/Play! app?
I've been having discussions with other members of my team on the pros and cons of using an async app server such as the Play! Framework (built on Netty) versus spinning up multiple instances of a Rails app server.
I know that Netty is asynchronous/non-blocking, meaning during a database query, network request, or something similar an async call will allow the event loop thread to switch from the blocked request to another request ready to be processed/served. This will keep the CPUs busy instead of blocking and waiting.
I'm arguing in favor or using something such as the Play! Framework or Vertx.io, something that is non-blocking... Scalable. My team members, on the other hand, are saying that you can get the same benefit by using multiple instances of a Rails app, which out of the box only comes with one thread and doesn't have true concurrency as do apps on the JVM. They are saying just use enough app instances to match the performance of one Play! application (or however many Play! apps we use), and when a Rails app blocks the OS will switch processes to a different Rails app. In the end, they are saying that the CPUs will be doing the same amount of work and we will get the same performance.
So here are my questions:
Are there any logical fallacies in the arguments above? Would the OS manage the Rails app instances as well as Netty (which also runs on the JVM, which maps threads to cores very well) manages requests in its event loop?
Would the OS be as performant in switching on blocking calls as would something like Netty or Vertx, or even something built on Ruby's own EventMachine?
With enough Rails app instances to match the performance Play! apps, would there be a cost noticeable cost difference in running the servers? If there are no cost difference it wouldn't really matter what method is used, in my opinion. Shoot if it was cheaper financially to run up a million Rails apps than one Play! app I would rather do that.
What are some other benefits to using either of these approaches that I may be failing to ask about?
Both approaches can and have worked. So if switching would incur a high development cost and/or schedule hit then it's probably not worth the effort...yet. Make the switch when the costs become unacceptably high. Think of using microservices as a gradual switching strategy.
If you are early on in your development cycle then making the switch early may make sense. Rewriting is a pain.
Or perhaps you'll never have to switch and rails will work for your use case like a charm. And you've been so successful at making your customers happy that the cash is just rolling in.
Some of the downsides of a blocking single server approach:
Increased memory usage. Sources: multiple processes, memory leaks, lack of shared datastructures (which increases communication costs and brings up consistency issues).
Lack of parallelism. This has two consequences: more boxes and more latency. You'll need potentially a much larger box count to handle the same load. So if you need to scale and have money concerns then this can be a problem. If it isn't a concern then it doesn't matter. In the server it means increased latency, the sort of latency which can't be improved by multiplying processes, which may be a killer argument depending on your app.
Some examples of those who had made such a switch from rails to node.js and golang:
LinkedIn Moved From Rails To Node: 27 Servers Cut And Up To 20x Faster : http://highscalability.com/blog/2012/10/4/linkedin-moved-from-rails-to-node-27-servers-cut-and-up-to-2.html
Why Timehop Chose Go to Replace Our Rails App : https://medium.com/building-timehop/why-timehop-chose-go-to-replace-our-rails-app-2855ea1912d
How We Moved Our API From Ruby to Go and Saved Our Sanity : http://blog.parse.com/learn/how-we-moved-our-api-from-ruby-to-go-and-saved-our-sanity/
How We Went from 30 Servers to 2: Go : http://www.iron.io/blog/2013/03/how-we-went-from-30-servers-to-2-go.html
These posts represent arguments that are probably illustrative of what your group is going through. The decision is unfortunately not an obvious one.
It depends on the nature of what you are building, the nature of your team, the nature of resources, the nature of your skills, the nature of your goals and how you value all the different tradeoffs.
Would costs really drop? Isn't the same amount of computation done no matter the number of servers?
Depends on the type and scale of the work being done. Typically web services are IO bound, waiting on responses from other services like databases, caches, etc.
If you are using a single threaded server the process is blocked on IO a lot so it is doing nothing a lot. In contrast the nonblocking server will be able to handle many many requests while the single threaded server is blocked. You can keep adding processes, but there are only so many processes a single machine can run. A nonblocking server can have the same number of processes while keeping the CPU busy as possible handling requests. It's often possible to handle higher loads on smaller cheaper machines when using nonblocking servers.
If your expected request rate can be handled by an acceptable number of boxes and you don't expect huge spikes then you would be fine with single threaded servers. Nonblocking servers are great at soaking up load spikes without necessarily having to add machines.
If your work is such that response latencies don't really matter then you can get by with fewer nodes.
If your workload is CPU bound then you'll need more boxes anyway because there won't be the same opportunity for parallelism because the servers won't be blocking on IO.

Ruby rack app on 1 CPU - how many processes should I run?

I'm quite new to Ruby web apps (coming from java).
I have VPS that has 1 CPU and 2GB of RAM and would like to play with some rails/sinatra stuff.
I'm using Ruby 2.1.0 MRI
How does number of CPUs maps to number of web server's processes I need to run? I use puma as a web server and have default threads (0,16) set up. But I noticed there is also "workers" option that forks another process to better handle multiple requests.
Do I understand correctly that for such setup (1 CPU) there is no point in running 2 web server processes? The only reasonable setup is 1 process with threads?
Oh now this is a pretty big question!
The number of processes and threads aren't necessarily linked to the number of CPU's. It's more a case of the amount of memory available, the amount of concurrent requests and the amount of 'locking' stuff that's going on.
If you're going to have long running requests that block other requests, then having additional processes can help with that. You can still have more than one process with a single CPU.
There are a number of different servers in Ruby that handle scaling in different ways, Unicorn, Puma, Thin are some of them. Doing a search on Unicorn vs Puma vs Thin can turn up some useful blog posts on the topic.
Here's a couple
http://ylan.segal-family.com/blog/2012/08/20/better-performance-on-heroku-thins-vs-unicorn-vs-puma/
https://www.engineyard.com/articles/rails-server
https://www.ruby-forum.com/topic/1822610
And some information on concurrency in Ruby
http://merbist.com/2011/02/22/concurrency-in-ruby-explained/
The TL:DR answer is, it depends!

How to deploy a threadsafe asynchronous Rails app?

I've read tons of material around the web about thread safety and performance in different versions of ruby and rails and I think I understand those things quite well at this point.
What seems to be oddly missing from the discussions is how to actually deploy an asynchronous Rails app. When talking about threads and synchronicity in an app, there are two things people want to optimize:
utilizing all CPU cores with minimal RAM usage
being able to serve new requests while previous requests are waiting on IO
Point 1 is where people get (rightly) excited about JRuby. For this question I am only trying to optimize point 2.
Say this is the only controller in my app:
class TheController < ActionController::Base
def fast
render :text => "hello"
end
def slow
render :text => User.count.to_s
end
end
fast has no IO and can serve hundreds or thousands of requests per second, and slow has to send a request over the network, wait for work to be done, then receive the answer over the network, and is therefore much slower than fast.
So an ideal deployment would allow hundreds of requests to fast to be fulfilled while a request to slow is waiting on IO.
What seems to be missing from the discussions around the web is which layer of the stack is responsible for enabling this concurrency. thin has a --threaded flag, which will "Call the Rack application in threads [experimental]" -- does that start a new thread for each incoming request? Spool up rack app instances in threads that persist and wait for incoming requests?
Is thin the only way or are there others? Does the ruby runtime matter for optimizing point 2?
The right approach for you depends heavily on what your slow method is doing.
In a perfect world, you could use use something like the sinatra-synchrony gem to handle each request in a fiber. You'd only be limited by the maximum number of fibers. Unfortunately, the stack size on fibers is hardcoded, and it is easy to overrun in a Rails app. Additionally, I've read a few horror stories of the difficulties of debugging fibers, due to the automatic yielding after async IO has been initiated. Race conditions are still possible when using fibers, as well. Currently, fibered Ruby is a bit of a ghetto, at least on the front-end of a web app.
A more pragmatic solution that doesn't require code changes is to use a Rack server that has pool of worker threads such as Rainbows! or Puma. I believe Thin's --threaded flag handles each request in a new thread, but spinning up a native OS thread is not cheap. Better to use a thread pool with the pool size set sufficiently high. In Rails, don't forget to set config.threadsafe! in production.
If you're OK with changing code, you can check out Konstantin Haase's excellent talk on real-time Rack. He discusses using the EventMachine::Deferrable class to produce a response outside of the traditional request/response cycle that Rack is built on. This seems really neat, but you have to rewrite the code in an async style.
Also take a look at Cramp and Goliath. These let you implement your slow method in a separate Rack app that is hosted alongside your Rails app, but you will probably have to rewrite your code to work in the Cramp/Goliath handlers as well.
As for your question about the Ruby runtime, it also depends on the work that slow is doing. If you're doing CPU-heavy computation, then you run the risk of the GIL giving you issues. If you're doing IO, then the GIL shouldn't get in your way. (I say shouldn't because I believe I've read about issues with the older mysql gem blocking the GIL.)
Personally, I've had success using sinatra-synchrony for a backend, mashup web service. I can issue several requests to external web services in parallel, and wait for all of them to return. Meanwhile, the frontend Rails server uses a thread pool, and makes requests directly to the backend. Not perfect, but it works well enough right now.

Ruby on Rails: How would I handle 10 concurrent users? Do I need more CPU?

Sorry if this might seem obvious. I've monitored that a web request on my Rails app uses 30-33% of CPU every time. For example, if I load a web page, then 30% of CPU is used. Does that mean that my box can only handle 3 concurrent web requests, and will stall if there are more than 3 web requests (i.e. I'll get a 100% CPU)?
If so, does that also mean that if I want to handle more than 3 concurrent web requests, then I'll have to get more servers to handle the load using a load balancer? (e.g. to handle 6 concurrent web requests, I'll need 2 servers; for 9 concurrent requests, I'll need 3 servers; for 12, I'll need 4 servers -- and so on?)
I think you should start with load tests. I wouldn't trust manual testing that much.
Load tests tell you how long the response takes for each client, and how many clients
simply time-out.
Also you will be able to measure the improvements objectively for any changes that you make.
Look at ab, or httperf; there are many other tools available.
Stephan
Your Apache or Nginx in front of the Passenger will queue requests until a Passenger worker becomes available. You can limit the number of concurrent workers so your server never stalls (but new visitors will have to wait longer until it's their turn).
It's difficult to tell based on this information. It depends very much on the web server stack you're using and which environment you're running. Different servers (Mongrel, Webrick, Apache using various mechanisms, Unicorn) all have different memory characteristics. Different environments (development vs. test vs. production) all exhibit radically different memory usage characteristics.

Resources