Apache+Passenger Optimization Questions?

Apache+Passenger Optimization Questions? - ruby-on-rails

I am working to optimize the speed of a Rails application we have in production.
It is built on an Apache,Passenger,Rails 2.35, Ubuntu on an m1.large EC2 instance (7.5GB RAM, 4 Compute Units) We will be switching to nginx in the near term, but have some dependancies on Apache right now.
I'm running load tests against it with ab and httperf.
We are seeing consistently 45 request/sec varying between 30 - 200 concurrent users for a simple API request to get a single record from a database table with 100k records indexed. It seems very slow to me.
We are focusing on both application code optimization as well as server configuration optimization.
I am currently focused on server configuration optimization. Here's what I've done so far:
Passenger adjusted PassengerMaxPoolSize to (90% Total Memory) / (Memory per passenger process, 230mb) Apache
adjusted MaxClients to 256 max.
I did both independently and together I saw no impact to the
req/sec.
Another note is that this seems to scale linearly by adding servers. So doesn't seem to be a database issue. 2 servers 90 req/sec, 3 servers was around 120 request/sec... etc
Any tips? It just seems like we should be getting better performance by adding more processes?

Related

How do I determine the right number of Puma workers and threads to run on a Heroku Performance dyno?

I've read all of the articles I can find on Heroku about Puma and dyno types and I can't get a straight answer.
I see some mentions that the number of Puma workers should be determined by the number of cores. I can't find anywhere that Heroku reveals how many cores a performance-M or performance-L dyno has.
In this article, Heroku hinted at an approach:
https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server
I think they're suggesting to set the threads to 1 and increase the number of Puma workers until you start to see R14 (memory) errors, then back off. And then increase the number of threads until the CPU maxes out, although I don't think Heroku reports on CPU utilization.
Can anyone provide guidance?
(I also want to decide whether I should use one performance-L or multiple performance-M dynos, but I think that will be clear once I figure out how to set the workers & threads)

The roadmap I currently figured out like this:
heroku run "cat /proc/cpuinfo" --size performance-m --app yourapp
heroku run "cat /proc/cpuinfo" --size performance-l --app yourapp
Write down the process information you have
Googling model type, family, model, step number of Intel processor, and looking for how many core does this processor has or simulates.
Take a look this https://devcenter.heroku.com/articles/dynos#process-thread-limits
Do some small experiments with standard-2X / standard-1X to determine PUMA_WORKER value.
Do your math like this:
(Max Threads of your desired dyno type could support) / (Max Threads of baseline dyno could support) x (Your experiment `PUMA_WORKER` value on baseline dyno) - (Number of CPU core)
For example, if the PUMA_WORKER is 3 on my standard-2X dyno as baseline, then the PUMA_WORKER number on performance-m I would start to test it out would be:
16384 / 512 * 3 - 4 = 92
You should also consider how much memory your app consumes and pick the lowest one.
EDIT: Previously my answer was written before ps:exec available. You could read the official document and learn how to ssh into running dyno(s). It should be quite easier than before.

Currently facing the same issue for an application running in production in AWS (we are using ECS), and trying to define the good fit between:
Quantity of vCPU / Ram per instance
Number of instances
Number of puma_threads running per instance (each instance is having a single puma process)
In order to better understand how our application is consuming the pool of puma_threads we did the following:
Export puma metrics to cloudwatch (threads running + backlog), we then saw that around 15 concurent threads, the backlog is starting to grow.
Put this in comparaison with vCPU (usage), we saw that our vCPU was never above 25%
Using these two informations together we decided to take the actions described above.
Finally I would like to share this article, that I found very interesting about this topic.

Scaling Puppet - when is too much for WEBrick?

I've found the following at Docs: Scaling Puppet:
Are you using the default webserver?
WEBrick, the default web server used to enable Puppet’s web services connectivity, is essentially a reference implementation, and becomes unreliable beyond about ten managed nodes. In any sort of production environment serving many nodes, you should switch to a more efficient web server implementation such as Passenger or Mongrel.
Where does the the number 10 come from in "ten managed nodes"?
I have a little over 20 nodes and I might soon have little over 30. Should I change to Passenger or not?

You should change to Passenger when you start having problems with WEBrick (or a little before). When that happens for you will depend on your workload.
The biggest problem with WEBrick is that it's single-threaded and blocking; once it's started working on a request, it cannot handle any other requests until it's done with the first one. Thus, what will make the difference to you is how much of the time Puppet spends processing requests.
Each time a client asks for its catalog, that's a request. Each separate file retrieved via puppet:/// URLs is also a request. If you're using Puppet lightly, each catalog won't take too long to generate, you won't be distributing many files on any given Puppet run, and each client won't be taking more than four to six seconds of server time every hour. If each client takes four seconds of server time per hour, 10 clients have a 5% chance of collisions0--of at least one client having to wait while another's request is processed. For 20 or 30 clients, those chances are 19% and 39%, respectively. As long as each request is short, you might be able to live with some contention, but the odds of collisions increase pretty quickly, so if you've got more than, say, 50 hosts (75% collision chance) you really ought to by using Passenger unless you're doing active performance measuring that shows that you're doing okay.
If, however, you're working your Puppet master harder--taking longer to generate catalogs, serving lots of files, serving large files, or whatever--you need to switch to Passenger sooner. I inherited a set of about thirty hosts with a WEBrick Puppet master where things were doing okay, but when I started deploying new systems, all of the Puppet traffic caused by a fresh deployment (including a couple of gigabyte files1) was preventing other hosts from getting their updates, so that's when I was forced to switch to Passenger.
In short, you'll probably be okay with 30 nodes if you're not doing anything too intense with Puppet, but at that point you need to be monitoring the performance of at least your Puppet master and preferably your clients' update status, too, so you'll know when you start running beyond the capabilities of WEBrick.
0 This is a standard birthday paradox calculation; if n is the number of clients and s is the average number of seconds of server time each client uses per hour, then the chance of having at least one collision during an hour is given by 1-(s/3600)!/((s/3600)^n*((s/3600)-n)!).
1 Puppet isn't really a good avenue for distributing files of this size in any case. I eventually switched to putting them on an NFS share that all of the hosts had access to.

For 20-30 nodes, there shouldn't be any problem. Note that passenger provides some additional features. It may be faster serving the nodes, but I am not sure how much improvement you will get if you have only 30 nodes.
You should change to passenger if you are using more than hundred nodes. I started seeing problems when the number of nodes requesting service from the puppet-master reached about 200. In my case, with the default web-server, about 5% of the nodes (random) couldn't receive the catalog during hourly run.

Is 100 or less requests per second (for non-cached pages) what one can expect with Rails?

Preface: Please don't start a discussion on premature optimization, or anything related. I'm just trying to understand what kind of performance I can get from a single server with rails.
I have been benchmarking ruby on rails 3 and it seems the highest rate of requests per second I can get is around 100 requests per second.
I used phusion passenger with nginx, and Ruby 1.8.7.
This is on a ec2 m1.large instance:
7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage
64-bit platform
I/O Performance: High
API name: m1.large
The page was a very simple action that wrote a single row into mysql.
user = User.new
user.name = "test"
user.save
I am assuming no caching (memcache, etc), I just want to get a feel for the raw numbers.
I used apache bench on the same ec2 instance, and I used varying levels of # of requests (from 1000 to 10000 and varying numbers of concurrent requests 1/5/10/25/50/100).

The EC2 m1.large instance is really not that fast, so these numbers aren't surprising. If you want performance, you can either spring for a larger instance, as there are now some with 80 ECUs, or try a different provider.
I've found that Linode generally offers higher performance for the same price, but is not as flexible and doesn't scale to very large numbers of servers as well. It's a much better fit if you're in the "under 20 servers" phase of rolling out.
Also don't think that MySQL is a no-cost operation.

High traffic rails perf tuning

I was attempting to evaluate various Rails server solutions. First on my list was an nginx + passenger system. I spun up an EC2 instance with 8 gigs of RAM and 2 processors, installed nginx and passenger, and added this to the nginx.conf file:
passenger_max_pool_size 30;
passenger_pool_idle_time 0;
rails_framework_spawner_idle_time 0;
rails_app_spawner_idle_time 0;
rails_spawn_method smart;
I added a little "awesome" controller to rails that would just render :text => (2+2).to_s
Then I spun up a little box and ran this to test it:
ab -n 5000 -c 5 'http://server/awesome'
And the CPU while this was running on the box looked pretty much like this:
05:29:12 PM CPU %user %nice %system %iowait %steal %idle
05:29:36 PM all 62.39 0.00 10.79 0.04 21.28 5.50
And I'm noticing that it takes only 7-10 simultaneous requests to bring the CPU to <1% idle, and of course this begins to seriously drag down response times.
So I'm wondering, is a lot of CPU load just the cost of doing business with Rails? Can it only serve a half dozen or so super-cheap requests simultaneously, even with a giant pile of RAM and a couple of cores? Are there any great perf suggestions to get me serving 15-30 simultaneous requests?
Update: tried the same thing on one of the "super mega lots and lots of CPUs" EC2 thing. Holy crap was that a lot of CPU power. The sweet spot seemed to be about 2 simultaneous requests per CPU, was able to get it up to about 630 requests/second at 16 simultaneous requests. Don't know if that's actually cost efficient over a lot of little boxes, though.

I must say that my Rails app got a massive boost to supporting about 80 concurrent users from about 20 initially supported after adding some memcached servers (4 mediums at EC2). I run a high traffic sports site that really hit it a few months ago. Database size is about 6 gigs with heavy updates/inserts.
MySQL (RDS large high usage) cache also helped a bit.
I've tried playing with the passenger settings but got some curious results - like for example each thread eats up 250 megs of RAM which is odd considering the application isn't that big.
You can also save massive $ by using spot instances but don't rely entirely on that - their pricing seems to spike on occasion. I'd AutoScale with two policies - one with spot instances and another with on demand (read: reserved) instances.

Passenger hosted Rails app painfully slow, but the server is a beast

I have been working to deploy a relatively large Rails app (Rails 2.3.5) and recently doing some load testing we discovered that the throughput for the site is way below the expected level of traffic.
We were running on a standard 32bit server, 3GB of RAM with Centos, and we were running Ruby Enterprise Edition (latest build), Passenger (Latest build) and Nginx (Latest build) - when there is only one or two users the site runs fine (as you would expect) however when we try to ramp up the load to ~50 concurrent requests it completely dies. (Apache Bench report ~2.3 req/sec, which is terrible)
We are running RPM and trying to determine where the load issue is, but it's pretty evenly distributed across Rails, SQL and Memcached, so we're more or less going through and optimizing the codebase.
Out of sheer desperation we spun up a large EC2 instance (Ubuntu 9.10, 7.5GB RAM, 2 Compute Units/Cores) and setup the same configuration as the original server, and while there are more resources we were still seeing pathetic results.
So, after spending too much time trying to optimize, playing with caching configuration etc I decided to test the throughput of some mongrels, and ta-da, they are performing much much better then Passenger.
Currently the configuration is 15x Mongrels being proxied via Nginx, and we seem to be meeting our load requirements just but it's not quite enough to make me comfortable with going live... What I was wondering is if anyone knows of some possible causes for this...?
My configuration for passenger/nginx was:
Nginx workers: tried between 1 and 10, usually three though.
Passenger max pool size: 10 - 30 (yes, these numbers are quite high)
Passenger global queueing: tried both on and off.
NGinx GZip on: yes
It might pay to note that we had increased the nginx max client body size to 200m to allow for large file uploads.
Anyway suggestions would be really appreciated, while the mongrels are working fine it changes how we do things a lot and I would really prefer to use Passenger - besides, wasn't it supposed to make this easier and perform better?

Maybe your sql pool size is too small? This essentially limits the parallelism of database workloads in your application which in turn builds up to much increased load as soon as you put work on your app stack...

As a first step I would deploy a minimal "Hello World" type Rails application to your environment and see what throughput you get with that. Doing that will at least tell you if your problem is with the environment or somewhere in your application.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart