How can I Reduce Apache Average Process Size? - memory

I'm running a API to a database written in Python (Flask Restless) and served using apache2 and mod_wsgi as per the Flask Documentation.
The app is hosting on AWS EC2 instances inside an Auto-scaling group.
We're currently hosting on m3.medium instances which have:
1 vCPU
3.75 Gb RAM
A regular problem that we have is memory errors. Apache uses up all available memory in the instances and causes [critical memory allocation] failure. This issue with apache is well documented.
The docs and other S.O. questions explain that I can prevent excess memory usage by find the average size of an apache process and then limiting MaxRequestWorkers (a.k.a MaxClients) to ensure that I can't start more workers than I have memory available however most docs don't show how to reduce the RAM that apache uses in the first place.
I'm using the following command to give the average process size:
sudo ps -ylC apache2 | awk '{x += $8;y += 1} END {print "Apache Memory Usage (MB): "x/1024; print "Average Process Size (MB): "x/((y-1)*1024)}'
Which is currently ~130Mb, is this an unreasonably large number
sudo free -m shows that I have ~3000Mb memory free so I can have 3000/130 = ~20 Workers before I use too much memory.
I can successfully limit the number of workers to prevent excess memory usage however when I do this the result is simply that I drop ~100 requests/min, returning 504 errors. This is unacceptable in my scenario.
Clearly I have 2 options, increase the amount of RAM in my server or reduce the amount of memory that each Apache Process consumes.
I have already removed the unnecessary modules from apache but am no longer sure what else I can do to reduce the memory footprint of each apache2 process. I want to run 50-100 Workers which would require 13GB RAM which seems excessive?
The WebApp is used to manage a system of about 30,000 internet connected products. Most requests are to log a connection to the system or return a json array of data to the user.
My Apache2.conf file is as follows:
Timeout 120
KeepAlive on
MaxKeepAliveRequests 100
KeepAliveTimeout 5
ThreadsPerChild 20
MaxRequestWorkers 100
ServerLimit 100
User ${APACHE_RUN_USER}
Group ${APACHE_RUN_GROUP}
HostnameLookups Off
LogLevel warn
IncludeOptional mods-enabled/*.load
IncludeOptional mods-enabled/*.conf
Include ports.conf
AccessFileName .htaccess
IncludeOptional conf-enabled/*.conf
IncludeOptional sites-enabled/*.conf
What steps can I take now, given that I don't think my App ought to require servers with 13+Gb of RAM and I don't want to reduce the number of MaxRequestWorkers because doing so causes the system to reject about 100 reqs/min

Related

Nginx + Unicorn : difference between both worker_processes

I have a rails app running on a Nginx + Unicorn environnement
I see that both have a "worker_processes" in their config file and I wonder what is the optimal configuration
If I have 4 cores, should I put 4 for both? Or 1 for nginx and 4 for unicorn?
(by the way, I am using sidekiq too, so what about sidekiq concurrency?)
Nginx
Nginx is event based server. It means that 1 operation system (OS) process can manage very big number of connections. It's possible, because usual state of connections is wait. While connection waiting for other side or sending/receiving packet of data - nginx can process with other connections. One nginx worker can work with thousands or even tens of thousands connections. So, even worker_processes 1 can be enough.
More nginx's workers allow to use more CPU cores (that can be important if nginx is the main CPU eater). More workers also good if nginx doing lot of disk IO.
Resume: you can safe start from 1 worker and increase to number of CPU cores.
Unicorn
Worker of Unicorn is little bit different from nginx because one worker = one request. Number of unicorn workers show how many ruby processes will execute same time. And this number depends on your application.
For example, you application is CPU bound (doing some math only). In this case number of workers greater than number of CPU cores can cause problems.
But usual application work with some databases and sleep while wait for database answer. If our database placed on other server (request processing do not eat our CPU) - ruby will sleep and CPU idle. In this case we can increase number of workers to CPU*3... CPU*5 or even CPU*20 workers.
Resume: Best way to find this number - load testing of your real application. Set number of unicorn workers, start load testing with the same number of concurrency. If server feels good - increase number of workers and test again.
Sidekiq
Concurrency of sidekiq similar to unicorn workers. If tasks is CPU bound - set number of treads close to number of CPU. If I/O bound - number of thread can be greater than number of CPU cores. Also, other tasks of this server is important (like unicorn). Just remember, that number of CPU cores do not change if you will run sidekiq on same server with unicorn :)
Resume: same as unicorn.
There is no absolute best answer. (If there was, the software would tune itself automatically.)
It all depends on your operating system, environment, processor, memory, discs, buffer cache of the OS, caching policy in nginx, hit rates, application, and probably many other factors, of what would be the best possible solution.
Which, not very surprisingly, is actually what the documentation of nginx says, too:
http://nginx.org/r/worker_processes
The optimal value depends on many factors including (but not limited to) the number of CPU cores, the number of hard disk drives that store data, and load pattern. When one is in doubt, setting it to the number of available CPU cores would be a good start (the value “auto” will try to autodetect it).
As for unicorn, a quick search for "worker_processes unicorn" reveals the following as the first hit:
http://bogomips.org/unicorn/TUNING.html
worker_processes should be scaled to the number of processes your backend system(s) can support. DO NOT scale it to the number of external network clients your application expects to be serving. unicorn is NOT for serving slow clients, that is the job of nginx.
worker_processes should be scaled to the number of processes your backend system(s) can support. DO NOT scale it to the number of external network clients your application expects to be serving. unicorn is NOT for serving slow clients, that is the job of nginx.
worker_processes should be at least the number of CPU cores on a dedicated server (unless you do not have enough memory). If your application has occasionally slow responses that are /not/ CPU-intensive, you may increase this to workaround those inefficiencies.
…
Never, ever, increase worker_processes to the point where the system runs out of physical memory and hits swap. Production servers should never see heavy swap activity.
https://bogomips.org/unicorn/Unicorn/Configurator.html#method-i-worker_processes
sets the current number of #worker_processes to nr. Each worker process will serve exactly one client at a time. You can increment or decrement this value at runtime by sending SIGTTIN or SIGTTOU respectively to the master process without reloading the rest of your Unicorn configuration. See the SIGNALS document for more information.
In summary:
for nginx, it is best to keep it at or below the number of CPUs (and I'd probably not count the hyperthreading ones, especially if you have other stuff running on the same server) and/or discs,
whereas for unicorn, it looks like, it probably has to be at least the number of CPUs, plus, if you have sufficient memory, and depending on your workload, you may possibly want to increase it much further than the raw number of the CPUs
The general rule of thumb is to use one worker process per core that your server has. So setting worker_processes 4; would be optimal in your scenario for both nginx and Unicorn config files, as given by example here:
nginx.conf
# One worker process per CPU core is a good guideline.
worker_processes 1;
unicorn.rb
# The number of worker processes you have here should equal the number of CPU
# cores your server has.
worker_processes (ENV['RAILS_ENV'] == 'production' ? 4 : 1)
More information on the Sidekiq concurrency can be found here:
You can tune the amount of concurrency in your sidekiq process. By default, one sidekiq process creates 25 threads. If that's crushing your machine with I/O, you can adjust it down:
sidekiq -c 10
Don't set the concurrency higher than 50. I've seen stability issues with concurrency of 100, for example. Note that ActiveRecord has a connection pool which needs to be properly configured in config/database.yml to work well with heavy concurrency. Set the pool setting to something close or equal to the number of threads:
production:
adapter: mysql2
database: foo_production
pool: 25

How do I determine the right number of Puma workers and threads to run on a Heroku Performance dyno?

I've read all of the articles I can find on Heroku about Puma and dyno types and I can't get a straight answer.
I see some mentions that the number of Puma workers should be determined by the number of cores. I can't find anywhere that Heroku reveals how many cores a performance-M or performance-L dyno has.
In this article, Heroku hinted at an approach:
https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server
I think they're suggesting to set the threads to 1 and increase the number of Puma workers until you start to see R14 (memory) errors, then back off. And then increase the number of threads until the CPU maxes out, although I don't think Heroku reports on CPU utilization.
Can anyone provide guidance?
(I also want to decide whether I should use one performance-L or multiple performance-M dynos, but I think that will be clear once I figure out how to set the workers & threads)
The roadmap I currently figured out like this:
heroku run "cat /proc/cpuinfo" --size performance-m --app yourapp
heroku run "cat /proc/cpuinfo" --size performance-l --app yourapp
Write down the process information you have
Googling model type, family, model, step number of Intel processor, and looking for how many core does this processor has or simulates.
Take a look this https://devcenter.heroku.com/articles/dynos#process-thread-limits
Do some small experiments with standard-2X / standard-1X to determine PUMA_WORKER value.
Do your math like this:
(Max Threads of your desired dyno type could support) / (Max Threads of baseline dyno could support) x (Your experiment `PUMA_WORKER` value on baseline dyno) - (Number of CPU core)
For example, if the PUMA_WORKER is 3 on my standard-2X dyno as baseline, then the PUMA_WORKER number on performance-m I would start to test it out would be:
16384 / 512 * 3 - 4 = 92
You should also consider how much memory your app consumes and pick the lowest one.
EDIT: Previously my answer was written before ps:exec available. You could read the official document and learn how to ssh into running dyno(s). It should be quite easier than before.
Currently facing the same issue for an application running in production in AWS (we are using ECS), and trying to define the good fit between:
Quantity of vCPU / Ram per instance
Number of instances
Number of puma_threads running per instance (each instance is having a single puma process)
In order to better understand how our application is consuming the pool of puma_threads we did the following:
Export puma metrics to cloudwatch (threads running + backlog), we then saw that around 15 concurent threads, the backlog is starting to grow.
Put this in comparaison with vCPU (usage), we saw that our vCPU was never above 25%
Using these two informations together we decided to take the actions described above.
Finally I would like to share this article, that I found very interesting about this topic.

Rails application servers

I've been reading information about how different rails application servers work for a while and some things got me confused probably because of my lack of knowledge in this field. Anyway, the following things got me confused:
Puma server has the following line about its clustered mode workers number in its readme:
On a ruby implementation that offers native threads, you should tune this number to match the number of cores available
So if I have, lets say, 2 cores and use rubinius as a ruby implementation, should I still use more than 1 process considering that rubinius use native threads and doesn't have the lock thus it uses all the CPU cores anyway, even with 1 process?
I understand it that I'd need to only increase the threads pool of the only process if I upgrade to a machine with more cores and memory, if it's not correct, please explain it to me.
I've read some articles on using Server-Sent Events with puma which, as far as I understand, blocks the puma thread since the browser keeps the connection open, so if I have 16 threads and 16 people are using my site, then the 17th would have to wait before one of those 16 leaves so it could connect? That's not very efficient, is it? Or what do I miss?
If I have a 1 core machine with 3Gb of RAM, just for the sake of the question, and using unicorn as my application server and 1 worker takes 300 MBs of memory and its CPU usage is insignificant, how many workers should I have? Some say that the number of workers should be equal to the number of cores, but if I set the workers number to, lets say, 7 (since I have enough RAM for it), it will be able to handle 7 concurrent requests, won't it? So it's just a question of memory and cpu usage and amount of RAM? Or what do I miss?

Rails: Server Monitoring - Ruby Running 17 processes?

I am monitoring my server on New Relic and the memory consumption of my app is rather high about 1 GB. Currently I am the only visitor to the site. When I drill down, I see that most of the consumption is because of Ruby. It says 17 instances running. What does this mean and how can I lower it?
Unicorn is configured to run X number of instances by default. You can explicitly configure this number in config/unicorn.rb using worker_processes 4 (to run 4 instances). Each unicorn instance will load up the entire stack for your application and keep it memory. A mid-sized rails applications tends to be around ~100 MB and up, it should stay at that level given there aren't any memory leaks. The memory consumption is generally affected by the number of dependencies and the complexity of the application.

High traffic rails perf tuning

I was attempting to evaluate various Rails server solutions. First on my list was an nginx + passenger system. I spun up an EC2 instance with 8 gigs of RAM and 2 processors, installed nginx and passenger, and added this to the nginx.conf file:
passenger_max_pool_size 30;
passenger_pool_idle_time 0;
rails_framework_spawner_idle_time 0;
rails_app_spawner_idle_time 0;
rails_spawn_method smart;
I added a little "awesome" controller to rails that would just render :text => (2+2).to_s
Then I spun up a little box and ran this to test it:
ab -n 5000 -c 5 'http://server/awesome'
And the CPU while this was running on the box looked pretty much like this:
05:29:12 PM CPU %user %nice %system %iowait %steal %idle
05:29:36 PM all 62.39 0.00 10.79 0.04 21.28 5.50
And I'm noticing that it takes only 7-10 simultaneous requests to bring the CPU to <1% idle, and of course this begins to seriously drag down response times.
So I'm wondering, is a lot of CPU load just the cost of doing business with Rails? Can it only serve a half dozen or so super-cheap requests simultaneously, even with a giant pile of RAM and a couple of cores? Are there any great perf suggestions to get me serving 15-30 simultaneous requests?
Update: tried the same thing on one of the "super mega lots and lots of CPUs" EC2 thing. Holy crap was that a lot of CPU power. The sweet spot seemed to be about 2 simultaneous requests per CPU, was able to get it up to about 630 requests/second at 16 simultaneous requests. Don't know if that's actually cost efficient over a lot of little boxes, though.
I must say that my Rails app got a massive boost to supporting about 80 concurrent users from about 20 initially supported after adding some memcached servers (4 mediums at EC2). I run a high traffic sports site that really hit it a few months ago. Database size is about 6 gigs with heavy updates/inserts.
MySQL (RDS large high usage) cache also helped a bit.
I've tried playing with the passenger settings but got some curious results - like for example each thread eats up 250 megs of RAM which is odd considering the application isn't that big.
You can also save massive $ by using spot instances but don't rely entirely on that - their pricing seems to spike on occasion. I'd AutoScale with two policies - one with spot instances and another with on demand (read: reserved) instances.

Resources