Ruby on Rails: CPU 100% and server crashed - ruby-on-rails

I have 8 core/ 16GB RAM server.
But when i test load on this server. the cpu reached 100% and crash in between process.
my landing page send 250+ HTTP requests/user.
the server is configured with nginx.
please comment required detail, i will edit this post.

Related

Passenger uses more PostgreSQL connection than expected

Hard issue happening in production for a long time, we have no clue about where it's coming from. Can sometimes reproduces it on localhost, Heroku Enterprise support has been clue-less about this.
On our production database, we currently have the following setup:
Passenger Standalone, threading disabled, limited to 25 processes MAX. No min setup.
3 web dynos
a SELECT * FROM pg_stat_activity GROUP BY client_addr and count the number of connections per instance shows that more than 1 PSQL connection is opened for one passenger process during our peak days.
Assumptions:
A single address is about a single Dyno (Confirmed by Heroku staff)
Passenger does not spawn more than 25 processes at the time (confirmed with passenger-status during those peaks)
Here is a screenshot of what looks the SELECT * FROM pg_stat_activity;:
In the screenshot, we can see that there are 45 psql connections coming from the same dyno that runs passenger. If we followed our previous logic, it should not have more than 1 connection per Passenger process, so 25.
The logs doesn't look unusual, nothing mentioning either a dyno crash / process crash.
Here is a screenshot of our passenger status for the same dyno (different time, just to prove that there are not more processes than 25 created for one dyno):
And finally one of the response we got from the Heroku support (Amazing support btw)
I have also seen previous reports of Passenger utilising more connections than expected, but most were closed due to difficulty reproducing, unfortunately.
In the Passenger documentation, it's explained that Passenger handle itself the ActiveRecord connections.
Any leads appreciated. Thanks!
Various information:
Ruby Version: 2.4.x
Rails Version: 5.1.x
Passenger Version: 5.3.x
PG Version: 10.x
ActiveRecord Version: 5.1.x
If you need any more info, just let me know in the comments, I will happily update this post.
One last thing: We use ActionCable. I've read somewhere that passenger is handling weirdly the socket connections (Opens a somewhat hidden process to keep the connection alive). This is one of our leads, but so far no luck in reproducing it on localhost. If anyone can confirm how Passenger handles ActionCable connections, it would be much appreciated.
Update 1 (01/10/2018):
Experimented:
Disable NewRelic Auto-Explain feature as explained here: https://devcenter.heroku.com/articles/forked-pg-connections#disabling-new-relic-explain
Run locally a Passenger server with min and max pool size set to 3 (more makes my computer burn), then kill process with various signals (SIGKILL, SIGTERM) to try to see if connections are closed properly. They are.
We finally managed to fix the issue we had on Passenger. We have had this issue for a very long time actually.
The fix
If you use ActionCable, and your default cable route is /cable, then change the Procfile from:
web: bundle exec passenger start -p $PORT --max-pool-size $PASSENGER_MAX_POOL_SIZE
to
web: bundle exec passenger start -p $PORT --max-pool-size $PASSENGER_MAX_POOL_SIZE --unlimited-concurrency-path /cable
Explanation
Before the change, each socket connection (ActionCable) would take one single process in Passenger.
But a Socket is actually something that should not take a whole process. A process can handle many many open socket connection. (Many is more than 10thousands at the same time for some big names). Fortunately, we have much lower socket connections, but still.
After the change, we basically told Passenger to not take a whole process to handle one socket connection, but rather dedicate a whole process to handle all the socket connections.
Documentation
The in-depth documentation on how to do Sockets with Passenger: https://www.phusionpassenger.com/library/config/standalone/tuning_sse_and_websockets/
The flag to pass to Passenger: https://www.phusionpassenger.com/library/config/standalone/reference/#--unlimited-concurrency-path-unlimited_concurrency_paths
Some metrics, after 3 weeks with the fix
Number of forked processes on Passenger dramatically decreased (from 75 processes to ~ 15 processes)
Global memory usage on the web dynos dramatically decreased (related to previous point on forked Passenger processes)
The global number of PSQL connections dramatically decreased and has been steady for two days (even after deployment). (from 150 to ~30 connections)
Number of PSQL connections per dyno dramatically decreased, (from ~50 per dyno to less than 10 per dyno)
The number of Redis connections decreased and has been steady for two days (even after deployment)
Average memory usage on PostgreSQL dramatically decreased and has been steady for two days.
The overall throughput is a bit higher than usual (Throughput is the number of requests handled per minute)

403 errors from load balancer while new instances are booting

I have a RoR app running on elastic beanstalk. I have occasionally seen 403 errors from Passenger for a while. Most of the time 1 server is running but this gets increased to 3 or 4 instances in busy periods during the day.
Session stickeyness is not turned on
I have noticed that when a new server is started the ELB is sending requests to it before bundle install has finished.
If I ssh to the newly started server I can see in /var/app/current/ that the app has not yet been installed and if I run top it looks like bundler is running and compiling things with cc1, etc.
/var/app/support/log/passenger.log shows that requests to valid urls within my rails app are being received and responded to with 404. Hardly surprising because the app isn't there yet
After 5-10 minutes all of the compiling is complete and the app files appear in /var/app/current and all is well.
This doesn't seem quite right to me. How do I set up the ELB / my rails app so that the ELB can tell when it is ready to receive requests?
I found the answer to this. There was no application health check url set. In this case the ELB pings the instance to see if it's healthy, i.e. it checks that it is booted rather than if rails is up and running. Setting the health check url to '/login/' fixed it for me because this gives a 404 until rails in running and a 200 afterwards.
Elastic beanstalk demands 2 correct responses before it deems an instance to be healthy. It checks the instance every 5 minutes. This means that an instance can take a while to start serving requests. i.e. it takes boot time + waiting for next poll from elb + 5 minutes before it sees any real traffic

nginx passenger server error

I am running rails applications with nginx+passenger
after nginx started serve, I can access it
but after sometime,may be one hour or half a day, it tells me the following message
Internal server error
An error occurred while starting the web application. It sent an unknown response type "".
then i need to reboot the server to let nginx serve normally
My server is running on AliYun and it's memory size is only 512M, is it too small too run passenger?
or what's wrong with the configureation?
It's only a workaround and you should find what is the actual problem (by monitoring memory usage, processor usage, open file handles etc) but until then you can use passenger_max_requests directive

Nginx bottleneck as load balancer?

We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.
During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so.
CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.
Our setup
1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5
The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.
We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.
NGINX SETTINGS
worker_processes=16
worker_connections=4096
multi_accept=on
LINUX SETTINGS
fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"
Why is the app bypassing the load balancer so fast?
Can nginx as load balancer be the bottle neck?
Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?
Depending on your settings nginx might be the bottleneck...
Check/tune the following settings in nginx:
the worker_processes setting (should be equal to the number of cores/cpus)
the worker_connections setting (should be very high if you have lots of connections at peak)
set multi_accept on;
if on linux, in nginx make sure you're using epoll (use epoll;-directive)
check/tune the following settings of your OS:
number of allowed open file descriptors (sysctl -w fs.file-max=999999 on linux)
tcp read and write buffers (sysctl -w net.ipv4.tcp_rmem="4096 4096 16777216" and
sysctl - net.ipv4.tcp_wmem="4096 4096 16777216" on linux)
local port range (sysctl -w net.ipv4.ip_local_port_range="1024 65536" on linux)
Update:
so you have 16 workers and 4096 connections per workers
which means a maximum of 4096*16=65536 concurrent connections
you probably have multiple requests per browser (ajax, css, js, page itself, any images on the page, ...), let's say 4 request per browser
that allows for slightly over 16k concurrent users, is that enough for your peaks?
How do you set up your upstream server group and what is the load balancing method you use?
It's hard to imagine that Nginx itself is the bottleneck. Is it possible that some upstream app servers get hit much more than others and start to refuse connection due to backlog is full? See this load balancing issue on Heroku and see if you can get more help there.
After nginx version 1.2.2, nginx provides this least_conn. That might be an easy fix. I haven't tried it myself yet.
Specifies that a group should use a load balancing method where a
request is passed to the server with the least number of active
connections, taking into account weights of servers. If there are
several such servers, they are tried using a weighted round-robin
balancing method.

Apache doesn't use all bandwith

I'm using apache 2.4.1 on windows and I'm trying to optimize my website www.xgclan.com loading speed with http://www.webpagetest.org.
I noticed that the download time is quite long in the report.
Today I downloaded the Windows 8 preview to my server as mirror, I put it on my apache server and tried downloading it with my home connection, the speed was only 500 KB/s.
My server has an 100 Mb/s duplex connection and task manager indicates that only 7% of the bandwidth is used.
I have 120Mb/s down at home and I ran a speedtest to make sure its not an issue with my home connection.
Downloading works fine on the server so I think its an issue with apache or windows server 2008 R2.
Can anybody help me so I can use my full 100 Mb/s upload?
This issue was caused by EnableSendfile on, after turning this off I was able to use the full connection speed.

Resources