I've deployed my ROR app at AWS EC2 instance using Nginx and Puma. Now, I have a page in app that runs lots of queries in loops(I know that's bad but we'll be improving it in some time).
Now the thing is, this page is giving 502 Gateway Timeout error resulting in crashing Puma Server. I investigated the CPU processes on server and it shows that ruby process runs at 100% CPU for few seconds and after that Puma crashes.
I'm unsure why is this happening, as the same page with same data loads on local PC in 6-7 seconds.
Is this some limit from AWS on processes?
Is this something on the Puma side?
Without further information, it's not possible to give an exact answer what's causing the problem.
As an "educated guess", I'd say it could be an out-of-memory issue.
I found the issue after multiple hours of debugging. It was a very edge case scenario putting server in an infinite loop causing memory to overflow.
I used top -i to investigate the increasing memory.
Thank you all for suggestions and responses.
Related
I have rails app running on passenger web server. From time to time I get queue full error from passenger.
But I have no idea what could cause this problem, because there is no long running requests in production log, nor could I see anything like this in the new relic monitor.
There are no memory leaks also, memory and cpu consumption is like average stats. Any ideas how to debug this situation?
I know about passenger-status --show=requests flag but, it is a little helpful because I don't know what requests it includes: only queued or running (hung) ones also?
I'm planning to have a Rails App that has a very content rich interactive page where many users will connect to.
Development has went well and small time testing on the Dev servers went without a hitch either.
Problems started when we started alpha testing with selected groups of people. the sever would grind to a halt suddenly. Nginx would stop because of queue being full. I was at a lose for a while, but after looking around, came to the conclusion that the live actioncable was completely eating up my memory.This especially gets bad when the user reloads the page multiple times that subscribes to actioncable, causing additional process to become active, completely stopping the server, only being cured by a nginx reboot.
I currently run a 2core 1GB memory SSD run VPS server for alpha testing, perhaps at tops 20 concurrent users.Should I be running into performance problems with such load? or should tuning the code or redis, passenger fix this?
I know its hard to say any definitive things without more specifics, but could a ballpark estimate be done with the information?
After some gogoling and testing Nginx settings, adding this directive to the nginx settings for passenger has seemed to dramatically improve the performance issue.
location /special_websocket_endpoint {
passenger_app_group_name foo_websocket;
passenger_force_max_concurrent_requests_per_process 0;
}
more info here
https://www.phusionpassenger.com/library/config/nginx/tuning_sse_and_websockets/
20 concurrent users plus multiple tabs per user is still less than about 100 concurrent websocket connections, it is not that lot.
First thing I'd look for is leaks - when for some reason websocket connection or other resources (open files etc.) does not get freed when actual user disconnects. Make sure you're running fresh versions of rails/passenger, as there was a bug in rails causing similar behaviour (see https://blog.phusion.nl/2016/07/07/actioncable-under-stress-p1/ for details)
Also while actioncable+passenger inside nginx allows you to run everything inside single process, it is not a good idea when you expect some load.
When running a clean nginx and separate rails servers for regular requests and cable - at least other parts of the app will continue some kind of working in such conditions.
I used 'unicorn-worker-killer' gem with the some additional modificationfrom here for ruby GC http://blog.newrelic.com/2013/05/28/unicorn-rawk-kick-gc-out-of-the-band/
But after the following instruction both there(https://github.com/kzk/unicorn-worker-killer) and deployed to production server. My application performance degrade gradually like
App server response time goes from 350 ms avg to 1100 ms
Page loading time goes from 6s avg to 13s
Also my heroku combination is:
6 Web dyno with 1 gb memory
1 woker dyno with 1x speed.
unicron worker process is 3
my db connection is 40 and set db pool 2 at heroku.
Please help me about how i optimize page loading time and app server time.
Any idea?
There isn't enough information here to diagnose your issue.
I would recommend you don't use this unicorn worker killer gem. You're better off using a normal application server like unicorn configured for Heroku unless you have a specific problem (hung workers or memory leaks).
If you want to diagnose your performance problems and load times, the best thing to do is use a service like NewRelic (available as a Heroku add-on) that will allow you to measure your request times and drill down into what specifically is a bottle neck and then fix that.
We are running a big server (12 threads, 6 cores, 64Gb ram, 2 SDDs raid-0) for our rails app deployed with nginx/passenger.
Unfortunately, pages are taken forever to load something like between 10 and 40 seconds. However, the server is under a really light load, with a load average of 0.61 0.56 0.53. We have ram used weirdly, free -ml reporting 57Gb (of 64Gb) usage whereas htop reporting only 4Gb (of 64Gb).
We have checked our production log, and rails request takes something like 100/200ms to be completed, so almost nothing.
How can we identify the bottleneck?
This question is fairly vague, but I'll see if I can give you some pointers.
My first guess is that your app is spending a lot of time doing database related stuff, see below for my advice.
As for the odd memory usage, are you looking at the correct part of the free -ml output? Just to clarify, you want to be looking at the -/+ buffers/cache: line to get the accurate output.
You might also check to see if any of your passenger workers are hanging, as that is a fairly common issue with passenger. You can do this by running strace -p $pid on your passenger workers. If its hanging, it will have an obvious lack of "doing anything"
As for troubleshooting response time within rails itself, I would highly suggest looking into using newrelic(http://newrelic.com/). You can often times see exactly which part of your app is causing the bad response time that way by breaking down how much time is spent in each part of your app. It's a simple gem to install and once you get reporting working, its pretty invaluable for issues like this.
Finally, the bottleneck was passenger, passenger-status is pretty useful showing up the queue left.
Our server is pretty precent so we just increase the number of passenger processes in nginx.conf to 600, resulting:
passenger_root /usr/local/rvm/gems/ruby-2.0.0-p195/gems/passenger-4.0.5;
passenger_ruby /usr/local/rvm/wrappers/ruby-2.0.0-p195/ruby;
passenger_max_pool_size 600;
After performing load testing against an app hosted on Heroku, I am finding that the most DB intensive request takes 50-200ms depending upon load. It never gets slower, no matter the load. However, seemingly at random, the request will outright timeout (30s or more).
On Heroku, why might a relatively high performing query/request work perfectly 8 times out of 10 and outright timeout 2 times out of 10 as load increases?
If this is starting to seem like a question for Heroku itself, I'm looking to first answer the question of whether "bad code" could somehow cause this issue -- or if it is clearly a problem on their end.
A bit more info:
Multiple Dynos
Cedar Stack
Dedicated Heroku DB (16 connections, 1.7 GB RAM, 1 comp. unit)
Rails 3.0.7
Thanks in advance.
Since you have multiple dynos and a dedicated DB instance and are paying hundreds of dollars a month for their service, you should ask Heroku
Edit: I should have added that when you check your logs, you can look for a line that says "routing" That is the Heroku routing layer that takes HTTP request and sends them to your app. You can add those up to see how much time is being spent outside your app. Unfortunately I don't know how easy it is to get large volumes of those logs for a load test.