Passenger processes stuck maxing CPU after hitting 100% - ruby-on-rails

The Setup:
* Ubuntu 18.04 LTS
* Apache 2.4.29
* Passenger 6.0.16
* Ruby 2.3.8
* Rails 4.2.x
I have both staging and prod servers with the same setup on AWS EC2; they are both running the same kernel/build. I upgraded the Ruby/Rails version of my app from Ruby 2.1.x -> 2.3.8, and Rails 4.0 -> 4.2, first on staging then on production.
On staging, everything was working fine; pages were loading quickly and without issue. On prod, pages would start by loading quickly but pretty soon would degrade. The user CPU would max out at 99%+ eventually causing the app to go down and be unresponsive. The only solution was to restart Apache, roughly every 30min.
After a LOT of digging and testing, top -c showed that Passenger RubyApp would hit 100% CPU and soon after would stay "locked" at max CPU for each process, even if no one was using the site. I've been trying to change different settings both in Apache and Passenger but nothing seems to work. Effectively, as soon as we get a few people hitting the site in a particular way, ANY of the spun Passenger processes that hit 100% end up staying fairly high and either don't shut off or don't exit and burn CPU, as if there were some IO issue.
Right now Passenger and Apache configs are exactly the same on staging/prod and are the defaults.
Screenshots of the example top in prod with a few users using it.
And roughly same amount of people using on staging.
Staging looks far more accurate in terms of a Rails app -- I'd expect to see higher memory use than CPU. AWS Support was also baffled, as prod is on an XL and staging is on a Micro instance, and the AWS kernel versions were the same. Here's AWS monitoring around CPU usage... prod was updated on the 20th, but not a lot of people used it over the weekend, and really became a problem on Monday during working hours.
Any ideas of why this is happening on one server vs the other?? It's no particular request that causes it; it's literally any (or 2-3 requests coming in tandem) that will cause the CPU to spike to 100 and get stuck.
TIA.

Related

Rails 5 app on EC2 keeps shutting down every few days.

We upgraded form a t2.micro to a t2.small EC2 instance when I noticed Rails app was shutting down, and I'd have to restart Unicorn.
Since EC2 doesn't include memory utilization out of the box, I installed perl scripts per AWS docs, and see that we hit 87% memory utilization in the last hour, even though we have a tiny amount of traffic.
What are the main issues that could be causing this?

Elastic Beanstalk Ruby processes consuming CPU

I have had a Rails 3 app deployed on Elastic Beanstalk for close to 2 years now. For the most part, I haven't had any issues; however, I recently upgraded to one of their new Ruby configurations (64bit Amazon Linux 2014.09 v1.0.9 running Ruby 2.1 (Passenger Standalone)) and I've been fighting an issue for several days where one of more Ruby processes will consume the CPU - to the point where my site becomes unresponsive. I was using a single m3.medium instance, but I've since moved to a m3.large, which only buys me some time to manually log into the EC2 instance and kill the run away process(es). I would say this happens once or twice a day.
The only thing I had an issue with when moving to the new Ruby config was that I had to add the following to my .ebextensions folder so Nokogiri could install (w/bundle install)...
commands:
build_nokogiri:
command: "bundle config build.nokogiri --use-system-libraries"
I don't think this would cause these hanging processes, but I could be wrong. I also don't want to rule out something unrelated to the Elastic Beanstalk upgrade, but I can't thing of any other significant change that would cause this problem. I realize this isn't a whole lot of information, but has anyone experienced anything similar to this? Anyone have suggestions for tracing these processes to their root cause?
Thanks in advance!
Since you upgraded your beanstalk configuration, I guess you also upgraded Ruby/Rails version. This bumped up all gem versions. The performance issue probably originate from one of these changes (and not the Hardware change).
So this brings us into the domain of RoR performance troubleshooting:
1. Check the beanstalk logs for errors. If you're lucky you'll find a configuration issue this way. give it an hour.
2. Assuming all well there, try to setup the exact same version on your localhost (passenger + ruby 2.1 + gems version). If you're lucky, you will witness the same slowness and be able to debug.
3. If you'd like to shoot straight for production debugging, I suggest you'd install newrelic (or any other application monitoring tool) and then drill into the details of the slowness in their dashboard. I found it extremely useful.
I was able to resolve my run away Ruby process issue by SSHing into my EC2 instance and installing/running gdb. Here's a link - http://isotope11.com/blog/getting-a-ruby-backtrace-from-gnu-debugger with the steps I followed. I did have to sudo yum install gdb before.
gdb uncovered an infinite loop in a section of my code that was looping through days in a date range.

Can you reload a Rails app on Passenger in the same seamless way as you can reload one on Unicorn?

With Unicorn, you can restart and reload a Rails app with kill -USR2 [master process], which doesn't kill the process immediately, but starts a new master process + slave processes in the background. When the new master is ready, you can shut off the old master with kill -QUIT. This lets you restart your website without having any visitors notice a slowdown in request handling.
But with Passenger, you restart the Rails app with touch tmp/restart.txt, which as far as I can tell, causes the Rails app to become unresponsive for the few seconds it takes to restart the Rails application.
Is there a way to use Passenger, but also have the Rails app restart seamlessly?
Rolling restarts are available in Phusion Passenger Enterprise.
This is the "licensed version" klochner talked about, but it wasn't released until August. Phusion Passenger Enterprise fully automates rolling restarts (Unicorn requires some manual scripting to make rolling restarts behave in a good way). It also includes a bunch of other useful features such as deployment error resistance, live IRB console, etc.
No. [now yes - see hongli's response]
You're asking for rolling restarts, where the new server processes are brought up before the old ones are killed. Passenger (the free version) won't drop requests, but they will get queued and delayed whenever you deploy.
Rolling restarts has supposedly already been implemented and is available in the licensed version, but not yet released for the free version. I've been unable to figure out how to get the licensed version.
Follow this google groups thread for more info:
https://groups.google.com/forum/#!msg/phusion-passenger/hNvU-ZE7_WY/gOF9XWmhHy0J
You could try running two standalone passenger processes and manually bring one down while the other stays up, but I don't think that's the answer you were looking for.

Deploy Rails app to EC2

My setup: Rails 2.3.10, Ruby 1.8.7 on Windows
The last time I deployed a Rails app from Windows to Linux on Slicehost, I used Capistrano, Nginx, Mongrel, and SVN. That was 3 years ago, fast forward to now, I'm still on Windows for development and is now looking to deploy to EC2. A quick search turns up tools like Rubber and Chef which aren't easy to grasp with a quick read. It seems like Rubber and Chef are designed for multi-EC2 instances deployment which will be useful when I need to scale.
I'm also new to Passenger but it seems to be the default way to deploy Rails app nowadays, one thing that isn't so clear to me is whether Passenger is a replacement for Mongrel? In my old setup, I configured Nginx to forward the Rails requests to a cluster of Mongrel processes but I don't see anything like that for Passenger.
Any insights are much appreciated.
We use something like what you're describing for our production server: EC2 + Apache + Passenger. We haven't had any need to use the fancy deployment tools you describe - plain old Capistrano (plus capistrano-ext so we can use it for multiple environments) does the job just fine. I've looked at Rubber (not Chef), but deemed it needlessly automagical and too poorly documented, and I'm really not sure what it offers that can't be done just as well with roles in Capistrano.
Passenger has been great. It's an "overseer" that manages a collection of Mongrel-like workers (I had thought that the workers were Mongrels, but upon further reading, I don't think they are. The Passenger comparisons page even compares its RPS to a Mongrel cluster, so...), starting them up as needed, culling them under low loads, restarting them if they crash, etc. It's actually very similar to the Server + Mongrel Cluster you described, but probably a bit better, as Passenger has an understanding of the underlying workers that Nginx / Apache don't. And you'll have to make a few minor tweaks to get Capistrano playing nicely with Passenger.
And if possible, pair Passenger with Ruby Enterprise Edition (from the same guys who made Passenger). It's a much faster version of Ruby, mostly due to a rewritten, configurable garbage collector. You'll have to tune your GC settings to get the most out of it.
Hope this helps!
Both might help:
http://ginzametrics.com/deploy-rails-app-to-ec2-with-rubber.html
Hosting rails on ec2
Rubystack allows you to have the same Rails environment for development on Windows and for deployment on Linux. We also have EC2 images (scroll to the bottom) and it is completely free, so you may want to give it a try.
Also, this may not work for you, but depending on your requirements, you may want to go for a PaaS solution like Heroku

Resque: Slow worker startup and Forking

I'm currently moving my application from a Linode setup to EC2. Redis is currently installed on a remote instance with various worker instances interacting with the queue. Thats all going fantastic.
My problem is with the amount of time it takes for a worker to be 'instantiated' and slow forking. Starting a worker will usually take between 30 seconds and a minute(from god.rb starting the worker rake task and the worker actively starting work on the queue). I could live with that, but I've not experienced such a wait time on my current Linode production box so I believe its one of my symptoms to a bigger problem. Next issue is that jobs that took a second or less in my previous environment now seem to take about 5 to 10 times longer..
I'm assuming this must be some sort of issue with my Ubuntu install on EC2? One notable difference is that I'm running REE 1.8.7-2010.01 in my new setup, and REE 1.8.6 on the old Linode boxes.
Anyone else experienced these issues?
It turns out I had overestimated the CPU power of an EC2 small instance. Moved my workers to a large instance and all is well.

Resources