We upgraded form a t2.micro to a t2.small EC2 instance when I noticed Rails app was shutting down, and I'd have to restart Unicorn.
Since EC2 doesn't include memory utilization out of the box, I installed perl scripts per AWS docs, and see that we hit 87% memory utilization in the last hour, even though we have a tiny amount of traffic.
What are the main issues that could be causing this?
Related
I am running some services with Google Cloud Run. While performance has been satisfactory, there's a recurrent issue with extremely slow startup time, which leads to occasionally dropped requests when new containers can't spin up in time.
Currently, with first gen execution environments and startup CPU boost enabled, Google's dashboard reports around 18 to 50 seconds of startup time. Image is based on ruby:3.0.2, and it runs a Ruby on Rails 6 application. In a development environment, startup (timed from run to container accepting requests) never seems to take more than 5 seconds.
I want to know what tools are available to diagnose this issue, and if there are any obvious pitfalls with my specific case that I might be missing.
I've tried playing around with the service's configuration options, to no avail. The biggest suspect is a startup bash script that handles migrations on the first boot, and asset compiling on development. However, I've tried building with an empty script, and the problem persists. I also think the container images might be too large (around 700Mb), but I haven't gotten around to slimming then down nor found evidence that this is the problem.
The Setup:
* Ubuntu 18.04 LTS
* Apache 2.4.29
* Passenger 6.0.16
* Ruby 2.3.8
* Rails 4.2.x
I have both staging and prod servers with the same setup on AWS EC2; they are both running the same kernel/build. I upgraded the Ruby/Rails version of my app from Ruby 2.1.x -> 2.3.8, and Rails 4.0 -> 4.2, first on staging then on production.
On staging, everything was working fine; pages were loading quickly and without issue. On prod, pages would start by loading quickly but pretty soon would degrade. The user CPU would max out at 99%+ eventually causing the app to go down and be unresponsive. The only solution was to restart Apache, roughly every 30min.
After a LOT of digging and testing, top -c showed that Passenger RubyApp would hit 100% CPU and soon after would stay "locked" at max CPU for each process, even if no one was using the site. I've been trying to change different settings both in Apache and Passenger but nothing seems to work. Effectively, as soon as we get a few people hitting the site in a particular way, ANY of the spun Passenger processes that hit 100% end up staying fairly high and either don't shut off or don't exit and burn CPU, as if there were some IO issue.
Right now Passenger and Apache configs are exactly the same on staging/prod and are the defaults.
Screenshots of the example top in prod with a few users using it.
And roughly same amount of people using on staging.
Staging looks far more accurate in terms of a Rails app -- I'd expect to see higher memory use than CPU. AWS Support was also baffled, as prod is on an XL and staging is on a Micro instance, and the AWS kernel versions were the same. Here's AWS monitoring around CPU usage... prod was updated on the 20th, but not a lot of people used it over the weekend, and really became a problem on Monday during working hours.
Any ideas of why this is happening on one server vs the other?? It's no particular request that causes it; it's literally any (or 2-3 requests coming in tandem) that will cause the CPU to spike to 100 and get stuck.
TIA.
We have several rails apps using passenger and apache on some ubuntu servers that get heavy load occasionally. We get datadog alerts that memory usage is high, get on the server, and do a top to see that passenger and ruby are using lots of memory, but how should I go about figuring out which one of the passenger/rails apps is the culprit? Or at least a list of apps using above a given threshold of memory?
I have only one RoR running on my server (and it's nginx) and I think your looking for
ps auxf
it shows me this for my one passenger instance:
nginx 28279 0.0 10.2 452128 107264 ? Sl Apr03 0:01 Passenger RackApp: /srv/http/redmine
The third column (10.2) is memory usage in %, the last columns shows the directory to the application. More about output here.
With Unicorn, you can restart and reload a Rails app with kill -USR2 [master process], which doesn't kill the process immediately, but starts a new master process + slave processes in the background. When the new master is ready, you can shut off the old master with kill -QUIT. This lets you restart your website without having any visitors notice a slowdown in request handling.
But with Passenger, you restart the Rails app with touch tmp/restart.txt, which as far as I can tell, causes the Rails app to become unresponsive for the few seconds it takes to restart the Rails application.
Is there a way to use Passenger, but also have the Rails app restart seamlessly?
Rolling restarts are available in Phusion Passenger Enterprise.
This is the "licensed version" klochner talked about, but it wasn't released until August. Phusion Passenger Enterprise fully automates rolling restarts (Unicorn requires some manual scripting to make rolling restarts behave in a good way). It also includes a bunch of other useful features such as deployment error resistance, live IRB console, etc.
No. [now yes - see hongli's response]
You're asking for rolling restarts, where the new server processes are brought up before the old ones are killed. Passenger (the free version) won't drop requests, but they will get queued and delayed whenever you deploy.
Rolling restarts has supposedly already been implemented and is available in the licensed version, but not yet released for the free version. I've been unable to figure out how to get the licensed version.
Follow this google groups thread for more info:
https://groups.google.com/forum/#!msg/phusion-passenger/hNvU-ZE7_WY/gOF9XWmhHy0J
You could try running two standalone passenger processes and manually bring one down while the other stays up, but I don't think that's the answer you were looking for.
I'm currently moving my application from a Linode setup to EC2. Redis is currently installed on a remote instance with various worker instances interacting with the queue. Thats all going fantastic.
My problem is with the amount of time it takes for a worker to be 'instantiated' and slow forking. Starting a worker will usually take between 30 seconds and a minute(from god.rb starting the worker rake task and the worker actively starting work on the queue). I could live with that, but I've not experienced such a wait time on my current Linode production box so I believe its one of my symptoms to a bigger problem. Next issue is that jobs that took a second or less in my previous environment now seem to take about 5 to 10 times longer..
I'm assuming this must be some sort of issue with my Ubuntu install on EC2? One notable difference is that I'm running REE 1.8.7-2010.01 in my new setup, and REE 1.8.6 on the old Linode boxes.
Anyone else experienced these issues?
It turns out I had overestimated the CPU power of an EC2 small instance. Moved my workers to a large instance and all is well.