Rails app (mod_rails) hangs every few hours - ruby-on-rails

I'm running a Rails app through Phusion Passenger (mod_rails) which will run smoothly for a while, then suddenly slow to a crawl (one or two requests per hour) and become unresponsive. CPU usage is low throughout the whole ordeal, although I'm not sure about memory.
Does anyone know where I should start to diagnose/fix the problem?
Update: restarting the app every now and then does fix the problem, although I'm looking for a more long-term solution. Memory usage gradually increases (initially ~30mb per instance, becomes 40mb after an hour, gets to 60 or 70mb by the time it crashes).

New Relic can show you combined memory usage. Engine Yard recommends tools like Rack::Bug, MemoryLogic or Oink. Here's a nice article on something similar that you might find useful.

If restarting the app cures the problem, looking at its resource usage would be a good place to start.

Sounds like you have a memory leak of some sort. If you'd like to bandaid the issue you can try setting the PassengerMaxRequests to something a bit lower until you figure out what's going on.
http://www.modrails.com/documentation/Users%20guide%20Apache.html#PassengerMaxRequests
This will restart your instances, individually, after they've served a set number of requests. You may have to fiddle with it to find the sweet spot where they are restarting automatically before they lock up.
Other tips are:
-Go through your plugins/gems and make sure they are up to date
-Check for heavy actions and requests where there is a lot of memory consumption (NewRelic is great for this)
-You may want to consider switching to REE as it has better garbage collection
Finally, you may want to set a cron job that looks at your currently running passenger instances and kills them if they are over a certain memory threshold. Passenger will handle restarting them.

Related

Quick and dirty way to solve/kill memory increses on Heroku

I have an app running on Heroku with a few thousand visitors per day. I do not update it very often as it runs well anyway. Recently, however, I started getting memory increases in a way I have never had before. I am 98% sure it does not have to do with any changes in the code that I have done, as I have not done anything to the code in quite a while. I know from experience that tracing down memory issues is extremely difficult and time-consuming - and at the moment I don't have the time to do it.
Considering the fact that I get this stair-case increase in memory over time (over the course of a few hours) once a day, is there a quick and dirty way of just restarting the server once it starts doing so, so it won't slow down the server for those hours? Like
RestartApp if ServerMemory > 500 Mb
or the likes of it?
I am running ruby 2.4.7 (is that likely an issue in terms of memory increases?) and Rails 4.2.10

Compounding performance problems in Ruby on Rails app

Let me start off by saying I understand performance is hard, and even harder through a forum, but I hope someone can point me towards the next step of figuring out this problem we're having in production on a Ruby on Rails environment. I've been seeing this for a while, but now with our usage down due to Covid, it's become more clear that it's not a user-load issue, but something in the infrastructure.
The day starts off fine, but midday, things slow more and more... until a passenger restart clears everything up for 24/48/96 hours.
Checking the obvious items:
Server memory usage is in check. Memory use increases during this time from 20%-30%. So some increase, but definitely not swapping (I've seen that happen before)
CPU usage is similar. It increases from 1% at 6am to 18% at worst case, but still well within the capabilities of the server.
Passenger usage. Looking at passenger-status, there's never a backlog of requests. Because of covid, our user-base is down, and even at its slowest, I'm seeing 2-3 passenger processes serving content (out of ~40 max), and no long-running requests.
Delayed job workers (3) are running, but no jobs at the time. They don't seem to be using a significant amount of cpu or mem
Watching the apache/rails logs, nothing looks awry. No DoSes, no unexpected load.
Looking at long-running transactions in NewRelic, there are some that are taking 20-30-50 sec, but after the passenger restart, they are back to 1-3 sec like usual. All the time is in Ruby.
So this is where I'm stuck. I can see where the problem is (185 sec):
but if I restart passenger and re-run, that same code will take less than a second. And that's just the example that I see from today's traces. Yesterday, it looked like it was a different controller having problems.
Any recommendations on my next steps to troubleshoot? I don't know if I should instrument a specific controller, because it's not just one method that's causing the problem (afaik). I think I'm seeing the symptoms, not the cause - and no idea how to see what's really going on.
Thanks in advance
-Mike
Rails 4.2.11.1
Ruby 2.4.9
Passenger 6.0.5 on Apache 2.4.43

ActiveRecord::QueryCache#call slow on Heroku with pg:backups

Lately we've had trouble on our Rais 4.2.7.1 app every night, we start seeing a bunch of really slow ActiveRecord::QueryCache#call calls even though our traffic is relatively low in the middle of the night:
We're running on Heroku using Puma and the app is very job heavy, for which we use Sidekiq. During the day it works fine but every night we get these spikes of extremely slow response times via the API that seem to originate with ActiveRecord::QueryCache#call.
The only thing I can find from our app that might be causing this is we have heroku pg:backups enabled, and the night of the above image the backup began running at 3:06 which is the exact time you see that first ActiveRecord::QueryCache#call spike in the newrelic graph. The backup finished an hour later, however (around the biggest spike), but as you can see the spikes continued until around 5am.
Could this be caused by pg:backups? (Our database is about 19GB), or could it be something else entirely? Is there a good way to avoid this cache call or speed it up? I don't fully understand WHY it would be so slow or exist at all in the transaction list. Any recommendations?
Funnily enough, we've been investigating this lately after seeing similar behaviour. There is a definite performance hit caused by pg:backups on large databases. Notice the big spike just after 1am, when backup kicks in:
DB size is >100GB
It's not that surprising, and in fact Heroku do have documentation on this, which suggests that you should only use pg:backups for databases under 20GB.
For larger databases, creating a follower and taking the backup from that is preferable. Annoyingly for high availability databases, it doesn't appear that you can read from the standby.
I can't shed much light on ActiveRecord::QueryCache though, so the rest of this post is speculation, and maybe the starting point for further investigation. Happy to delete/amend if someone more knowledgable can weigh in :-)
Heroku's docs do say that the backup process will evict well cached data from non-Postgres caches, so this could represent your workers repopulating that cache many times over.
It may also be worth having a look at this. Could your workers be reusing connections and receiving dirty query caches?

Can i limit apache+passenger memory usage on server without swap space

i'm running a rails application with apache+passenger on virtual servers that do not have any swap space configured.
The site gets decent amount of traffic with 200K+ daily requests and sometimes the whole system runs out of memory causing odd behaviour on whole system.
The question is that is there any way to configure apache or passenger not to run out of memory (e.g. gracefully restarting passenger instances when they start using, say more than 300M of memory).
Servers have 4GB of memory and currently i'm using passenger's PassengerMaxRequests option but it does not seem to be the most solid solution here.
At the moment, i also cannot switch to nginx so that is not an option to preserve some room.
Any clever ideas i'm probably missing are welcome.
Edit: My temporary solution
I did not go with restarting Rails instances when they exceed certain amount of memory usage. Engine Yard wrote great blog post on the ActiveRecord memory bloat issue. This is our main suspect on the subject. As i did not have much time to optimize application, i set PassengerMaxRequests to 300 and added extra 2GB memory to server. Things have been good since then. At first i was worried that re-starting Rails instances continuously makes it slow but it does not seem to have impact i should worry about.
If you mean "limiting" as killing those processes and if this is the only application on the server and it is a Linux, then you have two choices:
Set maximum amount of memory one process can have:
# ulimit -m
unlimited
Or use cgroups for similar behavior:
http://en.wikipedia.org/wiki/Cgroups
I would advise against restarting instances (if that is possible) that go over the "memory limit", because that may put your system in infinite loops where a process repeatedly reaches that limit and restarts.
Maybe you could write a simple daemon that constantly watches the processes, and kills any that go over a certain amount of memory. Be sure to log any information about the process that did this so you can fix the issue whenever it comes up.
I'd also look into getting some real swap space on there... This seems like a bad hack.
I have a problem where passenger processes end up going out of control and consuming too much memory. I wrote the following script which has been helping to keep things under control until I find the real solution http://www.codeotaku.com/blog/2012-06/passenger-memory-check/index. It might be helpful.
Passenger web instances don't contain important state (generally speaking) so killing them isn't normally a process, and passenger will restart them as and when required.
I don't have a canned solution but you might want to use two commands that ship with Passenger to keep track of memory usage and nr of processes: passenger-status and sudo passenger-memory-stats, see
Passenger users guide for Nginx or
Passenger users guide for Apache.

How can I find out why my app is slow?

I have a simple Rails app deployed on a 500 MB Slicehost VPN. I'm the only one who uses the app. When I run it on my laptop, it's fast enough. But the deployed version is insanely slow. It take 6 to 10 seconds to load the login screen.
I would like to find out why it's so slow. Is it my code? (Don't think so because it's much faster locally, but maybe.) Is it Slicehost's server being overloaded? Is it the Internet?
Can someone suggest a technique or set of steps I can take to help narrow down the cause of this problem?
Update:
Sorry forgot to mention. I'm running it under CentOS 5 using Phusion Passenger (AKA mod_rails or mod_rack).
If it is just slow on the first time you load it is probably because of passenger killing the process due to inactivity. I don't remember all the details but I do recall reading people who used cron jobs to keep at least one process alive to avoid this lag that can occur with passenger needed to reload the environment.
Edit: more details here
Specifically - pool idle time defaults to 2 minutes which means after two minutes of idling passenger would have to reload the environment to serve the next request.
First, find out if there's a particularly slow response from the server. Use Firefox and the Firebug plugin to see how long each component (including JavaScript and graphics) takes to download. Assuming the main page itself is what is taking all the time, you can start profiling the application. You'll need to find a good profiler, and as I don't actually work in Ruby on Rails, I can't suggest any: google "profile ruby on rails" for some options.
As YenTheFirst points out, the server software and config you're using may contribute to a slowdown, but A) slicehost doesn't choose that, you do, as Slicehost just provides very raw server "slices" that you can treat as dedicated machines. B) you're unlikely to see a script that runs instantly suddenly take 6 seconds just because it's running as CGI. Something else must be going on. Check how much RAM you're using: have you gone into swap? Is the login slow only the first time it's hit indicating some startup issue, or is it always that slow? Is static content served slow? That'd tend to mean some network issue (either on the Slicehost side, or your local network) is slowing things down, assuming you're not in swap.
When you say "fast enough" you're being vague: does the laptop version take 1 second to the Slicehost 6? That wouldn't be entirely surprising, if the laptop is decent: after all, the reason slices are cheap is because they're a fraction of a full server. You're using probably 1/32 of an 8 core machine at Slicehost, as opposed to both cores of a modern laptop. The Slicehost cores are quick, but your laptop could be a screamer compared to 1/4 of core. :)
Try to pint point where the slowness lies
1/ application is slow, or infrastructure (network + web server)
put a static file on your web server, and access it through your browser
2/ If it is fast, it is probable a problem with application + server configuration.
database access is slow
try a page with a simpel loop: is it slow?
3/ If it slow, it is probably your infrastructure. You can check:
bad network connection: do a packet capture (with Wireshark for example) and look for retransmissions, duplicate packets, etc.
DNS resolution is slow?
server is misconfigured?
etc.
What is Slicehost using to serve it?
Fast options are things like: Mongrel, or apache's mod_rails (also called passenger phusion or
something like that)
These are dedicated servers (or plugins to servers) which run an instance of your rails app.
If your host isn't using that, then it's probably defaulting to CGI. Rails comes with a simple CGI script that will serve the page, but it reloads the app for every page.
(edit: I suspect that this is the most likely case, that your app is running off of the CGI in /webapp_directory/public/dispatch.cgi, which would explain the slowness. This tends to be a default deployment on many hosts, since it doesn't require extra configuration on their part, but it doesn't give good performance)
If your host supports "Fast CGI", rails supports that too. Fast CGI will open a CGI session, and keep it open for multiple pages, so you get much better performance, but it's not nearly as good as Mongrel or mod_rails.
Secondly, is it in 'production' or 'development' mode? The easy way to tell is to go to a page in your app that gives an error. If it shows you a stack trace, it's in development mode, which is slower than production mode. Mongrel and mod_rails have startup options to determine whether to run the app in production or development mode.
Finally, if your database is slow for whatever reason, that will be a big bottleneck as well. If you do have a good deployment (Mongrel/mod_rails/etc.) in production mode, try looking into that.
Do you have a lot of data in your DB? I would double check that you have indexed all the appropriate columns- because this can make a huge difference. On your local dev system, you probably have a lot more memory than on your 500 mb slice, which would result in the DB running a lot slower if you have big, un indexed tables. You can also run the slow queries logger in MySql to pinpoint columns without indexes.
Other than that, yes- passenger will need to spool up a process for you if you have not been using the site recently. If this is the case, you should see a significant speed increase on second, and especially third and later page loads.
You might want to run a local virtual machine with 500 MB. Are you doing a lot of client-server interaction? Delays over the WAN are significant
You might want to check out RPM (there's a free "lite" version too) and/or New Relic's Tune Up.
Your CPU time is guaranteed by Slicehost using the Xen virtualization system, so it's not that. Don't have the other answers for you, sorry! Might try 'top' on a console while you're trying to access the page.
If you are using FireFox and doing localhost testing (or maybe even on LAN) you may want to try editing the network.dns.disableIPv6 setting.
Type about:config in the address bar and filter for network.dns.disableIPv6 and double-click to set to true.
This bug has been reported mainly from Vista OS's, but some others as well.
You could try running 'top' when you SSH in to see which process is heavy. If you also have problems logging you, perhaps you may try getting Statistics in the Slicehost manager.
If you discover it is MySQL's fault, consider decreasing the number of servers it can spawn.
512 seems decent for Rails application, you might have to check if you misconfigured too.

Resources