Quick and dirty way to solve/kill memory increses on Heroku - ruby-on-rails

I have an app running on Heroku with a few thousand visitors per day. I do not update it very often as it runs well anyway. Recently, however, I started getting memory increases in a way I have never had before. I am 98% sure it does not have to do with any changes in the code that I have done, as I have not done anything to the code in quite a while. I know from experience that tracing down memory issues is extremely difficult and time-consuming - and at the moment I don't have the time to do it.
Considering the fact that I get this stair-case increase in memory over time (over the course of a few hours) once a day, is there a quick and dirty way of just restarting the server once it starts doing so, so it won't slow down the server for those hours? Like
RestartApp if ServerMemory > 500 Mb
or the likes of it?
I am running ruby 2.4.7 (is that likely an issue in terms of memory increases?) and Rails 4.2.10

Related

Compounding performance problems in Ruby on Rails app

Let me start off by saying I understand performance is hard, and even harder through a forum, but I hope someone can point me towards the next step of figuring out this problem we're having in production on a Ruby on Rails environment. I've been seeing this for a while, but now with our usage down due to Covid, it's become more clear that it's not a user-load issue, but something in the infrastructure.
The day starts off fine, but midday, things slow more and more... until a passenger restart clears everything up for 24/48/96 hours.
Checking the obvious items:
Server memory usage is in check. Memory use increases during this time from 20%-30%. So some increase, but definitely not swapping (I've seen that happen before)
CPU usage is similar. It increases from 1% at 6am to 18% at worst case, but still well within the capabilities of the server.
Passenger usage. Looking at passenger-status, there's never a backlog of requests. Because of covid, our user-base is down, and even at its slowest, I'm seeing 2-3 passenger processes serving content (out of ~40 max), and no long-running requests.
Delayed job workers (3) are running, but no jobs at the time. They don't seem to be using a significant amount of cpu or mem
Watching the apache/rails logs, nothing looks awry. No DoSes, no unexpected load.
Looking at long-running transactions in NewRelic, there are some that are taking 20-30-50 sec, but after the passenger restart, they are back to 1-3 sec like usual. All the time is in Ruby.
So this is where I'm stuck. I can see where the problem is (185 sec):
but if I restart passenger and re-run, that same code will take less than a second. And that's just the example that I see from today's traces. Yesterday, it looked like it was a different controller having problems.
Any recommendations on my next steps to troubleshoot? I don't know if I should instrument a specific controller, because it's not just one method that's causing the problem (afaik). I think I'm seeing the symptoms, not the cause - and no idea how to see what's really going on.
Thanks in advance
-Mike
Rails 4.2.11.1
Ruby 2.4.9
Passenger 6.0.5 on Apache 2.4.43

ActiveRecord::QueryCache#call slow on Heroku with pg:backups

Lately we've had trouble on our Rais 4.2.7.1 app every night, we start seeing a bunch of really slow ActiveRecord::QueryCache#call calls even though our traffic is relatively low in the middle of the night:
We're running on Heroku using Puma and the app is very job heavy, for which we use Sidekiq. During the day it works fine but every night we get these spikes of extremely slow response times via the API that seem to originate with ActiveRecord::QueryCache#call.
The only thing I can find from our app that might be causing this is we have heroku pg:backups enabled, and the night of the above image the backup began running at 3:06 which is the exact time you see that first ActiveRecord::QueryCache#call spike in the newrelic graph. The backup finished an hour later, however (around the biggest spike), but as you can see the spikes continued until around 5am.
Could this be caused by pg:backups? (Our database is about 19GB), or could it be something else entirely? Is there a good way to avoid this cache call or speed it up? I don't fully understand WHY it would be so slow or exist at all in the transaction list. Any recommendations?
Funnily enough, we've been investigating this lately after seeing similar behaviour. There is a definite performance hit caused by pg:backups on large databases. Notice the big spike just after 1am, when backup kicks in:
DB size is >100GB
It's not that surprising, and in fact Heroku do have documentation on this, which suggests that you should only use pg:backups for databases under 20GB.
For larger databases, creating a follower and taking the backup from that is preferable. Annoyingly for high availability databases, it doesn't appear that you can read from the standby.
I can't shed much light on ActiveRecord::QueryCache though, so the rest of this post is speculation, and maybe the starting point for further investigation. Happy to delete/amend if someone more knowledgable can weigh in :-)
Heroku's docs do say that the backup process will evict well cached data from non-Postgres caches, so this could represent your workers repopulating that cache many times over.
It may also be worth having a look at this. Could your workers be reusing connections and receiving dirty query caches?

Rails Server Memory Leak/Bloating Issue

We are running 2 rails application on server with 4GB of ram. Both servers use rails 3.2.1 and when run in either development or production mode, the servers eat away ram at incredible speed consuming up-to 1.07GB ram each day. Keeping the server running for just 4 days triggered all memory alarms in monitoring and we had just 98MB ram free.
We tried active-record optimization related to bloating but still no effect. Please help us figure out how can we trace the issue that which of the controller is at fault.
Using mysql database and webrick server.
Thanks!
This is incredibly hard to answer, without looking into the project details itself. Though I am quite sure you won't be using Webrick in your target production build(right?), so check if it behaves the same under Passenger or whatever is your choice.
Also without knowing the details of the project I would suggest looking at features like generating pdfs, csv parsing, etc. Seen a case, where generating pdf files have been eating resources in a similar fashion, leaving like 5mb of not garbage collected memory for each run.
Good luck.

Heroku. Request taking 100ms, intermittently Times out

After performing load testing against an app hosted on Heroku, I am finding that the most DB intensive request takes 50-200ms depending upon load. It never gets slower, no matter the load. However, seemingly at random, the request will outright timeout (30s or more).
On Heroku, why might a relatively high performing query/request work perfectly 8 times out of 10 and outright timeout 2 times out of 10 as load increases?
If this is starting to seem like a question for Heroku itself, I'm looking to first answer the question of whether "bad code" could somehow cause this issue -- or if it is clearly a problem on their end.
A bit more info:
Multiple Dynos
Cedar Stack
Dedicated Heroku DB (16 connections, 1.7 GB RAM, 1 comp. unit)
Rails 3.0.7
Thanks in advance.
Since you have multiple dynos and a dedicated DB instance and are paying hundreds of dollars a month for their service, you should ask Heroku
Edit: I should have added that when you check your logs, you can look for a line that says "routing" That is the Heroku routing layer that takes HTTP request and sends them to your app. You can add those up to see how much time is being spent outside your app. Unfortunately I don't know how easy it is to get large volumes of those logs for a load test.

Rails app (mod_rails) hangs every few hours

I'm running a Rails app through Phusion Passenger (mod_rails) which will run smoothly for a while, then suddenly slow to a crawl (one or two requests per hour) and become unresponsive. CPU usage is low throughout the whole ordeal, although I'm not sure about memory.
Does anyone know where I should start to diagnose/fix the problem?
Update: restarting the app every now and then does fix the problem, although I'm looking for a more long-term solution. Memory usage gradually increases (initially ~30mb per instance, becomes 40mb after an hour, gets to 60 or 70mb by the time it crashes).
New Relic can show you combined memory usage. Engine Yard recommends tools like Rack::Bug, MemoryLogic or Oink. Here's a nice article on something similar that you might find useful.
If restarting the app cures the problem, looking at its resource usage would be a good place to start.
Sounds like you have a memory leak of some sort. If you'd like to bandaid the issue you can try setting the PassengerMaxRequests to something a bit lower until you figure out what's going on.
http://www.modrails.com/documentation/Users%20guide%20Apache.html#PassengerMaxRequests
This will restart your instances, individually, after they've served a set number of requests. You may have to fiddle with it to find the sweet spot where they are restarting automatically before they lock up.
Other tips are:
-Go through your plugins/gems and make sure they are up to date
-Check for heavy actions and requests where there is a lot of memory consumption (NewRelic is great for this)
-You may want to consider switching to REE as it has better garbage collection
Finally, you may want to set a cron job that looks at your currently running passenger instances and kills them if they are over a certain memory threshold. Passenger will handle restarting them.

Resources