Lately we've had trouble on our Rais 4.2.7.1 app every night, we start seeing a bunch of really slow ActiveRecord::QueryCache#call calls even though our traffic is relatively low in the middle of the night:
We're running on Heroku using Puma and the app is very job heavy, for which we use Sidekiq. During the day it works fine but every night we get these spikes of extremely slow response times via the API that seem to originate with ActiveRecord::QueryCache#call.
The only thing I can find from our app that might be causing this is we have heroku pg:backups enabled, and the night of the above image the backup began running at 3:06 which is the exact time you see that first ActiveRecord::QueryCache#call spike in the newrelic graph. The backup finished an hour later, however (around the biggest spike), but as you can see the spikes continued until around 5am.
Could this be caused by pg:backups? (Our database is about 19GB), or could it be something else entirely? Is there a good way to avoid this cache call or speed it up? I don't fully understand WHY it would be so slow or exist at all in the transaction list. Any recommendations?
Funnily enough, we've been investigating this lately after seeing similar behaviour. There is a definite performance hit caused by pg:backups on large databases. Notice the big spike just after 1am, when backup kicks in:
DB size is >100GB
It's not that surprising, and in fact Heroku do have documentation on this, which suggests that you should only use pg:backups for databases under 20GB.
For larger databases, creating a follower and taking the backup from that is preferable. Annoyingly for high availability databases, it doesn't appear that you can read from the standby.
I can't shed much light on ActiveRecord::QueryCache though, so the rest of this post is speculation, and maybe the starting point for further investigation. Happy to delete/amend if someone more knowledgable can weigh in :-)
Heroku's docs do say that the backup process will evict well cached data from non-Postgres caches, so this could represent your workers repopulating that cache many times over.
It may also be worth having a look at this. Could your workers be reusing connections and receiving dirty query caches?
Related
I have an app running on Heroku with a few thousand visitors per day. I do not update it very often as it runs well anyway. Recently, however, I started getting memory increases in a way I have never had before. I am 98% sure it does not have to do with any changes in the code that I have done, as I have not done anything to the code in quite a while. I know from experience that tracing down memory issues is extremely difficult and time-consuming - and at the moment I don't have the time to do it.
Considering the fact that I get this stair-case increase in memory over time (over the course of a few hours) once a day, is there a quick and dirty way of just restarting the server once it starts doing so, so it won't slow down the server for those hours? Like
RestartApp if ServerMemory > 500 Mb
or the likes of it?
I am running ruby 2.4.7 (is that likely an issue in terms of memory increases?) and Rails 4.2.10
Let me start off by saying I understand performance is hard, and even harder through a forum, but I hope someone can point me towards the next step of figuring out this problem we're having in production on a Ruby on Rails environment. I've been seeing this for a while, but now with our usage down due to Covid, it's become more clear that it's not a user-load issue, but something in the infrastructure.
The day starts off fine, but midday, things slow more and more... until a passenger restart clears everything up for 24/48/96 hours.
Checking the obvious items:
Server memory usage is in check. Memory use increases during this time from 20%-30%. So some increase, but definitely not swapping (I've seen that happen before)
CPU usage is similar. It increases from 1% at 6am to 18% at worst case, but still well within the capabilities of the server.
Passenger usage. Looking at passenger-status, there's never a backlog of requests. Because of covid, our user-base is down, and even at its slowest, I'm seeing 2-3 passenger processes serving content (out of ~40 max), and no long-running requests.
Delayed job workers (3) are running, but no jobs at the time. They don't seem to be using a significant amount of cpu or mem
Watching the apache/rails logs, nothing looks awry. No DoSes, no unexpected load.
Looking at long-running transactions in NewRelic, there are some that are taking 20-30-50 sec, but after the passenger restart, they are back to 1-3 sec like usual. All the time is in Ruby.
So this is where I'm stuck. I can see where the problem is (185 sec):
but if I restart passenger and re-run, that same code will take less than a second. And that's just the example that I see from today's traces. Yesterday, it looked like it was a different controller having problems.
Any recommendations on my next steps to troubleshoot? I don't know if I should instrument a specific controller, because it's not just one method that's causing the problem (afaik). I think I'm seeing the symptoms, not the cause - and no idea how to see what's really going on.
Thanks in advance
-Mike
Rails 4.2.11.1
Ruby 2.4.9
Passenger 6.0.5 on Apache 2.4.43
I have a server on Heroku - 3 dynos, 2 processes each.
The server does 2 things:
It responds to requests from the browser (AJAX and some web pages), based on data stored in a postgresql database
It exposes a REST API to update the data in the database. This API is called by another server. The rate of calls is limited: The other server only calls my server through a queue with a single worker, which makes sure the other server doesn't issue more than one request in parallel to my server (I verified that indeed it doesn't).
When I look at new relic, I see the following graph, which suggests that even though I keep the other server at one parallel request at most, it still loads my server which creates peaks.
I'd expect that since the rate of calls from the other server is limited, my server will not get overloaded, since a request will only start when the previous request ended (I'm guessing that maybe the database gets overloaded if it gets an update request and returns but continue processing after that).
What can explain this behaviour?
Where else can I look at in order to understand what's going on?
Is there a way to avoid this behaviour?
There are whole lot of directions this investigation could go, but from your screenshot and some inferences, I have two guesses.
A long query—You'd see this graph if your other server or a browser occasionally hits a slow query. If it's just a long read query and your DB isn't hitting its limits, it should only affect the process running the query, but if the query is taking an exclusive lock, all dynos will have to wait on it. Since the spikes are so regular, first think of anything you have running on a schedule - if the cadence matches, you probably have your culprit. The next simple thing to do is run heroku pg:long-running-queries and heroku pg:seq-scans. The former shows queries that might need optimization, and the latter shows full table scans you can probably fix with a different query or a better index. You can find similar information in NewRelic's Database tab, which has time and throughput graphs you can try to match agains your queueing spikes. Finally, look at NewRelic's Transactions tab.
There are various ways to sort - slowest average response time is probably going to help, but check out all the options and see if any transactions stand out.
Click on a suspicious transaction and look at the graph on the right. If you see spikes matching your queueing buildups, that could be it, but since it looks to be affecting your whole site, watch out for several transactions seeing correlated slowdowns.
Check out the transaction traces at the bottom. Something in there taking a long time to run is as close to a smoking gun as you'll get. This should correlate with pg:long-running-queries.
Look at the breakdown table between the graph and the transaction traces. Check for things that are taking a long time (eg. a 2 second external request) or happening often (eg, a partial that gets rendered 2500 times per request). Those are places for caching or optimization.
Garbage collection—This is less likely because Ruby GCs all the time and there's no reason it would show spikes on that regular cadence, but if there's a regular request that allocates a ton of objects, both building the objects and cleaning them up will take time. It would only affect one dyno at once, and it would be correlated with a long or highly repetitive query in your NewRelic investigation. You can see some stats about this in NewRelic's Ruby VM tab.
Take a look at your dyno and DB memory usage too. Both are printed to the Heroku logs, and if you add Librato, they'll build some automatic graphs that are quite helpful. If your dyno is swapping, performance will suffer and you should either upgrade to a bigger dyno or run fewer processes per dyno. Processes will typically accumulate memory as they run and never quite release as much as you'd like, so tune it so that right before a restart, your dyno is just under its available RAM. Similarly for the DB, if you're hitting swap there, query performance will suffer and you should upgrade.
Other things it could be, but probably isn't in this case:
Sleeping dynos—Heroku puts a dyno to sleep if it hasn't served a request in a while, but only if you have just 1 dyno running. You have 3, so this isn't it.
Web Server Concurrency—If at any given moment, there are more requests than available processes, requests will be queued. The obvious fix is to increase the available dynos/processes, which will put more load on your DB and potentially move the issue there. Since some regular request is visible every time, I'm guessing request volume is low and this also isn't your problem.
Heroku Instability—Sometimes, for no obvious reason, Heroku starts queueing requests more than it should and doesn't report any issues at status.heroku.com. Restarting the dynos typically fixes that temporarily while Heroku gets their head back on straight.
We're in the process of improving performance of the our rails app hosted at Heroku (rails 3.2.8 and ruby 1.9.3). During this we've come across one alarming problem for which the source seems to be extremely difficult to track. Let me quickly explain how we experience the problem and how we've tried to isolate it.
--
Since around June we've experienced weird lag behavior in Time to First Byte all over the site. The problems is obvious from using the site (sometimes the application doesn't respond for 10-20 seconds), and it's also present in waterfall analysis via webpagetest.org.
We're based in Denmark but get this result from any host.
To confirm the problem we've performed a benchmark test where we send 300 identical requests to a simple page and measured the response time.
If we send 300 requests to the front page the median response time is below 1 second, which is fairly good. What scares us is that 60 requests takes more that double that time and 40 of those takes more than 4 seconds. Some requests take as much as 16 seconds.
None of these slow requests show up in New Relic, which we use for performance monitoring. No request queuing shows up and the results are the same no matter how high we scale our web processes.
Still, we couldn't reject that the problem was caused by application code, so we tried another experiment where we responded to the request via rack middleware.
By placing this middleware (TestMiddleware) at the beginning of the rack stack, we returned a request before it even hit the application, ensuring that none of the following middleware or the rails app could cause the delay.
Middleware setup:
$ heroku run rake middleware
use Rack::Cache
use ActionDispatch::Static
use TestMiddleware
use Rack::Rewrite
use Rack::Lock
use Rack::Runtime
use Rack::MethodOverride
use ActionDispatch::RequestId
use Rails::Rack::Logger
use ActionDispatch::ShowExceptions
use ActionDispatch::DebugExceptions
use ActionDispatch::RemoteIp
use Rack::Sendfile
use ActionDispatch::Callbacks
use ActiveRecord::ConnectionAdapters::ConnectionManagement
use ActiveRecord::QueryCache
use ActionDispatch::Cookies
use ActionDispatch::Session::DalliStore
use ActionDispatch::Flash
use ActionDispatch::ParamsParser
use ActionDispatch::Head
use Rack::ConditionalGet
use Rack::ETag
use ActionDispatch::BestStandardsSupport
use NewRelic::Rack::BrowserMonitoring
use Rack::RailsExceptional
use OmniAuth::Builder
run AU::Application.routes
We then ran the same script to document response time and got pretty much the same result. The median response time was around 130ms (obviously faster because it doesn't hit the app. But still 60 requests took more than 400ms and 25 requests took more than 1 second. Again, with some requests as slow as 16 seconds.
One explanation could be related to slow hops on the network or DNS setup, but the results of traceroute looks perfectly OK.
This result was confirmed from running the response script on another rails 3.2 and ruby 1.9.3 application hosted on Heroku - no weird behavior at all.
The DNS setup follows Heroku's recommendations.
--
We're confused to say the least. Could there be something fishy with Heroku's routing network?
Why the heck are we seeing this weird behavior? How do we get rid of it? And why can't we see it in New Relic?
It Turned out that it was a kind of request queuing. Sometimes, that web server was busy, and since heroku just routs randomly incoming requests randomly to any dyno, then I could end up in a queue behind a dyno, which was totally stuck due to e.g. database problems. The strange thing is, that this was hardly noticeable in new relic (it's a good idea to uncheck all other resources when viewing thins in their charts, then the queuing suddenly appears)
EDIT 21/2 2013: It has turned out, that the reason why it wasn't hardly noticeable in Newrelic was, that it wasn't measured! http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics
We find this very frustrating, and we ended up leaving Heroku in favor of dedicated servers. This gave us 20 times better performance at a 1/10 of the cost. Additionally I must say that we are disappointed by Heroku who at the time this happened, denied that the slowness was due to their infrastructure even though we suspected it and highlighted it several times. We even got answers like this back:
Heroku 28/8 2012: "If you're not seeing request queueing or other slowness reported in New Relic, then this is likely not a server-side issue. Heroku's internal routing should take <1ms. None of our monitoring systems are indicating any routing problems currently."
Additionally we spoke to Newrelic who also seemed unaware of the issue, even though they according to them selfs has a very close work relationship with Heroku.
Newrelic 29/8 2012: "It looks like whatever is causing this is happening before the Ruby agent's visibility starts. The queue time that the agent records is from the time the request enters a dyno, so the slow down is occurring before then."
The bottom-line was, that we ended up spending hours and hours on optimizing code that wasn't really the bottleneck. Additionally running with a too high dyno scale in a desperate try to boost our performance, but the only thing that we really got from this was bigger receipts from both Heroku and Newrelic - NOT COOL. I'm glad that we changed.
PS. At that time there even was a bug that caused newrelic pro to be charged on ALL dynos even though we, (according to Newrelics own advice), had disabled the monitoring on our background worker processes. It took a lot of time and many emails before the mistake was admitted by both parties.
PPS. If you are not aware of the current ongoing discussion, then here is the link http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics
EDIT 26/2 2013
Heroku has just announced in their newsletter, that Newrelic has released an update that apparently should cast some light on the situation at Heroku.
EDIT 8/4 2013
Heroku has just released an FAQ over the topic
traceroute is not a good measure of problems in the network, its a tool that can find failures along the network, but it will not show you the best view.
Try just putting up a static webpage and hit it with the ip address from your webpage tester. If it is still slow, blame the network.
If for some reason it is fast, then you have a different issue.
I'm running a Rails app through Phusion Passenger (mod_rails) which will run smoothly for a while, then suddenly slow to a crawl (one or two requests per hour) and become unresponsive. CPU usage is low throughout the whole ordeal, although I'm not sure about memory.
Does anyone know where I should start to diagnose/fix the problem?
Update: restarting the app every now and then does fix the problem, although I'm looking for a more long-term solution. Memory usage gradually increases (initially ~30mb per instance, becomes 40mb after an hour, gets to 60 or 70mb by the time it crashes).
New Relic can show you combined memory usage. Engine Yard recommends tools like Rack::Bug, MemoryLogic or Oink. Here's a nice article on something similar that you might find useful.
If restarting the app cures the problem, looking at its resource usage would be a good place to start.
Sounds like you have a memory leak of some sort. If you'd like to bandaid the issue you can try setting the PassengerMaxRequests to something a bit lower until you figure out what's going on.
http://www.modrails.com/documentation/Users%20guide%20Apache.html#PassengerMaxRequests
This will restart your instances, individually, after they've served a set number of requests. You may have to fiddle with it to find the sweet spot where they are restarting automatically before they lock up.
Other tips are:
-Go through your plugins/gems and make sure they are up to date
-Check for heavy actions and requests where there is a lot of memory consumption (NewRelic is great for this)
-You may want to consider switching to REE as it has better garbage collection
Finally, you may want to set a cron job that looks at your currently running passenger instances and kills them if they are over a certain memory threshold. Passenger will handle restarting them.