I have a production app on heroku with postgres, serving a Rails API. Every day between 7-8am, there will be one or more long-running requests (about 20s or more) and it only occurs then. According to logs the time spent is in the database
There are no scheduled jobs
Backups are not scheduled anywhere close to that time
Traffic is at its lowest at that time
Memory is stable
There are other requests between the daily restart and that, so the dynos are not "cold"
It is not always the same endpoint, and it doesn't occur any other time
I'm not sure if it means anything but timezone is Singapore GMT+8, so 7-8am is right before midnight.
Has anyone else experienced this or has ideas to troubleshoot?
EDIT to add details:
You can see here (Scout APM) that it precedes the high throughput, basically before the start of the work day, so it's not due to load.
In fact there is pretty much no load at all. Neither is it the first request since server restart (at 4am). The slow request at 7:46am here was repeated at night (same url, same query string) which finished in under a second
Related
I deployed a Vue.js and a Kotlin server app. Cloud Run does promise to put a service to sleep if no request to it arise for a specific time. I did not opened my app for a day now. As I opened it - it was available almost immediatly. Since I know how long it takes to spin up when started locally I kinda don't trust the promise that Cloud Run really had put the app to sleep and span it up so crazy fast.
I'd love to know a way how I can really see how long it took for the spinup - also for startup improvement for the backend service.
After having the service inactive for some time, record the time when you request the service URL and request it.
Then go to the logs for the Cloud Run service, and use this filter to see the logs for the service:
resource.type="cloud_run_revision"
resource.labels.service_name="$SERVICE_NAME"
Look for the log entry with the normal app output after your request, check its time and compare it with the recorded time.
You can't know when the instance will be evicted or if it is kept in memory. It could happen quickly, or take hours or days before eviction. it's "serverless".
About the starting time, when I test, I deploy a new revision and I have a try on it. In the logging service, the first log entry of the new revision provides me the cold start duration. (Usually 300+ ms, compare to usual 20 - 50 ms with warm start).
The billing instance time is the sum of all the containers running times. A container is considered as "running" when it process request(s).
So I am working on a pretty high traffic rails/heroku/postgres app, the backend only, and after running for hours, days, or weeks at times the database will randomly start taking 120 seconds to perform queries that usually take 2-3 seconds, and it clears up as soon as the app is restarted and everyone is essentially "kicked off". What could cause a database to start taking a ridiculously long time to perform all queries? The database is not running out of memory, it is being vacuumed regularly, and it is not running out of connections. There are around 500 users at times, dynos are autoscaling, and the web server is passenger. However this is probably something with PG as it is happening at the query level.
I have a server on Heroku - 3 dynos, 2 processes each.
The server does 2 things:
It responds to requests from the browser (AJAX and some web pages), based on data stored in a postgresql database
It exposes a REST API to update the data in the database. This API is called by another server. The rate of calls is limited: The other server only calls my server through a queue with a single worker, which makes sure the other server doesn't issue more than one request in parallel to my server (I verified that indeed it doesn't).
When I look at new relic, I see the following graph, which suggests that even though I keep the other server at one parallel request at most, it still loads my server which creates peaks.
I'd expect that since the rate of calls from the other server is limited, my server will not get overloaded, since a request will only start when the previous request ended (I'm guessing that maybe the database gets overloaded if it gets an update request and returns but continue processing after that).
What can explain this behaviour?
Where else can I look at in order to understand what's going on?
Is there a way to avoid this behaviour?
There are whole lot of directions this investigation could go, but from your screenshot and some inferences, I have two guesses.
A long query—You'd see this graph if your other server or a browser occasionally hits a slow query. If it's just a long read query and your DB isn't hitting its limits, it should only affect the process running the query, but if the query is taking an exclusive lock, all dynos will have to wait on it. Since the spikes are so regular, first think of anything you have running on a schedule - if the cadence matches, you probably have your culprit. The next simple thing to do is run heroku pg:long-running-queries and heroku pg:seq-scans. The former shows queries that might need optimization, and the latter shows full table scans you can probably fix with a different query or a better index. You can find similar information in NewRelic's Database tab, which has time and throughput graphs you can try to match agains your queueing spikes. Finally, look at NewRelic's Transactions tab.
There are various ways to sort - slowest average response time is probably going to help, but check out all the options and see if any transactions stand out.
Click on a suspicious transaction and look at the graph on the right. If you see spikes matching your queueing buildups, that could be it, but since it looks to be affecting your whole site, watch out for several transactions seeing correlated slowdowns.
Check out the transaction traces at the bottom. Something in there taking a long time to run is as close to a smoking gun as you'll get. This should correlate with pg:long-running-queries.
Look at the breakdown table between the graph and the transaction traces. Check for things that are taking a long time (eg. a 2 second external request) or happening often (eg, a partial that gets rendered 2500 times per request). Those are places for caching or optimization.
Garbage collection—This is less likely because Ruby GCs all the time and there's no reason it would show spikes on that regular cadence, but if there's a regular request that allocates a ton of objects, both building the objects and cleaning them up will take time. It would only affect one dyno at once, and it would be correlated with a long or highly repetitive query in your NewRelic investigation. You can see some stats about this in NewRelic's Ruby VM tab.
Take a look at your dyno and DB memory usage too. Both are printed to the Heroku logs, and if you add Librato, they'll build some automatic graphs that are quite helpful. If your dyno is swapping, performance will suffer and you should either upgrade to a bigger dyno or run fewer processes per dyno. Processes will typically accumulate memory as they run and never quite release as much as you'd like, so tune it so that right before a restart, your dyno is just under its available RAM. Similarly for the DB, if you're hitting swap there, query performance will suffer and you should upgrade.
Other things it could be, but probably isn't in this case:
Sleeping dynos—Heroku puts a dyno to sleep if it hasn't served a request in a while, but only if you have just 1 dyno running. You have 3, so this isn't it.
Web Server Concurrency—If at any given moment, there are more requests than available processes, requests will be queued. The obvious fix is to increase the available dynos/processes, which will put more load on your DB and potentially move the issue there. Since some regular request is visible every time, I'm guessing request volume is low and this also isn't your problem.
Heroku Instability—Sometimes, for no obvious reason, Heroku starts queueing requests more than it should and doesn't report any issues at status.heroku.com. Restarting the dynos typically fixes that temporarily while Heroku gets their head back on straight.
My users are seeing occasional request timeouts on Heroku. Unfortunately I can not consistently reproduce them which makes them really hard to debug. There's plenty of opportunity to improve performance - e.g. by reducing the huge number of database queries per request and by adding more caching - but without profiling that's a shot in the dark.
According to our New Relic analytics, many requests take between 1 and 5 seconds on the server. I know that's too slow, but it nowhere near the 30 seconds needed for the timeout.
The error tab on New Relic shows me several different database queries where the timeout occurs, but these aren't particularly slow queries and it can be different queries for each crash. Also for the same URL it sometimes does and sometimes does not show a database query.
How do I find out what's going on in these particular cases? E.g. how do I see how much time it was spending in the database when the timeout occurred, as opposed to the time it spends in the database when there's no error?
One hypothesis I have is that the database gets locked in some cases; perhaps a combination of reading and writing.
You may have already seen it, but Heroku has a doc with some good background about request timeouts.
If your requests are taking a long time, and the processes servicing them are not being killed before the requests complete, then they should be generating transaction traces that will provide details about individual transactions that took too long.
If you're using Unicorn, it's possible that this is not happening because the requests are taking long enough that they're hitting up against Unicorn's timeout (after which the workers servicing those requests will be forcibly killed, not giving the New Relic agent enough time to report back in).
I'd recommend a two-step approach:
Configure the rack-timeout middleware to have a timeout below Heroku's 30s timeout. If this works, it will terminate requests taking longer than the timeout by raising a Timeout::Error, and such requests should generate transaction traces in New Relic.
If that yields nothing (which it might, because rack-timeout relies on Ruby's stdlib Timeout class, which has some limitations), you can try bumping the Unicorn request handling timeout up from its default of 60s (assuming you're using Unicorn). Be aware that long-running requests will tie up a Unicorn worker for a longer period in this case, which may further slow down your site, so use this as a last resort.
Two years late here. I have minimal experience with Ruby, but for Django the issue with Gunicorn is that it does not properly handle slow clients on Heroku because requests are not pre-buffered, meaning a server connection could be left waiting (blocking). This might be a helpful article to you, although it applies primarily to Gunicorn and Python.
You're pretty clearly hitting the issue with long running requests. Check out http://artsy.github.com/blog/2013/02/17/impact-of-heroku-routing-mesh-and-random-routing/ and upgrade to NewRelic RPM 3.5.7.59 - the wait time measuring will be accurately reported.
After performing load testing against an app hosted on Heroku, I am finding that the most DB intensive request takes 50-200ms depending upon load. It never gets slower, no matter the load. However, seemingly at random, the request will outright timeout (30s or more).
On Heroku, why might a relatively high performing query/request work perfectly 8 times out of 10 and outright timeout 2 times out of 10 as load increases?
If this is starting to seem like a question for Heroku itself, I'm looking to first answer the question of whether "bad code" could somehow cause this issue -- or if it is clearly a problem on their end.
A bit more info:
Multiple Dynos
Cedar Stack
Dedicated Heroku DB (16 connections, 1.7 GB RAM, 1 comp. unit)
Rails 3.0.7
Thanks in advance.
Since you have multiple dynos and a dedicated DB instance and are paying hundreds of dollars a month for their service, you should ask Heroku
Edit: I should have added that when you check your logs, you can look for a line that says "routing" That is the Heroku routing layer that takes HTTP request and sends them to your app. You can add those up to see how much time is being spent outside your app. Unfortunately I don't know how easy it is to get large volumes of those logs for a load test.