So I am working on a pretty high traffic rails/heroku/postgres app, the backend only, and after running for hours, days, or weeks at times the database will randomly start taking 120 seconds to perform queries that usually take 2-3 seconds, and it clears up as soon as the app is restarted and everyone is essentially "kicked off". What could cause a database to start taking a ridiculously long time to perform all queries? The database is not running out of memory, it is being vacuumed regularly, and it is not running out of connections. There are around 500 users at times, dynos are autoscaling, and the web server is passenger. However this is probably something with PG as it is happening at the query level.
Related
I have a production app on heroku with postgres, serving a Rails API. Every day between 7-8am, there will be one or more long-running requests (about 20s or more) and it only occurs then. According to logs the time spent is in the database
There are no scheduled jobs
Backups are not scheduled anywhere close to that time
Traffic is at its lowest at that time
Memory is stable
There are other requests between the daily restart and that, so the dynos are not "cold"
It is not always the same endpoint, and it doesn't occur any other time
I'm not sure if it means anything but timezone is Singapore GMT+8, so 7-8am is right before midnight.
Has anyone else experienced this or has ideas to troubleshoot?
EDIT to add details:
You can see here (Scout APM) that it precedes the high throughput, basically before the start of the work day, so it's not due to load.
In fact there is pretty much no load at all. Neither is it the first request since server restart (at 4am). The slow request at 7:46am here was repeated at night (same url, same query string) which finished in under a second
I'm using Heroku for a production Rails application.
I'm monitoring it with scoutapp and noticed that requests time can be 4 times slower or faster after a deploy in production.
I made some screenshots this time, but this happened multiple times, if I'm luky it will be fast after deploy.
The deployment just contains a css update
heroku stats also shows slower response time:
I'd guess that slower times are due to cache being cleared when you deploy and the server restarts. When each request is cached again you would expect it to speed back up.
I can't offer an explanation for the faster times though.
While website loading speed testing I found that website is sometimes loading very quickly and some times it takes lot of time to start loading. When I checked it in detail, I found on some requests wait time was just in few hundred milliseconds, while on some other request which was slow it was actually taking 5 to 30 seconds in wait time.
What may be the cause of this kind of deviation from few milliseconds to 30 or more seconds. And how to improve it.
The site is build upon ASP.net MVC3 and Microsoft SQL Server database.
What patterns are there i.e. are the same URLs always slow, and other URLs always fast, or does it just appear to be random?
Look at what else is running on the server, is it a dedicated server or a VPS?
Look at the DB performance i.e. is it consistent, which are the queries that are taking the longest time, most CPU, most IO etc.
How busy is the site, do the slowdowns match when the app-pool is being recycled or started up?
After performing load testing against an app hosted on Heroku, I am finding that the most DB intensive request takes 50-200ms depending upon load. It never gets slower, no matter the load. However, seemingly at random, the request will outright timeout (30s or more).
On Heroku, why might a relatively high performing query/request work perfectly 8 times out of 10 and outright timeout 2 times out of 10 as load increases?
If this is starting to seem like a question for Heroku itself, I'm looking to first answer the question of whether "bad code" could somehow cause this issue -- or if it is clearly a problem on their end.
A bit more info:
Multiple Dynos
Cedar Stack
Dedicated Heroku DB (16 connections, 1.7 GB RAM, 1 comp. unit)
Rails 3.0.7
Thanks in advance.
Since you have multiple dynos and a dedicated DB instance and are paying hundreds of dollars a month for their service, you should ask Heroku
Edit: I should have added that when you check your logs, you can look for a line that says "routing" That is the Heroku routing layer that takes HTTP request and sends them to your app. You can add those up to see how much time is being spent outside your app. Unfortunately I don't know how easy it is to get large volumes of those logs for a load test.
Ok. I know I don't have a lot of information. That is, essentially, the reason for my question. I am building a game using Flash/Flex and Rails on the back-end. Communication between the two is via WebORB.
Here is what is happening. When I start the client an operation calls the server every 60 seconds (not much, right?) which results in two database SELECTS and an UPDATE and a resulting response to the client.
This repeats every 60 seconds. I deployed a test version on heroku and NewRelic's RPM told me that response time degraded over time. One client with one task every 60 seconds. Over several hours the response time drifted from 150ms to over 900ms in response time.
I have been able to reproduce this in my development environment (my Macbook Pro) so it isn't a problem on Heroku's side.
I am not doing anything sophisticated (by design) in the server app. An action gets called, gets some data from the database, performs an AR update and then returns a response. No caching, etc.
Any thoughts? Anyone? I'd really appreciate it.
What does the development log say is slow for those requests? The view or db? If it's the db, check to see how many records there are in database and see how to optimize the queries. Maybe you need to index some fields.
Are you running locally in development or production mode? I've seen Rails apps performance degrade faster (memory usage) over time in development mode. I'm not sure if one can run an app on Heroku in development mode but if I were you I would check into that.