Recently our ruby on rails application was upgraded to 2.3.8. We also replaced Mongrel/Mongrel cluster with Phusion Passenger during upgrade.
Whenever we try to deploy our application, it seems to be responding faster initially but response time gradually increases. We also noticed that cpu usage on the database box spiking to 400% and lot of requests waiting in global queue. This seems to be happening only in our production environment.
Can anyone let me know how I should go about debugging this issue?
Is there anyway we can restrict the number of connections between passenger and DB?
Also is there a way we can setup connection pooling in passenger?
Thanks,
Sivakumar.
I don't think the issue is with Passenger. If you are having a large spike on your DB box CPU, your issue probably lies there. Without more information about your database it's hard to give you specifics, but here are a few things you could try:
Run top and check how much of your total memory the mysqld or other database process is consuming. If this amount is not high then you probably need to tune MySQL settings to take advantage of the RAM on your DB box.
Analyze your running database queries with the mytop command. You might have some queries that are hogging all of your system resources or causing lots of swapping.
Look through your MySQL slow log and see if you have queries that are taking over 1 second to run.
Check your database engines. Are you using MYISAM and/or InnoDB? You might need to tune your database differently so that it can allocate the proper amount of memory and resources to each engine.
Consult a DBA. They'll be able to take a look at your usage and application and tell you more definitively if the issue lies with your database or application.
P.S: I would recommend upgrading to Passenger 3, it performs better than Passenter 2.
Related
My database is very cpu constrained, and I can't find the root cause of the issue. I currently have two applications servers each wit a Rails api connecting to PostgreSQL via the ruby-pg gem. Both application server also have sidekiq running background jobs, and I have a handful of support servers processing new posts from a national feed via sidekiq. If I were running out of memory, the solution would seemingly be straight forward. Any general ideas why I am CPU constrained?
Database Specs:
Rackspace 8GB Performance Tier cloud VM (8GB RAM, 8x Core CPU, SSD)
Debian 7 Wheezy Linux OS
PostgreSQL 9.1 with PostGIS extension
Possible Problems:
PostgreSQL 9.1 is bad at indexes
The database has nearly 10GB of indexes. I am going to upgrade my database to PostgreSQL version >= 9.2. In version 9.2, index only scans were introduced.
Too many connections
In the postgresql.conf, I have set max connection equal to '500'. Usually throughout the day, only 175 connections are utilized, but during peak times, sidekiq tasks will increase the current connections to 350. How many connections are recommended with an 8GB server instance?
Idol Connections
When I take a look at pg_stat_activity in the psql console, I see sidekiq is leaving a lot of IDLE connections. Could these connections result in CPU inflation? Does the fix exist in the api or in sidekiq?
Need a more powerful server
Maybe there is not a bug. I might need to simply increase the server instance. Again this would make more sense if I was memory bound. However, both app servers and 3 of the support sidekiq servers are 4gb performance tier instances. Essentially, servers that interact with the database have combined more than double the resources of the database. Should this even matter?
Additional questions:
What tools/techniques should I employ to troubleshoot the issue?
Any basic settings in the postgresql.conf related to cpu usage?
Are there any known issues related to rails, sidekiq, or the pg gem that could be a contributing factor? (I havent seen any open issues.)
Are there any general postgreSQL guideline for CPU usage?
Any other ideas thoughts that might help my search?
You are using massively too many concurrent connections. PostgreSQL will be wasting lots of its time on housekeeping and juggling concurrent queries. All the concurrent work will be fighting for CPU and buffer space, there'll be heavy contention on spinlocks, and it'll all generally be a mess.
On an 8 core machine, you should probably not have more than 20 actively working connections if you're mostly CPU constrained. If you're I/O limited, you can go higher, but 350 is just ridiculous.
If possible, put a PgBouncer in transaction pooling mode in front of your PostgreSQL instance, so queries get queued up and executed rapidly in series instead of slowly in parallel.
See number of database connections (Pg wiki).
Additionally, PostGIS can be very CPU-heavy. It sometimes needs to do very complex calculations. I suggest using the auto_explain module to record long running queries, and using pg_stat_statements / pg_stat_plans to record what's taking up resources. Examine these queries to see if they need improvement.
Your idle in transaction sessions must be dealt with, too. Depending on why they're idle and whether they have a transaction ID or not, they might be causing serious table bloat. They're also creating unnecessary signalling overhead within PostgreSQL, as it has to do more co-ordination with backends that're actively doing things. Finally, the number of open transactions its self increases the cost of some internal housekeeping operations.
So. Your DB will probably perform better if you reduce the connection counts, put a PgBouncer in transaction pooling mode in front, and fix those idle connections.
Most likely you are CPU constrained because your work needs a lot of CPU. :)
9.1 is not generally bad at indexes. There may be some specific issues, as all versions might, which exactly what they are might change from version to version.
Index-only-scans are mostly a benefit when you are IO constrained. I wouldn't hold out much hope for that being a magic bullet for you.
350 connections are certainly not helpful, but probably are not very harmful, either. But when they are harmful, it can be downright catastrophic. The correct value is more determined by the number of cores, not the amount of RAM. If it is easy to throttle down the sidekiq connections, do it even if you can't prove that it helps.
If the connections are just IDLE, not IDLE in transaction, then they probably aren't very harmful, but again there are a few cases where they can be. That is pretty much the same issue as the number of connections.
The connection you showed from top was idle in transaction. That status shouldn't be taking up much CPU, so that probably means it is rapidly cycling through statements and top just happens to catch it while it is between them. But you didn't say how many similar lines there were in top, if it is just that one it suggests your code is not running concurrently and 7 of you 8 CPUs are wasted.
Regarding the db server versus the other servers, if the database is fundamentally the limit, beating on it with a bigger hammer is not going to help. Often there is some flexibility about where computation is done. If you can get the app servers to do more computation that is currently done on the db and let the db focus on ACID issues, that would be good. But no one but you can know if that is possible or feasible.
My first stop would be to use pg_stat_statements to see what SQL statements are taking the most time. Maybe just adding an index to the slowest/most frequent query would make the problem magically go away.
We are running 2 rails application on server with 4GB of ram. Both servers use rails 3.2.1 and when run in either development or production mode, the servers eat away ram at incredible speed consuming up-to 1.07GB ram each day. Keeping the server running for just 4 days triggered all memory alarms in monitoring and we had just 98MB ram free.
We tried active-record optimization related to bloating but still no effect. Please help us figure out how can we trace the issue that which of the controller is at fault.
Using mysql database and webrick server.
Thanks!
This is incredibly hard to answer, without looking into the project details itself. Though I am quite sure you won't be using Webrick in your target production build(right?), so check if it behaves the same under Passenger or whatever is your choice.
Also without knowing the details of the project I would suggest looking at features like generating pdfs, csv parsing, etc. Seen a case, where generating pdf files have been eating resources in a similar fashion, leaving like 5mb of not garbage collected memory for each run.
Good luck.
i'm running a rails application with apache+passenger on virtual servers that do not have any swap space configured.
The site gets decent amount of traffic with 200K+ daily requests and sometimes the whole system runs out of memory causing odd behaviour on whole system.
The question is that is there any way to configure apache or passenger not to run out of memory (e.g. gracefully restarting passenger instances when they start using, say more than 300M of memory).
Servers have 4GB of memory and currently i'm using passenger's PassengerMaxRequests option but it does not seem to be the most solid solution here.
At the moment, i also cannot switch to nginx so that is not an option to preserve some room.
Any clever ideas i'm probably missing are welcome.
Edit: My temporary solution
I did not go with restarting Rails instances when they exceed certain amount of memory usage. Engine Yard wrote great blog post on the ActiveRecord memory bloat issue. This is our main suspect on the subject. As i did not have much time to optimize application, i set PassengerMaxRequests to 300 and added extra 2GB memory to server. Things have been good since then. At first i was worried that re-starting Rails instances continuously makes it slow but it does not seem to have impact i should worry about.
If you mean "limiting" as killing those processes and if this is the only application on the server and it is a Linux, then you have two choices:
Set maximum amount of memory one process can have:
# ulimit -m
unlimited
Or use cgroups for similar behavior:
http://en.wikipedia.org/wiki/Cgroups
I would advise against restarting instances (if that is possible) that go over the "memory limit", because that may put your system in infinite loops where a process repeatedly reaches that limit and restarts.
Maybe you could write a simple daemon that constantly watches the processes, and kills any that go over a certain amount of memory. Be sure to log any information about the process that did this so you can fix the issue whenever it comes up.
I'd also look into getting some real swap space on there... This seems like a bad hack.
I have a problem where passenger processes end up going out of control and consuming too much memory. I wrote the following script which has been helping to keep things under control until I find the real solution http://www.codeotaku.com/blog/2012-06/passenger-memory-check/index. It might be helpful.
Passenger web instances don't contain important state (generally speaking) so killing them isn't normally a process, and passenger will restart them as and when required.
I don't have a canned solution but you might want to use two commands that ship with Passenger to keep track of memory usage and nr of processes: passenger-status and sudo passenger-memory-stats, see
Passenger users guide for Nginx or
Passenger users guide for Apache.
I have recently installed Nginx + Thin on my deployment server, but i am not sure how this will perform in last requests & responses situation. lets say 1000/req per sec.
so the speed on thin is good with 10-100 req /per sec
I wanted to know on higher volumes of data being processed on the request/response cluster.
Guide me on this :-)
Multiple thin processes and nginx are capable of providing lots of speed, depending on what your application is doing. So, the problem will be your application code, the speed of your application server, and your database server.
Scaling Rails has been recently covered in depth by the Scaling Rails Screencasts. I recommend you start there. My 5 step program to scaling Rails would be:
First step is to have the tools to look at what is slow in your application. Do not spend time optimizing everything in your application when you don't know what the problem is.
The easiest way to be able to handle lots of requests/second is with page caching.
If you can't do that, cache everything possible (fragment caching, use memcached to cache data, etc), to speed up your application.
After that, optimize your application as best as possible, make SQL queries fast, index everything, etc.
If you still need more speed, throw more hardware at the problem. Get a big, powerful database server, a bunch of app servers, and proxy your requests across them. You can start here, too, but it will only delay the optimization process.
If you have a single server I think that the main key is, apart from everything already mentioned, is don't skimp on the specs of it. Trying to get too much to run on too little is just a recipe for disaster.
It is also a good idea to get monit or God monitoring your thin instances, I started out with God, but it leaked memory pretty bad on Ruby 1.8.6 so I stop using it in favour of monit. Monit is written in C I believe and has a tiny memory footprint so I'd recommend that one.
If all that seems like a bit much to keep nginx and thin playing nicely you may want to look into an all in one solution like Passenger or LiteSpeed. I have very little experience with these so can offer no substancial advice for them.
I have a simple Rails app deployed on a 500 MB Slicehost VPN. I'm the only one who uses the app. When I run it on my laptop, it's fast enough. But the deployed version is insanely slow. It take 6 to 10 seconds to load the login screen.
I would like to find out why it's so slow. Is it my code? (Don't think so because it's much faster locally, but maybe.) Is it Slicehost's server being overloaded? Is it the Internet?
Can someone suggest a technique or set of steps I can take to help narrow down the cause of this problem?
Update:
Sorry forgot to mention. I'm running it under CentOS 5 using Phusion Passenger (AKA mod_rails or mod_rack).
If it is just slow on the first time you load it is probably because of passenger killing the process due to inactivity. I don't remember all the details but I do recall reading people who used cron jobs to keep at least one process alive to avoid this lag that can occur with passenger needed to reload the environment.
Edit: more details here
Specifically - pool idle time defaults to 2 minutes which means after two minutes of idling passenger would have to reload the environment to serve the next request.
First, find out if there's a particularly slow response from the server. Use Firefox and the Firebug plugin to see how long each component (including JavaScript and graphics) takes to download. Assuming the main page itself is what is taking all the time, you can start profiling the application. You'll need to find a good profiler, and as I don't actually work in Ruby on Rails, I can't suggest any: google "profile ruby on rails" for some options.
As YenTheFirst points out, the server software and config you're using may contribute to a slowdown, but A) slicehost doesn't choose that, you do, as Slicehost just provides very raw server "slices" that you can treat as dedicated machines. B) you're unlikely to see a script that runs instantly suddenly take 6 seconds just because it's running as CGI. Something else must be going on. Check how much RAM you're using: have you gone into swap? Is the login slow only the first time it's hit indicating some startup issue, or is it always that slow? Is static content served slow? That'd tend to mean some network issue (either on the Slicehost side, or your local network) is slowing things down, assuming you're not in swap.
When you say "fast enough" you're being vague: does the laptop version take 1 second to the Slicehost 6? That wouldn't be entirely surprising, if the laptop is decent: after all, the reason slices are cheap is because they're a fraction of a full server. You're using probably 1/32 of an 8 core machine at Slicehost, as opposed to both cores of a modern laptop. The Slicehost cores are quick, but your laptop could be a screamer compared to 1/4 of core. :)
Try to pint point where the slowness lies
1/ application is slow, or infrastructure (network + web server)
put a static file on your web server, and access it through your browser
2/ If it is fast, it is probable a problem with application + server configuration.
database access is slow
try a page with a simpel loop: is it slow?
3/ If it slow, it is probably your infrastructure. You can check:
bad network connection: do a packet capture (with Wireshark for example) and look for retransmissions, duplicate packets, etc.
DNS resolution is slow?
server is misconfigured?
etc.
What is Slicehost using to serve it?
Fast options are things like: Mongrel, or apache's mod_rails (also called passenger phusion or
something like that)
These are dedicated servers (or plugins to servers) which run an instance of your rails app.
If your host isn't using that, then it's probably defaulting to CGI. Rails comes with a simple CGI script that will serve the page, but it reloads the app for every page.
(edit: I suspect that this is the most likely case, that your app is running off of the CGI in /webapp_directory/public/dispatch.cgi, which would explain the slowness. This tends to be a default deployment on many hosts, since it doesn't require extra configuration on their part, but it doesn't give good performance)
If your host supports "Fast CGI", rails supports that too. Fast CGI will open a CGI session, and keep it open for multiple pages, so you get much better performance, but it's not nearly as good as Mongrel or mod_rails.
Secondly, is it in 'production' or 'development' mode? The easy way to tell is to go to a page in your app that gives an error. If it shows you a stack trace, it's in development mode, which is slower than production mode. Mongrel and mod_rails have startup options to determine whether to run the app in production or development mode.
Finally, if your database is slow for whatever reason, that will be a big bottleneck as well. If you do have a good deployment (Mongrel/mod_rails/etc.) in production mode, try looking into that.
Do you have a lot of data in your DB? I would double check that you have indexed all the appropriate columns- because this can make a huge difference. On your local dev system, you probably have a lot more memory than on your 500 mb slice, which would result in the DB running a lot slower if you have big, un indexed tables. You can also run the slow queries logger in MySql to pinpoint columns without indexes.
Other than that, yes- passenger will need to spool up a process for you if you have not been using the site recently. If this is the case, you should see a significant speed increase on second, and especially third and later page loads.
You might want to run a local virtual machine with 500 MB. Are you doing a lot of client-server interaction? Delays over the WAN are significant
You might want to check out RPM (there's a free "lite" version too) and/or New Relic's Tune Up.
Your CPU time is guaranteed by Slicehost using the Xen virtualization system, so it's not that. Don't have the other answers for you, sorry! Might try 'top' on a console while you're trying to access the page.
If you are using FireFox and doing localhost testing (or maybe even on LAN) you may want to try editing the network.dns.disableIPv6 setting.
Type about:config in the address bar and filter for network.dns.disableIPv6 and double-click to set to true.
This bug has been reported mainly from Vista OS's, but some others as well.
You could try running 'top' when you SSH in to see which process is heavy. If you also have problems logging you, perhaps you may try getting Statistics in the Slicehost manager.
If you discover it is MySQL's fault, consider decreasing the number of servers it can spawn.
512 seems decent for Rails application, you might have to check if you misconfigured too.