Rails response time and server health monitoring tools - ruby-on-rails

I have a Rails 3 application running in production. What are the tools available in the market to monitor response time and server health.
I am aware of New Relic but response time monitoring is available only in gold version which is $200/month.
For monitoring the health of the system I know of Server Density.
Any other tool?

I have been using ServerDensity for one year and I'm extremely satisfied. Recently a GitHub user released a Passenger plugin for ServerDensity.
If you want something more Ruby-oriented, have a look at Scout.

The Request Log Analyzer gem can be useful, is free, and works by analyzing Rails log files. Thus, there's no chance of it having any negative impact on your application's performance.

Related

Heroku configuration for Ruby on Rails application

I’ve set a client up with Heroku for their Ruby on Rails application and have had a great deal of trouble over the years with their application not running well regardless of how much money we spend on additional resources, find their documentation highly confusing. I’ve never been able to understand their specific terminology and documentation. We are constantly getting "H12" errors and "R14" errors etc. The memory usage and dyno loads are constantly spiking. And yet this is a small to medium-sized business without a massive amount of traffic. Wondering if anybody out there who does understand the ins and outs of Heroku can look this configuration over and tell me if it makes sense:
DB_POOL: 10
MALLOC_ARENA_MAX: 2
RAILS_MAX_THREADS: 5
WEB_CONCURRENCY: 4
Ruby 2.7
Rails 6.0
Puma
8 2x web dynos
5 1x worker dynos
$50 Postgres standard 0 database
$15 Memcachier
$10 Rediscloud
...etc addons
Your WEB_CONCURRENCY is too high for your Standard-2x dynos. The recommended default is 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
This is likely contributing to your R14 errors as higher web concurrency means more memory usage. So you need to either lower your web concurrency (which may mean you also need to increase the # of dynos to compensate) or you need to use bigger dynos.
You already have MALLOC_ARENA_MAX=2 but not sure if you are using jemalloc. You might want to try that too.
Of course, you may also have other memory issues in your app - check out some tips here. I also recommend adding a monitoring tool like AppSignal as it's capable of tracking memory allocations per transaction.
For mitigating H12s:
Ensure you have installed something like the rack-timeout gem, which ensures that a long-running request is dropped at the dyno-level and thus avoids the H12 error (you get a Rack::TimeoutError exception instead). Set the timeout to 15s so that it is well under the 30s for H12 timeout.
Investigate your slow transactions. A monitoring tool is key here, i.e. New Relic (start with lowest-priced paid plan - free plan does not allow transaction tracing). Here is their blog post on how to trace transactions
When you've identified the problem - fix it!
if the bottleneck is external:
check for external API limits and throttling
add timeouts and make app resilient to slow external responses
if the bottleneck is due to the database:
optimize slow queries
check cache hit rates
check for the # of waiting connections and db locks -> if the number of waiting connections is consistently above 0 for X minutes, that indicates you have some long locks that you'll need to investigate. Waiting connections is easiest to track over time with Librato (free plan should do fine)
if the bottleneck is other app code:
add more custom instrumentation to get more insights, i.e. New Relic instructions
address app code issues
I want to stress the importance of monitoring tools to help diagnose issues and help determine optimal resource usage. Doing things like figuring out the correct concurrency configs, the correct size and # of dynos to run are virtually impossible without proper monitoring tools. Hopefully you have some already that are covered by your etc add-ons that are not listed, but if you do not, I'll summarize my recommendations and mention a couple other tips:
To get more metrics info, ensure you have enabled log-runtime-metrics
Also enable Ruby language metrics
Add a monitoring tool that can track Ruby memory allocations like AppSignal. Scout APM can do this too but I think their plans capable of this are more expensive (requires Scout Insights feature)
Add the lowest-paid version of New Relic. This is my go-to tool for transaction tracing. AppSignal can do this too if you don't want to pay for another tool, but I find it easier with New Relic.
Add Librato. It offers some great charts out of the box, including a set of Postgres charts in its own dashboard.
Set alerts in your monitoring apps to warn you about things like response times so you can look into them!
And of course, make all your changes in staging first AND load test them to see the impacts of your changes before attempting in production!
Update: I also just noticed that you said you are using Standard-0 Postgres, which means it has a 120 connection limit. So if you end up lowering your WEB_CONCURRENCY and increasing the # of dynos, watch out for your total connections to that database. Beyond just the fact that there is a limit, more connections also mean more overhead for your db anyway so if you are close to your connection limit, you are more likely to see db performance suffer. You may want to upgrade to another plan that has a higher connection limit or use pgbouncer as your connection pooler to avoid connection limits.

Unexplained high response time on Heroku environment

I'm using a paid Heroku ($35/mo 2x web dyno, $50/mo silver postgresdb) plan for a small internal app with typically 1-2 concurrent users. Database snapshot size is less than 1MB.
App is Rails 4.1. In the last two weeks, there's been a significant performance drop in production env, where Chrome dev tools reports response times like 8s
Typical:
Total ​8.13s
Stalled ​3.643 ms
DNS Lookup ​2.637 ms
Initial connection 235.532 ms
SSL ​133.738 ms
Request sent ​0.546 ms
Waiting (TTFB) ​3.43 s
Content Download ​4.47 s
I'm using Nitrous dev environment and get sub-1s response on dev server with non-precompiled assets (with mirrored db).
I'm a novice programmer and am not clear on how to debug this. Why would I see 800% slower performance than the dev environment on an $85/mo+ Heroku plan? Given my current programming skill level, my app is probably poorly optimized (a few N+1 queries in there...) but how bad can it be when production has 1-2 concurrent users??
Sample from logs if it helps:
sample#current_transaction=16478 sample#db_size=18974904bytes sample#tables=28 sample#active-connections=6 sam
ple#waiting-connections=0 sample#index-cache-hit-rate=0.99929 sample#table-cache-hit-rate=0.99917 sample#load-avg-1m=0.365 sample#load-avg-5m=0.45 sample#load-avg-15m=0.445 sample#read-iops=
37.587 sample#write-iops=36.7 sample#memory-total=15405616kB sample#memory-free=1409236kB sample#memory-cached=12980840kB sample#memory-postgres=497784kB
Sample from server logs:
Completed 200 OK in 78ms (Views: 40.9ms | ActiveRecord: 26.6ms
I'm seeing similar numbers on dev server but the actual visual performance is night and day. The dev responds as you would expect - sub-1s response and render. The production server is 4-5s delay.
I don't think it's related to ISP as suggested because I've actually been traveling and seeing identical performance problems from USA and Europe.
Add NewRelic add-on to your app's ecosystem on Heroku and explore what's going on. You can do it from Heroku's dashboard. Choose free plan and you will be provided with 2 weeks trial period of full functionality. After trial period there will be some constraints but anyway it's enough for measuring performance of the small app.
Also you can add the Logentries add-on and you will get access to the app's log history via web interface.

Bulk publishing Issue in Fatwire

Currently performing a Bulk publish using Mirror Queue for around 11000 assets.
Have tried the same in lower environments and was successful.
No error logged # source and destination futuretense logs. The asset update got logged at destination for long time and no updates later. Publishing process still keeps rolling in source.
No stuck threads and memory consumption looks good on both environments.
Any inputs for the same, will be much appreciated.
What is the Fatwire version? Please note that mirror to server publishing is no longer supported in Oracle Webcenter Sites 11.1.1.8. Real time publishing is more robust and less error prone.

Monitor Multiple Rails Applications

Are there any tools that I can run on my server to monitor multiple rails applications?
I need to monitor the number of requests each application receives, how much memory each application is using, how much of the cpu is being used and other stats similar to those. I need to see the stats for each individual rails application.
I recommend you try NewRelic RPM.
The free version:
RPM Lite is the most widely used
solution for basic web application
monitoring. RPM Lite provides
application monitoring for unlimited
Java, Ruby or JRuby applications, for
unlimited users, for an unlimited
time. What a deal! With RPM Lite you
can identify overall app health, app
response time, throughput, Apdex SLA
scoring, cluster breakdown, and Notes.
You can also see where web
transactions are spending the most
time, isolate the worst offenders, and
determine where to focus your
remediation efforts
Later edit:
An alternative to NewRelic RPM is ScoutApp, that has a lot of plugins covering all your required features.
If you need something that can be run on your server, there is also the munin plugins gem that you may try. If you need a users monitoring tools (kind of like Google Analytics)m you can use the RailStat gem.
The Request Log Analyzer gem can be useful, is free, and works by analyzing Rails log files. Thus, there's no chance of it having any negative impact on your application's performance.

How can I find out why my app is slow?

I have a simple Rails app deployed on a 500 MB Slicehost VPN. I'm the only one who uses the app. When I run it on my laptop, it's fast enough. But the deployed version is insanely slow. It take 6 to 10 seconds to load the login screen.
I would like to find out why it's so slow. Is it my code? (Don't think so because it's much faster locally, but maybe.) Is it Slicehost's server being overloaded? Is it the Internet?
Can someone suggest a technique or set of steps I can take to help narrow down the cause of this problem?
Update:
Sorry forgot to mention. I'm running it under CentOS 5 using Phusion Passenger (AKA mod_rails or mod_rack).
If it is just slow on the first time you load it is probably because of passenger killing the process due to inactivity. I don't remember all the details but I do recall reading people who used cron jobs to keep at least one process alive to avoid this lag that can occur with passenger needed to reload the environment.
Edit: more details here
Specifically - pool idle time defaults to 2 minutes which means after two minutes of idling passenger would have to reload the environment to serve the next request.
First, find out if there's a particularly slow response from the server. Use Firefox and the Firebug plugin to see how long each component (including JavaScript and graphics) takes to download. Assuming the main page itself is what is taking all the time, you can start profiling the application. You'll need to find a good profiler, and as I don't actually work in Ruby on Rails, I can't suggest any: google "profile ruby on rails" for some options.
As YenTheFirst points out, the server software and config you're using may contribute to a slowdown, but A) slicehost doesn't choose that, you do, as Slicehost just provides very raw server "slices" that you can treat as dedicated machines. B) you're unlikely to see a script that runs instantly suddenly take 6 seconds just because it's running as CGI. Something else must be going on. Check how much RAM you're using: have you gone into swap? Is the login slow only the first time it's hit indicating some startup issue, or is it always that slow? Is static content served slow? That'd tend to mean some network issue (either on the Slicehost side, or your local network) is slowing things down, assuming you're not in swap.
When you say "fast enough" you're being vague: does the laptop version take 1 second to the Slicehost 6? That wouldn't be entirely surprising, if the laptop is decent: after all, the reason slices are cheap is because they're a fraction of a full server. You're using probably 1/32 of an 8 core machine at Slicehost, as opposed to both cores of a modern laptop. The Slicehost cores are quick, but your laptop could be a screamer compared to 1/4 of core. :)
Try to pint point where the slowness lies
1/ application is slow, or infrastructure (network + web server)
put a static file on your web server, and access it through your browser
2/ If it is fast, it is probable a problem with application + server configuration.
database access is slow
try a page with a simpel loop: is it slow?
3/ If it slow, it is probably your infrastructure. You can check:
bad network connection: do a packet capture (with Wireshark for example) and look for retransmissions, duplicate packets, etc.
DNS resolution is slow?
server is misconfigured?
etc.
What is Slicehost using to serve it?
Fast options are things like: Mongrel, or apache's mod_rails (also called passenger phusion or
something like that)
These are dedicated servers (or plugins to servers) which run an instance of your rails app.
If your host isn't using that, then it's probably defaulting to CGI. Rails comes with a simple CGI script that will serve the page, but it reloads the app for every page.
(edit: I suspect that this is the most likely case, that your app is running off of the CGI in /webapp_directory/public/dispatch.cgi, which would explain the slowness. This tends to be a default deployment on many hosts, since it doesn't require extra configuration on their part, but it doesn't give good performance)
If your host supports "Fast CGI", rails supports that too. Fast CGI will open a CGI session, and keep it open for multiple pages, so you get much better performance, but it's not nearly as good as Mongrel or mod_rails.
Secondly, is it in 'production' or 'development' mode? The easy way to tell is to go to a page in your app that gives an error. If it shows you a stack trace, it's in development mode, which is slower than production mode. Mongrel and mod_rails have startup options to determine whether to run the app in production or development mode.
Finally, if your database is slow for whatever reason, that will be a big bottleneck as well. If you do have a good deployment (Mongrel/mod_rails/etc.) in production mode, try looking into that.
Do you have a lot of data in your DB? I would double check that you have indexed all the appropriate columns- because this can make a huge difference. On your local dev system, you probably have a lot more memory than on your 500 mb slice, which would result in the DB running a lot slower if you have big, un indexed tables. You can also run the slow queries logger in MySql to pinpoint columns without indexes.
Other than that, yes- passenger will need to spool up a process for you if you have not been using the site recently. If this is the case, you should see a significant speed increase on second, and especially third and later page loads.
You might want to run a local virtual machine with 500 MB. Are you doing a lot of client-server interaction? Delays over the WAN are significant
You might want to check out RPM (there's a free "lite" version too) and/or New Relic's Tune Up.
Your CPU time is guaranteed by Slicehost using the Xen virtualization system, so it's not that. Don't have the other answers for you, sorry! Might try 'top' on a console while you're trying to access the page.
If you are using FireFox and doing localhost testing (or maybe even on LAN) you may want to try editing the network.dns.disableIPv6 setting.
Type about:config in the address bar and filter for network.dns.disableIPv6 and double-click to set to true.
This bug has been reported mainly from Vista OS's, but some others as well.
You could try running 'top' when you SSH in to see which process is heavy. If you also have problems logging you, perhaps you may try getting Statistics in the Slicehost manager.
If you discover it is MySQL's fault, consider decreasing the number of servers it can spawn.
512 seems decent for Rails application, you might have to check if you misconfigured too.

Resources