How to setup alerts for memory usage in Heroku? - ruby-on-rails

That, basically. I have a Rails 5 application but really more on a general level Im trying to find a way to get email alerts when my dynos reach certain memory usage threshold. Im using one web and one worker. Cant find anything.

Heroku has an 'Application metrics' page which can also alert you on some conditions. However there is no option for alerting for memory usage. We have asked Heroku support and this is their answer:
Yes, unfortunately there's not a built-in way to receive alerts about
memory usage. However, you could potentially set up your Papertrail
instance to alert you if your app emits an R14 or an R15. Keep in mind
that those are cases where it might be too late to take any corrective
action on memory as it is already alerting a disruption to the app.
For more granularity, you'll need to enable the log-runtime-metrics
lab so that your memory usage is printed to your application logs:
https://devcenter.heroku.com/articles/log-runtime-metrics.
You can also use a log drain to parse, chart, and setup alerts for
those errors. Librato is the most common tool I see customers using
for that kind of workflow: https://elements.heroku.com/addons/librato.
Relying on Papertrail is good enough for us for now, but we are still exploring other options. One option is to simply use the Rails app to report its own memory usage, for example through a system call like this pmap #{Process.pid} | tail -1. Another option is to use Monit (https://mmonit.com/) but it's not very easy to set it up and configure. (You will need a custom build pack to run this on Heroku.) Also Heroku supports some 3rd party addons for monitoring which can send alerts, like New Relic.

You can't in Heroku, but the best solution I've found so far is to use App Signal
They allow you to set up custom alerts for any metric they track, including host memory usage.
App Signal is a pretty solid APM in general too, so you can ditch New Relic, Scout, or whatever other tool you might be using as well.

Related

Heroku configuration for Ruby on Rails application

I’ve set a client up with Heroku for their Ruby on Rails application and have had a great deal of trouble over the years with their application not running well regardless of how much money we spend on additional resources, find their documentation highly confusing. I’ve never been able to understand their specific terminology and documentation. We are constantly getting "H12" errors and "R14" errors etc. The memory usage and dyno loads are constantly spiking. And yet this is a small to medium-sized business without a massive amount of traffic. Wondering if anybody out there who does understand the ins and outs of Heroku can look this configuration over and tell me if it makes sense:
DB_POOL: 10
MALLOC_ARENA_MAX: 2
RAILS_MAX_THREADS: 5
WEB_CONCURRENCY: 4
Ruby 2.7
Rails 6.0
Puma
8 2x web dynos
5 1x worker dynos
$50 Postgres standard 0 database
$15 Memcachier
$10 Rediscloud
...etc addons
Your WEB_CONCURRENCY is too high for your Standard-2x dynos. The recommended default is 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
This is likely contributing to your R14 errors as higher web concurrency means more memory usage. So you need to either lower your web concurrency (which may mean you also need to increase the # of dynos to compensate) or you need to use bigger dynos.
You already have MALLOC_ARENA_MAX=2 but not sure if you are using jemalloc. You might want to try that too.
Of course, you may also have other memory issues in your app - check out some tips here. I also recommend adding a monitoring tool like AppSignal as it's capable of tracking memory allocations per transaction.
For mitigating H12s:
Ensure you have installed something like the rack-timeout gem, which ensures that a long-running request is dropped at the dyno-level and thus avoids the H12 error (you get a Rack::TimeoutError exception instead). Set the timeout to 15s so that it is well under the 30s for H12 timeout.
Investigate your slow transactions. A monitoring tool is key here, i.e. New Relic (start with lowest-priced paid plan - free plan does not allow transaction tracing). Here is their blog post on how to trace transactions
When you've identified the problem - fix it!
if the bottleneck is external:
check for external API limits and throttling
add timeouts and make app resilient to slow external responses
if the bottleneck is due to the database:
optimize slow queries
check cache hit rates
check for the # of waiting connections and db locks -> if the number of waiting connections is consistently above 0 for X minutes, that indicates you have some long locks that you'll need to investigate. Waiting connections is easiest to track over time with Librato (free plan should do fine)
if the bottleneck is other app code:
add more custom instrumentation to get more insights, i.e. New Relic instructions
address app code issues
I want to stress the importance of monitoring tools to help diagnose issues and help determine optimal resource usage. Doing things like figuring out the correct concurrency configs, the correct size and # of dynos to run are virtually impossible without proper monitoring tools. Hopefully you have some already that are covered by your etc add-ons that are not listed, but if you do not, I'll summarize my recommendations and mention a couple other tips:
To get more metrics info, ensure you have enabled log-runtime-metrics
Also enable Ruby language metrics
Add a monitoring tool that can track Ruby memory allocations like AppSignal. Scout APM can do this too but I think their plans capable of this are more expensive (requires Scout Insights feature)
Add the lowest-paid version of New Relic. This is my go-to tool for transaction tracing. AppSignal can do this too if you don't want to pay for another tool, but I find it easier with New Relic.
Add Librato. It offers some great charts out of the box, including a set of Postgres charts in its own dashboard.
Set alerts in your monitoring apps to warn you about things like response times so you can look into them!
And of course, make all your changes in staging first AND load test them to see the impacts of your changes before attempting in production!
Update: I also just noticed that you said you are using Standard-0 Postgres, which means it has a 120 connection limit. So if you end up lowering your WEB_CONCURRENCY and increasing the # of dynos, watch out for your total connections to that database. Beyond just the fact that there is a limit, more connections also mean more overhead for your db anyway so if you are close to your connection limit, you are more likely to see db performance suffer. You may want to upgrade to another plan that has a higher connection limit or use pgbouncer as your connection pooler to avoid connection limits.

How does adding a 3rd party logging service affect how much i have to pay to Heroku

This would sound very newbie but I've just added a centralized logging service (Splunkstorm free version) to my rails app on heroku and it completely changed my life. I don't know why i never thought of this before.
I can just read all the logs from web interface without running heroku logs --tail which spawns a new dyno everytime i do it.
Which makes me curious: Does adding this type of logging service affect how much i have to pay to heroku? I mean, it's sending out packets every time something happens.
Nope!
Bandwidth is included in the dyno pricing (including the one you get for free).
There is a soft limit at 2TB of bandwidth, but you're unlikely to come anywhere near that from logging.

Identifying poor performance in an Application

We are in the process of building a high-performance web application.
Unfortunately, there are times when performance unexpectedly degrades and we want to be able to monitor this so that we can proactively fix the problem when it occurs, as opposed to waiting for a user to report the problem.
So far, we are putting in place system monitors for metrics such as server memory usage, CPU usage and for gathering statistics on the database.
Whilst these show the overall health of the system, they don't help us when one particular user's session is slow. We have implemented tracing into our C# application which is particularly useful when identifying issues where data is the culprit, but for performance reasons tracing will be off by default and only enabled when trying to fix a problem.
So my question is are there any other best-practices that we should be considering (WMI for instance)? Is there anything else we should consider building into our web app that will benefit us without itself becoming a performance burden?
This depends a lot on your application, but I would always suggest to add your application metrics into your monitoring. For example number of recent picture uploads, number of concurrent users - I think you get the idea. Seeing the application specific metrics in combination with your server metrics like memory or CPU sometimes gives valuable insights.
In addition to system health monitoring (using Nagios) of parameters such as load, disk space, etc.., we
have built-in a REST service, called from Nagios, that provides statistics on
transactions pers second (which makes sense in our case)
number of active sessions
the number of errors in the logs per minute
....
in short, anything that is specific to the application(s)
monitor the time it takes for a (dummy) round trip transaction: as if an user or system was performing the business function
All this data being sent back to Nagios, we then configure alert levels and notifications.
We find that monitoring the number of Error entries in the logs gives some excellent short term warnings of a major crash/issue on the way for a lot of systems.
Many of our customers use Systems and Application Monitor, which handles the health monitoring, along with Synthetic End User Monitor, which runs continuous synthetic transactions to show you the performance of a web application from the end-user's perspective. It works for apps outside and behind the firewall. Users often tell us that SEUM will reveal availability problems from certain locations, or at certain times of day. You can download a free trial at
SolarWinds.com.

Best way to run rails with long delays

I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.

Quality Control / Log Monitoring

One of the articles I really enjoyed reading recently was Quality Control by Last.FM. In the spirit of this article, I was wondering if anyone else had favorite monitoring setups for web type applications. Or maybe if you don't believe in Log Monitoring, why?
I'm looking for a mix of opinion slash experience here I guess.
We get a bunch of email/pager alerts from an older host/app/network monitoring environment that get gradually more abusive depending on severity of the problem/time taken to respond. Fortunately we all have thick skins and very broad senses of humour. :)
We use log4net, and normally write both to log files and the database. However, when we've been tracking down a particularly difficult problem, we've enabled the email appender, so that critical log messages went straight to a developer's email account. This allowed us to figure out what was happening more immediately.
In addition, our infrastructure team has several tools they use to monitor system uptime, event logs, etc., to give them early warning when something is about to go down. We've also helped them implement custom monitoring scripts that test specific functionality of our code.

Resources