Restart heroku dynos when their RAM exceeded - ruby-on-rails

I have a memory leak problem with my server (who is written in ruby on rails)
I want to implement a temporary solution that restarts the dynos automatically when their memory is exceeding. What is the best way to do this? And is it risky ?

There is a great solution for it if you're using Puma as a server.
https://github.com/schneems/puma_worker_killer
You can restart your server when the RAM exceeds some threshold:
for example:
PumaWorkerKiller.config do |config|
config.ram = 1024 # mb
config.frequency = 5 # seconds
config.percent_usage = 0.98
config.rolling_restart_frequency = 12 * 3600 # 12 hours in seconds
end
PumaWorkerKiller.start
Also, to prevent data corruption and other funny issues in your DB, I would also suggest to make sure you are covered with atomic transactions.

Related

Redis consumes max memory on large DEL & HMSET function

Issue: After expected clear and rebuild of specific redis keys, worker dynos doesn't allocate memory (until restart of dyno).
I am experience an issue where my Heroku worker-dynos are hitting 95%-100% max memory usage during a delete and rebuild on about 4000 keys. I have a scheduled rebuild that starts every day at 4:00am. Based on logs I assume the DEL of the keys + the rebuild of keys take about ~1490 seconds.
Jun 29 04:01:41 app app/worker.2: 4 TID-...io8w RedisWorker JID-...cd2a7 INFO: start
Jun 29 04:06:28 app app/worker.1: 4 TID-...mtks RedisWorker JID-...bf170 INFO: start
Jun 29 04:26:32 app app/worker.1: 4 TID-...mtks RedisWorker JID-...bf170 INFO: done: 1203.71 sec
Jun 29 04:26:33 app app/worker.2: 4 TID-...io8w RedisWorker JID-...cd2a7 INFO: done: 1490.938 sec
The memory will hover max usage until the dyno restarts (which is scheduled) or we deploy. Example image: Heroku Memory Usage
This is a high level what gets triggered at 4am:
def full_clear
RedisWorker.delete_keys("key1*")
RedisWorker.delete_keys("key2*")
RedisWorker.delete_keys("key3*")
self.build
return true
end
def build
... rebuilds keys based on models ...
return true
end
def self.delete_keys(regex)
$redis.scan_each(match: regex) do |key|
$redis.del(key)
end
end
What I have researched so far or my thoughts:
After redis DEL is being invoked the memory doesn't allocate?
Could there be a better implementation of finding all keys that match and doing a batch delete?
I am using defaults for puma; would configuring puma+sidekiq to better match our resources help be the best starting action? Deploying Rails Applications with the Puma Web Server. After a restart the memory is only about 30%-40% until the next full-rebuild (even during high usage of hmsets).
I noticed that my ObjectSpace counts is comparably a lot lower after the dyno gets restart/rest of day until next scheduled full_rebuild.
Any thoughts how I can go about trying to figure out whats causing the dynos to hang memory? Seems isloated to side / the worker dynos being used do rebuild Redis.
Solution:
I installed New Relic to see if there was a potential memory bloat. One of our most used function calls was doing a N+1 query. Fixed the N+1 query and watched our 60k call in New Relic drop to ~5K.
GC also wasn't collecting because it wasn't hitting our threshold. Later on there might be potential GC optimization -- but for now we our immediate issue has been resolved.
I also reached out to Heroku for their thoughts and this is what was discussed:
Memory usage on the Dyno will be managed by the Ruby VM, and is likely that you're keep ing too much information in memory during the key rebuild. You should look into freeing memory used to generate the key-values after the data has been added redis.
Spending time fixing your N+1 queries will definitely help!

Rails 4 Puma concurrency with Capistrano

I am trying to deploy a concurrent Rails 4 Puma app with capistrano and was confused by the example of capistrano-puma gem.
From the snipper from github
set :puma_threads, [0, 16]
set :puma_workers, 0
What are the differences of threads and workers in puma?
What does 0 puma worker means and [0, 16] threads mean?
What are the parameters to achieve concurrency? My aim is to achieve simple SSE to send notification. What are the best parameters to do in puma?
I am sorry if these are simple questions but i am having hard time finding resources online even on the official site, if someone can point me to an article which answer my question, i am happy to accept it. Thanks.
Tho not found in any documentation, I suppose set :puma_workers, 0 would means unlimited puma workers.
A worker is the number of processes running or instances of your application.
Each instance can run multiple threads. So if you have 2 workers running with max 16 threads it means your server can serve 2 * 16 = 32 requests at a time and if avg response time of your request is 100ms it means per second requests that could serve = (1000/100) * 32 = 320rps approx.

sidekiq-pro batches don't appear to release redis memory after batches complete

We are using sidekiq pro 1.7.3 and sidekiq 3.1.4, Ruby 2.0, Rails 4.0.5 on heroku with the redis green addon with 1.75G of memory.
We run a lot of sidekiq batch jobs, probably around 2 million jobs a day. What we've noticed is that the redis memory steadily increases over the course of a week. I would have expected that when the queues are empty and no workers are busy that redis would have low memory usage, but it appears to stay high. I'm forced to do a flushdb pretty much every week or so because we approach our redis memory limit.
I've had a series of correspondence with Redisgreen and they suggested I reach out to the sidekiq community. Here are some stats from redisgreen:
Here's a quick summary of RAM use across your database:
The vast majority of keys in your database are simple values taking up 2 bytes each.
200MB is being consumed by "queue:low", the contents of your low-priority sidekiq queue.
The next largest key is "dead", which occupies about 14MB.
And:
We just ran an analysis of your database - here is a summary of what we found in 23129 keys:
18448 strings with 1048468 bytes (79.76% of keys, avg size 56.83)
6 lists with 41642 items (00.03% of keys, avg size 6940.33)
4660 sets with 3325721 members (20.15% of keys, avg size 713.67)
8 hashs with 58 fields (00.03% of keys, avg size 7.25)
7 zsets with 1459 members (00.03% of keys, avg size 208.43)
It appears that you have quite a lot of memory occupied by sets. For example - each of these sets have more than 10,000 members and occupies nearly 300KB:
b-3819647d4385b54b-jids
b-3b68a011a2bc55bf-jids
b-5eaa0cd3a4e13d99-jids
b-78604305f73e44ba-jids
b-e823c15161b02bde-jids
These look like Sidekiq Pro "batches". It seems like some of your batches are getting filled up with very large numbers of jobs, which is causing the additional memory usage that we've been seeing.
Let me know if that sounds like it might be the issue.
Don't be afraid to open a Sidekiq issue or email prosupport # sidekiq.org directly.
Sidekiq Pro Batches have a default expiration of 3 days. If you set the Batch's expires_in setting longer, the data will sit in Redis longer. Unlike jobs, batches do not disappear from Redis once they are complete. They need to expire over time. This means you need enough memory in Redis to hold N days of Batches, usually not a problem for most people, but if you have a busy Sidekiq installation and are creating lots of batches, you might notice elevated memory usage.

How to tune a Ruby on Rails application running on Heroku which uses production level Heroku Postgres?

The Company I work for decided on moving their entire stack to Heroku. The main motivation was it's ease of use: No sysAdmin, no cry. But I still have some questions about it...
I'm making some load and stress tests on both application platform and Postgres service. I'm using blitz as an addon of Heroku. I attack on the site with number of users between 1 to 250. There are some very interesting results I got and I need help on evaluating them.
The Test Stack:
Application specifications
It hasn't anything that much special at all.
Rails 4.0.4
Unicorn
database.yml set up to connect to Heroku postgres.
Not using cache.
Database
It's a Standard Tengu (naming conventions of Heroku will kill me one day :) properly connected to the application.
Heroku configs
I applied everything on unicorn.rb as told in "Deploying Rails Applications With Unicorn" article. I have 2 regular web dynos.
WEB_CONCURRENCY : 2
DB_POOL : 5
Data
episodes table counts 100.000~
episode_urls table counts 300.000~
episode_images table counts 75.000~
Code
episodes_controller.rb
def index
#episodes = Episode.joins(:program).where(programs: {channel_id: 1}).limit(100).includes(:episode_image, :episode_urls)
end
episodes/index.html.erb
<% #episodes.each do |t| %>
<% if !t.episode_image.blank? %>
<li><%= image_tag(t.episode_image.image(:thumb)) %></li>
<% end %>
<li><%= t.episode_urls.first.mas_path if !t.episode_urls.first.blank?%></li>
<li><%= t.title %></li>
<% end %>
Scenario #1:
Web dynos : 2
Duration : 30 seconds
Timeout : 8000 ms
Start users : 10
End users : 10
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 218 successful hits in 30.00 seconds and we
transferred 6.04 MB of data in and out of your app. The average hit
rate of 7.27/second translates to about 627,840 hits/day.
Scenario #2:
Web dynos : 2
Duration : 30 seconds
Timeout : 8000 ms
Start users : 20
End users : 20
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 365 successful hits in 30.00 seconds and we
transferred 10.12 MB of data in and out of your app. The average hit
rate of 12.17/second translates to about 1,051,200 hits/day. The
average response time was 622 ms.
Scenario #3:
Web dynos : 2
Duration : 30 seconds
Timeout : 8000 ms
Start users : 50
End users : 50
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 371 successful hits in 30.00 seconds and we
transferred 10.29 MB of data in and out of your app. The average hit
rate of 12.37/second translates to about 1,068,480 hits/day. The
average response time was 2,631 ms.
Scenario #4:
Web dynos : 4
Duration : 30 seconds
Timeout : 8000 ms
Start users : 50
End users : 50
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 484 successful hits in 30.00 seconds and we
transferred 13.43 MB of data in and out of your app. The average hit
rate of 16.13/second translates to about 1,393,920 hits/day. The
average response time was 1,856 ms.
Scenario #5:
Web dynos : 4
Duration : 30 seconds
Timeout : 8000 ms
Start users : 150
End users : 150
Result:
HITS 71.22% (386)
ERRORS 0.00% (0)
TIMEOUTS 28.78% (156)
This rush generated 386 successful hits in 30.00 seconds and we
transferred 10.76 MB of data in and out of your app. The average hit
rate of 12.87/second translates to about 1,111,680 hits/day. The
average response time was 5,446 ms.
Scenario #6:
Web dynos : 10
Duration : 30 seconds
Timeout : 8000 ms
Start users : 150
End users : 150
Result:
HITS 73.79% (428)
ERRORS 0.17% (1)
TIMEOUTS 26.03% (151)
This rush generated 428 successful hits in 30.00 seconds and we
transferred 11.92 MB of data in and out of your app. The average hit
rate of 14.27/second translates to about 1,232,640 hits/day. The
average response time was 4,793 ms. You've got bigger problems,
though: 26.21% of the users during this rush experienced timeouts or
errors!
General Summary:
The "Hit Rate" never goes beyond the number of 15 even though 150 users sends request to the application.
Increasing number of web dynos does not help handling requests.
Questions:
When I use caching and memcached (Memcachier add-on from Heroku) even 2 web dynos can handle >180 hits per second. I'm just trying to understand what can dynos and the postgres service can do without cache. This way I'm trying to understand how to tune them. How to do it?
Standard Tengu is said to have 200 concurrent connections. So why it never reaches that number?
If having a prdouction level db and increasing web dynos won't help to scale my app, what's the point to use Heroku?
Probably the most important question: What am I doing wrong? :)
Thank you for even reading this crazy question!
I particularly figured out the issue.
Firstly, remember my code in the view:
<% #episodes.each do |t| %>
<% if !t.episode_image.blank? %>
<li><%= image_tag(t.episode_image.image(:thumb)) %></li>
<% end %>
<li><%= t.episode_urls.first.mas_path if !t.episode_urls.first.blank?%></li>
<li><%= t.title %></li>
<% end %>
Here I'm getting each episodes episode_image inside my iteration. Even though I've been using includes in my controller, there was a big mistake at my table schema. I did not have index for episode_id in my episode_images table!. This was causing an extremely high query time. I've found it using New Relic's database reports. All other query times were 0.5ms or 2-3ms but episode.episode_image was causing almost 6500ms!
I don't know much about the relationship between query time and application execution but as I added index to my episode_images table, now I can clearly see the difference. If you have your database schema properly, you'll probably won't face any problem with scaling via Heroku. But any dyno can not help you with a badly designed database.
For people who might run into same problem, I would like to tell you about some of my findings of relationship between Heroku web dynos, Unicorn workers and Postgresql active connections:
Basically, Heroku provides you a dyno which is some kind of a small virtual machine having 1 core and 512MB ram. Inside that little virtual machine, your Unicorn server runs. Unicorn has a master process and worker processes. Each of your Unicorn workers has their own permanent connection to your existing Postgresql server (Don't forget to check out this) It basically means that when you have a Heroku dyno up with 3 Unicorn workers running on it, you have at least 4 active connections. If you have 2 web dynos, you have at least 8 active connections.
Let's say you have a Standard Tengu Postgres with 200 concurrent connections limit. If you have problematic queries with bad db design neither can db nor more dynos can save you without cache... If you have long running queries you have no choice other than caching, I think.
All above is my own findings, if there is anything wrong with them please warn me by your comments.

php-fpm "pool seems busy error". Why am I getting this?

I have a server with 64gb RAM using apache + fastcgi to connect to php-fpm.
I am running some load tests with ApacheBench. 500k reqs with 200 reqs/sec (goal is 10k/sec per server). I keep getting the "pool seems busy error" and am at a loss as to how to configure fpm properly to handle even 200reqs/sec. Feels like i'm missing something obvious.
fpm-config:
pm = dynamic
pm.max_children = 8192
pm.start_servers = 2048
pm.min_spare_servers = 2048
pm.max_spare_servers = 2048
pm.max_requests = 8000
apache config:
<IfModule worker.c>
StartServers 2048
ServerLimit 8175
MaxClients 8175
MinSpareThreads 2048
MaxSpareThreads 2048
ThreadsPerChild 25
MaxRequestsPerChild 8000
</IfModule>
What am I doing wrong?
My initial gut reaction is that having a max of 8000 children would seem to be quite a large number of processes to have running unless you have a lot of wait time per request. After a while the large number of processes will actually cause a degradation of performance since context switches would end up swapping the running processes in and out of CPU time, which takes time to do. Unless you have a lot of external service calls with processes waiting a lot, this seems to be a little excessive. Additionally, with an assumption of 20 MB allocated over the course of a request you are using 60+% of your free RAM just to serve start_servers.
As for the "pool seems busy" error, I don't know offhand. It's tough (for me) to say without getting deeper into the environment. What is your free CPU time like and your memory utilization when you are running AB?
I also wonder if there is a system limit on the number of connections an individual process (like the FPM) can have open... Check ulimit -a

Resources