Rails, Puma, Sidekiq how to calculate total DB connections? - ruby-on-rails

I am getting ActiveRecord::ConnectionTimeoutError once or twice in a day. could someone help me in calculating how many connections my application is making to my DB? and suggestion to optimise my connections?
Here is my configuration
AWS
Database : Mysql
Version : 5.7.23
Provider : AWS RDS (db.m5.large, vCPU: 2, RAM: 8GB)
3 servers with bellow configurations
# database.yml
pool: 20
# puma.rb
RAILS_MAX_THREADS : 5
WEB_CONCURRENCY : 2
1 sidekiq server with bellow configuration
# sidekiq
concurrency: 25
I tried to get max number of connection my DB is able to handle
# MySQL Max connections ("show global variables like 'max_connections';")
624

The total number of connections to a database equals the number of connections per server times the number of servers.
Total DB Connections = Connections per server * server count.
Connections per server = AR Database Pool Size * Processes per server (usually set with WEB_CONCURRENCY or SIDEKIQ_COUNT)
So for the web servers you have:
AR Database Pool Size = 20
Processes per server = 2
Server Count = 3
Total DB Connections(Web Server) = 20 * 2 * 3 = 120
The for the sidekiq server:
AR Database Pool Size = 20
Processes per server = 1
Server Count = 1
Total DB Connections(Sidekiq Server) = 20 * 1 * 1 = 20
So the total expected DB connections should be 140, which is way below the limit of the RDS instance.
My guess is that you are getting the ActiveRecord::ConnectionTimeoutError because your Sidekiq concurrency setting is higher than the AR connection pool value. All of the Sidekiq threads need an ActiveRecord database connection, so setting the AR pool size to a number smaller than Sidekiq's concurrency means some Sidekiq threads will become blocked waiting for a free database connection. In your case, at some point in time you might have 25 threads trying to access the database through a database pool that can use at most 20 connections and if a thread can't get a free database connection within 5 seconds, you get a connection timeout error.
In Sidekiq the total DB connections should be
minimum(Threads That Need a Database Connection, AR Database Pool Size) * Processes per Server (WEB_CONCURRENCY or SIDEKIQ_COUNT) * Server Count.
Additionally the Sidekiq documentation states that
Starting in Rails 5, RAILS_MAX_THREADS can be used to configure Rails and Sidekiq concurrency. Note that ActiveRecord has a connection pool which needs to be properly configured in config/database.yml to work well with heavy concurrency. Set pool equal to the number of threads pool: <%= ENV['RAILS_MAX_THREADS'] || 10 %>
Most of this answer is based on the Sidekiq in Practice email series from Nate Berkopec

Related

What is the best way to performance test an SQS consumer to find the max TPS that one host can handle?

I have a SQS consumer running in EventConsumerService that needs to handle up to 3K TPS successfully, sometimes upwards of 20K TPS (or 1.2 million messages per minute). For each message processed, I make a REST call to DataService's TCP VIP. I'm trying to perform a load test to find the max TPS that one host can handle in EventConsumerService without overstraining:
Request volume on dependencies, DynamoDB storage, etc
CPU utilization in both EventConsumerService and DataService
Network connections per host
IO stats due to overlogging
DLQ size must be minimal, currently I am seeing my DLQ growing to 500K messages due to 500 Service Unavailable exceptions thrown from DataService, so something must be wrong.
Approximate age of oldest message. I do not want a message sitting in the queue for over X minutes.
Fatals and latency of the REST call to DataService
Active threads
This is how I am performing the performance test:
I set up both my consumer and the other service on one host, the reason being I want to understand the load on both services per host.
I use a TPS generator to fill the SQS queue with a million messages
The EventConsumerService service is already running in production. Once messages started filling the SQS queue, I immediately could see requests being sent to DataService.
Here are the parameters I am tuning to find messagesPolledPerSecond:
messagesPolledPerSecond = (numberOfHosts * numberOfPollers * messageFetchSize) * (1000/(sleepTimeBetweenPollsPerMs+receiveMessageTimePerMs))
messagesInSurge / messagesPolledPerSecond = ageOfOldestMessageSLA
ageOfOldestMessage + settingsUpdatedLatency < latencySLA
The variables for SqsConsumer which I kept constant are:
numberOfHosts = 1
ReceiveMessageTimePerMs = 60 ms? It's out of my control
Max thread pool size: 300
Other factors are all game:
Number of pollers (default 1), I set to 150
Sleep time between polls (default 100 ms), I set to 0 ms
Sleep time when no messages (default 1000 ms), ???
message fetch size (default 1), I set to 10
However, with the above parameters, I am seeing a high amount of messages being sent to the DLQ due to server errors, so clearly I have set values to be too high. This testing methodology seems highly inefficient, and I am unable to find the optimal TPS that does not cause such a tremendous number of messages to be sent to the DLQ, and does not cause such a high approximate age of the oldest message.
Any guidance is appreciated in how best I should test. It'd be very helpful if we can set up a time to chat. PM me directly

Restart heroku dynos when their RAM exceeded

I have a memory leak problem with my server (who is written in ruby on rails)
I want to implement a temporary solution that restarts the dynos automatically when their memory is exceeding. What is the best way to do this? And is it risky ?
There is a great solution for it if you're using Puma as a server.
https://github.com/schneems/puma_worker_killer
You can restart your server when the RAM exceeds some threshold:
for example:
PumaWorkerKiller.config do |config|
config.ram = 1024 # mb
config.frequency = 5 # seconds
config.percent_usage = 0.98
config.rolling_restart_frequency = 12 * 3600 # 12 hours in seconds
end
PumaWorkerKiller.start
Also, to prevent data corruption and other funny issues in your DB, I would also suggest to make sure you are covered with atomic transactions.

Neo4j node creation speed

I have a fresh neo4j setup on my laptop, and creating new nodes via the REST API seems to be quite slow (~30-40 ms average). I've Googled around a bit, but can't find any real benchmarks for how long it "should" take; there's this post, but that only lists relative performance, not absolute performance. Is neo4j inherently limited to only adding ~30 new nodes per second (outside of batch mode), or is there something wrong with my configuration?
Config details:
Neo4j version 2.2.5
Server is on my mid-end 2014 laptop, running Ubuntu 15.04
OpenJDK version 1.8
Calls to the server are also from my laptop (via localhost:7474), so there shouldn't be any network latency involved
I'm calling neo4j via Clojure/Neocons; method used is "create" in the class clojurewerkz.neocons.rest.nodes
Using Cypher seems to be even slower; eg. calling "PROFILE CREATE (you:Person {name:"Jane Doe"}) RETURN you" via the HTML interface returns "Cypher version: CYPHER 2.2, planner: RULE. 5 total db hits in 54 ms."
Neo4j performance charasteristics is a tricky area.
Mesuring performance
First of all: it all depends a lot on how server is configured. Measuring anything on laptop is wrong way to do it.
Befor measuring performance you should check following:
You have appropriate server hardware (requirements)
Client and server are in local network.
Neo4j is properly configured (memory mapping, webserver thread pool, java heap size and etc)
Server is properly configured (Linux tcp stack, maximum available open files and etc)
Server is warmed up. Neo4j is written in Java, so you should do appropriate warmup before measuring numbers (i.e. making some load for ~15 minutes).
And last one - enterprise edition. Neo4j enterprise edition has some advanced features that can improve performance a lot (i.e. HPC cache).
Neo4j internally
Neo4j internally is:
Storage
Core API
Traversal API
Cypher API
Everything is performed without any additional network requests. Neo4j server is build on top of this solid foundation.
So, when you are making request to Neo4j server, you are measuring:
Latency between client and server
JSON serialization costs
Web server (Jetty)
Additional modules that are intended for managing locks, transaction and etc
And Neo4j itself
So, bottom line here is - Neo4j is pretty fast by itself, if used in embedded mode. But dealing with Neo4j server involved additional costs.
Numbers
We had internal Neo4j testing. We measured several cases.
Create nodes
Here we are using vanilla Transactional Cypher REST API.
Threads: 2
Node per transaction: 1000
Execution time: 1635
Total nodes created: 7000000
Nodes per second: 7070
Threads: 5
Node per transaction: 750
Execution time: 852
Total nodes created: 7000000
Nodes per second: 8215
Huge database sync
This one uses custom developed unmanaged extension, with binary protocol between server and client and some concurrency.
But this is still Neo4j server (in fact - Neo4j cluster).
Node count: 80.32M (80 320 000)
Relationship count: 80.30M (80 300 000)
Property count: 257.78M (257 780 000)
Consumed time: 2142 seconds
Per second:
Nodes - 37497
Relationships - 37488
Properties - 120345
This numbers shows true Neo4j power.
My numbers
I tried to measure performance right now
Fresh and unconfigured database (2.2.5), Ubuntu 14.04 (VM).
Results:
$ ab -p post_loc.txt -T application/json -c 1 -n 10000 http://localhost:7474/db/data/node
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: Jetty(9.2.4.v20141103)
Server Hostname: localhost
Server Port: 7474
Document Path: /db/data/node
Document Length: 1245 bytes
Concurrency Level: 1
Time taken for tests: 14.082 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 14910000 bytes
Total body sent: 1460000
HTML transferred: 12450000 bytes
Requests per second: 710.13 [#/sec] (mean)
Time per request: 1.408 [ms] (mean)
Time per request: 1.408 [ms] (mean, across all concurrent requests)
Transfer rate: 1033.99 [Kbytes/sec] received
101.25 kb/s sent
1135.24 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 19
Processing: 1 1 1.3 1 53
Waiting: 0 1 1.2 1 53
Total: 1 1 1.3 1 54
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 2
95% 2
98% 3
99% 4
100% 54 (longest request)
This one creates 10000 nodes using REST API, with no properties in 1 thread.
As you can see, event on my laptop in Linux VM, with default settings - Neo4j is able to create nodes in 4ms or less (99%).
Note: I have warmed up database before (created and deleted 100K nodes).
Bolt
If you are looking for best Neo4j performance, you should follow Bolt development. This is new binary protocol for Neo4j server.
More info: here, here and here.
One other thing to try is to run ./bin/neo4j-shell. Since there's no HTTP connection it can help you understand how much is Neo4j and how much is from the HTTP interface.
When I do that on 2.2.2 my CREATEs are generally around 10ms.
I'm not sure what the ideal is and if there is configuration which can improve the performance.

How to tune a Ruby on Rails application running on Heroku which uses production level Heroku Postgres?

The Company I work for decided on moving their entire stack to Heroku. The main motivation was it's ease of use: No sysAdmin, no cry. But I still have some questions about it...
I'm making some load and stress tests on both application platform and Postgres service. I'm using blitz as an addon of Heroku. I attack on the site with number of users between 1 to 250. There are some very interesting results I got and I need help on evaluating them.
The Test Stack:
Application specifications
It hasn't anything that much special at all.
Rails 4.0.4
Unicorn
database.yml set up to connect to Heroku postgres.
Not using cache.
Database
It's a Standard Tengu (naming conventions of Heroku will kill me one day :) properly connected to the application.
Heroku configs
I applied everything on unicorn.rb as told in "Deploying Rails Applications With Unicorn" article. I have 2 regular web dynos.
WEB_CONCURRENCY : 2
DB_POOL : 5
Data
episodes table counts 100.000~
episode_urls table counts 300.000~
episode_images table counts 75.000~
Code
episodes_controller.rb
def index
#episodes = Episode.joins(:program).where(programs: {channel_id: 1}).limit(100).includes(:episode_image, :episode_urls)
end
episodes/index.html.erb
<% #episodes.each do |t| %>
<% if !t.episode_image.blank? %>
<li><%= image_tag(t.episode_image.image(:thumb)) %></li>
<% end %>
<li><%= t.episode_urls.first.mas_path if !t.episode_urls.first.blank?%></li>
<li><%= t.title %></li>
<% end %>
Scenario #1:
Web dynos : 2
Duration : 30 seconds
Timeout : 8000 ms
Start users : 10
End users : 10
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 218 successful hits in 30.00 seconds and we
transferred 6.04 MB of data in and out of your app. The average hit
rate of 7.27/second translates to about 627,840 hits/day.
Scenario #2:
Web dynos : 2
Duration : 30 seconds
Timeout : 8000 ms
Start users : 20
End users : 20
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 365 successful hits in 30.00 seconds and we
transferred 10.12 MB of data in and out of your app. The average hit
rate of 12.17/second translates to about 1,051,200 hits/day. The
average response time was 622 ms.
Scenario #3:
Web dynos : 2
Duration : 30 seconds
Timeout : 8000 ms
Start users : 50
End users : 50
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 371 successful hits in 30.00 seconds and we
transferred 10.29 MB of data in and out of your app. The average hit
rate of 12.37/second translates to about 1,068,480 hits/day. The
average response time was 2,631 ms.
Scenario #4:
Web dynos : 4
Duration : 30 seconds
Timeout : 8000 ms
Start users : 50
End users : 50
Result:
HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)
This rush generated 484 successful hits in 30.00 seconds and we
transferred 13.43 MB of data in and out of your app. The average hit
rate of 16.13/second translates to about 1,393,920 hits/day. The
average response time was 1,856 ms.
Scenario #5:
Web dynos : 4
Duration : 30 seconds
Timeout : 8000 ms
Start users : 150
End users : 150
Result:
HITS 71.22% (386)
ERRORS 0.00% (0)
TIMEOUTS 28.78% (156)
This rush generated 386 successful hits in 30.00 seconds and we
transferred 10.76 MB of data in and out of your app. The average hit
rate of 12.87/second translates to about 1,111,680 hits/day. The
average response time was 5,446 ms.
Scenario #6:
Web dynos : 10
Duration : 30 seconds
Timeout : 8000 ms
Start users : 150
End users : 150
Result:
HITS 73.79% (428)
ERRORS 0.17% (1)
TIMEOUTS 26.03% (151)
This rush generated 428 successful hits in 30.00 seconds and we
transferred 11.92 MB of data in and out of your app. The average hit
rate of 14.27/second translates to about 1,232,640 hits/day. The
average response time was 4,793 ms. You've got bigger problems,
though: 26.21% of the users during this rush experienced timeouts or
errors!
General Summary:
The "Hit Rate" never goes beyond the number of 15 even though 150 users sends request to the application.
Increasing number of web dynos does not help handling requests.
Questions:
When I use caching and memcached (Memcachier add-on from Heroku) even 2 web dynos can handle >180 hits per second. I'm just trying to understand what can dynos and the postgres service can do without cache. This way I'm trying to understand how to tune them. How to do it?
Standard Tengu is said to have 200 concurrent connections. So why it never reaches that number?
If having a prdouction level db and increasing web dynos won't help to scale my app, what's the point to use Heroku?
Probably the most important question: What am I doing wrong? :)
Thank you for even reading this crazy question!
I particularly figured out the issue.
Firstly, remember my code in the view:
<% #episodes.each do |t| %>
<% if !t.episode_image.blank? %>
<li><%= image_tag(t.episode_image.image(:thumb)) %></li>
<% end %>
<li><%= t.episode_urls.first.mas_path if !t.episode_urls.first.blank?%></li>
<li><%= t.title %></li>
<% end %>
Here I'm getting each episodes episode_image inside my iteration. Even though I've been using includes in my controller, there was a big mistake at my table schema. I did not have index for episode_id in my episode_images table!. This was causing an extremely high query time. I've found it using New Relic's database reports. All other query times were 0.5ms or 2-3ms but episode.episode_image was causing almost 6500ms!
I don't know much about the relationship between query time and application execution but as I added index to my episode_images table, now I can clearly see the difference. If you have your database schema properly, you'll probably won't face any problem with scaling via Heroku. But any dyno can not help you with a badly designed database.
For people who might run into same problem, I would like to tell you about some of my findings of relationship between Heroku web dynos, Unicorn workers and Postgresql active connections:
Basically, Heroku provides you a dyno which is some kind of a small virtual machine having 1 core and 512MB ram. Inside that little virtual machine, your Unicorn server runs. Unicorn has a master process and worker processes. Each of your Unicorn workers has their own permanent connection to your existing Postgresql server (Don't forget to check out this) It basically means that when you have a Heroku dyno up with 3 Unicorn workers running on it, you have at least 4 active connections. If you have 2 web dynos, you have at least 8 active connections.
Let's say you have a Standard Tengu Postgres with 200 concurrent connections limit. If you have problematic queries with bad db design neither can db nor more dynos can save you without cache... If you have long running queries you have no choice other than caching, I think.
All above is my own findings, if there is anything wrong with them please warn me by your comments.

Inserting data into mysql db from csv file using sidekiq and smarter_csv

Im using Sidekiq (https://github.com/mperham/sidekiq) for background processing in my rails application. I need to insert 75,000 records into a mysql db from a csv file. Im using smarter_csv (https://github.com/tilo/smarter_csv) in conjunction with sidekiq to insert the data in chunks into the db. I have the following questions
Is the maximum number of workers for sidekiq 25 ?
What is the maximum possible pool size for a mysql db and what should be the optimum value of pool size i should use for minimum possible transfer time ?
Thanks
sidekiq -c 50 creates 50 processors (default is 25)
MySql accepts 100 connections by default. If you change the pool size in database.yml, make sure you enter a value less or equal then the number of connections MySql can handle. I don't know what the optimal value is, I think it depends on the amount of RAM available.

Resources