I have a microservice that is taking in webhooks to process but it is currently getting pounded by the sender of said webhooks. Right now I am taking them and inserting the webhooks into the db for processing but the data is so bursty at times that I don't have enough bandwidth to manage the flood of requests and I cannot scale anymore as I'm out of db connections. The current thought is to just take the webhooks and throw them into a Kafka queue for processing; using Kafka I can scale up the number of frontend workers to whatever I need to handle the deluge of requests and I have the replayability of Kafka. By throwing the webhooks into Kafka, the frontend web server no longer needs a pool of db connections as it literally is just taking the request and throwing into the queue for processing. Does anyone have any knowledge on removing the db connectivity from Puma or have an alternative to do what's being asked?
Currently running
ruby 2.6.3
rails 6.0.1
puma 3.11
Ended up using Puma's before fork and on_worker_boot methods to not re-establish the database connection for those particular web workers within the config
I am using Nginx with Phusion Passenger with a single-threaded Rails application. Here's the catch. Within that application, I am using multi-threaded sidekiq to perfrom some background jobs. Typically in my database.yml, I would only need to set the pool value to 1. Here's an example:
default: &default
adapter: mysql2
encoding: utf8
collation: utf8_unicode_ci
pool: 1
username: username
password: password
host: localhost
The reason is because for each tcp socket connection opened, when an http request comes in through that socket, nginx will take the request and pass the information to passenger. Passenger detects its a Rails app, and it spawns a Rails instance, which converts the response to html, which is sent back to nginx, which is then passed back to the client (browser) So for each passenger instance, I will only need one database connection, with a single-threaded Rails app.
But in my sidekiq.yml, I have set concurrency to 5:
:concurrency: 5
This means for each passenger rack instance, I will have 5 concurrent threads handled by sidekiq plus the one connection for the main app, that is a total of 6 database connections for one passenger instance.
When I look at passenger-status, I notice that max_pool_size is set to 6:
----------- General information -----------
Max pool size : 6
So does that mean passenger will never spawn more than 6 Rails instances concurrently? And if that's the case, does that mean my math is correct: 6 (instances) * 6 (database connections: 5 for sidekiq and 1 for main app) = 36 (total database connections possible for my rails app to handle concurrently).
Right now my mysql database is configured to handle 151 max concurrent connections.
SHOW VARIABLES LIKE "max_connections";
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
I just want to make sure my math is correct regarding passenger, rails and sidekiq.
First of all, your Sidekiq processes and your web server (in your case, Passenger) are separate. Passenger's thread pool size has no effect on your Sidekiq concurrency; instead, your Sidekiq configuration specifies a separate concurrency. So, we'll consider the two separately:
Passenger
The ActiveRecord database pool value is the number of database connections that your web process will use, in total across all threads. If your Passenger server is set up in multi-process mode, then your max connections from your web processes is db pool size * passenger pool size. On the other hand, if you set it up in multi-threaded mode (which I'd recommend if possible), your max connections is just db pool size (multiplied by however many processes are running; Puma, for example, runs by default two processes with up to fifteen threads or so, so the max connections in that case would be 30).
So, if you're using multi-threaded mode, a pool size of 1 is absolutely not sufficient -- you'll want at least as big a pool as you expect to have threads. In multi-process mode, 1 might work but I doubt it's really worth straying from the default of 5, until you encounter issues.
Sidekiq
Sidekiq always runs in multi-threaded mode (you can technically run multiple processes as well, but I'll assume you aren't). So, like above, you want your connection pool to be at least as big as the number of threads. This might mean that you technically need two different values for your db pool value depending on whether the Rails env is spinning up for Passenger, or for Sidekiq -- see this issue on the Sidekiq repo or this helpful Heroku guide for more information on how to address that.
In summary
Don't forget that, aside from all the above, you may easily have multiple servers all running the same Rails app, but only one database with one connection limit. If you're running Passenger in multi-instance mode with a max of 6 processes, set your db pool size to 5, then each web server node will use up to 30 connections. If it's running a Sidekiq server, then add 5 to that. You will probably not need more than one Sidekiq server, so 4 web nodes # 30 connections + one Sidekiq process # 5 connections = 125 maximum connections, well within your MySQL connection limit.
I reviewed the Passenger documentation again, and while the answer above answers the question, I want to add a little more detail:
HTTP client via TCP sends a request to Nginx
Phusion Passenger loaded into Nginx checks if request should be handled by Passenger. If so, request is sent to Passenger Core.
Passenger core, using load balancing rules, determines which process a request should be forwarded to.
Passenger core also takes care of application spawning: if it determines that having more application processes is necessary or beneficial, then it will make that happen subject to user-configured limits: the core will never spawn more processes than a user-configured maximum.
Passenger core also has monitoring and statistics: passenger-memory-stats and passenger-status
Passenger core restarts an application process if it crashes.
UstRouter sits idle and does not consume resources if you did not configure it to send data to Union Station, a monitoring web service
Watchdog monitors Passenger Core and UstRouter. If either of them crash, they are restarted by the Watchdog.
passenger-memory-stats will verify the three aforementioned processes as well as the spawned rack apps:
------ Passenger processes ------
PID VMSize Private Name
---------------------------------
18355 419.1 MB ? Passenger watchdog
18358 1096.5 MB ? Passenger core
18363 427.2 MB ? Passenger ust-router
18700 818.9 MB 256.2 MB Passenger RubyApp: myapp_rack_rails
24783 686.9 MB 180.2 MB Passenger RubyApp: myapp_rack_rails
passenger-status shows that the max_pool_size is 6. That is, at most there will be 6 rack apps spawned by Passenger Core:
----------- General information -----------
Max pool size : 6
App groups : 2
Processes : 3
As stated in another answer, the ActiveRecord database pool value is the number of database connections that your web process will use, in total across all threads.
But since I am using the free Passenger server, which is set up in multi-process mode, then my max connections from my web processes is db pool size * passenger pool size. So since Passenger pool size is 6, and if my db pool size is 1, that is 6 * 1 = 6. That will be 6 maximum database connections.
Sidekiq always runs in multi-threaded mode.
If someone wants to use sidekiq they must configure the number of threads they want to run on or use the default (25). If they are using a database (likely) then to not hit a connection timeout error they will need to have at least as many connections in their database pool as sidekiq threads. Currently they must configure these two values in two different places, database pool in database.yml for ActiveRecord, and sidekiq connections either via command line or the sidekiq yml file. This is a problem as it is difficult to remember when you are modifying one value that you need to modify both.
I've been searching for an answer on this but I couldn't find one.
How does Puma master process communicates with the workers ? How the master process sends the request to the worker ? Is this done with shared memory ? Unix socket ?
Thanks!
The master doesn't deal with requests, it merely monitors the workers and restarts them when necessary.
The workers, independently, will pull requests from some queueing system, e.g. a TCP port or unix socket.
We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.
During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so.
CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.
Our setup
1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5
The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.
We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.
NGINX SETTINGS
worker_processes=16
worker_connections=4096
multi_accept=on
LINUX SETTINGS
fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"
Why is the app bypassing the load balancer so fast?
Can nginx as load balancer be the bottle neck?
Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?
Depending on your settings nginx might be the bottleneck...
Check/tune the following settings in nginx:
the worker_processes setting (should be equal to the number of cores/cpus)
the worker_connections setting (should be very high if you have lots of connections at peak)
set multi_accept on;
if on linux, in nginx make sure you're using epoll (use epoll;-directive)
check/tune the following settings of your OS:
number of allowed open file descriptors (sysctl -w fs.file-max=999999 on linux)
tcp read and write buffers (sysctl -w net.ipv4.tcp_rmem="4096 4096 16777216" and
sysctl - net.ipv4.tcp_wmem="4096 4096 16777216" on linux)
local port range (sysctl -w net.ipv4.ip_local_port_range="1024 65536" on linux)
Update:
so you have 16 workers and 4096 connections per workers
which means a maximum of 4096*16=65536 concurrent connections
you probably have multiple requests per browser (ajax, css, js, page itself, any images on the page, ...), let's say 4 request per browser
that allows for slightly over 16k concurrent users, is that enough for your peaks?
How do you set up your upstream server group and what is the load balancing method you use?
It's hard to imagine that Nginx itself is the bottleneck. Is it possible that some upstream app servers get hit much more than others and start to refuse connection due to backlog is full? See this load balancing issue on Heroku and see if you can get more help there.
After nginx version 1.2.2, nginx provides this least_conn. That might be an easy fix. I haven't tried it myself yet.
Specifies that a group should use a load balancing method where a
request is passed to the server with the least number of active
connections, taking into account weights of servers. If there are
several such servers, they are tried using a weighted round-robin
balancing method.
I am struggling to get munin reporting working when running a Tsung load test.
My set up is as follows.
Web site staging server (staging4):
2 CPUs
Tsung server
2 CPUs
My Tsung server has an SSH tunnel to staging4 on port 4950 see my tsung.xml configuration below:
<monitoring>
<monitor host="localhost" type="munin">
<munin port="4950" />
</monitor>
</monitoring>
When I start my load test I get the following error message every 10 seconds:
=INFO REPORT==== 16-Nov-2011::16:04:09 ===
ts_os_mon_munin:(4:<0.72.0>) CPU usage value from munin too high, skip (host "ip-10-48-177-212.housetrip.com" , cpu 8761644.1)
I maybe wrong but I think this is because our staging 4 server has 2 CPUs and so the resulting CPU % is greater than 100%.
I checked through the Tsung code and their didn't seem to be an option to set the number of CPUs referenced in the monitoring XML element https://github.com/processone/tsung/blob/master/src/tsung_controller/ts_config.erl
However there does seem to be a CPU setting on the munin plugin wrapper https://github.com/processone/tsung/blob/master/src/tsung_controller/ts_os_mon_munin.erl
Has anyone come across this before? Is there anyway I can get the munin values to be returned in my log file?
Any suggestions would be greatly appreciated.
Many thanks
I haven't worked with munin, but I know that Tsung doesn't handle multicore CPUs very well.
To avoid Tsung crashes when running massive Tsung load from a client I used this workaround on a 4 core CPU.
<clients>
<client host="myhostname" use_controller_vm="false" weight="1"/>
<client host="myhostname" use_controller_vm="false" weight="1"/>
<client host="myhostname" use_controller_vm="false" weight="1"/>
<client host="myhostname" use_controller_vm="false" weight="1"/>
</clients>
As you can see, the trick is to set up one client Tsung erlang node per available core.
Maybe this trick can solve your munin problem also.