Terminating timed out worker of puma in ebs on rails app - ruby-on-rails

Running rails app in elastic-beanstalk for quite long time so far I didn't face this issue. My server keeps on restarting with logs as
[4412] ! Terminating timed out worker: 8646
[4412] ! Out-of-sync worker list, no 8646 worker
[4412] ! Out-of-sync worker list, no 8646 worker
In ebs log, I am getting
Environment health has transitioned from Ok to Warning. 50.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (3.0 requests/min) to determine application health (5 minutes ago). 1 out of 1 instances are impacted. See instance health for details.
My instance ran in t2.medium(CPU core 2), is this because of heavy load on the server? Do I need to upgrade my instance? I am totally stuck.
FYI: I included puma only by having in gemfile, as rails 4.+ can handle.
Suggest me your opinion on this. Thanks.

Related

Ruby on Rails - Unicorn - Abort Query after Worker Timeout

we have a RoR 6 application using unicorn (tested on 5.8 and 6.1) and a mysql (5.7 but soon to switch to 8) db.
An issue that we have been running into lately is the following:
a user makes a request to a page
the request times out (the reasons are not important here) and returns a 504 - in particular, the unicorn master kills the unicorn worker after 60s
however, the query keeps running even after the worker is killed and reforked - this is particularly bad for suboptimal queries that put a lot of load on the db
I was wondering if there was some established way of killing any running queries made by the connection of the timed out unicorn worker in a safe way?
thank you!

Passenger uses more PostgreSQL connection than expected

Hard issue happening in production for a long time, we have no clue about where it's coming from. Can sometimes reproduces it on localhost, Heroku Enterprise support has been clue-less about this.
On our production database, we currently have the following setup:
Passenger Standalone, threading disabled, limited to 25 processes MAX. No min setup.
3 web dynos
a SELECT * FROM pg_stat_activity GROUP BY client_addr and count the number of connections per instance shows that more than 1 PSQL connection is opened for one passenger process during our peak days.
Assumptions:
A single address is about a single Dyno (Confirmed by Heroku staff)
Passenger does not spawn more than 25 processes at the time (confirmed with passenger-status during those peaks)
Here is a screenshot of what looks the SELECT * FROM pg_stat_activity;:
In the screenshot, we can see that there are 45 psql connections coming from the same dyno that runs passenger. If we followed our previous logic, it should not have more than 1 connection per Passenger process, so 25.
The logs doesn't look unusual, nothing mentioning either a dyno crash / process crash.
Here is a screenshot of our passenger status for the same dyno (different time, just to prove that there are not more processes than 25 created for one dyno):
And finally one of the response we got from the Heroku support (Amazing support btw)
I have also seen previous reports of Passenger utilising more connections than expected, but most were closed due to difficulty reproducing, unfortunately.
In the Passenger documentation, it's explained that Passenger handle itself the ActiveRecord connections.
Any leads appreciated. Thanks!
Various information:
Ruby Version: 2.4.x
Rails Version: 5.1.x
Passenger Version: 5.3.x
PG Version: 10.x
ActiveRecord Version: 5.1.x
If you need any more info, just let me know in the comments, I will happily update this post.
One last thing: We use ActionCable. I've read somewhere that passenger is handling weirdly the socket connections (Opens a somewhat hidden process to keep the connection alive). This is one of our leads, but so far no luck in reproducing it on localhost. If anyone can confirm how Passenger handles ActionCable connections, it would be much appreciated.
Update 1 (01/10/2018):
Experimented:
Disable NewRelic Auto-Explain feature as explained here: https://devcenter.heroku.com/articles/forked-pg-connections#disabling-new-relic-explain
Run locally a Passenger server with min and max pool size set to 3 (more makes my computer burn), then kill process with various signals (SIGKILL, SIGTERM) to try to see if connections are closed properly. They are.
We finally managed to fix the issue we had on Passenger. We have had this issue for a very long time actually.
The fix
If you use ActionCable, and your default cable route is /cable, then change the Procfile from:
web: bundle exec passenger start -p $PORT --max-pool-size $PASSENGER_MAX_POOL_SIZE
to
web: bundle exec passenger start -p $PORT --max-pool-size $PASSENGER_MAX_POOL_SIZE --unlimited-concurrency-path /cable
Explanation
Before the change, each socket connection (ActionCable) would take one single process in Passenger.
But a Socket is actually something that should not take a whole process. A process can handle many many open socket connection. (Many is more than 10thousands at the same time for some big names). Fortunately, we have much lower socket connections, but still.
After the change, we basically told Passenger to not take a whole process to handle one socket connection, but rather dedicate a whole process to handle all the socket connections.
Documentation
The in-depth documentation on how to do Sockets with Passenger: https://www.phusionpassenger.com/library/config/standalone/tuning_sse_and_websockets/
The flag to pass to Passenger: https://www.phusionpassenger.com/library/config/standalone/reference/#--unlimited-concurrency-path-unlimited_concurrency_paths
Some metrics, after 3 weeks with the fix
Number of forked processes on Passenger dramatically decreased (from 75 processes to ~ 15 processes)
Global memory usage on the web dynos dramatically decreased (related to previous point on forked Passenger processes)
The global number of PSQL connections dramatically decreased and has been steady for two days (even after deployment). (from 150 to ~30 connections)
Number of PSQL connections per dyno dramatically decreased, (from ~50 per dyno to less than 10 per dyno)
The number of Redis connections decreased and has been steady for two days (even after deployment)
Average memory usage on PostgreSQL dramatically decreased and has been steady for two days.
The overall throughput is a bit higher than usual (Throughput is the number of requests handled per minute)

Why my Amz EC2 server downs everyday for few minutes? Errors 503 and 502. It's a Rails app

I don't know which error causes the problem. When I see, the server is down with the error 503. In Google Chrome log, I have the following error:
503 Service Unavailable: Back-end server is at capacity
While the server is down, I can't get to connect via SSH to see the error log. After few minutes the server works and I am go to the nginx error log.
In the log, I have common errors, like:
ActiveRecord::RecordNotFound (Couldn't find Attachment with 'id'=4240)
I know how to solve and I think that this errors is not the problem.
But I have this error too:
Sending 502 response: application did not send a complete response
Process (pid=31880, group=/home/ubuntu/........./current/public) no longer exists! Detaching it from the pool.
I think that it is the problem, but I looked in the internet and the causes and solutions do not appear to solve the problem.
This problem happens after I created a Load Balancer and use HTTPS.
Before, this problem never happens.
About my server and app:
Amazon Ec2 instance;
Using Classic Load Balancer (with Amazon Certificate Manager in https port);
Using Route 53;
Don't using Elastic IP;
OS: Ubuntu 14.04.2 LTS
ruby -v: 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
rails -v: Rails 4.2.3
nginx -v: nginx/1.8.0
passenger -v: Phusion Passenger version 5.0.10
Load Balancer Health Check is set up like this:
Ping Target
HTTP:80/index.html
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 5
Healthy threshold 5
Health Check Information:
I get this print in the Load Balancer MONITORING tab. Is the Unhealthy Hosts (Count). Why my host was unhealthy?
SOLUTION
In my case, the problem was in the assets precompile task.
I have a lot of assets in my app and when I did the deploy with capistrano, it exhausts the server.
In other side, sometimes, the assets was precompiled after the deploy, during the page load. But this task is very slowly, and returns the errors 502, 503 and 504.
It causes the servers down to, because the CPU utilization goes to 100%, the average latency is going higher too.
To solve, I removed the assets precompile task from Capistrano. I precompile the assets in my locally PC and send all of them to GIT branch MASTER. When I run cap production deploy, the precompile taks will not run. More details in this post.
I did some changes in my Load Balancer Health Check settings:
Ping Target HTTP:80/elb/index.html (I created in pubic folder this folder and file)
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 2
Healthy threshold 10
Idle timeout: 65 seconds (equal my nginx timeout)
With this I hope the task assets precompile never more runs on the server.

403 errors from load balancer while new instances are booting

I have a RoR app running on elastic beanstalk. I have occasionally seen 403 errors from Passenger for a while. Most of the time 1 server is running but this gets increased to 3 or 4 instances in busy periods during the day.
Session stickeyness is not turned on
I have noticed that when a new server is started the ELB is sending requests to it before bundle install has finished.
If I ssh to the newly started server I can see in /var/app/current/ that the app has not yet been installed and if I run top it looks like bundler is running and compiling things with cc1, etc.
/var/app/support/log/passenger.log shows that requests to valid urls within my rails app are being received and responded to with 404. Hardly surprising because the app isn't there yet
After 5-10 minutes all of the compiling is complete and the app files appear in /var/app/current and all is well.
This doesn't seem quite right to me. How do I set up the ELB / my rails app so that the ELB can tell when it is ready to receive requests?
I found the answer to this. There was no application health check url set. In this case the ELB pings the instance to see if it's healthy, i.e. it checks that it is booted rather than if rails is up and running. Setting the health check url to '/login/' fixed it for me because this gives a 404 until rails in running and a 200 afterwards.
Elastic beanstalk demands 2 correct responses before it deems an instance to be healthy. It checks the instance every 5 minutes. This means that an instance can take a while to start serving requests. i.e. it takes boot time + waiting for next poll from elb + 5 minutes before it sees any real traffic

Nginx + unicorn (rails) often gives "Connection refused" in nginx error log

At work we're running some high traffic sites in rails. We often get a problem with the following being spammed in the nginx error log:
2011/05/24 11:20:08 [error] 90248#0: *468577825 connect() to unix:/app_path/production/shared/system/unicorn.sock failed (61: Connection refused) while connecting to upstream
Our setup is nginx on the frontend server (load balancing), and unicorn on our 4 app servers. Each unicorn is running with 8 workers. The setup is very similar to the one GitHub uses.
Most of our content is cached, and when the request hits nginx it looks for the page in memcached and serves that it if can find it - otherwise the request goes to rails.
I can solve the above issue - SOMETIMES - by doing a pkill of the unicorn processes on the servers followed by a:
cap production unicorn:check (removing all the pid's)
cap production unicorn:start
Do you guys have any clue to how I can debug this issue? We don't have any significantly high load on our database server when these problems occurs..
Something killed your unicorn process on one of the servers, or it timed out. Or you have an old app server in your upstream app_server { } block that is no longer valid. Nginx will retry it from time to time. The default is to re-try another upstream if it gets a connection error, so hopefully your clients didn't notice anything.
I don't think this is a nginx issue for me, restarting nginx didn't help. It seems to be gunicorn...A quick and dirty way to avoid this is to recycle the gunicorn instances when the system is not being used, say 1AM for example if that is an acceptable maintenance window. I run gunicorn as a service that will come back up if killed so a pkill script takes care of the recycle/respawn:
start on runlevel [2345]
stop on runlevel [06]
respawn
respawn limit 10 5
exec /var/web/proj/server.sh
I am starting to wonder if this is at all related to memory allocation. I have MongoDB running on the same system and it reserves all the memory for itself but it is supposed to yield if other applications require more memory.
Other things worth a try is getting rid of eventlet or other dependent modules when running gunicorn. uWSGI can also be used as an alternative to gunicorn.

Resources