I am running a Spring Cloud Gateway(version 3.0.3) Project with reactor netty(version 2.5.7). My system is with 8 core processor and 16 GB RAM. I have a route which calls the HttpBin API. When i launch load testing with Constant throughput of 25 tps, i am getting around 0.75% 502 (Bad Gate way) errors.
What might be the issue.
I tried increasing the Netty's worker Thread count to 200, but no change in the throughput. still getting around 0.75% 502 errors.
Update:- After upgrading Spring Cloud Gateway version to 4.0.0, the 502 error count is reduced. It is now 0.15% 502 errors.
Related
Hi have a few NodeJS servers on GCP Run and when there is a lot of requests, Google adds more instances. I've noticed that when they are added, we temporarily get a lot of 503 errors but the adding of instances could be because of instances being added.
Things I have tried to do to fix this:
Reduce the concurrency from 1000 to 500 to 300. We are still getting 503 errors but in some cases fewer
Used health checks to ping an endpoint to make sure express is running before sending traffic. This also might have helped but we are still getting quite a few 503 errors.
I expect that we should be able to fix this so that we don't have any 503 errors but we are still getting quite a few.
What else should I try?
Update #1:
It takes about 10 to 20 seconds to start:
Update #2:
I noticed that the 503 errors happen in groups and that the requests are not just a few milliseconds which means that they are running when the 503 errors occur. That means that GCP Run is likely adding instances because of the 503 errors and the 503 errors are not happening when the instances are added.
What can cause a 503 error all of a sudden for an instance that is already running?
ALB is giving me 502 errors randomly during the playwright test. When I debugged and checked ALB logs in Athena and I found elb_status_code is 502
Also access log entry says, the request_processing_time is 0.0, the target_processing_time is 0.005, and the response_processing_time is -1.
As per this AWS documentation:
https://aws.amazon.com/premiumsupport/knowledge-center/elb-alb-troubleshoot-502-errors/
If the elb_status_code is "502" and the target_status_code is "-", then your load balancer is the source of the HTTP 502 errors
But I didn't understand a way to fix it. Can someone please help me with it?
I had the same issue - which caused due to the keep-alive timeouts (the server used shorter timeouts than the load balancer)
Take a look on that post:
AWS Load Balancer 502
I don't know which error causes the problem. When I see, the server is down with the error 503. In Google Chrome log, I have the following error:
503 Service Unavailable: Back-end server is at capacity
While the server is down, I can't get to connect via SSH to see the error log. After few minutes the server works and I am go to the nginx error log.
In the log, I have common errors, like:
ActiveRecord::RecordNotFound (Couldn't find Attachment with 'id'=4240)
I know how to solve and I think that this errors is not the problem.
But I have this error too:
Sending 502 response: application did not send a complete response
Process (pid=31880, group=/home/ubuntu/........./current/public) no longer exists! Detaching it from the pool.
I think that it is the problem, but I looked in the internet and the causes and solutions do not appear to solve the problem.
This problem happens after I created a Load Balancer and use HTTPS.
Before, this problem never happens.
About my server and app:
Amazon Ec2 instance;
Using Classic Load Balancer (with Amazon Certificate Manager in https port);
Using Route 53;
Don't using Elastic IP;
OS: Ubuntu 14.04.2 LTS
ruby -v: 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
rails -v: Rails 4.2.3
nginx -v: nginx/1.8.0
passenger -v: Phusion Passenger version 5.0.10
Load Balancer Health Check is set up like this:
Ping Target
HTTP:80/index.html
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 5
Healthy threshold 5
Health Check Information:
I get this print in the Load Balancer MONITORING tab. Is the Unhealthy Hosts (Count). Why my host was unhealthy?
SOLUTION
In my case, the problem was in the assets precompile task.
I have a lot of assets in my app and when I did the deploy with capistrano, it exhausts the server.
In other side, sometimes, the assets was precompiled after the deploy, during the page load. But this task is very slowly, and returns the errors 502, 503 and 504.
It causes the servers down to, because the CPU utilization goes to 100%, the average latency is going higher too.
To solve, I removed the assets precompile task from Capistrano. I precompile the assets in my locally PC and send all of them to GIT branch MASTER. When I run cap production deploy, the precompile taks will not run. More details in this post.
I did some changes in my Load Balancer Health Check settings:
Ping Target HTTP:80/elb/index.html (I created in pubic folder this folder and file)
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 2
Healthy threshold 10
Idle timeout: 65 seconds (equal my nginx timeout)
With this I hope the task assets precompile never more runs on the server.
While doing a load test I found passenger throwing below error at first when lots of concurrent requests hit server. And, client side it gives 502 error code. However, after some requests say 1000- 2000 requests its works fine.
2013/07/23 11:22:46 [error] 14131#0: *50226 connect() to /tmp/passenger.1.0.14107/generation-
0/request failed (11: Resource temporarily unavailable) while connecting to upstream, client: 10.251.18.167, server: 10.*, request: "GET /home HTTP/1.0", upstream: "passenger:/tmp/passenger.1.0.14107/generation-0/request:", host: hostname
Server Details.
Passenger 4.0.10
ruby 1.9.3/2.0
Server Ec2 m1.xlarge
64-bit 4core 15gb
Ubuntu 12:24 LTS
Its a web server which servers dynamic webpages for rails framework
Can somebody suggest what the issue might be?
A "temporarily unavailable" error in that context means the socket backlog is full. That can happen if your app cannot handle your requests fast enough. What happens is that the queue grows and grows, until it's full, and then you start getting those errors. In the mean time your users' response times grow and grow until they get an error. This is probably an application-level problem so it's best to try starting there. Try figuring out why your app is slow, at which request it is slow, and fix that. Or maybe you need to scale to more servers.
We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.
During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so.
CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.
Our setup
1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5
The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.
We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.
NGINX SETTINGS
worker_processes=16
worker_connections=4096
multi_accept=on
LINUX SETTINGS
fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"
Why is the app bypassing the load balancer so fast?
Can nginx as load balancer be the bottle neck?
Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?
Depending on your settings nginx might be the bottleneck...
Check/tune the following settings in nginx:
the worker_processes setting (should be equal to the number of cores/cpus)
the worker_connections setting (should be very high if you have lots of connections at peak)
set multi_accept on;
if on linux, in nginx make sure you're using epoll (use epoll;-directive)
check/tune the following settings of your OS:
number of allowed open file descriptors (sysctl -w fs.file-max=999999 on linux)
tcp read and write buffers (sysctl -w net.ipv4.tcp_rmem="4096 4096 16777216" and
sysctl - net.ipv4.tcp_wmem="4096 4096 16777216" on linux)
local port range (sysctl -w net.ipv4.ip_local_port_range="1024 65536" on linux)
Update:
so you have 16 workers and 4096 connections per workers
which means a maximum of 4096*16=65536 concurrent connections
you probably have multiple requests per browser (ajax, css, js, page itself, any images on the page, ...), let's say 4 request per browser
that allows for slightly over 16k concurrent users, is that enough for your peaks?
How do you set up your upstream server group and what is the load balancing method you use?
It's hard to imagine that Nginx itself is the bottleneck. Is it possible that some upstream app servers get hit much more than others and start to refuse connection due to backlog is full? See this load balancing issue on Heroku and see if you can get more help there.
After nginx version 1.2.2, nginx provides this least_conn. That might be an easy fix. I haven't tried it myself yet.
Specifies that a group should use a load balancing method where a
request is passed to the server with the least number of active
connections, taking into account weights of servers. If there are
several such servers, they are tried using a weighted round-robin
balancing method.