Why is my lb responding with bad gateway? - amazon-elb

I have no webserver runnning on my ec2 machine, but I still get 502 bad gateway from the load balancer in front of it.
Why do I get bad gateway error from the load balancer, but no bad gateway error, when there is no load balancer in front of the ec2 machine, but just a time out.

The load balancer regularly does health checks on its target machines, i.e. it sends an HTTP or TCP request (as you have configured it). This way it knows what machines in its target pool are healthy and can take requests and which can't. It's supposed to balance the load between multiple machines after all.
When your EC2 machine does not have a running web server, its health check fails and it's seen as unavailable by the load balancer. Since apparently there's no other healthy machine in the pool, the load balancer cannot forward any requests to anything, and thus answers with a 502 Bad Gateway status.
The difference to just timing out when you try to access your EC2 machine directly is that in the case of a load balancer, there's still something that can accept and handle HTTP requests and return appropriate HTTP error codes. When you simply have no web server whatsoever, the connection cannot be accepted by anything and thus can only time out.

Related

AWS ELB/ECS Http response headers changed

Some context here:
An old Symfony app is used in multiple EC2 instances. Handles millions of requests each day without issues.
For dev purposes, the app was added to a container and that container is used locally by the developers without having to install all the requirements. The dockerized app uses the same nginx/supervisor/php-fpm configs that productive ec2 instances.
To make easier some dev processes, it was decided to create multiple dev environments using AWS Fargate, instead of EC2 instances.
The image is pushed to ECR and is deployed using FARGATE strategy to clusters.
The approach perhaps is too much, since we have 1 Cluster running 1 service only with 1 task. That Service uses an ELB -> Target group.
The application is working fine, but after some time (hours, or days), some requests are returned with different headers. The response is a JSON, but the content type is returned as HTML, other headers are dropped from the request like access-control-allow-headers, access-control-allow-credentials, access-control-allow-methods, triggering a CORS error in the client's browser.
The weird part is that if 1 page creates 10 requests to this service, 9 will work correctly, but 1 request will return 200 with different headers. That endpoint consistently will behave in the same way to any user until the task is restarted.
The response headers are returned by the Symfony app. I also tried to force those headers including those in nginx config by default for every response, and the result is the same.
The docker image exposes port 80 to the service.
The load balancer has the rule to forward HTTPS (443) traffic to port 80, so traffic can reach the container.
The load balancer has enabled the use of HTTP/2
The only notable difference besides EC2/Fargate implementations is the load balancer. The production load balancer is an old class load balancer with only HTTP/1 enabled and the new ones are Applications load balancers using HTTP/2.
This is driving me crazy. Has anyone experienced something like this?
Incorrect headers
Correct headers

Docker Swarm load balance testing using Chrome

I've tried doing simple single node swarm just like in Docker tutorial part 3 and I've found out that if I use curl then I'm jumping between two replicas, but if I use Chrome then once I open the page then any following requests will be handled by the same replica. I'm sure I'm actually hitting it only once, because counter increases only by 1.
What is happening? Is it some kind of feature in Docker Swarm load balancing? If so, how would it work? No specific request headers are send to the server, so how would the load balancer recognize me? It can't be IP, because if I use incognito mode I'll be handled by different replica and I'll be stick to it as long as I'm in incognito.
It's not a Swarm thing, it's a chrome thing. Curl acts like you'd expect, each command is a new TCP request that shows as a new connection going through the Swarm VIP load balancer.
Chrome (and other browsers) have lots of methods to keep TCP connections open for future requests (HTTP keep-alives, etc). This is why it will stay connected to the same container because the connection is persistent through the LB to the replica. The LB will only shift to the "next in the round-robin pool" for a new connection.

GCP Load Balancer: 502 Server Error, "failed_to_connect_to_backend"

I have a dockerized Go application running on two GCP instances, everything works fine when using them with their individual external IPs, but when put through the load balancer, they're either slow to answer or it answers a 502 server error. The health checks seems to be ok, so I really don't understand.
In the logs, the error thrown is
failed_to_connect_to_backend
I've already seen other answers on this question, but none of them seems to provide an answer for my case. I cannot modify the way the application is served, so it doesn't seems to be a timeout thing.
To troubleshoot 502 response from the Load Balancer due to "failed_to_connect_to_backend." I would check the followings:
1) Usually, "failed_to_connect_to_backend" error message indicates that the load balancer is failing to connect to backends, investigating URL map rules is also a good point to start. I would also suggest reviewing your Load Balancer's URL map to make sure that Host rules, Path matcher, and Path rules are correctly defined and comply with descriptions in this article.
2) Also check if the backend instances are exhausting their resources, If a backend server is overwhelmed, it will refuse incoming requests, potentially causing the load balancer to give up on it and return the specific 502 error you're experiencing. For Apache, you could use this link and nginx this link. Also, check the output on how many established connections are present at any one time using 'netstat' and watch command.
3) I would also recommend testing again with the HTTP(S) request directly to the instance, request the same URL that reporting 502. You might do this test in another VM instance in your VPC network.
checking whether your backend block google's cloud cdn ip address or not.those addresses can be found here:https://cloud.google.com/compute/docs/faq#find_ip_range
this happened to me more than once, I was using apache in my servers, and the issue was not of CPU, but of configuration,
I am using apache mpm_event in combination with php-fpm and there are many settings that will limit the max amount of requests that you want apache and fpm to allow.
In my case I increased in Apache MPM config MaxRequestWorkers from the default 150 to 600, and in PHP FPM config pm.max_children to 80 (I don't remember what was the default here)
This worked as expected, hope this helps you to extrapolate to your own stack.
Just encountered 502 errors myself on access to a Prometheus pod running on my GKE Standard cluster (exposed through IAP).
The issue was that the configured External HTTP/S Load Balancer's health check was coming back unhealthy. This was despite the Prometheus pod running as expected. After digging into the issue I found out that the GCP auto-generated health check was faulty, it was checking URL / instead of /-/ready. When I deleted the Prometheus k8s Ingress resource (which auto-generates GCPs LB and Health Check) and recreated it - the issue was resolved (after a few minutes of resource propagation).

AWS Load Balancer EC2 health check request timed out failure

I'm trying to get down and dirty with DevOps and I'm running into a health check request timed out failure. The problem is my Elastic Load Balancer sends a health check to my EC2 instance and gets a network timeout. I'm not sure what I did wrong. I am following this tutorial and I have completed all the steps up to and including "Using a Elastic Load Balancer". My EC2 instance seems to be working fine and I am able to successfully curl localhost on port 9292 from within the EC2 instance.
EC2 instance security group setup:
Elastic Load Balancer setup:
My target group for the ELB routing has port 9292 open via HTTP and here's a screenshot of the target in my target group that is unhealthy.
Health check config:
I have a VPC that my EC2 instance is a part of and my ELB is connected to the same VPC. I do not have Apache installed and I do not have nginx installed. To my understanding, I do not need these. I have a Rails Puma server running and I can send successful curl requests to the server.
My hunch is that my ELB is not allowed to reach my EC2 instance, resulting in a network timeout and a failed health check. I'm unable to find the cause for this. Any ideas? This SO post didn't help much. Are my security groups misconfigured? What else could potentially block a routing request from ELB to my EC2 instance?
Also, is there a way to view network requests / logs for my EC2 instance? I keep seeing VPC flow logging but I feel like there are simpler alternatives.
Here's something I posted in the AWS forums but to no avail.
UPDATE: I can curl the private IP of target just fine from within an EC2 instance. I don't think it's the target instance, I think it's something to do with the security group setup. I am unable to identify why though because I have basically allowed all traffic from the Load Balancer to the EC2 instance.
I made my mistake during the "Setup your VPC" step. I finished creating a subnet for an RDS instance. I proceeded to start an instance and the default subnet that AWS chose when I switched to my VPC was the subnet I made for my RDS, which was NOT a public subnet. Therefore, any attempts, from any EC2 instance or my load balancer, would not be able to reach it because I had only set up my public subnet to take requests.
The solution was to create a new instance and this time, pick the correct public subnet. My original EC2 instance was associated with a private subnet while the load balancer was pointing to the public subnet.
Here's a link to a hand drawn image that helped me pin point my problem, hopefully can help anyone else who's having trouble setting up. I didn't put image here directly because it's bigger than 2MB.
Glad to answer any further questions too!

Docker Swarm Late Server Startup

I've been using docker swarm for a while and I'm really pleased with how simple it is to set up a swarm cluster and to run replicated services. However I've faced a problem that seems like a blocker in my use case.
I'm using docker 1.12 and swarm mode.
My problem is that the internal IPVS load balancer sends request to tasks that have "status health: starting" and whereas my application is not properly started.
My application takes some time to start but docker swarm load balancer starts sending requests as soon as the container is in "state running".
After running some tests I realized that If I scale up one instance, the instance is available to the load balancer immediately and the client may get a connection refused response if the load balancer sends the request to the starting server.
I've implemented the health check and I was expecting a particular instance to only become available to the load balancer after the first successful health check.
Is there any way to configure the load balancer or the scheduler to only send request to instance that are properly started?
Best Regards,
Bruno Vale

Resources