GCP Load Balancer: 502 Server Error, "failed_to_connect_to_backend" - docker

I have a dockerized Go application running on two GCP instances, everything works fine when using them with their individual external IPs, but when put through the load balancer, they're either slow to answer or it answers a 502 server error. The health checks seems to be ok, so I really don't understand.
In the logs, the error thrown is
failed_to_connect_to_backend
I've already seen other answers on this question, but none of them seems to provide an answer for my case. I cannot modify the way the application is served, so it doesn't seems to be a timeout thing.

To troubleshoot 502 response from the Load Balancer due to "failed_to_connect_to_backend." I would check the followings:
1) Usually, "failed_to_connect_to_backend" error message indicates that the load balancer is failing to connect to backends, investigating URL map rules is also a good point to start. I would also suggest reviewing your Load Balancer's URL map to make sure that Host rules, Path matcher, and Path rules are correctly defined and comply with descriptions in this article.
2) Also check if the backend instances are exhausting their resources, If a backend server is overwhelmed, it will refuse incoming requests, potentially causing the load balancer to give up on it and return the specific 502 error you're experiencing. For Apache, you could use this link and nginx this link. Also, check the output on how many established connections are present at any one time using 'netstat' and watch command.
3) I would also recommend testing again with the HTTP(S) request directly to the instance, request the same URL that reporting 502. You might do this test in another VM instance in your VPC network.

checking whether your backend block google's cloud cdn ip address or not.those addresses can be found here:https://cloud.google.com/compute/docs/faq#find_ip_range

this happened to me more than once, I was using apache in my servers, and the issue was not of CPU, but of configuration,
I am using apache mpm_event in combination with php-fpm and there are many settings that will limit the max amount of requests that you want apache and fpm to allow.
In my case I increased in Apache MPM config MaxRequestWorkers from the default 150 to 600, and in PHP FPM config pm.max_children to 80 (I don't remember what was the default here)
This worked as expected, hope this helps you to extrapolate to your own stack.

Just encountered 502 errors myself on access to a Prometheus pod running on my GKE Standard cluster (exposed through IAP).
The issue was that the configured External HTTP/S Load Balancer's health check was coming back unhealthy. This was despite the Prometheus pod running as expected. After digging into the issue I found out that the GCP auto-generated health check was faulty, it was checking URL / instead of /-/ready. When I deleted the Prometheus k8s Ingress resource (which auto-generates GCPs LB and Health Check) and recreated it - the issue was resolved (after a few minutes of resource propagation).

Related

AWS ELB/ECS Http response headers changed

Some context here:
An old Symfony app is used in multiple EC2 instances. Handles millions of requests each day without issues.
For dev purposes, the app was added to a container and that container is used locally by the developers without having to install all the requirements. The dockerized app uses the same nginx/supervisor/php-fpm configs that productive ec2 instances.
To make easier some dev processes, it was decided to create multiple dev environments using AWS Fargate, instead of EC2 instances.
The image is pushed to ECR and is deployed using FARGATE strategy to clusters.
The approach perhaps is too much, since we have 1 Cluster running 1 service only with 1 task. That Service uses an ELB -> Target group.
The application is working fine, but after some time (hours, or days), some requests are returned with different headers. The response is a JSON, but the content type is returned as HTML, other headers are dropped from the request like access-control-allow-headers, access-control-allow-credentials, access-control-allow-methods, triggering a CORS error in the client's browser.
The weird part is that if 1 page creates 10 requests to this service, 9 will work correctly, but 1 request will return 200 with different headers. That endpoint consistently will behave in the same way to any user until the task is restarted.
The response headers are returned by the Symfony app. I also tried to force those headers including those in nginx config by default for every response, and the result is the same.
The docker image exposes port 80 to the service.
The load balancer has the rule to forward HTTPS (443) traffic to port 80, so traffic can reach the container.
The load balancer has enabled the use of HTTP/2
The only notable difference besides EC2/Fargate implementations is the load balancer. The production load balancer is an old class load balancer with only HTTP/1 enabled and the new ones are Applications load balancers using HTTP/2.
This is driving me crazy. Has anyone experienced something like this?
Incorrect headers
Correct headers

Why is my lb responding with bad gateway?

I have no webserver runnning on my ec2 machine, but I still get 502 bad gateway from the load balancer in front of it.
Why do I get bad gateway error from the load balancer, but no bad gateway error, when there is no load balancer in front of the ec2 machine, but just a time out.
The load balancer regularly does health checks on its target machines, i.e. it sends an HTTP or TCP request (as you have configured it). This way it knows what machines in its target pool are healthy and can take requests and which can't. It's supposed to balance the load between multiple machines after all.
When your EC2 machine does not have a running web server, its health check fails and it's seen as unavailable by the load balancer. Since apparently there's no other healthy machine in the pool, the load balancer cannot forward any requests to anything, and thus answers with a 502 Bad Gateway status.
The difference to just timing out when you try to access your EC2 machine directly is that in the case of a load balancer, there's still something that can accept and handle HTTP requests and return appropriate HTTP error codes. When you simply have no web server whatsoever, the connection cannot be accepted by anything and thus can only time out.

Cannot access Keycloak account-console in Kubernetes (403)

I have found a strange behavior in Keycloak when deployed in Kubernetes, that I can't wrap my head around.
Use-case:
login as admin:admin (created by default)
click on Manage account
(manage account dialog screenshot)
I have compared how the (same) image (quay.io/keycloak/keycloak:17.0.0) behaves if it runs on Docker or in Kubernetes (K3S).
If I run it from Docker, the account console loads. In other terms, I get a success (204) for the request
GET /realms/master/protocol/openid-connect/login-status-iframe.html/init?client_id=account-console
From the same image deployed in Kubernetes, the same request fails with error 403. However, on this same application, I get a success (204) for the request
GET /realms/master/protocol/openid-connect/login-status-iframe.html/init?client_id=security-admin-console
Since I can call security-admin-console, this does not look like an issue with the Kubernetes Ingress gateway nor with anything related to routing.
I've then thought about a Keycloak access-control configuration issue, but in both cases I use the default image without any change. I cross-checked to be sure, it appears that the admin user and the account-console client are configured exactly in the same way in both the docker and k8s applications.
I have no more idea about what could be the problem, do you have any suggestion?
Try to set ssl_required = NONE in realm table in Keycloak database to your realm (master)
So we found that it was the nginx ingress controller causing a lot of issues. While we were able to get it working with nginx, via X-Forwarded-Proto etc., but it was a bit complicated and convoluted. Moving to haproxy instead resolved this problem. As well, make sure you are interfacing with the ingress controller over https or that may cause issues with keycloak.
annotations:
kubernetes.io/ingress.class: haproxy
...

Azure Cloud Service microservice to K8 Migration

I am in the process of evaluating moving a very large Azure Cloud Service (Web Role) microservice architecture to AKS and have been working through the necessary code and build changes to support it.
In order to replicate the production environment locally for the developers, we run nginx on the host with SSL offloading and DNS (hosted in Azure) A records pointing to 127.0.0.1. When running in the Azure Emulator, the net affect is the ability for both the developer to visit the various web front ends in their browser (i.e. https://myapp.mydomain.dev) as well as hit the various API's in the solution (Web API 2) in Postman/cURL, etc.
Additionally due to how the networking of the Azure Emulator works, the apps themselves can resolve each other through nginx on the host (i.e. MVC app at https://myapp.mydomain.dev can obtain a token from the IdP web API at https://identity.mydomain.dev and then use that token at the API at https://api.mydomain.dev). This is the critical piece and the source of my question.
All attempts at getting the containers themselves to resolve each other the same way the host OS can (browser/Postman, SSL offloading via nginx) have failed. Many of the instructions out there are understandably for linux containers but having adapted the various networking docker-compose settings for the windows container equivalent have not yet yielded an success. In order to keep the development environments aligned with the real work systems, which are tenantized and make sure of the default mapping in nginx to catch all incoming traffic and route it to a specific user facing app/container, it is not as simple as determining a "static" method of addressing these on startup and why the effort was put in to produce the development environments we have today.
Right now when one service (container) attempts to communication with another, it ultimately results in a resolution error as all requests resolve to https://127.0.0.1 due to the DNS A records hosted in Azure for the domain. Since this migration will be a longer term project, the environments need to co-exist so changing the way that DNS is resolved (real DNS A records pointing to 127.0.0.1), host running nginx and handling SSL offloading to the various webroles normally running in the Azure Emulator is not an option.
Is there a way (with Windows containers) to either:
Allow the container to utilize nginx on the host OS transparently (app must still call the API at https://api.mydomain.dev), which will cause the traffic to be routed properly to the correct container/port defined in the docker-compose file?
OR
Run nginx on each container, allowing each container to then resolve and route appropriately without knowing the IP of the other container, possibly through an alias which could be added to the containers nginx.conf before the service starts?
The platform utilizes OAuth2/OIDC and it is critical to maintain the full URL to the other services from the applications perspective. Beyond mirroring production and sandbox environments, this URL's are utilized for redirect URL and post logout redirect URL validation among other things so using "https://myContainerNameForOtherContainerAlias" is not a workable solution.
Will I have the same problem when setting up the AKS environment as well?

Getting Neo4J running on OpenShift

I am trying to get the Bitnami Neo4j image running on OpenShift (testing on my local Minishift), but I am unable to connect. I am following the steps outlined in this issue (now closed), however, now I cannot access the external IP for the load balancer.
Here are the steps I have taken:
Deploy Image (bitnami/neo4j)
Create service for the load balancer,
using the YAML supplied in the issue mentioned
Get the external IP
address for the LB (oc get services) The command in step 3 lists 2
of the same IP addresses, and when I attempt to go to this IP in my
browser it times out.
I can create a route that points to port 7374 on the IP of the LB, but
then I get the same error as reported in the aforementioned issue.
(ServiceUnavailable: WebSocket connection failure. Due to security
constraints in your web browser, the reason for the failure is not
available to this Neo4j Driver. Please use your browsers development
console to determine the root cause of the failure. Common)
Configure neo4j to accept non-local connections. E.g.:
dbms.connector.bolt.address=0.0.0.0:7687
Source: https://neo4j.com/developer/kb/explanation-of-error-websocket-connection-failure/

Resources