AWS ELB/ECS Http response headers changed - docker

Some context here:
An old Symfony app is used in multiple EC2 instances. Handles millions of requests each day without issues.
For dev purposes, the app was added to a container and that container is used locally by the developers without having to install all the requirements. The dockerized app uses the same nginx/supervisor/php-fpm configs that productive ec2 instances.
To make easier some dev processes, it was decided to create multiple dev environments using AWS Fargate, instead of EC2 instances.
The image is pushed to ECR and is deployed using FARGATE strategy to clusters.
The approach perhaps is too much, since we have 1 Cluster running 1 service only with 1 task. That Service uses an ELB -> Target group.
The application is working fine, but after some time (hours, or days), some requests are returned with different headers. The response is a JSON, but the content type is returned as HTML, other headers are dropped from the request like access-control-allow-headers, access-control-allow-credentials, access-control-allow-methods, triggering a CORS error in the client's browser.
The weird part is that if 1 page creates 10 requests to this service, 9 will work correctly, but 1 request will return 200 with different headers. That endpoint consistently will behave in the same way to any user until the task is restarted.
The response headers are returned by the Symfony app. I also tried to force those headers including those in nginx config by default for every response, and the result is the same.
The docker image exposes port 80 to the service.
The load balancer has the rule to forward HTTPS (443) traffic to port 80, so traffic can reach the container.
The load balancer has enabled the use of HTTP/2
The only notable difference besides EC2/Fargate implementations is the load balancer. The production load balancer is an old class load balancer with only HTTP/1 enabled and the new ones are Applications load balancers using HTTP/2.
This is driving me crazy. Has anyone experienced something like this?
Incorrect headers
Correct headers

Related

Why is my lb responding with bad gateway?

I have no webserver runnning on my ec2 machine, but I still get 502 bad gateway from the load balancer in front of it.
Why do I get bad gateway error from the load balancer, but no bad gateway error, when there is no load balancer in front of the ec2 machine, but just a time out.
The load balancer regularly does health checks on its target machines, i.e. it sends an HTTP or TCP request (as you have configured it). This way it knows what machines in its target pool are healthy and can take requests and which can't. It's supposed to balance the load between multiple machines after all.
When your EC2 machine does not have a running web server, its health check fails and it's seen as unavailable by the load balancer. Since apparently there's no other healthy machine in the pool, the load balancer cannot forward any requests to anything, and thus answers with a 502 Bad Gateway status.
The difference to just timing out when you try to access your EC2 machine directly is that in the case of a load balancer, there's still something that can accept and handle HTTP requests and return appropriate HTTP error codes. When you simply have no web server whatsoever, the connection cannot be accepted by anything and thus can only time out.

Azure Cloud Service microservice to K8 Migration

I am in the process of evaluating moving a very large Azure Cloud Service (Web Role) microservice architecture to AKS and have been working through the necessary code and build changes to support it.
In order to replicate the production environment locally for the developers, we run nginx on the host with SSL offloading and DNS (hosted in Azure) A records pointing to 127.0.0.1. When running in the Azure Emulator, the net affect is the ability for both the developer to visit the various web front ends in their browser (i.e. https://myapp.mydomain.dev) as well as hit the various API's in the solution (Web API 2) in Postman/cURL, etc.
Additionally due to how the networking of the Azure Emulator works, the apps themselves can resolve each other through nginx on the host (i.e. MVC app at https://myapp.mydomain.dev can obtain a token from the IdP web API at https://identity.mydomain.dev and then use that token at the API at https://api.mydomain.dev). This is the critical piece and the source of my question.
All attempts at getting the containers themselves to resolve each other the same way the host OS can (browser/Postman, SSL offloading via nginx) have failed. Many of the instructions out there are understandably for linux containers but having adapted the various networking docker-compose settings for the windows container equivalent have not yet yielded an success. In order to keep the development environments aligned with the real work systems, which are tenantized and make sure of the default mapping in nginx to catch all incoming traffic and route it to a specific user facing app/container, it is not as simple as determining a "static" method of addressing these on startup and why the effort was put in to produce the development environments we have today.
Right now when one service (container) attempts to communication with another, it ultimately results in a resolution error as all requests resolve to https://127.0.0.1 due to the DNS A records hosted in Azure for the domain. Since this migration will be a longer term project, the environments need to co-exist so changing the way that DNS is resolved (real DNS A records pointing to 127.0.0.1), host running nginx and handling SSL offloading to the various webroles normally running in the Azure Emulator is not an option.
Is there a way (with Windows containers) to either:
Allow the container to utilize nginx on the host OS transparently (app must still call the API at https://api.mydomain.dev), which will cause the traffic to be routed properly to the correct container/port defined in the docker-compose file?
OR
Run nginx on each container, allowing each container to then resolve and route appropriately without knowing the IP of the other container, possibly through an alias which could be added to the containers nginx.conf before the service starts?
The platform utilizes OAuth2/OIDC and it is critical to maintain the full URL to the other services from the applications perspective. Beyond mirroring production and sandbox environments, this URL's are utilized for redirect URL and post logout redirect URL validation among other things so using "https://myContainerNameForOtherContainerAlias" is not a workable solution.
Will I have the same problem when setting up the AKS environment as well?

google run - does each container instance get 443, if that is what I need

I am trying to understand google run to deploy docker containers on demand. I may have load balancer at 443 and all that, but assume without load balancer will I be able to get 443 for all say 10s or 100s or instances? Thanks!
It's serverless! It's mysterious and powerful!! In fact, on only have to worry about your code (here, your container with Cloud Run). You have to host a webserver (in HTTP (by default on the port 8080 but you can change it), not HTTPS) that answer to HTTP requests. That's all!!
Then deploy it. The deployment create a service and a revision. Each new deployment, create a new revision (set of container + param unique, like this, if your new container and/or the new params of the new revision break your service, you can easily rollback to a previous stable revision).
When you serve traffic, Cloud Run is behind GFE (Google Front End). A Google wide proxy in charge of SSL management (that's why you don't have to worry about HTTPS in your container) and to route the traffic to your Cloud Run revisions. Here, Cloud Run engine is in charge of the instance creation (because Cloud Run scale to 0), and the loadbalancing of the traffic between all the created instances. You have nothing to do, it's native.
So, take it easy, that's the future for the developers!

Docker Swarm load balance testing using Chrome

I've tried doing simple single node swarm just like in Docker tutorial part 3 and I've found out that if I use curl then I'm jumping between two replicas, but if I use Chrome then once I open the page then any following requests will be handled by the same replica. I'm sure I'm actually hitting it only once, because counter increases only by 1.
What is happening? Is it some kind of feature in Docker Swarm load balancing? If so, how would it work? No specific request headers are send to the server, so how would the load balancer recognize me? It can't be IP, because if I use incognito mode I'll be handled by different replica and I'll be stick to it as long as I'm in incognito.
It's not a Swarm thing, it's a chrome thing. Curl acts like you'd expect, each command is a new TCP request that shows as a new connection going through the Swarm VIP load balancer.
Chrome (and other browsers) have lots of methods to keep TCP connections open for future requests (HTTP keep-alives, etc). This is why it will stay connected to the same container because the connection is persistent through the LB to the replica. The LB will only shift to the "next in the round-robin pool" for a new connection.

GCP Load Balancer: 502 Server Error, "failed_to_connect_to_backend"

I have a dockerized Go application running on two GCP instances, everything works fine when using them with their individual external IPs, but when put through the load balancer, they're either slow to answer or it answers a 502 server error. The health checks seems to be ok, so I really don't understand.
In the logs, the error thrown is
failed_to_connect_to_backend
I've already seen other answers on this question, but none of them seems to provide an answer for my case. I cannot modify the way the application is served, so it doesn't seems to be a timeout thing.
To troubleshoot 502 response from the Load Balancer due to "failed_to_connect_to_backend." I would check the followings:
1) Usually, "failed_to_connect_to_backend" error message indicates that the load balancer is failing to connect to backends, investigating URL map rules is also a good point to start. I would also suggest reviewing your Load Balancer's URL map to make sure that Host rules, Path matcher, and Path rules are correctly defined and comply with descriptions in this article.
2) Also check if the backend instances are exhausting their resources, If a backend server is overwhelmed, it will refuse incoming requests, potentially causing the load balancer to give up on it and return the specific 502 error you're experiencing. For Apache, you could use this link and nginx this link. Also, check the output on how many established connections are present at any one time using 'netstat' and watch command.
3) I would also recommend testing again with the HTTP(S) request directly to the instance, request the same URL that reporting 502. You might do this test in another VM instance in your VPC network.
checking whether your backend block google's cloud cdn ip address or not.those addresses can be found here:https://cloud.google.com/compute/docs/faq#find_ip_range
this happened to me more than once, I was using apache in my servers, and the issue was not of CPU, but of configuration,
I am using apache mpm_event in combination with php-fpm and there are many settings that will limit the max amount of requests that you want apache and fpm to allow.
In my case I increased in Apache MPM config MaxRequestWorkers from the default 150 to 600, and in PHP FPM config pm.max_children to 80 (I don't remember what was the default here)
This worked as expected, hope this helps you to extrapolate to your own stack.
Just encountered 502 errors myself on access to a Prometheus pod running on my GKE Standard cluster (exposed through IAP).
The issue was that the configured External HTTP/S Load Balancer's health check was coming back unhealthy. This was despite the Prometheus pod running as expected. After digging into the issue I found out that the GCP auto-generated health check was faulty, it was checking URL / instead of /-/ready. When I deleted the Prometheus k8s Ingress resource (which auto-generates GCPs LB and Health Check) and recreated it - the issue was resolved (after a few minutes of resource propagation).

Resources