Docker Swarm load balance testing using Chrome

Docker Swarm load balance testing using Chrome - docker

I've tried doing simple single node swarm just like in Docker tutorial part 3 and I've found out that if I use curl then I'm jumping between two replicas, but if I use Chrome then once I open the page then any following requests will be handled by the same replica. I'm sure I'm actually hitting it only once, because counter increases only by 1.
What is happening? Is it some kind of feature in Docker Swarm load balancing? If so, how would it work? No specific request headers are send to the server, so how would the load balancer recognize me? It can't be IP, because if I use incognito mode I'll be handled by different replica and I'll be stick to it as long as I'm in incognito.

It's not a Swarm thing, it's a chrome thing. Curl acts like you'd expect, each command is a new TCP request that shows as a new connection going through the Swarm VIP load balancer.
Chrome (and other browsers) have lots of methods to keep TCP connections open for future requests (HTTP keep-alives, etc). This is why it will stay connected to the same container because the connection is persistent through the LB to the replica. The LB will only shift to the "next in the round-robin pool" for a new connection.

Related

AWS ELB/ECS Http response headers changed

Some context here:
An old Symfony app is used in multiple EC2 instances. Handles millions of requests each day without issues.
For dev purposes, the app was added to a container and that container is used locally by the developers without having to install all the requirements. The dockerized app uses the same nginx/supervisor/php-fpm configs that productive ec2 instances.
To make easier some dev processes, it was decided to create multiple dev environments using AWS Fargate, instead of EC2 instances.
The image is pushed to ECR and is deployed using FARGATE strategy to clusters.
The approach perhaps is too much, since we have 1 Cluster running 1 service only with 1 task. That Service uses an ELB -> Target group.
The application is working fine, but after some time (hours, or days), some requests are returned with different headers. The response is a JSON, but the content type is returned as HTML, other headers are dropped from the request like access-control-allow-headers, access-control-allow-credentials, access-control-allow-methods, triggering a CORS error in the client's browser.
The weird part is that if 1 page creates 10 requests to this service, 9 will work correctly, but 1 request will return 200 with different headers. That endpoint consistently will behave in the same way to any user until the task is restarted.
The response headers are returned by the Symfony app. I also tried to force those headers including those in nginx config by default for every response, and the result is the same.
The docker image exposes port 80 to the service.
The load balancer has the rule to forward HTTPS (443) traffic to port 80, so traffic can reach the container.
The load balancer has enabled the use of HTTP/2
The only notable difference besides EC2/Fargate implementations is the load balancer. The production load balancer is an old class load balancer with only HTTP/1 enabled and the new ones are Applications load balancers using HTTP/2.
This is driving me crazy. Has anyone experienced something like this?
Incorrect headers
Correct headers

google run - does each container instance get 443, if that is what I need

I am trying to understand google run to deploy docker containers on demand. I may have load balancer at 443 and all that, but assume without load balancer will I be able to get 443 for all say 10s or 100s or instances? Thanks!

It's serverless! It's mysterious and powerful!! In fact, on only have to worry about your code (here, your container with Cloud Run). You have to host a webserver (in HTTP (by default on the port 8080 but you can change it), not HTTPS) that answer to HTTP requests. That's all!!
Then deploy it. The deployment create a service and a revision. Each new deployment, create a new revision (set of container + param unique, like this, if your new container and/or the new params of the new revision break your service, you can easily rollback to a previous stable revision).
When you serve traffic, Cloud Run is behind GFE (Google Front End). A Google wide proxy in charge of SSL management (that's why you don't have to worry about HTTPS in your container) and to route the traffic to your Cloud Run revisions. Here, Cloud Run engine is in charge of the instance creation (because Cloud Run scale to 0), and the loadbalancing of the traffic between all the created instances. You have nothing to do, it's native.
So, take it easy, that's the future for the developers!

Docker services stops communicating after some time

I have together 6 containers running in docker swarm. Kafka+Zookeeper, MongoDB, A, B, C and Interface. Interface is the main access point from public - only this container publish the port - 5683. The interface container connects to A, B and C during startup. I am using docker-compose file + docker stack deploy, each service has a name which is used as host for interface. Everything starts successfully and works fine. After some time (20 mins,1h,..) I am not able to make request to interface. Interface receives my requests but application lost connection with service A,B,C or all of them. If I restart interface, it's able to reconnect to services A,B,C.
I firstly thought it's problem of application so I expose 2 new ports on each service (interface, A,B,C) and connect with profiler and debugger to them. Application is running properly, no leaks, no blocked threads, normally working and waiting for connections. Debugger shows me that when I make a request to interface and interface tries to request service A, Connection reset by peer exception was thrown.
During this debugging I found out interesting stuff. I attached debugger to interface when the services started and also debugger was disconnected after some time. + I was not able to reconnect it, until I made request to the container -> application. PRoblem - handshake failed.
Another interesting stuff that I found out was that I was not able to request neither interface. So I used wireshark to see what's going on and: SYN - ACK was fine. Then application post some data and interface respond with FIN,ACK. I assume that this also happen when interface tries to request service A and it FIN the connection. Codebase of Interface, A,B and C is the same regarding netty server.
Finally, I don't think it's a application issue. Why? I tried to deploy containers not as services. I run each container separately, published the ports of each and endpoint of services were set to localhost. (not overlay network). And it is working. Containers run without problem. + I didn't say at the beginning, that the the java applications (interface, A,B,C) runs without problem when they are running as standalone application - not in docker.
Could you please help me what could be the issue? Why the docker in case of overlay network is closing sockets?
I am using newest docker. I used also older.

Finally, I was able to solve the problem.
What was happening, one more time. Interface opens permanent TCP connection to A,B,C. When you try to run these services A,B,C as a standalone java applications, everything is working. When we dockerize them and run in swarm, it was working only few minutes. Strange was that the connection between Interface and another service was interrupted in the moment when you made a request from client to interface.
After many many unsuccessful tests and debugging each container I tried to run each docker container separately, with mapped ports and as endpoint I specified localhost. (each container exposed ports and interface was connecting to localhost) Funny thing happen, it was working. When you run containers like this, different network driver for container is used. Bridge one. If you run it in swarm, overlay network driver is used.
So it had to be something with the docker network, not with application itself. Next step was tcpdump from each container after couple of minutes, when it should stop working. It was very interesting.
Client -> Interface (OK, request accepted)
Interface ->(forward request because it belongs to A) A
Interface -> A [POST]
A -> Interface [RESET]
A was reseting opened TCP communication after couple of minutes without communication. Why?
Docker uses IP Virtual Server and IPVS maintains its own connection table. The default timeout for CLOSE_WAIT connections in IPVS table is 60 seconds. Hence when the server sends something after 60 seconds, the IPVS connection is no longer available and the packet looks invalid for a new TCP session and gets RST. On the client side, the connection remains forever in FIN_WAIT2 state because the app still has the socket open; kernel's fin_wait timer kicks in only for orphaned TCP sockets.
This is what I read about it and how understand it. I am not sure if my explanation of problem is correct, but based on these assumptions I implemented ping-pong between Interface and A,B,C services in case there is no communication for <60seconds. And, it’s working.

Got the same issue.
Specified
endpoint_mode: dnsrr
to properties of the service which plays "server" role and it works just fine.
https://forums.docker.com/t/tcp-timeout-that-occurs-only-in-docker-swarm-not-simple-docker-run/58179

Docker Swarm Late Server Startup

I've been using docker swarm for a while and I'm really pleased with how simple it is to set up a swarm cluster and to run replicated services. However I've faced a problem that seems like a blocker in my use case.
I'm using docker 1.12 and swarm mode.
My problem is that the internal IPVS load balancer sends request to tasks that have "status health: starting" and whereas my application is not properly started.
My application takes some time to start but docker swarm load balancer starts sending requests as soon as the container is in "state running".
After running some tests I realized that If I scale up one instance, the instance is available to the load balancer immediately and the client may get a connection refused response if the load balancer sends the request to the starting server.
I've implemented the health check and I was expecting a particular instance to only become available to the load balancer after the first successful health check.
Is there any way to configure the load balancer or the scheduler to only send request to instance that are properly started?
Best Regards,
Bruno Vale

dns is slow when accessing webapp, normal when pinging?

(This is a follow up to rails app fast on server, but slow when accessed from another machine.)
I have a Rails web app that's incredibly slow when I access via its hostname, but runs at normal speeds when I access via its IP address (or via localhost, if I access it on the same server machine it's running on). This makes me think the problem is with the DNS. (Also, all these machines are running on the same corporate intranet.)
However, when I ping the hostname from a terminal, the ping seems to run fine. Does the fact that pinging works suggest that the problem is not with the DNS? (I don't really know much about DNS or servers and networking, so I'm kind of floundering around a bit here.)
Update to add: I also ran a simple "Hello world" Sinatra app, and this also runs super slowly when accessed via hostname (but not when accessed via IP address).

Fast ping from your terminal suggests that DNS between you and DNS server was fine and that network between you and server is fine.
This still does not help with the DNS on your server. Do you have any network operations that your server performs? If so, you need to make sure the network is reachable.
I suggest you get a simple "hello world" Rails application deployed there and see if it is Rails issue related (server wide) or your application related (very easy to do).
The other suggestion is to profile your Rails app and see which operation is taking the time to complete.

Your ping command is probably using cached DNS instead of hitting the server every time. Google for "flushdns" to find the right syntax to purge your cache for your particular operating system, then try it. You'll need to do this every time if you want to use ping to see about DNS response.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart