Kubernetes Service Selector change doesn't take effect on connected Clients - docker

I want to display a maintenance page on an application running under Kubernetes whilst a deployment is in progress, in this “maintenance” window, I backup the database and then apply schema changes and then deploy the new version.
I thought maybe what I could do is change the service selector so that it would point to a nginx container serving up a simple maintenance page whilst the deployment progressed. Once the deployment had succeeded, I would switch back the selector to point to the pods that do the actual work.
My problem with this is approach is that unless I close and reopen the browser that is currently looking at the site then I never see the maintenance page; I’m guessing the browser is keeping a connection open. The public service address doesn’t change throughout this process.
I’m testing this locally on a Docker Kubernetes installation using a type of NodePort .
Any ideas on how to get it working or am I flogging a dead horse with this approach?
Regards
Lee

This happens due to a combination of how browsers and k8s services work.
Browsers cache TCP connections to servers: when requesting a page they will leave the TCP connection open, and if the user later requests more pages from the same domain, the browser will reuse the already-open TCP connection to save time.
The k8s service load balancing operates at the TCP layer. When a new TCP connection is received, it will be assigned to a pod from the Service, and it will keep talking to that pod for the entire TCP connection's lifetime.
So, the issue is your browser is keeping TCP connections open to your old pods, even if you modify the service.
How can we fix this?
Non-solution #1: have the browser not cache connections. As far as I know there's no way to do this, and you don't want it anyway because it'll make your site slower. Also, HTTP caching headers have no impact on this. Browsers always cache TCP connections. A no-cache header will make the browser request the page again, but over the already-open connection.
Non-solution #2: have k8s kill TCP connections when updating the service. This is not possible and is not desirable either because this behavior is what makes "graceful shutdown / request draining" deployment strategies work. See issue.
Solution #1: Use Layer 7 (HTTP) load balancing instead of Layer 4 (TCP) load balancing, such as nginx-ingress. L7 load balancing routes traffic to pods "per HTTP request", instead of "per TCP connection", so you won't have this problem even if browsers keep TCP connections open.
Solution #2: do this from your application instead of from k8s. For example, have an "in-maintenance" DB flag, check it on every request and serve the maintenance page if it's set.

Here is how services in Kubernetes work, they are basically a dummy loadbalancers forwarding requests to pods in a round robin fashion, and they select which pods to forward the requests to based on the labels as you have already figured out.
Now here is how http/tcp work, I open the browser to visit your website www.example.com the tcp takes it's round of syn,ack,syn-ack and I receive the data.
In your case once I open your website I get a reply from a certain pod based on how the service routed me, and that's it, no further communication is made.
Afterwards you remove the functional pods from the service and add the maintenance page, this will be only shown to the new clients connecting to your website.
I.E if I requested your website, and then you changed all the code and restarted NGINX, if I didn't refresh I would not receive new content

First of all make sure the content you are serving is not cached.
Second, make sure to close all open TCP connections when you shut down your pods. The steps should be as follows:
Change service selector to route traffic to maintenance pods
Gracefully shutdown running pods (this includes closing all open TCP connections)
Do maintenance
Change service selector back
As an alternative approach, you can use an ingress controller. That won't have this problem, because it doesn't maintain an open TCP connection to the pods.

Related

launch/scale a containerized app on http request

Sorry if my subject was already handled elsewhere but I don't see where to start the search.
I have an app separated in 3 containers : front (Angular) / back (Node.js) / mysql. This is a demo that will be available on a website.
The app will be provided by another server and I want to launch the app in a separated window with an http request from my website (button). As the user will test his own data (video file and management of fictive users), I want to erase all after he leaves.
Question 1: is it possible to launch the set of containers on a http request (and how to?)
Question 2: How to erase the datas? (destroying the container by a timeout?)
Question 3: Is launching a set of containers for each user a good solution to handle several users at the same time? I looked at Kubernetes but didn't find metric to scale up on http request. Moreover, how to redirect each user on his set of containers?
lunching on http request - not on kubernetes. you ususally deploy
there with kubectl command. and it takes few min to start all the
pods you need and services to be accesable.
Destroy pod (running container in kubernetes called pod) is usually
kubectl.... command as well.
creating pod per user is not what kubernetes designed for sure.
Kubernetes has autoscaling on load, but it is based on load balancer, and all pods in autoscaler should be able to do any request. So kubernetes is more like constantly monitoring automatic DevOps guy which also autoscale if necessary (most often cpu usage, but not limited by cpu)

Connection reset by peer on Azure

I have a web application running on an App Service on Azure cloud.
On the back-end I'm using a tcp connection to our database (Neo4j graph db), the best practice is to open the tcp connection and keep it alive in order to be more reactive when we perform queries.
The issue I encountered is that the database is logging the exception "Connection reset by peer";
reading on the web I found out that maybe Azure has a TCP timeout configured by default, I read it to be set up to 4 minutes, which could be my issue root cause.
Someone knows how to configure the tcp KEEP ALIVE to always for App Services on Azure?
I found on the web how to do it in Google cloud but nothing about Azure cloud.
Thank you in advance.
OaicStef
From everything I can find that is not an adjustable setting. Here is the forum link that says it will not be changing and that is a couple years old at this point. https://social.msdn.microsoft.com/Forums/en-US/32b76114-67a4-4e6b-ac45-61b0f0a0829f/changing-the-4-minute-request-time-out-for-app-services?forum=windowsazurewebsitespreview
I think you are going to have to add logic to your app that tests the connection, if it has been closed then either reopen it or create a new one. I don't know what language you are using to make any suggestions there.
Edit
I will add that the total number of TCP connections that can be open on a single App Service is about 6k, at least on the S1. Keep that in mind because if you don't have pooling on the server side or you are not disposing of those then you will exhaust that the TCP pool and you will start getting errors. I recommend you configure an alert for that.

Load balancer and http download streams on docker swarm

I have put together an architecture that at high level is best described below
Five node docker swarm cluster
Have say 5 instances of my dockerized micro service running one copy on each of the swarm nodes
The service offers functionality via REST end points
One such functionality is downloads and they work perfectly, I wrote some code in Scala/Play framerwork, dockerized the service and deployed it.
I also know that since I use swarm , it internally does LB per request for me.
I have some questions on WebSocket and how load balancer does not ruin things during download.
I start a 5GB file download and it works. I am using HTTP stream or chunked I guess it does not matter. Now my question is once my REST end point for download is hit, the TCP connection remains open and since it is open until the server closes the connection, it is due to this that the swarm load balancing does not interfere? In short, each time a client requests a HTTP call, swarm load balances it but once the TCP socket is established as in case of specific download example, the request is served by one node as the connection is not re-stablished during the download process?
If a client opens a web socket, it will hit one of the nodes of swarm where the service is running and the websocket connection since it is open, the same service instance will push the notifications?
If for some reason the websocket dies, a new connection might be established by client but the request might end up on some other service instance and will remain like that until a new connection is again established?
Are above 3 points correct in my understanding? Is there some reading material/blogs I can find more on elaborating this?
Maybe using nginx like proxy LB, ip_hash mode
Specifies that a group should use a load balancing method where requests are distributed between servers based on client IP addresses. The first three octets of the client IPv4 address, or the entire IPv6 address, are used as a hashing key. The method ensures that requests from the same client will always be passed to the same server except when this server is unavailable. In the latter case client requests will be passed to another server. Most probably, it will always be the same server as well.
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#ip_hash

AWS ELB HealthCheck Improvements

All,
We recently had an issue with ELB HealthCheck in covering up a certain use-case or scenario which caused an application impact.
Can anyone suggest a fault-tolerant approach to handle this?
We have a nodeJS app running in a port - 80
We have 3 instances in the Target Group & that is enrolled in ELB.
ELB HealthCheck was configured to hit root path on port 80 and return success if it gets HTTP 200
Recently one of the node had 100% disk filled on application mount and root mount was still having space.
Though the HealthCheck was succeeding as per ELB the server didn't respond for any other services and it was ideally unhealthy. This means that there are some requests that got
succeeded but some of them failed (that was routed to this disk-filled server).
We did received notifications from other monitoring systems on disk filling but due to overwhelming emails & limited resources it got missed out.
Is there any other way we can improvise the HealthCheck strategy to just have these scenarios intimated to AutoScaling Group or ELB
so that we can target these nodes to be removed and replace them automatically?
Rather than just checking that the index.htm page is returning a 200 response, you can configure Elastic Load Balancing to point to a customer Health Check page (eg healthcheck.php).
You could run some code on that page to test the general health of the application (database connectivity, disk space, free memory). If everything checks out OK, return a 200 response. If something is wrong, return a 500 response. This will cause the Load Balancer to treat the instance as Unhealthy and it will stop serving traffic to the instance.
If Auto Scaling is configured to use the ELB Health Check, then Auto Scaling will terminate the unhealthy instance and automatically replace it with a new instance.

Asyncronously send file over TCP connection

so I'm making an iOS app, but this is more of a general networking question.
So what I have is one phone that acts as the server and then a bunch of phones connect to the phone as the client. Basically it's a game/music sharer.
It's kind of hard to really get into the semantics of it, but that isn't important.
What is important is that the server and client are repeatedly sending each other commands and positions rapidly over a TCP connection, and sometimes the client wants to send the server a music file (4MB usually) to play as the music.
The problem I initially encountered was that when sending the large file, it would hang the sending of commands from the client to the server.
My naive solution was to create another socket to connect to the server to send the file to the server, the server would check the IP of the new socket, and if it has the IP of an existing connection then it would just tie it to that connection, receive the file, and then disconnect the socket.
But the problem with this is that it takes a 1-2 second delay for the socket to connect, and I'm aware that there are man-in-the-middle attacks that can occur.
Is there a more elegant solution to this problem?
I would not call your solution naive, this is largely how FTP works, separating data and control paths is a good design pattern in my view.
I wouldn't worry about the man in the middle thing. If you wanted, you could add a command to the client that it responds to over the data connection with a secret the server supplies, this would let you associate the connections without using the ip addressing.
If the delay is a problem then why not establish both connections at the start, the overhead of a few tcp connections on an operating system is not usually significant.
You could also use the two connections for both commands and data, alternating between them. Since both the server and client know when a connection is busy they can choose to use the idle one. The advantage of this is that it will keep both connections busy to ensure they are both known to be working.
You probably should also use a different thread for each socket but I suspect you are doing this since it won't work too well without it.

Resources