Is it normal to have to stop and start my Azure Kubernetes cluster to fix problems?

Is it normal to have to stop and start my Azure Kubernetes cluster to fix problems? - azure-aks

I have encountered 3 different problems so far with Azure Kubernetes where stopping and starting the cluster fixed the problem.
Is this normal? Should stopping and starting the cluster be done nightly or something?

Related

Digital Ocean Kubernetes Node scale -> 503

I'm running a rails container in a digital ocean kubernetes cluster with a horizontal pod autoscaler.
Sometimes when our nodes are scaled, customers encounter 503s.
When the replica count is increased but the node count stays the same this does not happen.
I have set the readiness probe to a quite high value (30s) and as I said the scheduling of new pods in general seems to work without errors.
Has anyone seen this behaviour or maybe solved it?
Any help or hints would be appreciated.

How to kill a multi-container pod if one container fails?

I'm using Jenkins Kubernetes Plugin which starts Pods in a Kubernetes Cluster which serve as Jenkins agents. The pods contain 3 containers in order to provide the slave logic, a Docker socket as well as the gcloud command line tool.
The usual workflow is that the slave does its job and notifies the master that it completed. Then the master terminates the pod. However, if the slave container crashes due to a lost network connection, the container terminates with error code 255, the other two containers keep running and so does the pod. This is a problem because the pods have large CPU requests and setup is cheap with the slave running only when they have to, but having multiple machines running for 24h or over the weekend is a noticable financial damage.
I'm aware that starting multiple containers in the same pod is not fine Kubernetes arts, however ok if I know what I'm doing and I assume I do. I'm sure it's hard to solve this differently given the way the Jenkins Kubernetes Plugin works.
Can I make the pod terminate if one container fails without it respawn? As solution with a timeout is acceptable as well, however less preferred.

Disclaimer, I have a rather limited knowledge of kubernetes, but given the question:
Maybe you can run the forth container that exposes one simple endpoint of "liveness"
It can run ps -ef or any other way to contact 3 existing containers just to make sure they're alive.
This endpoint could return "OK" only if all the containers are running, and "ERROR" if at least one of them was detected as "crushed"
Then you could setup a liveness probe of kubernetes so that it would stop the pod upon the error returned from that forth container.
Of course if this 4th process will crash by itself for any reason (well it shouldn't unless there is a bug or something) then the liveness probe won't respond and kubernetes is supposed to stop the pod anyway, which is probably what you really want to achieve.

Docker Swarm CPU overload on deploy with Spring Boot containers

I have created a number of Spring Boot application, which all work like magic in isolation or when started up one of the other manually.
My challenge is that I want to deploy a stack with all the services in a Docker Swarm.
Initially I didn't understand what was going on, as it seemed like all my containers were hanging.
Turns out running a single Spring Boot application spikes up my CPU utilization to max it out for a good couple of seconds (20s+ to start up).
Now the issue is that Docker Swarm is launching 10 of these containers simultaneously and my load average goes above 80 and the system grinds to a halt. The container HEALTHCHECKS starts timing out and eventually Docker restarts them. This is an endless cycle and may or may not stabilize and if it does stabilize it takes a minimum of 30 minutes. So much for micro services vs big fat Java EE applications :(
Is there any way to convince Docker to rollout the containers one by one? I'm sure this will help a lot.
There is a rolling update parameter - https://docs.docker.com/engine/swarm/swarm-tutorial/rolling-update/ - but is does not seem applicable to startup deployment.
Your help will be greatly appreciated.
I've also tried systemd (which isn't ideal for distributed micro services). It worked slightly better than Docker, but have the same issue when deploying all the applications at once.
Initially I wanted to try Kubernetes, but I've got enough on my plate and if I can get away with Docker Swarm, that would be awesome.
Thanks!

Timeouts accessing services on swarm published ports

We're using Docker in Swarm mode to host a number of services. Recently we've hit an issue where we get connection timeouts intermittently (sometimes as much as every other request) when trying to access some services.
We've upgraded the environment to the latest version of Docker (currently Docker version 17.03.0-ce, build 3a232c8), done a staggered reboot of all servers (trying to maintain uptime if possible even though this environment is technically a test environment) and tried stopping / starting services as well, but the issue still persists.
I'm confident the issue is not related to the service that's running in Docker, as we're seeing it on various services which have until recently been running without issue, I think it's more likely an environmental issue, or some problem with Docker's internal routing in the overlay network, but not sure how to prove / solve this.
Any advice on how to diagnose or solve this would be greatly appreciated!

Run multiple instances on a Mesos slave node

I'm building an Apache mesos cluster with 3 masters and 3 slaves. I installed docker on the slave nodes and it's able to create instances which are vissible in Marathon. Now i tried to install the HAproxy server on top of it but that didn't worked out that well so I deleted it.
The problem is, since then i'm only able to scale my application to a maximum of 3 instances, the exact number of nodes When I want to scale to 5, there are 2 instances that are stuck at the 'deploying' stage.
Does anyone know how to fix this issue so i'm back able to create more instances?
Thank you

To perform that, you trully need to setup Marathon ServiceDiscovery with HAProxy as unknown ports on the same slave machine will be binded to your containers.
First, install HAProxy on every slave. If you need SSL, you will need to make build HAProxy to support SSL.
Then, when HAProxy service is running, you need to follow this very well explain tutorial to enable Marathon service discovery on every Slave
HAProxy marathon Service discovery
Pay well attention to the tutorial, it is very well explained and straightforward.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart