How to increase Node Drain Timeout for AKS node upgrade rollout - azure-aks

Problem:
PDB with maxUnavailable of 1
Pods have LONG grace period of 15 hours (technical requirement for the business case at hand handling stateful connections. Also induced by external dependency, so no way for me to change this)
Node Drain timeout is only 1 hour?
During an upgrade, a pod that needs to be evicted off a node might take longer than the Node Drain timeout and yields the following error:
(UpgradeFailed) Drain of NODE_NAME did not complete pods [STS_NAME:POD_NAME]: Pod
POD_NAME still in state Running on node NODE_NAME, pod termination grace period 15h0m0s was
greater than remaining per node drain timeout. See http://aka.ms/aks/debugdrainfailures
Code: UpgradeFailed
After which the cluster is in failed state.
Due to the grace period of pods not being in my control, I would like to increase the node drain timeout to 31 hours, as there can be 2 of those long grace period pods on a single node. I haven't been able to find anything regarding the node drain timeout though.
I can't even figure out, if it's part of K8s, or AKS specifically.
How to increase the per node drain timeout, such that my long grace period pods don't interrupt my node upgrade operations?
EDIT:
In the kubectl cli reference, the drain command takes a timeout parameter. As I don't invoke the drain myself, I don't see how this helps me. It lead me to believe that, if anywhere, this needs to be dealt with on the AKS side of things.

Not an answer to the actual question, but a possible workaround:
Scale up to double the amount of nodes you need to run the workload
Manually evict first half of nodes that need upgrading
Start the upgrade
Upgrade fails somewhere in the second half
Manually evict second half of nodes
Start the upgrade
Upgrade completes
Scale down to required amount of nodes again
Disadvantages:
Doubled infrastructure cost for the duration of the upgrades
Lots of manual steps for upscaling, evicting, upgrading, evicting again, upgrading again, downscaling
Might require additional quota requests to actually perform due to possible vCore quota limits
Manual nature of this upgrade will prevent auto-upgrading the cluster successfully
At least doubles the time for the entire operation, because the entire workload needs to be evicted completely twice, instead of just once
It's a terrible workaround, but a workaround.

Related

Cloud Run - OpenBLAS Warnings and application restarts (Not a cold start issue)

Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.
OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.

Prometheus alerting when a pod is running for too long

I have run into a bit of a trouble for what is seems to be an easy question.
My scenario:
I have a k8s job which can be run at any time (not a cronJob) which in turn creates a pod to perform some tasks. Once the pod performs its task it completes, thus completing the job that spawned it.
What I want:
I want to alert via prometheus if the pod is in a running state for more than 1h signalling that the task is taking too much time.
I'm interested to alert ONLY when duration symbolised by the arrow in the attached image exceeds 1h. Also have no alerts triggered when the pod is no longer running.
What I tried:
The following prometheus metric, which is an instant vector that can be either 0(pod not running) or 1(pod is running):
kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}
I figured I tried to use this metric with the following formula for computing the duration for when the metric was one during a day
(1 - avg_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d])) * 86400 > 3600
Because these pods come and go and are not always present I'm encountering the following problems:
The expr above starts from the 86400 value and eventually drops once
the container is running this would trigger an alert
The pod eventually goes away and I would not like to send out fake alerts for pods which are no longer running(although they took over 1h to run)
Thanks to the suggestion of #HelloWorld i think this would be the best solution to achieve what I wanted:
(sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d:1s]) > 3600) and (kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}==1)
Count the number of times pods is running in the past day/6h/3h and verify if that exceeds 1h(3600s)
AND
Check if the pod is still running - so that it doesn't take into consideration old pods or if the pod terminates.

Jenkins: jobs in queue are stuck and not triggered to be restarted

For a while, our Jenkins experiences critical problems. We have jobs hung, our job scheduler does not trigger the builds. After the Jenkins service restart, everything is back to normal, but after some period of time all problem are return. (this period can be week or day or ever less). Any idea where we can start looking? I'll appreciate any help on this issue
Muatik has made a good point in his comment, the recommended approach is to run jobs on agents (slave) nodes. If you already do it, you can look at:
Jenkins master machine CPU, RAM and hard disk usage. Access the machine and/or use plugin like Java Melody. I have seen missing graphics in the builds test results and stuck builds due to no hard disk space. You could also have hit the limit of RAM or CPU for the slaves/jobs you are executing. You may need more heap space.
Look at Jenkins log files, start with severe exceptions. If the files are too big or you see logrotate exceptions, you can change the logging levels, so that fewer exceptions are logged. For more details see my article on this topic. Try to fix exceptions that you see logged.
Go through recently made changes that can be the cause of such behavior, for example, new plugins, changes to config files (jenkins.xml)?
Look at TCP connections. Run netstat -a Are there suspicious connections (CLOSED_WAIT status)?
Delete old builds that you do not need.
We have been facing this issue from last 4 months, and tried everything, changing resources CPU & memory, increasing desired nodes in ASG. But nothing seems worked .
Solution: 1. Go to Manage Jnekins-> System Configurationd-> Maven project
configurations
2. In "usage" field, select "Only buid Jobs with label expressions matching this nodes"
Doing this resolved it and jenkins is working as a Rocket now :)

What's the main advantage of using replicas in Docker Swarm Mode?

I'm struggling to understand the idea of replica instances in Docker Swarm Mode. I've read that it's a feature that helps with high availability.
However, Docker automatically starts up a new task on a different node if one node goes down even with 1 replica defined for the service, which also provides high availability.
So what's the advantage of having 3 replica instances rather than 1 for an arbitrary service? My assumption was that with more replicas, Docker spends less time creating a new instance on another node in the event of failure, which aids performance. Is this correct?
What Makes a System Highly Available?
One of the goals of high availability is to eliminate single points of
failure in your infrastructure. A single point of failure is a
component of your technology stack that would cause a service
interruption if it became unavailable.
Let's take your example of a replica that consists of a single instance. Now let's suppose there is a failure. Docker Swarm will notice that the service failed and restart it. The service restarts, but a restart isn't instant. Let's say the restart takes 5 seconds. For those 5 seconds your service is unavailable. Single point of failure.
What if you had a replica that consists of 3 instances. Now when one of them fails (no service is perfect), Docker Swarm will notice that one of the instances is unavailable and create a new one. During that time you still have 2 healthy instances serving requests. To a user of your service, it appears as if there was no down time. This component is no longer a single point of failure.
ROMANARMY answer is very good and i just wanted to mention that the replicas can be on different nodes, so if one of your servers goes down(become unavailable) the container(replica) on the other server can be run without problem.

How many Remote Nodes can Jenkins manage

How many Remote Nodes can Jenkins manage ? Are there any limitations/memory issues?
What is more effective:
1) 100 Nodes 1 executor per node ?
2) 5 Nodes with 20 executors per node ?
Tx.
As far as i know, there is no limitation on # of nodes one can have although your system might feel like saying, enough is enough! Issues such as number of processes per user (we got this issue recently, not with Jenkins but some other application where RAM and disk space were fine but the system stopped responding. We started getting system cannot fork() error), total number of open files etc. Few such issues might still be configurable but may not be allowed/feasible.
If resource (in your case, nodes) is not a constraint, which process wouldn't like to run wild? :) In practical cases, generally you wouldn't have the flexibility to opt for first option. In second case where you have 5 nodes with 20 executors, all you have to make sure is not to tie up jobs to a particular node unless you have a compelling reason.
Some slaves are faster, while others are slow. Some slaves are closer (network wise) to a master, others are far away. So doing a good build distribution is a challenge. Currently, Jenkins employs the following strategy:
If a project is configured to stick to one computer, that's always honored.
Jenkins tries to build a project on the same computer that it was previously built.
Jenkins tries to move long builds to slaves, because the amount of network interaction between a master and a slave tends to be logarithmic to the duration of a build (IOW, even if project A takes twice as long to build as project B, it won't require double network transfer.) So this strategy reduces the network overhead.
You should also have a look at these links:
https://wiki.jenkins-ci.org/display/JENKINS/Least+Load+Plugin
https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin

Resources