I am writing a Jenkins Global pipeline library where I have a stage to deploy my docker image to K8s cluster.
So after building my docker image during CI process I am promoting(deploying) the image to multiple environments(sequentially lower to higher).
So, to get the correct status of deployment after running
kubectl apply -f Application-k8s-file.yaml
I used following command in a shell step.
kubectl rollout status deployment deployment_name
Things goes well if my deployment does not have error but if my deployment has some error(might be some code bug, application does not start) then this command kubectl rollout status deployment <deployment name> runs infinitely(as k8s retries again and again to redeploy) and my Jenkins job runs infinitely(till the Job timeout).
So to find a hack I tried a logic to put the timeout on this command and calculations are something like this:
timeout = (number of pods * liveness probe time + number of pods* 10) seconds
Not sure if this calculation is correct or not.
My code snippet looks like this
sh(returnStdout: true,script:"#!/bin/sh +e\n timeout --preserve-status ${timeout_value} kubectl rollout status deployment ${deploymentName} --kubeconfig='/opt/kubernetes-secrets/${env}/kubeconfig' 2>tmpfile; echo \$? > tmpfile1")
def readThisFile = readFile "tmpfile1.txt"
def var=readThisFile.toInteger()
if(var==0)
{
echo "deployment successful"
}
else{"do something else"}
This works well initially but later I find that k8s "kubectl rollout status deployment " command doesn't give exit code 0 until all the pods get scheduled and old get terminated completely which sometimes take time.
What I basically want is a minimal calculated timeout value.
My K8s file have parameters like this:
spec:
minReadySeconds: 30
livenessProbe:
httpGet:
path: /ping
port: 80
initialDelaySeconds: 45
periodSeconds: 5
timeoutSeconds: 60
name: test-dummyservice
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 60
periodSeconds: 120
timeoutSeconds: 60
I did not find anything specific related to this in K8s documentation. Anyone facing same challenge?
You should take a look at progressDeadlineSeconds. Once this exceeds the deadline the rollout status will exit out.
kubectl rollout status deployment ng
Waiting for rollout to finish: 2 out of 7 new replicas have been updated...
error: deployment "ng" exceeded its progress deadline
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds
You can add the timeout flag like below, here is the doc https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands
kubectl rollout status deployment deployment_name --watch --timeout=5m
If you don't want to wait for the rollout to finish then you can use --watch=false.
kubectl rollout status deployment deployment_name --watch=false
Now, you can check with this command for a duration with specific interval.
Related
I'm working on the setup of a new Rails project, hosted with Google Kubernetes Engine. Everything was going fine until I switched my deployed server to production mode, with RAILS_ENV=production.
My Kubernetes pods don't reach the ready state anymore. The readiness probe is forbidden to hit the server apparently, since it return a 403 code.
When I run kubectl describe pod <name> on a stuck pod, I get this :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m25s default-scheduler Successfully assigned front to gke-interne-pool
Normal Pulling 5m24s kubelet Pulling image "registry/image:latest"
Normal Pulled 5m24s kubelet Successfully pulled image "registry/image:latest"
Normal Created 5m24s kubelet Created container front
Normal Started 5m24s kubelet Started container front
Warning Unhealthy 11s (x19 over 4m41s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 403
The return of kubectl logs <name> for this pod shows indeed no request from the probe.
But when I launch a console with kubectl exec -it deploy/front -- bash, I can make a curl -s http://localhost:3000, which works perfectly, is displayed in the logs and returns 200.
My setup works in development mode but not in production, and so the Rails 6 app config is the main suspect. Something that I don't understand in the production mode of Rails 6 forbid my readiness probes to contact my pod.
Just in case, the readiness part of deployment.yaml :
spec:
containers:
- name: front
image: registry/image:latest
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: "/"
port: 3000
initialDelaySeconds: 30
periodSeconds: 15
Another potential cause of readiness 403 errors with Rails 6 is the allowed hosts list defined in config.
Typically, you'll have a line of code in config/environments/production.rb that looks something like:
config.hosts << "www.mydomain.com"
This leads to the rejection of any requests that come from hosts other than "www.mydomain.com". The readiness checks come from a private IP address within your cluster so they're going to be rejected given the above config.
One way to get around this is by adding an additional hosts entry that allows traffic from any private IP addresses:
config.hosts << "www.mydomain.com"
config.hosts << /\A10\.\d+\.\d+\.\d+\z/
I can't notice a specific error whereby your implementation failed after switching the mode RAILS_ENV = production. But after checking the error 403 I was able to find a hack, which seems that it worked for some users in their use case, you can try that by leaving your code in yaml like,
readinessProbe:
httpGet:
path: "/"
port: 3000
scheme: "HTTP"
initialDelaySeconds: 30
periodSeconds: 15
Even though I was not able to find an error on your deploy, the error could be directed to your credentials, so validate the route that you put in Rails and the permissions it has and identify if these change depending on whether it is in production or development environment.
as a last option would be to clarify your suspicions with the Rails app, because I don't see what affects when changing the environment variable to Production.
I'm trying to evaluate the performance of one of my go server running inside the pod. However, receiving an error saying too many open files. Is there any way to set the ulimit in kubernetes?
ubuntu#ip-10-0-1-217:~/ppu$ kubectl exec -it go-ppu-7b4b679bf5-44rf7 -- /bin/sh -c 'ulimit -a'
core file size (blocks) (-c) unlimited
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 15473
max locked memory (kb) (-l) 64
max memory size (kb) (-m) unlimited
open files (-n) 1048576
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) unlimited
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
Deployment file.
---
apiVersion: apps/v1
kind: Deployment # Type of Kubernetes resource
metadata:
name: go-ppu # Name of the Kubernetes resource
spec:
replicas: 1 # Number of pods to run at any given time
selector:
matchLabels:
app: go-ppu # This deployment applies to any Pods matching the specified label
template: # This deployment will create a set of pods using the configurations in this template
metadata:
labels: # The labels that will be applied to all of the pods in this deployment
app: go-ppu
spec: # Spec for the container which will run in the Pod
containers:
- name: go-ppu
image: ppu_test:latest
imagePullPolicy: Never
ports:
- containerPort: 8081 # Should match the port number that the Go application listens on
livenessProbe: # To check t$(minikube docker-env)he health of the Pod
httpGet:
path: /health
port: 8081
scheme: HTTP
initialDelaySeconds: 35
periodSeconds: 30
timeoutSeconds: 20
readinessProbe: # To check if the Pod is ready to serve traffic or not
httpGet:
path: /readiness
port: 8081
scheme: HTTP
initialDelaySeconds: 35
timeoutSeconds: 20
Pods info:
ubuntu#ip-10-0-1-217:~/ppu$ kubectl get pods
NAME READY STATUS RESTARTS AGE
go-ppu-7b4b679bf5-44rf7 1/1 Running 0 18h
ubuntu#ip-10-0-1-217:~/ppu$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 19h
ppu-service LoadBalancer 100.64.171.12 74d35bb2a5f30ca13877-1351038893.us-east-1.elb.amazonaws.com 8081:32623/TCP 18h
When I used locust to test the performance of the server receiving the following error.
# fails Method Name Type
3472 POST /supplyInkHistory ConnectionError(MaxRetryError("HTTPConnectionPool(host='74d35bb2a5f30ca13877-1351038893.us-east-1.elb.amazonaws.com', port=8081): Max retries exceeded with url: /supplyInkHistory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x....>: Failed to establish a new connection: [Errno 24] Too many open files',))",),)
May you have a look at https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
You but you need enable few features to make it work.
securityContext:
sysctls:
- name: fs.file-max
value: "YOUR VALUE HERE"
There was a few cases regarding setting --ulimit argument, you can find them here or check this article. This resource limit can be set by Docker during the container startup. As you add tag google-kubernetes-engine answer will be related to GKE environment, however on other cloud it could work similar.
If you would like to set unlimit for open files you can modify configuration file /etc/security/limits.conf. However, please not it will not persist across reboots.
Second option would be edit /etc/init/docker.conf and restart docker service. As default it have a few limits like nofile or nproc, you can add it here.
Another option could be to use instance template. Instance template would include a start-up script that set the required limit.
After that, you would need to use this new instance template for the instance group in the GKE. More information here and here.
The "azds up" command times out before all the steps are done. I have a large Angular app that typically takes 5 minutes+ when npm install is executed. When I execute azds up this is what I get:
Step 1/9 : FROM node
Step 2/9 : ENV PORT 80
Step 3/9 : WORKDIR /app
Step 4/9 : COPY package*.json ./
Step 5/9 : RUN npm install --silent
Waiting for container...
and then it returns to the command line.
Is there a configuration in the azds.yaml where I can tell azds/helm to wait for a longer period of time?
Thanks!
To begin with i will give some examples that might be helpful in your case:
An example of .yaml of how to rolling upgrade
spec:
minReadySeconds: 60
progressDeadlineSeconds: 600
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 50%
maxUnavailable: 50%
minReadySeconds: The bootup time of your application, Kubernetes waits specific time until the next pod creation. For
example minReadySeconds is 60 then after the Pod become healthy the
Deployment must wait for 60 seconds to update the next Pod.
progressDeadlineSeconds Timeout value for updating one pod in this example 10 minutes. If the rollout fails to progress in 10
minutes, then the Deployment is marked as failed and finished, also
all of next job never invoke.
maxSurge: The maxSurge parameter controls how many extra resources can be created during the rollout.(Absolute value or %)
maxUnavailable: The maxUnavailable parameter sets the maximum number of Pods that can be unavailable during a rolling
update.(Absolute value or %)
An example of .yaml Liveness & Readiness
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 60
timeoutSeconds: 1
periodSeconds: 10
failureThreshold: 3
The above manifest configured livenessProbes which confirm whether an container is running properly or not.
It probe the health check by using HTTP GET request to /healthz with port 8080.
The probe sets an initialDelaySeconds=60 which means that it will not
be called until 60 seconds after all the containers in the Pod are
created. And timeoutSeconds=1 was configured it means that the probe
must respond with in the 1 second timeout. The periodSeconds=10 was
configured, it means that the probe invoke every 10 seconds. If more
than 3 probe failed(failureThreshold=3), the container will be
considerd un-healthy and failed and restart.
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 120
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 3
The above readinessProbe is more important than liveness probe in
production environment. the readinessProbe is confirm whether the
service can acceptable or not. If this probe failed, the internal
loadbalancer never send the traffic to this pod. Only successed the
probe, the traffic to this pod will start.
I have 1 node with 3 pods. I want to rollout a new image in 1 of the three pods and the other 2 pods stay with the old image. Is it possible?
Second question. I tried rolling out a new image that contains error and I already define the maxUnavailable. But kubernetes still rollout all pods. I thought kubernetes will stop rolling out the whole pods, once kubernetes discover an error in the first pod. Do we need to manually stop the rollout?
Here is my deployment script.
# Service setup
apiVersion: v1
kind: Service
metadata:
name: semantic-service
spec:
ports:
- port: 50049
selector:
app: semantic
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: semantic-service
spec:
selector:
matchLabels:
app: semantic
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
metadata:
labels:
app: semantic
spec:
containers:
- name: semantic-service
image: something/semantic-service:v2
As #David Maze wrote in the comment, you can consider using canary where it is possible distinguish deployments of different releases or configurations of the same component with multiple labels and then track these labels and point to different releases, more information about Canary deployments can be find here. Another way how to achieve your goal can be Blue/Green deployment in case if you want to use two different environments identical as possible with a comprehensive way to switch between Blue/Green environments and rollback deployments at any moment of time.
Answering the second question depends on what kind of error a given image contains and how Kubernetes identifies this issue in the Pod, as maxUnavailable: 1 parameter states maximum number of Pods that can be unavailable during update. In the process of Deployment update within a cluster deployment controller creates a new Pod and then delete the old one assuming that the number of available Pods matches rollingUpdate strategy parameters.
Additionally, Kubernetes uses liveness/readiness probes to check whether the Pod is ready (alive) during deployment update and leave the old Pod running until probes have been successful on the new replica. I would suggest checking probes to identify the status of the Pods when deployment tries rolling out updates across you cluster Pods.
Regarding question 1:
I have 1 node with 3 pods. I want to rollout a new image in 1 of the
three pods and the other 2 pods stay with the old image. Is it
possible?
Answer:
Change your maxSurge in your strategy to 0:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 <------ From the 3 replicas - 1 can be unavailable
maxSurge: 0 <------ You can't have more then the 3 replicas of pods at a time
Regarding question 2:
I tried rolling out a new image that contains error and I already
define the maxUnavailable. But kubernetes still rollout all pods. I
thought kubernetes will stop rolling out the whole pods, once
kubernetes discover an error in the first pod. Do we need to manually
stop the rollout?
A ) In order for kubernetes to stop rolling out the whole pods - Use minReadySeconds to specify how much time the pod that was created should be considered ready (use liveness / readiness probes like #Nick_Kh suggested).
If one of the probes had failed before the interval of minReadySeconds has finished then the all rollout will be blocked.
So with a combination of maxSurge = 0 and the setup of minReadySeconds and liveness / readiness probes you can achieve your desired state: 3 pods: 2 with the old image and 1 pod with the new image.
B ) In case of A - you don't need to stop the rollout manually.
But in cases when you will have to do that, you can run:
$ kubectl rollout pause deployment <name>
Debug the non functioning pods and take the relevant action.
If you decide to revert the rollout you can run:
$ kubectl rollout undo deployment <name> --to-revision=1
(View revisions with: $ kubectl rollout history deployment <name>).
Notice that after you paused the rollout you need to resume it with:
$ kubectl rollout resume deployment <name>
Even if you decide to undo and return to previous revision.
I have the following liveness probe in my service deployment.yaml
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9081
scheme: HTTP
initialDelaySeconds: 180
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
I want to test that the probe is actually triggering a POD redeployment, which is the easiest thing to do to make it fail?
Possibly in a programmatic way.
Update:
Better to clarify the question, I don't want to change the code in the application, neither pausing the container that is running.
I was wondering if it's possible to block someway the endpoint/port at runtime maybe using a kubernetes or docker command.
You could define your liveness probe as follows
livenessProbe:
exec:
command:
- /bin/bash
- '-c'
- /liveness-probe.sh
initialDelaySeconds: 10
periodSeconds: 60
And create an sh file in your root path named
liveness-probe.sh
that contains
#!/bin/bash
#exit 0 #Does not fail and does not trigger a pod restart
exit 1 #Triggers pod restart
If you have the ability to change the underlying applications code, simply change the /health endpoint to make it return something higher than a 400 http status code.
If not, you'll have to make your application fail somehow, probably by logging into the pod using kubectl exec and making changes that affect the application's health.
This is entirely dependent on your application, and kubernetes will simply do what you tell it to.
If you can get to the host where the pod is running, doing a docker pause on the container will pause all the processes in the container, which should fail the liveness probes.
Note: I have not tried this myself but based on the documentation of docker pause here, it sounds like that.