OpenWhisk Invoker Agent Suspend Time - serverless

Deployed Openwhisk on Rancher with kubernetesContainerFactory and Invoker Agent enabled. Invoked hello.js action and wsk created pods wskowdev-invoker-00-1-prewarm-nodejs10 and wskowdev-invoker-00-2-prewarm-nodejs10. Pod 1 got removed after 15 minutes but pod 2 never gets removed by the invoker agent. My understanding is after a specified period of time the pods should be suspended. Please clarify how the invoker agent and pod suspension/removal works.

One of the pre-warmed containers was removed after because it was used. The other is still pristine and remains around/not reclaimed. This is called a stem cell container. A container is subject to garbage collection later once it is specialized (initialized with user code).
The default idle timeout is 10 minutes though, not 15 per this configuration, unless overridden.
https://github.com/apache/incubator-openwhisk/blob/2f0155fb750ce8b5eef6d5b0f4e2e2db40e5a037/core/invoker/src/main/resources/application.conf#L103-L110
The number of stemcells is specific to a runtime and determined by the runtime manifest. For example: https://github.com/apache/incubator-openwhisk/blob/ce45d54c824ef6c3e5d98ce0b220b924c81e688b/ansible/files/runtimes.json#L45-L50
Once a stem cell container is used, a nanny process will replace it with a new container so that the number of pre-warmed containers should generally be constant.

Related

what would happen if i restart a node with some pods running

Assume that there are some pods from Deployments/StatefulSet/DaemonSet, etc. running on a Kubernetes node.
Then I restarted the node directly, and then start docker, start kubelet with the same parameters.
What would happen to those pods?
Are they recreated with metadata saved locally from kubelet? Or use info retrieved from api-server? Or recovered from OCI runtime and behaves like nothing happened?
Is it that only stateless pod(no --local-data) can be recovered normally? If any of them has a local PV/dir, would they be connected back normally?
What if I did not restart the node for a long time? Would api-server assign other nodes to create those pods? What is the default timeout value? How can I configure this?
As far as I know:
apiserver
^
|(sync)
V
kubelet
^
|(sync)
V
-------------
| CRI plugin |(like api)
| containerd |(like api-server)
| runc |(low-level binary which manages container)
| c' runtime |(container runtime where containers run)
-------------
When kubelet received a PodSpec from kube-api-server, it calls CRI like a remote service, the steps be like:
create PodSandbox(a.k.a 'pause' image, always 'stopped')
create container(s)
run container(s)
So I guess that as the node and docker being restarted, steps 1 and 2 are already done, containers are at 'stopped' status; Then as kubelet being restarted, it pulls latest info from kube-api-server, found out that container(s) are not in 'running' state, so it calls CRI to run container(s), then everything are back to normal.
Please help me confirm.
Thank you in advance~
Good questions. A few things first; a Pod is not pinned to a certain node. The nodes is mostly seen as a "server farm" that Kubernetes can use to run its workload. E.g. you give Kubernetes a set of nodes and you also give a set of e.g. Deployment - that is desired state of applications that should run on your servers. Kubernetes is responsible for scheduling these Pods and also keep them running when something in the cluster is changed.
Standalone pods is not managed by anything, so if a Pod crashes it is not recovered. You typically want to deploy your stateless apps as Deployments, that then initiates ReplicaSets that manage a set of Pods - e.g. 4 Pods - instances of your app.
Your desired state; a Deployment with e.g. replicas: 4 is saved in the etcd database within the Kubernetes control plane.
Then a set of controllers for Deployment and ReplicaSet is responsible for keeping 4 replicas of your app alive. E.g. if a node becomes unresponsible (or dies), new pods will be created on other Nodes, if they are managed by the controllers for ReplicaSet.
A Kubelet receives a PodSpecs that are scheduled to the node, and then keep these pods alive by regularly health checks.
Is it that only stateless pod(no --local-data) can be recovered normally?
Pods should be seen as emphemeral - e.g. can disappear - but is recovered by a controller that manages them - unless deployed as standalone Pod. So don't store local data within the pod.
There is also StatefulSet pods, those are meant for stateful workload - but distributed stateful workload, typically e.g. 3 pods, that use Raft to replicate data. The etcd database is an example of distributed database that uses Raft.
The correct answer: it depends.
Imagine, you've got 3 nodes cluster, where you created a Deployment with 3 replicas, and 3-5 standalone pods.
Pods are created and scheduled to nodes.
Everything is up and running.
Let's assume that worker node node1 has got 1 deployment replica and 1 or more standalone pods.
The general sequence of node restart process as follows:
The node gets restarted, for ex. using sudo reboot
After restart, the node starts all OS processes in the order specified by systemd dependencies
When dockerd is started it does nothing. At this point all previous containers has Exited state.
When kubelet is started it requests the cluster apiserver for the list of Pods with node property equals its node name.
After getting the reply from apiserver, kubelet starts containers for all pods described in the apiserver reply using Docker CRI.
When pause container starts for each Pod from the list, it gets new IP address configured by CNI binary, deployed by Network addon Daemonset's Pod.
After kube-proxy Pod is started on the node, it updates iptables rules to implement Kubernetes Services desired configuration, taking to account new Pods' IP addresses.
Now things become a bit more complicated.
Depending on apiserver, kube-controller-manager and kubelet configuration, they reacts on the fact that node is not responding with some delay.
If the node restarts fast enough, kube-controller-manager doesn't evict the Pods and they all remain scheduled on the same node increasing their RESTARTS number after their new containers become Ready.
Example 1.
The cluster is created using Kubeadm with Flannel network addon on Ubuntu 18.04 VM created in GCP.
Kubernetes version is v1.18.8
Docker version is 19.03.12
After the node is restarted, all Pods' containers are started on the node with new IP addresses. Pods keep their names and location.
If node is stopped for a long time, the pods on the node stays in Running state, but connection attempts are obviously timed out.
If node remains stopped, after approximately 5 minutes pods scheduled on that node were evicted by kube-controller-manager and terminated. If I would start node before that eviction all pods were remained on the node.
In case of eviction, standalone Pods disappear forever, Deployments and similar controllers create necessary number of pods to replace evicted Pods and kube-scheduler puts them to appropriate nodes. If new Pod can't be scheduled on another node, for ex. due to lack of required volumes it will remain in Pending state, until the scheduling requirements were satisfied.
On a cluster created using Ubuntu 18.04 Vagrant box and Virtualbox hypervisor with host-only adapter dedicated for Kubernetes networking, pods on stopped node remains in the Running, but with Readiness: false state even after two hours, and were never evicted. After starting the node in 2 hours all containers were restarted successfully.
This configuration's behavior is the same all the way from Kubernetes v1.7 till the latest v1.19.2.
Example 2.
The cluster is created in Google cloud (GKE) with the default kubenet network addon:
Kubernetes version is 1.15.12-gke.20
Node OS is Container-Optimized OS (cos)
After the node is restarted (it takes around 15-20 seconds) all pods are started on the node with new IP addresses. Pods keep their names and location. (same with example 1)
If the node is stopped, after short period of time (T1 equals around 30-60 seconds) all pods on the node change status to Terminating. Couple minutes later they disappear from the Pods list. Pods managed by Deployment are rescheduled on other nodes with new names and ip addresses.
If the node pool is created with Ubuntu nodes, apiserver terminates Pods later, T1 equals around 2-3 minutes.
The examples show that the situation after worker node gets restarted is different for different clusters, and it's better to run the experiment on a specific cluster to check if you can get the expected results.
How to configure those timeouts:
How Can I Reduce Detecting the Node Failure Time on Kubernetes?
Kubernetes recreate pod if node becomes offline timeout
When the node is restarted and there are pods scheduled on it, managed by Deployment or ReplicaSet, those controllers will take care of scheduling desired number of replicas on another, healthy node. So if you have 2 replicas running on restarted node, they will be terminated and scheduled on other node.
Before restarting a node you should use kubectl cordon to mark the node as unschedulable and give kubernetes time to reschedule pods.
Stateless pods will not be rescheduled on any other node, they will be terminated.

when terminate a pod, does kubelet stop containers in parallel?

As far as I know, docker stop stops containers one by one even though apply multiple containers to it at a time. So does kubelet behave like this?
Pod termination prefectly described on POD lifecycle page
User sends command to delete Pod, with default grace period (30s)
The Pod in the API server is updated with the time beyond which the Pod is considered “dead” along with the grace period.
Pod shows up as “Terminating” when listed in client commands
(simultaneous with 3) When the Kubelet sees that a Pod has been marked as terminating because the time in 2 has been set, it begins the Pod shutdown process.
If one of the Pod’s containers has defined a preStop hook, it is invoked inside of the container. If the preStop hook is still running after the grace period expires, step 2 is then invoked with a small (2 second) extended grace period.
The container is sent the TERM signal. Note that not all containers in the Pod will receive the TERM signal at the same time and may each require a preStop hook if the order in which they shut down matters.
(simultaneous with 3) Pod is removed from endpoints list for service, and are no longer considered part of the set of running Pods for replication controllers. Pods that shutdown slowly cannot continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
When the grace period expires, any processes still running in the Pod are killed with SIGKILL.
The Kubelet will finish deleting the Pod on the API server by setting grace period 0 (immediate deletion). The Pod disappears from the API and is no longer visible from the client.

What steps does docker swarm take when doing a rolling update with start-first?

When docker swarm does a rolling update with stop-first on multiple running container instances, it takes -among others- the following steps in order for each container in a row:
Remove the container from its internal load balancer
Send a SIGTERM signal to the container.
With respect to the stop-grace-period, send a SIGKILL signal.
Start a new container
Add the new container to its internal load balancer
But which order of steps are taken when I want to do a rolling update with start-first?
Will the old- and new- container be available through the loadbalancer at the same time (until the old one has stopped and removed from the lb)?
Or will the new container first be started and not be added to the loadbalancer until the old container is stopped and removed from the loadbalancer?
The latter would nescesarry for processes that are bounded to a specific instance of a service (container).
But which order of steps are taken when I want to do a rolling update
with start-first?
It's basically the reverse. New container starts, added to LB, then the old one is removed from LB and sent shutdown signal.
Will the old- and new- container be available through the loadbalancer
at the same time (until the old one has stopped and removed from the
lb)?
Yes.
A reminder that most of this will not be seamless (or near zero downtime) unless you (at a minimum) have healthchecks enabled in the service. I talk about this a little in this YouTube video.

How to add containers to a Kubernetes pod on runtime

I have a number of Jobs
running on k8s.
These jobs run a custom agent that copies some files and sets up the environment for a user (trusted) provided container to run.
This agent runs on the side of the user container, captures the logs, waits for the container to exit and process the generated results.
To achieve this, we mount Docker's socket /var/run/docker.sock and run as a privileged container, and from within the agent, we use docker-py to interact with the user container (setup, run, capture logs, terminate).
This works almost fine, but I'd consider it a hack. Since the user container was created by calling docker directly on a node, k8s is not aware of it's existence. This has been causing troubles since our monitoring tools interact with K8s, and don't get visibility to these stand-alone user containers. It also makes pod scheduling harder to manage, since the limits (cpu/memory) for the user container are not accounted as the requests for the pod.
I'm aware of init containers but these don't quite fit this use case, since we want to keep the agent running and monitoring the user container until it completes.
Is it possible for a container running on a pod, to request Kubernetes to add additional containers to the same pod the agent is running? And if so, can the agent also request Kubernetes to remove the user container at will (e.g. certain custom condition was met)?
From this GitHub issue, it seems that the answer is that adding or removing containers to a pod is not possible, since the container list in the pod spec is immutable.
In kubernetes 1.16, there is an alpha feature that would allow for creation of ephemeral containers which could be "added" to running pods. Note, that this requires a feature gate to be enabled on relevant components e.g. kubelet. This may be hard to enable on control plane for cloud provider managed services such as EKS.
API Reference 1.16
Simple tutorial
I don't think you can alter a running pod like that but you can certainly define your own pod and run it programmatically using API
What I mean is you should define a pod with the user container and whatever other containers you wish and run it as a unit. It's possible you'll need to play around with liveness checks to have post processing completed after your user container dies
You can share data between multiple containers in a pod using shared volumes. this would let your agent container read from log files written on the user container, and drop config files into the shared volume for setup.
This way you could run the user container and the agent container as a Job with both containers in the pod. When both containers exit, the job will be completed.
You seem to indicate above that you are manually terminating the user container. That wouldn't be supported via shared volume unless you did something like forcing users to terminate their execution at the presence of a file on the shared volume.
Is it possible for a container running on a pod, to request Kubernetes
to add additional containers to the same pod the agent is running? And
if so, can the agent also request Kubernetes to remove the user
container at will (e.g. certain custom condition was met)?
I'm not aware of any way to add containers to existing Job pod definitions. There's no replicas option for Jobs so you couldn't hack it by changing replicas from 0->1 like you potentially could on a Deployment.
I'm not aware of any way to use kubectl to delete a container but not the whole pod. See kubectl delete.
If you want to kill the user container (rather than having it run to completion), you'll have to get on the host and use docker kill <sha> on the user container. Make sure to set .spec.template.spec.restartPolicy = "Never" on the user container or k8s will restart it.
I'd recommend:
Having a shared volume to transfer logs to the agent and so the agent can set up the user container
Making user containers expect to exit on their own and read configs from the shared volume
I don't know what workloads you are doing or how users are making containers so that may not be possible. If you're not able to dictate how users build their containers, the above may not work.
Another option is providing a binary that acts as a command API on the user container. This binary could accept commands like "setup", "run", "terminate", "transfer logs" via RPC and it would be the main process in their docker container.
Then you could make the build process for users something like:
FROM your-container-with-binary:latest
put whatever you want in this
container and set ENV JOB_PATH=/path/to/executable/code (or put code
in specific location)
Lots of moving parts to this whichever way you make it happen.
You can inject containers to pods dynamically via : https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
An admission controller is a piece of code that intercepts requests to
the Kubernetes API server prior to persistence of the object, but
after the request is authenticated and authorized. The controllers
consist of the list below, are compiled into the kube-apiserver
binary, and may only be configured by the cluster administrator. In
that list, there are two special controllers: MutatingAdmissionWebhook
and ValidatingAdmissionWebhook. These execute the mutating and
validating (respectively) admission control webhooks which are
configured in the API.
Admission controllers may be “validating”, “mutating”, or both.
Mutating controllers may modify the objects they admit; validating
controllers may not.
And you can inject additional runtime requirements to pods via : https://kubernetes.io/docs/concepts/workloads/pods/podpreset/
A Pod Preset is an API resource for injecting additional runtime
requirements into a Pod at creation time. You use label selectors to
specify the Pods to which a given Pod Preset applies.

Openshift PaaS/Kubernetes Docker Container Monitoring and Orchestration

Kubernetes deployment and replication controller give the ability to self heal by ensuring a minimum number of replicas is/are present.
Also the auto scaling features, allows to increase replicas given a specific cpu threshold.
Are there tools available that would provide flexibility in the auto-healing and auto-scale features?
Example :
Auto-adjust number of replicas during peak hours or days.
When the pod dies, and is due to external issues, prevent the system from re-creating container and wait for a condition to succeed, i.e. ping or telnet test.
You can block pod startup by waiting for external services in an entrypoint script or init container. That's the closest that exists today to waiting for external conditions.
There is no time based autoscaler today, although it would be possible to script it failure easily on a schedule.
In Openshift, you can easily scale your app by running this command in a cron job.
Scale command
oc scale dc app --replicas=5
And of course, scale it down changing the numer of replicas.
Autoscale
This is what Openshift for developers write about autoscaling.
OpenShift also supports automatic scaling, defining upper and lower thresholds for CPU usage by pod.
If the upper threshold is consistently exceeded by the running pods for your application, a new instance of your application will be started. When CPU usage drops back below the lower threshold, because your application is no longer working as hard, the number of instances will be scaled back again.
I think Kubernetes now released version 1.3 which allows autoscale but integrated yet in Openshift.
Health Check
What it comes to health check, Openshift has:
readiness checks Checks the status of the test you configure before the router start to send traffic to it.
liveness probe: liveness probe is run periodically once traffic has been switched to an instance of your application to ensure it is still behaving correctly. If the liveness probe fails, OpenShift will automatically shut down that instance of your application and replace it with a new one.
You can perform this kind of tests (HTTP check, Container execution check and TCP socket check)
So e this tolos I guess you can créate some readiness check and liveness check to ensure that the status of your pod is running properly, if not a new deployment will be triggered until readiness status comes to ok.

Resources