I'm running Kubernetes service using exec which have few pods in statefulset.
If I kill one of the master pod used by service in exec, it exits with code 137. I want to forward it to another pod immediately after killing or apply wait before exiting. I need help. Waiting for answer. Thank you.
137 means your process exited due to SIGKILL, usually because the system ran out of RAM. Unfortunately no delay is possible with SIGKILL, the kernel just drops your process and that is that. Kubernetes does detect it rapidly and if you're using a Service-based network path it will usually react in 1-2 seconds. I would recommend looking into why your process is being hard-killed and fixing that :)
Related
I'm using Jenkins Kubernetes Plugin which starts Pods in a Kubernetes Cluster which serve as Jenkins agents. The pods contain 3 containers in order to provide the slave logic, a Docker socket as well as the gcloud command line tool.
The usual workflow is that the slave does its job and notifies the master that it completed. Then the master terminates the pod. However, if the slave container crashes due to a lost network connection, the container terminates with error code 255, the other two containers keep running and so does the pod. This is a problem because the pods have large CPU requests and setup is cheap with the slave running only when they have to, but having multiple machines running for 24h or over the weekend is a noticable financial damage.
I'm aware that starting multiple containers in the same pod is not fine Kubernetes arts, however ok if I know what I'm doing and I assume I do. I'm sure it's hard to solve this differently given the way the Jenkins Kubernetes Plugin works.
Can I make the pod terminate if one container fails without it respawn? As solution with a timeout is acceptable as well, however less preferred.
Disclaimer, I have a rather limited knowledge of kubernetes, but given the question:
Maybe you can run the forth container that exposes one simple endpoint of "liveness"
It can run ps -ef or any other way to contact 3 existing containers just to make sure they're alive.
This endpoint could return "OK" only if all the containers are running, and "ERROR" if at least one of them was detected as "crushed"
Then you could setup a liveness probe of kubernetes so that it would stop the pod upon the error returned from that forth container.
Of course if this 4th process will crash by itself for any reason (well it shouldn't unless there is a bug or something) then the liveness probe won't respond and kubernetes is supposed to stop the pod anyway, which is probably what you really want to achieve.
I am running an opencpu based image on openshift, every time the pod starts, after just a few seconds, it crashes with the error:
command terminated with non-zero exit code: Error executing in Docker Container: 137
Event tab shows only below three events and terminal logs does not show anything as well.
Back-off restarting the failed container
Pod sandbox changed, it will be killed and re-created.
Killing container with id docker://opencpu-test-temp:Need to kill Pod
I am really not getting any clue on why container gets restarted in every few seconds. This image runs just fine locally.
Does anyone give me a clue on how to debug this issue ?
Error 137 is often memory related in a docker context.
The actual error is from the process that is isolated in the docker container. It means that the process could not be killed with a SIGKILL. Source
From bobcares.com:
Error 137 in Docker denotes that the container was ‘KILL’ed by
‘oom-killer’ (Out of Memory). This happens when there isn’t enough
memory in the container for running the process.
‘OOM killer’ is a proactive process that jumps in to save the system
when its memory level goes too low, by killing the resource-abusive
processes to free up memory for the system.
Try checking your memory config of the container? And available memory on the host that is launching the pod? Is there nothing the the opencpu container log?
Check the seting rlimit.as in the config file /etc/opencpu/server.conf, inside the image. This limit is the "per request" memory limit for your opencpu instance (I realize that your problem is at startup, so this is perhaps not too likely to be the case).
I have a universal react app hosted in a docker container in a minikube (kubernetes) dev environment. I use virtualbox and I actually have more microservices on this vm.
In this react app, I use pm2 to restart my app on changes to server code, and webpack hmr to hot-reload client code on changes to client code.
Every say 15-45 seconds, pm2 is logging the below message to me indicating that the app exited due to a SIGKILL.
App [development] with id [0] and pid [299], exited with code [0] via signal [SIGKILL]
I can't for the life of me figure out why it is happening. It is relatively frequent, but not so frequent that it happens every second. It's quite annoying because each time it happens, my webpack bundle has to recompile.
What are some reasons why pm2 might receive a SIGKILL in this type of dev environment? Also, what are some possible ways of debugging this?
I noticed that my services that use pm2 to restart on server changes do NOT have this problem when they are just backend services. I.e. when they don't have webpack. In addition, I don't see these SIGKILL problems in my prod version of the app. That suggests to me there is some problem with the combination of webpack hmr setup, pm2, and minikube / docker.
I've tried the app locally (not in docker /minikube) and it works fine without any sigkills, so it can't be webpack hmr on its own. Does kubernetes kill services that use a lot of memory? (Maybe it thinks my app is using a lot of memory). If that's not the case, what might be some reasons kubernetes or docker send SIGKILL? Is there any way to debug this?
Any guidance is greatly appreciated. Thanks
I can't quite tell from the error message you posted, but usually this is a result of the kernel OOM Killer (Out of Memory Killer) taking out your process. This can be either because your process is just using up too much memory, or you have a cgroup setting on your container that is overly aggressive and causing it to get killed. You may also have under-allocated memory to your VirtualBox instance.
Normally you'll see Docker reporting that the container exited with code 137 in docker ps -a
dmesg or your syslogs on the node in question may show the kernel OOM killer output.
What's the best way in script to wait for a job or pod to complete in Kubernetes or Google Container Engine?
In particular, it would be better to be notified rather than polling for status in kubectl, but I'd be happy with a fairly efficient loop without any slips between the cracks. Essentially, I'd like the equivalent of a plain docker run since that blocks until command termination, but I don't want to use docker directly in this case.
I looked at Github Issue #1899 but it looks unresolved as yet.
It's not really what it was designed for, but you could run kubectl attach $POD. It'll show you the output of the pod while it's running and automatically terminate once the pod is done running.
Of course, you'll have to handle the error that it prints if the pod is already done running, since it's only really meant for use on pods that are currently running.
I am making graceful shutdown feature using go lang when kebernetes doing rolling-update on google container engine. Does anyone know what process signal is sent to the running pods when kubectl rolling-update starts?
I've listened to os.Kill, os.Interrupt, syscall.SIGTERM, syscall.SIGKILL, syscall.SIGSTOP signals to be handled, none of those signals was raised while kubectl rolling-update.
I would really appreciate for your answers.
I got a solution! I used shell script file as an ENTRYPOINT and executed go binary in that script file. So process ID of executed go binary was not 1.(shell script process's ID was 1 instead) And docker sent SIGTERM only to PID 1(which is not propagated to it's child processes). So, I had to change my ENTRYPOINT direct to executing go binary, and I got SIGTERM in my go code now. Refer to this link