Sidecar containers in Kubernetes Jobs? - monitoring

We use Kubernetes Jobs for a lot of batch computing here and I'd like to instrument each Job with a monitoring sidecar to update a centralized tracking system with the progress of a job.
The only problem is, I can't figure out what the semantics are (or are supposed to be) of multiple containers in a job.
I gave it a shot anyways (with an alpine sidecar that printed "hello" every 1 sec) and after my main task completed, the Jobs are considered Successful and the kubectl get pods in Kubernetes 1.2.0 shows:
NAME READY STATUS RESTARTS AGE
job-69541b2b2c0189ba82529830fe6064bd-ddt2b 1/2 Completed 0 4m
job-c53e78aee371403fe5d479ef69485a3d-4qtli 1/2 Completed 0 4m
job-df9a48b2fc89c75d50b298a43ca2c8d3-9r0te 1/2 Completed 0 4m
job-e98fb7df5e78fc3ccd5add85f8825471-eghtw 1/2 Completed 0 4m
And if I describe one of those pods
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 24 Mar 2016 11:59:19 -0700
Finished: Thu, 24 Mar 2016 11:59:21 -0700
Then GETing the yaml of the job shows information per container:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2016-03-24T18:59:29Z
message: 'containers with unready status: [pod-template]'
reason: ContainersNotReady
status: "False"
type: Ready
containerStatuses:
- containerID: docker://333709ca66462b0e41f42f297fa36261aa81fc099741e425b7192fa7ef733937
image: luigi-reduce:0.2
imageID: docker://sha256:5a5e15390ef8e89a450dac7f85a9821fb86a33b1b7daeab9f116be252424db70
lastState: {}
name: pod-template
ready: false
restartCount: 0
state:
terminated:
containerID: docker://333709ca66462b0e41f42f297fa36261aa81fc099741e425b7192fa7ef733937
exitCode: 0
finishedAt: 2016-03-24T18:59:30Z
reason: Completed
startedAt: 2016-03-24T18:59:29Z
- containerID: docker://3d2b51436e435e0b887af92c420d175fafbeb8441753e378eb77d009a38b7e1e
image: alpine
imageID: docker://sha256:70c557e50ed630deed07cbb0dc4d28aa0f2a485cf7af124cc48f06bce83f784b
lastState: {}
name: sidecar
ready: true
restartCount: 0
state:
running:
startedAt: 2016-03-24T18:59:31Z
hostIP: 10.2.113.74
phase: Running
So it looks like my sidecar would need to watch the main process (how?) and exit gracefully once it detects it is alone in the pod? If this is correct, then are there best practices/patterns for this (should the sidecar exit with the return code of the main container? but how does it get that?)?
** Update **
After further experimentation, I've also discovered the following:
If there are two containers in a pod, then it is not considered successful until all containers in the pod return with exit code 0.
Additionally, if restartPolicy: OnFailure is set on the pod spec, then any container in the pod that terminates with non-zero exit code will be restarted in the same pod (this could be useful for a monitoring sidecar to count the number of retries and delete the job after a certain number (to workaround no max-retries currently available in Kubernetes jobs)).

You can use the downward api to figure out your own podname from within the sidecar, and then retrieving your own pod from the apiserver to lookup exist status. Let me know how this goes.

Related

GKE problem when running cronjob by pulling image from Artifact Registry

I created a cronjob with the following spec in GKE:
# cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: collect-data-cj-111
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Allow
startingDeadlineSeconds: 100
suspend: false
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: collect-data-cj-111
image: collect_data:1.3
restartPolicy: OnFailure
I create the cronjob with the following command:
kubectl apply -f collect_data.yaml
When I later watch if it is running or not (as I scheduled it to run every 5th minute for for the sake of testing), here is what I see:
$ kubectl get pods --watch
NAME READY STATUS RESTARTS AGE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 Pending 0 0s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 Pending 0 1s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ContainerCreating 0 1s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ErrImagePull 0 3s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ImagePullBackOff 0 17s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ErrImagePull 0 30s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ImagePullBackOff 0 44s
It does not seem to be able to pull the image from Artifact Registry. I have both GKE and Artifact Registry created under the same project.
What can be the reason? After spending several hours in docs, I still could not make progress and I am quite new in the world of GKE.
If you happen to recommend me to check anything, I really appreciate if you also describe where in GCP I should check/control your recommendation.
ADDENDUM:
When I run the following command:
kubectl describe pods
The output is quite large but I guess the following message should indicate the problem.
Failed to pull image "collect_data:1.3": rpc error: code = Unknown
desc = failed to pull and unpack image "docker.io/library/collect_data:1.3":
failed to resolve reference "docker.io/library/collect_data:1.3": pull
access denied, repository does not exist or may require authorization:
server message: insufficient_scope: authorization failed
How do I solve this problem step by step?
From the error shared, I can tell that the image is not being pulled from Artifact Registry, and the reason for failure is because, by default, GKE pulls it directly from Docker Hub unless specified otherwise. Since there is no collect_data image there, hence the error.
The correct way to specify an image stored in Artifact Registry is as follows:
image: <location>-docker.pkg.dev/<project>/<repo-name>/<image-name:tag>
Be aware that the registry format has to be set to "docker" if you are using a docker-containerized image.
Take a look at the Quickstart for Docker guide, where it is specified how to pull and push docker images to Artifact Registry along with the permissions required.

Running a docker container which uses GPU from kubernetes fails to find the GPU

I want to run a docker container which uses GPU (it runs a cnn to detect objects on a video), and then run that container on Kubernetes.
I can run the container from docker alone without problems, but when I try to run the container from Kubernetes it fails to find the GPU.
I run it using this command:
kubectl exec -it namepod /bin/bash
This is the problem that I get:
kubectl exec -it tym-python-5bb7fcf76b-4c9z6 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root#tym-python-5bb7fcf76b-4c9z6:/opt# cd servicio/
root#tym-python-5bb7fcf76b-4c9z6:/opt/servicio# python3 TM_Servicev2.py
Try to load cfg: /opt/darknet/cfg/yolov4.cfg, weights: /opt/yolov4.weights, clear = 0
CUDA status Error: file: ./src/dark_cuda.c : () : line: 620 : build time: Jul 30 2021 - 14:05:34
CUDA Error: no CUDA-capable device is detected
python3: check_error: Unknown error -1979678822
root#tym-python-5bb7fcf76b-4c9z6:/opt/servicio#
EDIT.
I followed all the steps on the Nvidia docker 2 guide and downloaded the Nvidia plugin for Kubernetes.
however when I deploy Kubernetes it stays as "pending" and never actually starts. I don't get an error anymore, but it never starts.
The pod appears like this:
gpu-pod 0/1 Pending 0 3m19s
EDIT 2.
I ended up reinstalling everything and now my pod appears completed but not running. like this.
default gpu-operator-test 0/1 Completed 0 62m
Answering Wiktor.
when I run this command:
kubectl describe pod gpu-operator-test
I get:
Name: gpu-operator-test
Namespace: default
Priority: 0
Node: pdi-mc/192.168.0.15
Start Time: Mon, 09 Aug 2021 12:09:51 -0500
Labels: <none>
Annotations: cni.projectcalico.org/containerID: 968e49d27fb3d86ed7e70769953279271b675177e188d52d45d7c4926bcdfbb2
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Succeeded
IP: 192.168.10.81
IPs:
IP: 192.168.10.81
Containers:
cuda-vector-add:
Container ID: docker://d49545fad730b2ec3ea81a45a85a2fef323edc82e29339cd3603f122abde9cef
Image: nvidia/samples:vectoradd-cuda10.2
Image ID: docker-pullable://nvidia/samples#sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Aug 2021 12:10:29 -0500
Finished: Mon, 09 Aug 2021 12:10:30 -0500
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9ktgq (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-9ktgq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
I'm using this configuration file to create the pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda10.2"
resources:
limits:
nvidia.com/gpu: 1
Addressing two topics here:
The error you saw at the beginning:
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Means that you tried to use a deprecated version of the kubectl exec command. The proper syntax is:
$ kubectl exec (POD | TYPE/NAME) [-c CONTAINER] [flags] -- COMMAND [args...]
See here for more details.
According the the official docs the gpu-operator-test pod should run to completion:
You can see that the pod's status is Succeeded and also:
State: Terminated
Reason: Completed
Exit Code: 0
Exit Code: 0 means that the specified container command completed successfully.
More details can be found in the official docs.

CrashLoopBackOff while deploying pod using image from private registry

I am trying to create a pod using my own docker image on localhost.
This is the dockerfile used to create the image :
FROM centos:8
RUN yum install -y gdb
RUN yum group install -y "Development Tools"
CMD ["/usr/bin/bash"]
The yaml file used to create the pod is this :
---
apiVersion: v1
kind: Pod
metadata:
name: server
labels:
app: server
spec:
containers:
- name: server
imagePullPolicy: Never
image: localhost:5000/server
ports:
- containerPort: 80
root#node1:~/test/server# docker images | grep server
server latest 82c5228a553d 3 hours ago 948MB
localhost.localdomain:5000/server latest 82c5228a553d 3 hours ago 948MB
localhost:5000/server latest 82c5228a553d 3 hours ago 948MB
The image has been pushed to localhost registry.
Following is the error I receive.
root#node1:~/test/server# kubectl get pods
NAME READY STATUS RESTARTS AGE
server 0/1 CrashLoopBackOff 5 5m18s
The output of describe pod :
root#node1:~/test/server# kubectl describe pod server
Name: server
Namespace: default
Priority: 0
Node: node1/10.0.2.15
Start Time: Mon, 07 Dec 2020 15:35:49 +0530
Labels: app=server
Annotations: cni.projectcalico.org/podIP: 10.233.90.192/32
cni.projectcalico.org/podIPs: 10.233.90.192/32
Status: Running
IP: 10.233.90.192
IPs:
IP: 10.233.90.192
Containers:
server:
Container ID: docker://c2982e677bf37ff11272f9ea3f68565e0120fb8ccfb1595393794746ee29b821
Image: localhost:5000/server
Image ID: docker-pullable://localhost.localdomain:5000/server#sha256:6bc8193296d46e1e6fa4cb849fa83cb49e5accc8b0c89a14d95928982ec9d8e9
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 07 Dec 2020 15:41:33 +0530
Finished: Mon, 07 Dec 2020 15:41:33 +0530
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-tb7wb (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-tb7wb:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-tb7wb
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m default-scheduler Successfully assigned default/server to node1
Normal Pulled 4m34s (x5 over 5m59s) kubelet Container image "localhost:5000/server" already present on machine
Normal Created 4m34s (x5 over 5m59s) kubelet Created container server
Normal Started 4m34s (x5 over 5m59s) kubelet Started container server
Warning BackOff 56s (x25 over 5m58s) kubelet Back-off restarting failed container
I get no logs :
root#node1:~/test/server# kubectl logs -f server
root#node1:~/test/server#
I am unable to figure out whether the issue is with the container or yaml file for creating pod. Any help would be appreciated.
Posting this as Community Wiki.
As pointed by #David Maze in comment section.
If docker run exits immediately, a Kubernetes Pod will always go into CrashLoopBackOff state. Your Dockerfile needs to COPY in or otherwise install and application and set its CMD to run it.
Root cause can be also determined by Exit Code. In 3) Check the exit code article, you can find a few exit codes like 0, 1, 128, 137 with description.
3.1) Exit Code 0
This exit code implies that the specified container command completed ‘sucessfully’, but too often for Kubernetes to accept as working.
In short story, your container was created, all action mentioned was executed and as there was nothing else to do, it exit with Exit Code 0.
A CrashLoopBackOff error occurs when a pod startup fails repeatedly in Kubernetes.`
Your image based on centos with few additional installations did not have any process in backgroud left, so it was categorized as Completed. As this happen so fast, kubernetes restarted it and it fall in loop.
$ kubectl run centos --image=centos
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
centos 0/1 CrashLoopBackOff 1 5s
centos 0/1 Completed 2 17s
centos 0/1 CrashLoopBackOff 2 31s
centos 0/1 Completed 3 46s
centos 0/1 CrashLoopBackOff 3 58s
centos 1/1 Running 4 88s
centos 0/1 Completed 4 89s
centos 0/1 CrashLoopBackOff 4 102s
$ kubectl describe po centos | grep 'Exit Code'
Exit Code: 0
But when you have used sleep 3600, in your container, command sleep was executing for hour. After this time it would also exit with Exit Code 0.
Hope it clarified.

Kubernetes pods hanging in Init state

I am facing a weird issue with my pods. I am launching around 20 pods in my env and every time some random 3-4 pods out of them hang with Init:0/1 status. On checking the status of pod, Init container shows running status, which should terminate after task is finished, and app container shows Waiting/Pod Initializing stage. Same init container image and specs are being used in across all 20 pods but this issue is happening with some random pods every time. And on terminating these stuck pods, it stucks in Terminating state. If i ssh on node at which this pod is launched and run docker ps, it shows me init container in running state but on running docker exec it throws error that container doesn't exist. This init container is pulling configs from Consul Server and on checking volume (got from docker inspect), i found that it has pulled all the key-val pairs correctly and saved it in defined file name. I have checked resources on all the nodes and more than enough is available on all.
Below is detailed example of on the pod acting like this.
Kubectl Version :
kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Pods :
kubectl get pods -n dev1|grep -i session-service
session-service-app-75c9c8b5d9-dsmhp 0/1 Init:0/1 0 10h
session-service-app-75c9c8b5d9-vq98k 0/1 Terminating 0 11h
Pods Status :
kubectl describe pods session-service-app-75c9c8b5d9-dsmhp -n dev1
Name: session-service-app-75c9c8b5d9-dsmhp
Namespace: dev1
Node: ip-192-168-44-18.ap-southeast-1.compute.internal/192.168.44.18
Start Time: Fri, 27 Apr 2018 18:14:43 +0530
Labels: app=session-service-app
pod-template-hash=3175746185
release=session-service-app
Status: Pending
IP: 100.96.4.240
Controlled By: ReplicaSet/session-service-app-75c9c8b5d9
Init Containers:
initpullconsulconfig:
Container ID: docker://c658d59995636e39c9d03b06e4973b6e32f818783a21ad292a2cf20d0e43bb02
Image: shr-u-nexus-01.myops.de:8082/utils/app-init:1.0
Image ID: docker-pullable://shr-u-nexus-01.myops.de:8082/utils/app-init#sha256:7b0692e3f2e96c6e54c2da614773bb860305b79922b79642642c4e76bd5312cd
Port: <none>
Args:
-consul-addr=consul-server.consul.svc.cluster.local:8500
State: Running
Started: Fri, 27 Apr 2018 18:14:44 +0530
Ready: False
Restart Count: 0
Environment:
CONSUL_TEMPLATE_VERSION: 0.19.4
POD: sand
SERVICE: session-service-app
ENV: dev1
Mounts:
/var/lib/app from shared-volume-sidecar (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bthkv (ro)
Containers:
session-service-app:
Container ID:
Image: shr-u-nexus-01.myops.de:8082/sand-images/sessionservice-init:sitv12
Image ID:
Port: 8080/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/etc/appenv from shared-volume-sidecar (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bthkv (ro)
Conditions:
Type Status
Initialized False
Ready False
PodScheduled True
Volumes:
shared-volume-sidecar:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
default-token-bthkv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bthkv
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
Container Status on Node :
sudo docker ps|grep -i session
c658d5999563 shr-u-nexus-01.myops.de:8082/utils/app-init#sha256:7b0692e3f2e96c6e54c2da614773bb860305b79922b79642642c4e76bd5312cd "/usr/bin/consul-t..." 10 hours ago Up 10 hours k8s_initpullconsulconfig_session-service-app-75c9c8b5d9-dsmhp_dev1_c2075f2a-4a18-11e8-88e7-02929cc89ab6_0
da120abd3dbb gcr.io/google_containers/pause-amd64:3.0 "/pause" 10 hours ago Up 10 hours k8s_POD_session-service-app-75c9c8b5d9-dsmhp_dev1_c2075f2a-4a18-11e8-88e7-02929cc89ab6_0
f53d48c7d6ec shr-u-nexus-01.myops.de:8082/utils/app-init#sha256:7b0692e3f2e96c6e54c2da614773bb860305b79922b79642642c4e76bd5312cd "/usr/bin/consul-t..." 10 hours ago Up 10 hours k8s_initpullconsulconfig_session-service-app-75c9c8b5d9-vq98k_dev1_42837d12-4a12-11e8-88e7-02929cc89ab6_0
c26415458d39 gcr.io/google_containers/pause-amd64:3.0 "/pause" 10 hours ago Up 10 hours k8s_POD_session-service-app-75c9c8b5d9-vq98k_dev1_42837d12-4a12-11e8-88e7-02929cc89ab6_0
On running Docker exec (same result with kubectl exec) :
sudo docker exec -it c658d5999563 bash
rpc error: code = 2 desc = containerd: container not found
A Pod can be stuck in Init status due to many reasons.
PodInitializing or Init Status means that the Pod contains an Init container that hasn't finalized (Init containers: specialized containers that run before app containers in a Pod, init containers can contain utilities or setup scripts). If the pods status is ´Init:0/1´ means one init container is not finalized; init:N/M means the Pod has M Init Containers, and N have completed so far.
Gathering information
For those scenario the best would be to gather information, as the root cause can be different in every PodInitializing issue.
kubectl describe pods pod-XXX with this command you can get many info of the pod, you can check if there's any meaningful event as well. Save the init container name
kubectl logs pod-XXX this command prints the logs for a container in a pod or specified resource.
kubectl logs pod-XXX -c init-container-xxx This is the most accurate as could print the logs of the init container. You can get the init container name describing the pod in order to replace "init-container-XXX" as for example to "copy-default-config" as below:
The output of kubectl logs pod-XXX -c init-container-xxx can thrown meaningful info of the issue, reference:
In the image above we can see that the root cause is that the init container can't download the plugins from jenkins (timeout), here now we can check connection config, proxy, dns; or just modify the yaml to deploy the container without the plugins.
Additional:
kubectl describe node node-XXX describing the pod will give you the name of its node, which you can also inspect with this command.
kubectl get events to list the cluster events.
journalctl -xeu kubelet | tail -n 10 kubelet logs on systemd (journalctl -xeu docker | tail -n 1 for docker).
Solutions
The solutions depends on the information gathered, once the root cause is found.
When you find a log with an insight of the root cause, you can investigate that specific root cause.
Some examples:
1 > In there this happened when init container was deleted, can be fixed deleting the pod so it would be recreated, or redeploy it. Same scenario in 1.1.
2 > If you found "bad address 'kube-dns.kube-system'" the PVC may not be recycled correctly, solution provided in 2 is running /opt/kubernetes/bin/kube-restart.sh.
3 > There, a sh file was not found, the solution would be to modify the yaml file or remove the container if unnecessary.
4 > A FailedSync was found, and it was solved restarting docker on the node.
In general you can modify the yaml, for example to avoid using an outdated URL, try to recreate the affected resource, or just remove the init container that causes the issue from your deployment. However the specific solution will depend on the specific root cause.
My problem was related to the ebs-csi-controller (AWS EKS 1.24)
The ebs addin needs access to a role, and in my case the role trust relationship was broken. It uses OIDC, so I had to add my cluster's OIDC provider manually into the IAM identity provider section
kubectl logs deployment/ebs-csi-controller -n kube-system -c ebs-plugin
helped diagnose this, as well as
https://aws.amazon.com/premiumsupport/knowledge-center/eks-troubleshoot-ebs-volume-mounts/

How to determine the reason of a CrashLoopBackOff error in a Spring Boot application deployed on Kubernetes

I have a Spring Boot application, deployed on a docker container on Kubernetes. The application works well for some time (hours) but at a certain moment it starts restarting like crazy showing a CrashLoopBackOff error state.
This is the info I get from the dead pod:
Port: 8080/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Fri, 11 Aug 2017 10:15:03 +0200
Finished: Fri, 11 Aug 2017 10:16:22 +0200
Ready: False
Restart Count: 7
...
Volume Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bhk8f (ro)
Environment Variables:
JAVA_OPTS: -Xms512m -Xmx1792m
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
...
QoS Class: BestEffort
Tolerations: <none>
No events.
Is there any way to get more detailed information about the cause of the crashes?
Is 137 error code an out of memory error? I have kept increasing the memory of the Java process from -Xmx768m up to 1792m, but errors keep showing up. Could it be something else?
One weird fact: I need to find out how come the application runs well, after some hours the pod is killed and then every restart is killed after only some seconds executing.
kubectl logs podName containerName will provide you with the container logs which should give you additional information about the cause of the error.

Resources