Started working on Docker and kubernetes lately.
I ran into a problem which I don't actually understand fully.
The thing is when I execute my svc.yaml(service) and rc.yaml(replication controller) pods get created but its status is terminated.
I tried checking the possible reason for failure by using the command
docker ps -a
954c3ee817f9 localhost:5000/HelloService
"/bin/sh -c ./startSe" 2 minutes ago Exited (127) 2 minutes
ago
k8s_HelloService.523e3b04_HelloService-64789_default_40e92b63-707a-11e7-9b96-080027f96241_195f2fee
then tried running
docker run -i -t localhost:5000/HelloService
/bin/sh: ./startService.sh: not found
what is the possible reason I am getting these errors.
Docker File:
FROM alpine:3.2
VOLUME /tmp
ADD HelloService-0.0.1-SNAPSHOT.jar app.jar
VOLUME /etc
ADD /etc/ /etc/
ADD startService.sh /startService.sh
RUN chmod 700 /startService.sh
ENTRYPOINT ./startService.sh
startService.sh
#!/bin/sh
touch /app.jar
java -Djava.security.egd=file:/dev/./urandom -Xms256m -Xmx256m -jar /app.jar
Also I would like to know if there any specific way I can access the logs from kubernetes for the terminated pods?
Update :
on running below command
kubectl describe pods HelloService-522qw
24s 24s 1 {default-scheduler } Normal Scheduled Successfully
assigned HelloService-522qw to ssantosh.centos7 17s 17s 1 {kubelet
ssantosh.centos7} spec.containers{HelloService} Normal Created Created
container with docker id b550557f4c17; Security:[seccomp=unconfined]
17s 17s 1 {kubelet
ssantosh.centos7} spec.containers{HelloService} Normal Started Started
container with docker id b550557f4c17 18s 16s 2 {kubelet
ssantosh.centos7} spec.containers{HelloService} Normal Pulling pulling
image "localhost:5000/HelloService" 18s 16s 2 {kubelet
ssantosh.centos7} spec.containers{HelloService} Normal Pulled Successfully
pulled image "localhost:5000/HelloService" 15s 15s 1 {kubelet
ssantosh.centos7} spec.containers{HelloService} Normal Created Created
container with docker id d30b10211b1b; Security:[seccomp=unconfined]
14s 14s 1 {kubelet
ssantosh.centos7} spec.containers{HelloService} Normal Started Started
container with docker id d30b10211b1b 12s 11s 2 {kubelet
ssantosh.centos7} spec.containers{HelloService} Warning BackOff Back-off
restarting failed docker container 12s 11s 2 {kubelet
ssantosh.centos7} Warning FailedSync Error syncing pod, skipping:
failed to "StartContainer" for "HelloService" with CrashLoopBackOff:
"Back-off 10s restarting failed container=HelloService
pod=HelloService-522qw_default(1e951b45-7116-11e7-9b96-080027f96241)"
The issue was there is no java as a part of alpine image.
So modified the
FROM alpine:3.2
to
FROM anapsix/alpine-java
you need jdk on the machine also need to update the Dockerfile, remove the . infront of startService.sh command. like below
ENTRYPOINT /startService.sh
this will fix this error message.
/bin/sh: ./startService.sh: not found
Related
When trying to create Pods that can use GPU, I get the error "exec: "nvidia-smi": executable file not found in $PATH" ".
To explain the error from the beginning, my main goal was to create JupyterHub enviroments that can use GPU. I installed Zero to JupyterHub for Kubernetes. I followed these steps to be able to use GPU. When I check my nodes GPUs seems schedulable by Kubernetes. So far everything seemed fine.
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
arge-server 1
However, when I logged in to JupyetHub and tried to open the profile using GPU, I got an error: [Warning] 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.
kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-x5rqs 0/1 Init:0/1 2 6d20h
nvidia-device-plugin-daemonset-jhjhb 0/1 Init:0/1 0 6d20h
gpu-feature-discovery-pd4xv 0/1 Init:0/1 2 6d20h
nvidia-dcgm-exporter-7mjgt 0/1 Init:0/1 2 6d20h
nvidia-operator-validator-9xjmv 0/1 Init:Error 10 26m
After that, I took a closer look at the Pod nvidia-operator-validator-9xjmv, which was the beginning of the error, and I saw that the toolkit-validation container was throwing a CrashLoopBackOff error. Here is the relevant part of the log:
kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources
Name: nvidia-operator-validator-9xjmv
Namespace: gpu-operator-resources
.
.
.
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
.
.
.
toolkit-validation:
Container ID: containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator#sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 18 Nov 2021 12:55:00 +0300
Finished: Thu, 18 Nov 2021 12:55:00 +0300
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
Normal Pulled 58m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 58m kubelet Created container driver-validation
Normal Started 58m kubelet Started container driver-validation
Normal Pulled 56m (x5 over 58m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 56m (x5 over 58m) kubelet Created container toolkit-validation
Normal Started 56m (x5 over 58m) kubelet Started container toolkit-validation
Warning BackOff 3m7s (x255 over 58m) kubelet Back-off restarting failed container
Then, I looked at the logs of the container and I got the following error.
kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation
time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready
For similar issues, it was suggested to delete the failed Pod and deployment. However, doing these did not fix my problem. Do you have any suggestions?
I have;
Ubuntu 20.04
Kubernetes v1.21.6
Docker 20.10.10
NVIDIA-SMI 470.82.01
CUDA 11.4
CPU: Intel Xeon E5-2683 v4 (32) # 2.097GHz
GPU: NVIDIA GeForce RTX 2080 Ti
Memory: 13815MiB / 48280MiB
Thanks in advance.
In case you're are still having the issue, we just had the same issue on our cluster, the "dirty" fix is to do that:
rm /run/nvidia/driver
ln -s / /run/nvidia/drive
kubectl delete pod -n gpu-operator nvidia-operator-validator-xxxxx
The reason is the init pod of the nvidia-operator-validator try to execute nvidia-smi within a chroot from /run/nvidia/driver .. which is a tmpfs (so doesn't persist accross reboot) and is not populated when performing a manual install of the drivers.
Do hope for a better fix from Nvidia.
I am using kubernetes cluster to deploy an image using kubectl create -f dummy.yaml . my image is public in docker hub, the size of the image is 1.3 GB.
the image pull successfully but it is not running it is " CrashLoopBackOff".
when i run creation deployment command "kubectl create -f dummy.yaml" i got:
Name READY STATUS RESTARTS AGE
dummy-ser-5459bf444d-9b7sz 0/1 CrashLoopBackOff 118 10h
I tried to used
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ] in my yaml file, it is work with image size 700 MB but it show CrashLoopBackOff when i use it with other image 1.3 GB, it seems the container after pulling cannot run because the image successful pulled.
The describe pods show :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned dummy-ser-779 7db4cd4-djqdz to node02
Normal SuccessfulMountVolume 12m kubelet, node02 MountVolume.SetUp succeeded for vol ume "default-token-8p9lq"
Normal Created 1m (x4 over 2m) kubelet, node02 Created container
Normal Started 1m (x4 over 2m) kubelet, node02 Started container
Warning BackOff 53s (x8 over 2m) kubelet, node02 Back-off restarting failed containe r
Normal Pulling 41s (x5 over 12m) kubelet, node02 pulling image "xxx/dummyenc:ba ni"
Normal Pulled 40s (x5 over 2m) kubelet, node02 Successfully pulled image "xxx
Thank you in advanced
I fixed this problem. I got this error because the image was not compatible with the hardware that I tried to run on (ARM7)RPi. I create the image on ubuntu 64bit using docker build for Dockerfile so that image cannot run on Raspberry pi.
I have setup a new Kubernetes v1.5 cluster.
I locally created a new docker image using :
# MAIN IMAGE
FROM gcr.io/google_containers/nginx-slim
I created it using the command :
docker build -t myapp:1 .
I can see that the image is avaiable and running :
docker ps | grep app
d6fc0508e56b myapp:1 "nginx -g 'daemon ..." 31 seconds ago Up 30 seconds 0.0.0.0:32354->80/tcp
Now I am trying to use the same image in a kubernetes deployment.
kubectl run app-deployment --image myapp:1 --replicas=1 --port=80
But doing so does not start the pod and I get the ERROR :
19s 3s 2 {kubelet 10.0.0.17} spec.containers{app-deployment} Normal Pulling pulling image "myapp:1"
18s 2s 2 {kubelet 10.0.0.17} spec.containers{app-deployment} Warning Failed Failed to pull image "myapp:1": unauthorized: authentication required
18s 2s 2 {kubelet 10.0.0.17} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "app-deployment" with ErrImagePull: "unauthorized: authentication required"
The files /root/.docker/config.json & /var/lib/kubelet/.dockercfg are currently empty. Is there something Ive missed in setting up Kubernetes ?
Since you are building a custom docker image, you have to build it in every node of your cluster that the scheduler could put the pod into.
Furthermore, you need yo specify in your PodSpec an imagePullPolicy of ifNotPresent to indicate the kubelet not to try to download your image if it is already present.
This should make your image work, but I strongly suggest you to push your image in a docker registry and let the nodes pull it from there.
I'm new to openshift and I have a showstopper:
On my Computer I created a Dockerimage called restservice and I successfully tested it:
docker run -d -p 8080:8080 restservice
Then I created an app in Openshift-Online with the image:
oc new-app restservice
I can see the deployment-pod is starting and after the creating of the running pod is failing.
with
oc describe pod restservice-2-50n0h
I get the following Error:
...
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {default-scheduler } Normal Scheduled Successfully assigned restservice-2-50n0h to ip-172-31-54-238.us-west-2.compute.internal
41s 41s 1 {kubelet ip-172-31-54-238.us-west-2.compute.internal} spec.containers{restservice} Normal Pulling pulling image "restservice:latest"
39s 39s 1 {kubelet ip-172-31-54-238.us-west-2.compute.internal} spec.containers{restservice} Warning Failed Failed to pull image "restservice:latest": unauthorized: authentication required
39s 39s 1 {kubelet ip-172-31-54-238.us-west-2.compute.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "restservice" with ErrImagePull: "unauthorized: authentication required"
55s 9s 2 {kubelet ip-172-31-54-238.us-west-2.compute.internal} Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "restservice-2-50n0h_wgbeckmann" with SetupNetworkError: "Failed to setup network for pod \"restservice-2-50n0h_wgbeckmann(06f892b4-7568-11e7-914e-0a69cdf75e6f)\" using network plugins \"cni\": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 1 (iptables-restore: line 3 failed\n)\n'; Skipping pod"
I have no Idea what Authentication is needet.
The missing step is to pusch the Image to openshift online.
So these are the Steps:
Build a Image on the local Computer
docker build -t restservice .
Tag it with registry/username/image-name
docker tag restservice registry.starter-us-west2.openshift.com/myusername/myrestservice
Get your Secret for the Login into the Openshift Registry
oc whoami -t
sr3grwkegr3kjrk42k2jrg34kb5k43g5k4jg53 (sr3... is the output)
Log into Openshift Registry
docker login -u name#mail.com -p sr3grwkegr3kjrk42k2jrg34kb5k43g5k4jg53 https://registry.starter-us-west-2.openshift.com
Push the Image to the Registry
docker push registry.starter-us-west-2.openshift.com/myusername/myrestservice
Create the new App with the Image
oc new-app myrestservice
Thats all ....
I have a working installation with kubernetes 1.1.1 running on debian
I also have a private registry working nice running in v2 ..
I am facing a weird problem.
defining a pod in master
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: docker-registry.hiberus.com:5000/debian:ssh
imagePullSecrets:
- name: myregistrykey
I also have the secret on my master
myregistrykey kubernetes.io/dockercfg 1 44m
and my config.json is made this way
{
"auths": {
"https://docker-registry.hiberus.com:5000": {
"auth": "anNhdXJhOmpzYXVyYQ==",
"email": "jsaura#heraldo.es"
}
}
}
and so I did the base64 and created my secret.
simple as hell
on my node the image gets pulled without any problem
docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
docker-registry.hiberus.com:5000/debian ssh 3b332951c107 29 minutes ago 183.3 MB
golang 1.4 2819d1d84442 7 days ago 562.7 MB
debian latest 91bac885982d 8 days ago 125.1 MB
gcr.io/google_containers/pause 0.8.0 2c40b0526b63 7 months ago 241.7 kB
but my container does not start
./kubectl describe pod nginx
Name: nginx
Namespace: default
Image(s): docker-registry.hiberus.com:5000/debian:ssh
Node: 192.168.29.122/192.168.29.122
Start Time: Wed, 18 Nov 2015 17:08:53 +0100
Labels: app=nginx
Status: Running
Reason:
Message:
IP: 172.17.0.2
Replication Controllers:
Containers:
nginx:
Container ID: docker://3e55ab118a3e5d01d3c58361abb1b23483d41be06741ce747d4c20f5abfeb15f
Image: docker-registry.hiberus.com:5000/debian:ssh
Image ID: docker://3b332951c1070ba2d7a3bb439787a8169fe503ed8984bcefd0d6c273d22d4370
State: Waiting
Reason: CrashLoopBackOff
Last Termination State: Terminated
Reason: Error
Exit Code: 0
Started: Wed, 18 Nov 2015 17:08:59 +0100
Finished: Wed, 18 Nov 2015 17:08:59 +0100
Ready: False
Restart Count: 2
Environment Variables:
Conditions:
Type Status
Ready False
Volumes:
default-token-ha0i4:
Type: Secret (a secret that should populate this volume)
SecretName: default-token-ha0i4
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
───────── ──────── ───── ──── ───────────── ────── ───────
16s 16s 1 {kubelet 192.168.29.122} implicitly required container POD Created Created with docker id 4a063be27162
16s 16s 1 {kubelet 192.168.29.122} implicitly required container POD Pulled Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
16s 16s 1 {kubelet 192.168.29.122} implicitly required container POD Started Started with docker id 4a063be27162
16s 16s 1 {kubelet 192.168.29.122} spec.containers{nginx} Pulling Pulling image "docker-registry.hiberus.com:5000/debian:ssh"
15s 15s 1 {scheduler } Scheduled Successfully assigned nginx to 192.168.29.122
11s 11s 1 {kubelet 192.168.29.122} spec.containers{nginx} Created Created with docker id 36df2dc8b999
11s 11s 1 {kubelet 192.168.29.122} spec.containers{nginx} Pulled Successfully pulled image "docker-registry.hiberus.com:5000/debian:ssh"
11s 11s 1 {kubelet 192.168.29.122} spec.containers{nginx} Started Started with docker id 36df2dc8b999
10s 10s 1 {kubelet 192.168.29.122} spec.containers{nginx} Pulled Container image "docker-registry.hiberus.com:5000/debian:ssh" already present on machine
10s 10s 1 {kubelet 192.168.29.122} spec.containers{nginx} Created Created with docker id 3e55ab118a3e
10s 10s 1 {kubelet 192.168.29.122} spec.containers{nginx} Started Started with docker id 3e55ab118a3e
5s 5s 1 {kubelet 192.168.29.122} spec.containers{nginx} Backoff Back-off restarting failed docker container
it loops internally trying to start but it never does
the weird thing is that if y do a run command on my node manually the container starts without any problem, but using the pod pulls the image but never starts ..
am I doing something wrong?
if I use a public image for my pod it starts without any problem .. this only happens to me when using private images ..
I have also moved from debian to ubuntu, no luck same problem
I have also linked the secret to de default service account, still no luck
cloned last git version, compiled, no luck ..
It is clear for me that the problem is using private registry, but I have applied and followed all info I have read and still no luck.
A docker container could exit if it's main process has exit.
Could you share container logs ?
if you do docker ps -a you should see all running and exited containers
Run docker container logs container_id
Also try running your container in interactive and daemon mode and see if it fails only in daemon mode.
Running in daemon mode -
docker run -d -t Image_name
Running in interactive mode -
docker run -it Image_name
for interactive daemon mode docker run -idt Image_name
refer - Why docker container exits immediately