Kubernetes GPU Pod error : validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"

Kubernetes GPU Pod error : validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH" - docker

When trying to create Pods that can use GPU, I get the error "exec: "nvidia-smi": executable file not found in $PATH" ".
To explain the error from the beginning, my main goal was to create JupyterHub enviroments that can use GPU. I installed Zero to JupyterHub for Kubernetes. I followed these steps to be able to use GPU. When I check my nodes GPUs seems schedulable by Kubernetes. So far everything seemed fine.
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
arge-server 1
However, when I logged in to JupyetHub and tried to open the profile using GPU, I got an error: [Warning] 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.
kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-x5rqs 0/1 Init:0/1 2 6d20h
nvidia-device-plugin-daemonset-jhjhb 0/1 Init:0/1 0 6d20h
gpu-feature-discovery-pd4xv 0/1 Init:0/1 2 6d20h
nvidia-dcgm-exporter-7mjgt 0/1 Init:0/1 2 6d20h
nvidia-operator-validator-9xjmv 0/1 Init:Error 10 26m
After that, I took a closer look at the Pod nvidia-operator-validator-9xjmv, which was the beginning of the error, and I saw that the toolkit-validation container was throwing a CrashLoopBackOff error. Here is the relevant part of the log:
kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources
Name: nvidia-operator-validator-9xjmv
Namespace: gpu-operator-resources
.
.
.
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
.
.
.
toolkit-validation:
Container ID: containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator#sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 18 Nov 2021 12:55:00 +0300
Finished: Thu, 18 Nov 2021 12:55:00 +0300
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
Normal Pulled 58m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 58m kubelet Created container driver-validation
Normal Started 58m kubelet Started container driver-validation
Normal Pulled 56m (x5 over 58m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 56m (x5 over 58m) kubelet Created container toolkit-validation
Normal Started 56m (x5 over 58m) kubelet Started container toolkit-validation
Warning BackOff 3m7s (x255 over 58m) kubelet Back-off restarting failed container
Then, I looked at the logs of the container and I got the following error.
kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation
time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready
For similar issues, it was suggested to delete the failed Pod and deployment. However, doing these did not fix my problem. Do you have any suggestions?
I have;
Ubuntu 20.04
Kubernetes v1.21.6
Docker 20.10.10
NVIDIA-SMI 470.82.01
CUDA 11.4
CPU: Intel Xeon E5-2683 v4 (32) # 2.097GHz
GPU: NVIDIA GeForce RTX 2080 Ti
Memory: 13815MiB / 48280MiB
Thanks in advance.

In case you're are still having the issue, we just had the same issue on our cluster, the "dirty" fix is to do that:
rm /run/nvidia/driver
ln -s / /run/nvidia/drive
kubectl delete pod -n gpu-operator nvidia-operator-validator-xxxxx
The reason is the init pod of the nvidia-operator-validator try to execute nvidia-smi within a chroot from /run/nvidia/driver .. which is a tmpfs (so doesn't persist accross reboot) and is not populated when performing a manual install of the drivers.
Do hope for a better fix from Nvidia.

Related

CrashLoopBackOff while deploying pod using image from private registry

I am trying to create a pod using my own docker image on localhost.
This is the dockerfile used to create the image :
FROM centos:8
RUN yum install -y gdb
RUN yum group install -y "Development Tools"
CMD ["/usr/bin/bash"]
The yaml file used to create the pod is this :
---
apiVersion: v1
kind: Pod
metadata:
name: server
labels:
app: server
spec:
containers:
- name: server
imagePullPolicy: Never
image: localhost:5000/server
ports:
- containerPort: 80
root#node1:~/test/server# docker images | grep server
server latest 82c5228a553d 3 hours ago 948MB
localhost.localdomain:5000/server latest 82c5228a553d 3 hours ago 948MB
localhost:5000/server latest 82c5228a553d 3 hours ago 948MB
The image has been pushed to localhost registry.
Following is the error I receive.
root#node1:~/test/server# kubectl get pods
NAME READY STATUS RESTARTS AGE
server 0/1 CrashLoopBackOff 5 5m18s
The output of describe pod :
root#node1:~/test/server# kubectl describe pod server
Name: server
Namespace: default
Priority: 0
Node: node1/10.0.2.15
Start Time: Mon, 07 Dec 2020 15:35:49 +0530
Labels: app=server
Annotations: cni.projectcalico.org/podIP: 10.233.90.192/32
cni.projectcalico.org/podIPs: 10.233.90.192/32
Status: Running
IP: 10.233.90.192
IPs:
IP: 10.233.90.192
Containers:
server:
Container ID: docker://c2982e677bf37ff11272f9ea3f68565e0120fb8ccfb1595393794746ee29b821
Image: localhost:5000/server
Image ID: docker-pullable://localhost.localdomain:5000/server#sha256:6bc8193296d46e1e6fa4cb849fa83cb49e5accc8b0c89a14d95928982ec9d8e9
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 07 Dec 2020 15:41:33 +0530
Finished: Mon, 07 Dec 2020 15:41:33 +0530
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-tb7wb (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-tb7wb:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-tb7wb
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m default-scheduler Successfully assigned default/server to node1
Normal Pulled 4m34s (x5 over 5m59s) kubelet Container image "localhost:5000/server" already present on machine
Normal Created 4m34s (x5 over 5m59s) kubelet Created container server
Normal Started 4m34s (x5 over 5m59s) kubelet Started container server
Warning BackOff 56s (x25 over 5m58s) kubelet Back-off restarting failed container
I get no logs :
root#node1:~/test/server# kubectl logs -f server
root#node1:~/test/server#
I am unable to figure out whether the issue is with the container or yaml file for creating pod. Any help would be appreciated.

Posting this as Community Wiki.
As pointed by #David Maze in comment section.
If docker run exits immediately, a Kubernetes Pod will always go into CrashLoopBackOff state. Your Dockerfile needs to COPY in or otherwise install and application and set its CMD to run it.
Root cause can be also determined by Exit Code. In 3) Check the exit code article, you can find a few exit codes like 0, 1, 128, 137 with description.
3.1) Exit Code 0
This exit code implies that the specified container command completed ‘sucessfully’, but too often for Kubernetes to accept as working.
In short story, your container was created, all action mentioned was executed and as there was nothing else to do, it exit with Exit Code 0.
A CrashLoopBackOff error occurs when a pod startup fails repeatedly in Kubernetes.`
Your image based on centos with few additional installations did not have any process in backgroud left, so it was categorized as Completed. As this happen so fast, kubernetes restarted it and it fall in loop.
$ kubectl run centos --image=centos
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
centos 0/1 CrashLoopBackOff 1 5s
centos 0/1 Completed 2 17s
centos 0/1 CrashLoopBackOff 2 31s
centos 0/1 Completed 3 46s
centos 0/1 CrashLoopBackOff 3 58s
centos 1/1 Running 4 88s
centos 0/1 Completed 4 89s
centos 0/1 CrashLoopBackOff 4 102s
$ kubectl describe po centos | grep 'Exit Code'
Exit Code: 0
But when you have used sleep 3600, in your container, command sleep was executing for hour. After this time it would also exit with Exit Code 0.
Hope it clarified.

kubernetes 1.12.2 failed to load Kubelet config file /var/lib/kubelet/config.yaml

Environment:
Kubernetes 1.12.2
Docker 18.9.0
microk8s.kubectl
$ k get all
NAME READY STATUS
RESTARTS AGE
pod/mysql-0 1/1 Running 0 72s
pod/nginx-ingress-microk8s-controller-c2pgz 0/1 CrashLoopBackOff 129 22h
pod/web-0 1/1 Running 0 78s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 70m
service/mysql-service ClusterIP None <none> 3306/TCP 72s
service/nginx-service ClusterIP None <none> 80/TCP 78s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/nginx-ingress-microk8s-controller 1 1 0 1 0 <none> 2d22h
NAME DESIRED CURRENT AGE
statefulset.apps/mysql 1 1 72s
statefulset.apps/web 1 1 78s
/var/log/syslog:
failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
Error syncing pod f0ab0f74-e6f2-11e8-8410-482ae31e6a94 ("nginx-ingress-microk8s-controller-c2pgz_default(f0ab0f74-e6f2-11e8-8410-482ae31e6a94)"), skipping: failed to "StartContainer" for "nginx-ingress-microk8s" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-microk8s pod=nginx-ingress-microk8s-controller-c2pgz_default(f0ab0f74-e6f2-11e8-8410-482ae31e6a94)"
What is nginx-ingress-microk8s-controller-c2pgz? Who started it?

You mentioned in the comments that the reason is related to kubeadm init fails.
The /var/lib/kubelet/config.yaml config file is being populated only after:
A successful cluster initialization (kubeadmin init) in the master node.
In the worker node - after a successful joining to the cluster (kubeadm join).
So if the problem is with kubeadm init you should check the command's output (also great if you could paste it in the question).
Make sure you don't run kubeadm init with the --ignore-preflight-errors=all flag.
I'm not familiar with your specific error, but in order for the answer to be more helpful - I'll try to give some possible solutions:
Make sure all requirements for kubeadm are in place.
Check the firewall rules - make sure you don't block egress traffic and that port 6443 ingress rule is open for the worker node (relevant for the joining phase).
Make sure that the required ports are not occupied.
Try restarting Kubelet with systemctl restart kubelet and check latest logs with: sudo journalctl -u kubelet -n 100 --no-pager.
Check if Docker version can be updated to a newer stabler one.
Try running kubeadm reset and make sure you re-run kubeadm init with latest version or with the specific stable version by addding --kubernetes-version=X.Y.Z.

As per RtmY, it works only kubectl initilzation works correct
after doing following
kubeadm init --pod-network-cidr=192.168.0.0/16
it worked successfully.

As i have updated kubelet, I am not able to find /var/lib/kubelet/config.yaml
For that "systemctl status kubelet|journalctl -xe"
failed to load Kubelet config file /var/lib/kubelet/config.yaml
As per the below link, I have copied the config.yaml from other working worker nodes and its worked !!
https://github.com/kubernetes/kubernetes/issues/65863#issuecomment-403003592

Pod gets into status of CrashLoopBackOff and gets restarted repeatedly - Exit code is 0

I have a docker container that is running fine when I run it using docker run. I am trying to put that container inside a pod but I am facing issues. The first run of the pod shows status as "Completed". And then the pod keeps restarting with CrashLoopBackoff status. The exit code however is 0.
Here is the result of kubectl describe pod :
Name: messagingclientuiui-6bf95598db-5znfh
Namespace: mgmt
Node: db1mgr0deploy01/172.16.32.68
Start Time: Fri, 03 Aug 2018 09:46:20 -0400
Labels: app=messagingclientuiui
pod-template-hash=2695115486
Annotations: <none>
Status: Running
IP: 10.244.0.7
Controlled By: ReplicaSet/messagingclientuiui-6bf95598db
Containers:
messagingclientuiui:
Container ID: docker://a41db3bcb584582e9eacf26b02c7ef26f57c2d43b813f44e4fd1ba63347d3fc3
Image: 172.32.1.4/messagingclientuiui:667-I20180802-0202
Image ID: docker-pullable://172.32.1.4/messagingclientuiui#sha256:89a002448660e25492bed1956cfb8fff447569e80ac8b7f7e0fa4d44e8abee82
Port: 9087/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 03 Aug 2018 09:50:06 -0400
Finished: Fri, 03 Aug 2018 09:50:16 -0400
Ready: False
Restart Count: 5
Environment Variables from:
mesg-config ConfigMap Optional: false
Environment: <none>
Mounts:
/docker-mount from messuimount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2pthw (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
messuimount:
Type: HostPath (bare host directory volume)
Path: /mon/monitoring-messui/docker-mount
HostPathType:
default-token-2pthw:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2pthw
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m default-scheduler Successfully assigned messagingclientuiui-6bf95598db-5znfh to db1mgr0deploy01
Normal SuccessfulMountVolume 4m kubelet, db1mgr0deploy01 MountVolume.SetUp succeeded for volume "messuimount"
Normal SuccessfulMountVolume 4m kubelet, db1mgr0deploy01 MountVolume.SetUp succeeded for volume "default-token-2pthw"
Normal Pulled 2m (x5 over 4m) kubelet, db1mgr0deploy01 Container image "172.32.1.4/messagingclientuiui:667-I20180802-0202" already present on machine
Normal Created 2m (x5 over 4m) kubelet, db1mgr0deploy01 Created container
Normal Started 2m (x5 over 4m) kubelet, db1mgr0deploy01 Started container
Warning BackOff 1m (x8 over 4m) kubelet, db1mgr0deploy01 Back-off restarting failed container
kubectl get pods
NAME READY STATUS RESTARTS AGE
messagingclientuiui-6bf95598db-5znfh 0/1 CrashLoopBackOff 9 23m
I am assuming we need a loop to keep the container running in this case. But I dont understand why it worked when it ran using docker and not working when it is inside a pod. Shouldnt it behave the same ?
How do we henerally debug CrashLoopBackOff status apart from running kubectl describe pod and kubectl logs

The container would terminate with exit code 0 if there isn't at least one process running in the background. To keep the container running, add these to the deployment configuration:
command: ["sh"]
stdin: true
Replace sh with bash on any other shell that the image may have.
Then you can drop inside the container with exec:
kubectl exec -it <pod-name> sh
Add -c <container-name> argument if the pod has more than one container.

are you sure you run your software as docker run ... -d ... <command> and it kept running and you use the same exact command in your pod ? In some cases, if you compare things that run on docker with -it and no -d you might find your self in a pinch as they expect terminal to communicate with user and exit if tty is not available (hint: pod/container can be run with tty: true)
It is very unlikely that you have software that runs in a detached docker and does not in kube.

pod creation stuck in ContainerCreating state

I have created a k8s cluster with RHEL7 with kubernetes packages GitVersion:"v1.8.1". I'm trying to deploy wordpress on my custom cluster. But pod creation is always stuck in ContainerCreating state.
phani#k8s-master]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default wordpress-766d75457d-zlvdn 0/1 ContainerCreating 0 11m
kube-system etcd-k8s-master 1/1 Running 0 1h
kube-system kube-apiserver-k8s-master 1/1 Running 0 1h
kube-system kube-controller-manager-k8s-master 1/1 Running 0 1h
kube-system kube-dns-545bc4bfd4-bb8js 3/3 Running 0 1h
kube-system kube-proxy-bf4zr 1/1 Running 0 1h
kube-system kube-proxy-d7zvg 1/1 Running 0 34m
kube-system kube-scheduler-k8s-master 1/1 Running 0 1h
kube-system weave-net-92zf9 2/2 Running 0 34m
kube-system weave-net-sh7qk 2/2 Running 0 1h
Docker Version:1.13.1
Pod status from descibe command
Normal Scheduled 18m default-scheduler Successfully assigned wordpress-766d75457d-zlvdn to worker1
Normal SuccessfulMountVolume 18m kubelet, worker1 MountVolume.SetUp succeeded for volume "default-token-tmpcm"
Warning DNSSearchForming 18m kubelet, worker1 Search Line limits were exceeded, some dns names have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local
Warning FailedCreatePodSandBox 14m kubelet, worker1 Failed create pod sandbox.
Warning FailedSync 25s (x8 over 14m) kubelet, worker1 Error syncing pod
Normal SandboxChanged 24s (x8 over 14m) kubelet, worker1 Pod sandbox changed, it will be killed and re-created.
from the kubelet log I observed below error on worker
error: failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
But kubelet is stable no problems seen on worker.
How do I solve this problem?
I checked the cni failure, I couldn't find anything.
~]# ls /opt/cni/bin
bridge cnitool dhcp flannel host-local ipvlan loopback macvlan noop ptp tuning weave-ipam weave-net weave-plugin-2.3.0
In journal logs below messages are repetitively appeared . seems like scheduler is trying to create the container all the time.
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421184 14339 remote_runtime.go:115] StopPodSandbox "47da29873230d830f0ee21adfdd3b06ed0c653a0001c29289fe78446d27d2304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421212 14339 kuberuntime_manager.go:780] Failed to stop sandbox {"docker" "47da29873230d830f0ee21adfdd3b06ed0c653a0001c29289fe78446d27d2304"}
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421247 14339 kuberuntime_manager.go:580] killPodWithSyncResult failed: failed to "KillPodSandbox" for "7f1c6bf1-6af3-11e8-856b-fa163e3d1891" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421262 14339 pod_workers.go:182] Error syncing pod 7f1c6bf1-6af3-11e8-856b-fa163e3d1891 ("wordpress-766d75457d-spdrb_default(7f1c6bf1-6af3-11e8-856b-fa163e3d1891)"), skipping: failed to "KillPodSandbox" for "7f1c6bf1-6af3-11e8-856b-fa163e3d1891" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Failed create pod sandbox.
... is almost always a CNI failure; I would check on the node that all the weave containers are happy, and that /opt/cni/bin is present (or its weave equivalent)
You may have to check both the journalctl -u kubelet.service as well as the docker logs for any containers running to discover the full scope of the error on the node.

It's seem to working by removing the$KUBELET_NETWORK_ARGS in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
I have removed $KUBELET_NETWORK_ARGS and restarted the worker node then pods got deployed successfully.

As Matthew said it's most likely a CNI failure.
First, find the node this pod is running on:
kubectl get po wordpress-766d75457d-zlvdn -o wide
Next in the node where the pod is located check /etc/cni/net.d if you have more than one .conf then you can delete one and restart the node.
source: https://github.com/kubernetes/kubeadm/issues/578.
note this is one of the solutions.

While hopefully it's no one else's problem, for me, this happened when part of my filesystem was full.
I had pods stuck in ContainerCreating only on one node in my cluster. I also had a bunch of pods which I expected to shutdown, but hadn't. Someone recommended running
sudo systemctl status kubelet -l
which showed me a bunch of lines like
Jun 18 23:19:56 worker01 kubelet[1718]: E0618 23:19:56.461378 1718 kuberuntime_manager.go:647] createPodSandbox for pod "REDACTED(2c681b9c-cf5b-11eb-9c79-52540077cc53)" failed: mkdir /var/log/pods/2c681b9c-cf5b-11eb-9c79-52540077cc53: no space left on device
I confirmed that I was out of space with
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 189G 0 189G 0% /dev
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/mapper/vg01-root 20G 7.0G 14G 35% /
/dev/mapper/vg01-tmp 4.0G 34M 4.0G 1% /tmp
/dev/mapper/vg01-home 4.0G 72M 4.0G 2% /home
/dev/mapper/vg01-varlog 10G 10G 20K 100% /var/log
/dev/mapper/vg01-varlogaudit 2.0G 68M 2.0G 4% /var/log/audit
I just had to clear out that dir (and did some manual cleanup on all the pending pods and pods that were stuck running).

Error syncing pod,failed for registry.access.redhat.com (Kubernetes)

kubectl create -f web.yml
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
httpd 0/1 ContainerCreating 0 1h kube-node2
[root#kube-master pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
httpd 0/1 ContainerCreating 0 1h kube-node2
[root#kube-master pods]# kubectl describe pods httpd Name: httpd
Namespace: default Node: kube-node2/10.10.0.102 Start Time: Mon, 30
Oct 2017 17:47:38 +0600 Labels: app=webserver Status: Pending IP:
Controllers: Containers: httpd:
Container ID:
Image: webserver
Image ID:
Port: 80/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Volume Mounts:
Environment Variables: Conditions: Type Status Initialized True Ready False PodScheduled True No volumes.
QoS Class: BestEffort Tolerations: Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ ------- 1h 5m 16 {kubelet kube-node2}
Warning FailedSync Error syncing
pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull:
"image pull failed for
registry.access.redhat.com/rhel7/pod-infrastructure:latest, this may
be because there are no credentials on this request. details: (open
/etc/docker/certs.d/registry.access.redhat.com/redhat-ca.crt: no such
file or directory)"
1h 8s 271 {kubelet kube-node2} Warning FailedSync Error syncing
pod, skipping: failed to "StartContainer" for "POD" with
ImagePullBackOff: "Back-off pulling image
\
"registry.access.redhat.com/rhel7/pod-infrastructure:latest\""
registry should go to hub.docker but here says
Error syncing pod, skipping: failed to "StartContainer" for "POD" with
ErrImagePull: "image pull failed for
registry.access.redhat.com/rhel7/pod-infrastructure:latest, this may
be because there are no credentials on this request. details: (open
/etc/docker/certs.d/registry.access.redhat.com/redhat-ca.crt: no such
file or directory)"
Why ?
Please give me solution

I encounter the same problem, and i found that i not install rhsm related software on machine, you can execute command "yum install rhsm" to solve this problem.

For me the file /etc/rhsm/ca/redhat-uep.pem was missing.
I had to uninstall and reinstall docker/kubernetes on the minion to get the file back and it worked again. What a pain.
My environment is on CentOS Linux release 7.4.1708
And these rpms.
kubernetes-master-1.5.2-0.7.git269f928.el7.x86_64
kubernetes-1.5.2-0.7.git269f928.el7.x86_64
kubernetes-node-1.5.2-0.7.git269f928.el7.x86_64
kubernetes-client-1.5.2-0.7.git269f928.el7.x86_64
docker-1.12.6-71.git3e8e77d.el7.centos.1.x86_64
docker-client-1.12.6-71.git3e8e77d.el7.centos.1.x86_64
docker-common-1.12.6-71.git3e8e77d.el7.centos.1.x86_64
There ins no rhsm in CentOS.
This post hast the alternative to rhsm for CentOS.

I met the same issue on centos 7.
yum install rhsm is not working for me giving the following output:
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: centos.ustc.edu.cn
* extras: mirrors.zju.edu.cn
* updates: centos.ustc.edu.cn
No package rhsm available.
Error: Nothing to do
But yum install subscription-manager works well for me.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart