After a few days of running dockerd on a kubernetes host, where pods are scheduled by kubelet, dockerd goes bad - consuming a lot of resources (50% memory - ~4gigs).
When it gets to this state, it is unable to act on commands for containers that appear to be running via $ docker ps. Also checking ps -ef on the host these containers don't map to any underlying host processes.
$ docker exec yields -
level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found"
Cannot kill container 6a8d4....8: rpc error: code = 14 desc = grpc: the connection is unavailable"
level=fatal msg="open /var/run/docker/libcontainerd/containerd/7657...4/65...6/process.json: no such file or directory"
Looking through the process tree on the host there seem to be a lot of defunct processes which point to dockerd as the parent id. Any pointers on what the issue might be or where to look further?
Enabled debug on dockerd to see if the issue re-occurs, a dockerd restart fixes the issue.
Sounds like you have a container misbehaving and docker is not able to reap it. I would take a look at what has been scheduled on the nodes where you see the problem. The error you are seeing seems like the docker daemon not responding to API requests issued by the docker CLI. Some pointers:
Has the container exited successfully or with an error?
Did they containers get killed for some reason?
Check the kubelet logs
Check the kube-scheduler logs?
Follow the logs in the containers on your node docker logs -f <containerid>
Related
I have a kubernetes cluster setup at home on two bare metal machines.
I used kubespray to install both and it uses kubeadm behind the scenes.
The problem I encounter is that all containers within the cluster have a restartPolicy: no which makes my cluster break when I restart the main node.
I have to manually run "docker container start" for all containers in "kube-system" namespace to make it work after reboot.
Does anyone have an idea where the problem might be coming from ?
Docker provides restart policies to control whether your containers start automatically when they exit, or when Docker restarts. Here your containers have the restart policy - no which means this policy will never automatically start the container under any circumstance.
You need to change the restart policy to Always which restarts the container if it stops. If it is manually stopped, it is restarted only when Docker daemon restarts or the container itself is manually restarted.
You can change the restart policy of an existing container using docker update. Pass the name of the container to the command. You can find container names by running docker ps -a.
docker update --restart=always <CONTAINER NAME>
Restart policy details:
Keep the following in mind when using restart policies:
A restart policy only takes effect after a container starts successfully. In this case, starting successfully means that the container is up for at least 10 seconds and Docker has started monitoring it. This prevents a container which does not start at all from going into a restart loop.
If you manually stop a container, its restart policy is ignored until the Docker daemon restarts or the container is manually restarted. This is another attempt to prevent a restart loop.
I am answering my question:
It wasn't probably very clear but I was talking about the kube-system pods that manage the whole cluster and that should automatically start when the machine restarts.
It turns out those pods (ex: code-dns, kube-proxy, etc) have a restart policy of "no" intentionally and it is the kubelet service on the node that spins up the whole cluster when you restart your node.
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
In my case kubelet could not start due to missing cri-dockerd process.
Check the issue I opened at kubespray:
Verifying the kubelet logs is done like so:
journalctl -u kubelet -f
I've troubles with Docker EE, I've successfully started last week 3 containers I've made, now I need to run a simple nodejs container, I did. docker run -d node but it exists immediately and I've got the following error
time="2020-11-16T11:25:05+01:00" level=error msg="Error waiting for container: failed to shutdown container: container 8a5e6905d6432a9e0ab4dc46b50654e6afe4a6f297dd478d4b07b0dd69e00009 encountered an error during hcsshim::System::waitBackground: failure in a Windows system call: The virtual machine or container with the specified identifier is not running. (0xc0370110): subsequent terminate failed container 8a5e6905d6432a9e0ab4dc46b50654e6afe4a6f297dd478d4b07b0dd69e00009 encountered an error during hcsshim::System::waitBackground: failure in a Windows system call: The virtual machine or container with the specified identifier is not running. (0xc0370110)"
I'm under Windows 2019 Server Standard. Where can I start looking at?
I am trying to learn docker and swarm. I created a swarm with 3 nodes and completed an example using virtualbox and docker-machine. I Once i restarted my machine, All nodes shown as stopped. I started all nodes using
docker-machine start node1 node2 node3
All node started but still I am not able to list nodes even on master node and getting below error:
docker#node1:~$ docker node ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Also the docker state on node1 (master) is pending.
Swarm: pending
NodeID: c93hv5pixlfiei7q9qneuiuen
Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
I am getting this error every time i restarted my machine.This is causing me to setup everything from start each time.
Is there anyway I can avoid setting up cluster again and again.
Thanks
You must include the docker service start somewhere in your boot config.
Preventing
demote the node you are going to "switch off"/leave swarm
# find node id
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
o1iz67ehuenfzbyg2gjxayaee hostA Ready Active Reachable 20.10.6
fic857lrupfemxqie5rvq63yt * hostB Ready Active Leader 20.10.6
$ docker node demote o1iz67ehuenfzbyg2gjxayaee
Manager o1iz67ehuenfzbyg2gjxayaee demoted in the swarm.
# now on, the node can safely leave the swarm
$ docker swarm leave --force
Reacting
Restart if there are no healthy nodes.
Start >> stop Docker engine (NOT restart) and init Swarm again. Validate firewall ruleset afterwards as Docker overwrites it.
$ systemctl stop docker
$ systemctl start docker
Drain left node if there is healthy manager node.
Reference https://cynici.wordpress.com/2018/05/31/docker-info-rpc-error-on-manager-node/
please check the firewall on linux:
If you want to promote some NODE as manager, so you please check the port=2377 is accepting request on particular node. Then only Node work as manager. Otherwise you will get an error like below :
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Solution : Add port number 2377 in firewall.
firewall-cmd --zone=public --add-port=2377/tcp --permanent
success
firewall-cmd --reload
success
I see that docker daemon use a lot of CPU. As I understand the kubelet and the dockerd communicate with each other to maintain the state of the cluster. But does dockerd for some reason do extra runtime work after containers are started that would spike CPU? To get information to report to kubelet?
But does dockerd for some reason do extra runtime work after containers are started that would spike CPU?
Not really unless you have another container or process constantly calling the docker API or running a docker command from the CLI.
The kubelet talks to the docker daemon through a docker shim to do everything that it needs to run containers, so I would check if the kubelet is doing some extra works, maybe scheduling and then evicting/stopping containers.
One my ICP nodes appears to be running, but the services on that node are unresponsive and will at times return a 504 Gateway Timeout.
When I SSH into the unresponsive node and run journalctl -u kubelet -f I am seeing error messages such as transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused
Furthermore, when I run top I'm seeing dockerd using an usually high percentage of my CPU.
What is causing this behavior and how can I return my node to its normal working condition?
These errors might be due to a known issue with Docker where an old containerd reference is used even after the containerd daemon was restarted. This defect causes the Docker daemon to go into an internal error loop that uses a high amount of CPU resources and logs a high number of errors. For more information about this error, please see the Refresh containerd remotes on containerd restarted pull request against the Moby project.
To work around this issue, use the host operating system command to restart the docker service on the node. After some time, the services should resume.