Docker commands do not respond anymore - docker

Most of docker commands never end. I have to interrupt them manually with CTRL+C. Even simple commands like docker ps or docker info do not respond.
However, docker help and docker version still work.
I think there is something like a deadlock with a particular container, so commands related to containers won't complete.
How to handle such a situation ?
My docker version is 1.12.3. I don't use Swarm mode. The docker logs command doesn't work too. Using dmesg I can see a lot of I/O errors, but I don't know if it is related with my problem:
[12898.121287] loop: Write error at byte offset 8882749440, length 4096.
[12898.122837] loop: Write error at byte offset 8883666944, length 4096.
[12898.124685] loop: Write error at byte offset 8882814976, length 4096.
[12898.126459] loop: Write error at byte offset 8883404800, length 4096.
[12898.128201] loop: Write error at byte offset 8883470336, length 4096.
[12898.129921] loop: Write error at byte offset 8883535872, length 4096.
[12898.131774] loop: Write error at byte offset 8883601408, length 4096.
[12898.133594] loop: Write error at byte offset 8883732480, length 4096.
[12917.269786] loop: Write error at byte offset 8883798016, length 4096.
[12917.270331] quiet_error: 632 callbacks suppressed
[12917.270334] Buffer I/O error on device dm-6, logical block 1313320
[12917.270540] lost page write due to I/O error on dm-6
[12917.270543] Buffer I/O error on device dm-6, logical block 1313321
[12917.270740] lost page write due to I/O error on dm-6
[12917.270742] Buffer I/O error on device dm-6, logical block 1313322
[12917.270957] lost page write due to I/O error on dm-6
[12917.270959] Buffer I/O error on device dm-6, logical block 1313323
[12917.271177] lost page write due to I/O error on dm-6
[12917.271179] Buffer I/O error on device dm-6, logical block 1313324
[12917.271377] lost page write due to I/O error on dm-6
[12917.271379] Buffer I/O error on device dm-6, logical block 1313325
[12917.271573] lost page write due to I/O error on dm-6
[12917.301759] loop: Write error at byte offset 8883863552, length 4096.
[12917.312038] loop: Write error at byte offset 8883929088, length 4096.
[12917.312396] Buffer I/O error on device dm-6, logical block 1313328
[12917.312635] lost page write due to I/O error on dm-6
[12917.312638] Buffer I/O error on device dm-6, logical block 1313329
[12917.312867] lost page write due to I/O error on dm-6
[12917.312869] Buffer I/O error on device dm-6, logical block 1313330
[12917.313121] lost page write due to I/O error on dm-6
[12917.313123] Buffer I/O error on device dm-6, logical block 1313331
[12917.313346] lost page write due to I/O error on dm-6
[13090.853726] INFO: task kworker/u8:0:17212 blocked for more than 120 seconds.
[13090.854055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Using the command sudo systemctl status -l docker, the following messages are printed, but I cannot tell if they are related:
dockerd[1344]: time="2016-11-24T17:49:01.184874648+01:00" level=warning msg="libcontainerd: container c9f35af1836bf856001ca6156663f713c1217a697e8d2451927c67797fb5a770 restart canceled"
dockerd[1344]: time="2016-11-24T17:49:02.627116016+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T17:49:02.627152661+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
dockerd[1344]: time="2016-11-24T18:19:51.472701647+01:00" level=warning msg="libcontainerd: container c9f35af1836bf856001ca6156663f713c1217a697e8d2451927c67797fb5a770 restart canceled"
dockerd[1344]: time="2016-11-24T18:19:56.712126199+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T18:19:56.712159759+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
dockerd[1344]: time="2016-11-24T18:34:24.301786606+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T18:34:24.302208751+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"

That Docker commands hanging bug happened after I deleted a container.
The daemon dockerd was in an abnormal state: it couldn't be started (sudo service docker start) after having been stopped (service docker stop).
# sudo service docker start
Redirecting to /bin/systemctl start docker.service
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
# journalctl -xe
kernel: device-mapper: ioctl: unable to remove open device docker-253:0-19468577-d6f74dd67f106d6bfa483df4ee534dd9545dc8ca
...
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
systemd[1]: Unit docker.service entered failed state.
systemd[1]: docker.service failed.
polkitd[896]: Unregistered Authentication Agent for unix-process:22551:34177094 (system bus name :1.290, object path /org
ESCESC
kernel: dev_remove: 41 callbacks suppressed
kernel: device-mapper: ioctl: unable to remove open device docker-253:0-19468577-fc63401af903e22d05a4518e02504527f0d7883f9d997d7d97fdfe72ba789863
...
dockerd[22566]: time="2016-11-28T10:18:09.840268573+01:00" level=fatal msg="Error starting daemon: timeout"
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
Moreover, many zombie Docker processes could be observed using ps -eax | grep docker (presence of a "Z" in the "STAT" column), for example docker-proxies.
After rebooting the server and restarting Docker, the zombie processes disappeared and Docker commands were working again.

I just had a similar issue as well. Rebooting the server did not work for me. I got this issue, because I just installed a new container with some kind of errors. After that, most Docker commands did not respond. I fixed it by executing the following command:
docker system prune -a
This removes all unused containers. In my case also the container I just added. More information:
https://docs.docker.com/engine/reference/commandline/system_prune/

I had the same problem (commands not responding) and I fix it by increasing the resources allocated to Docker.
Docker Desktop -> Preferences -> Advanced
In my case, I increased:
Memory from 2GB to 8GB
Swap from 1GB to 2GB
Try different values according with your machine.

From the symptoms that you present, it seems something I struggled as well.
I did the following, hope it helps!
After checking it the service was not responding successfully, using:
system status docker.service
I used the following command to put it to work:
sudo dockerd --debug

Restarting my PC worked for me

Related

Is it possible to free space used by docker container on restart?

I am trying to fetch mass of urls with Selenium WebDriver (selenium/standalone-chrome:96.0 image), running in container on EC2 instance with 30GB storage. I put many efforts to avoid disc space leaking during this proccess, but finally gave up. So after a while container run out of space and i get error from WebDriver like selenium.common.exceptions.WebDriverException: Message: unknown error: cannot create temp dir for user data dir
As workaround I can force container exit after a while, so docker will restart container (with restart:always policy), but disc space is not reclaimed, and sooner or later docker restart manager throws error like
restartmanger wait error: mount /dev/mapper/docker-259:3-394503-72f7b76024003665f890079f6f681414587483fa2f30e0f080c027cd516ba7d2:/var/lib/docker/devicemapper/mnt/72f7b76024003665f890079f6f681414587483fa2f30e0f080c027cd516ba7d2: input/output error\nFailed to mount; and leaves container stopped.
Are there any technique to reclaim disk space on container restart?
UPDATE
creating/closing webdriver, performed after each driver.get()
def create_webdriver():
global driver
try:
logger.info("WebDriver: creating...")
options = Options()
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-browser-side-navigation")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)
except Exception:
logger.exception("WebDriver: exception while creating, can not manage, exiting.")
exit(1)
def close_webdriver():
global driver
if driver is not None:
driver.quit()
driver = None
UPDATE2
It seems that there are no disk space leakage, but some issues with docker devicemapper fs on EC2 instance. I carefully investigate disk and docker space usage during the proccess, and find no issues
devtmpfs 16323728 120 16323608 1% /dev
tmpfs 16333664 0 16333664 0% /dev/shm
/dev/nvme0n1p1 8189348 1919080 6170020 24% /
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 9 1 9.033GB 8.571GB (94%)
Containers 8 6 144.8MB -2B (0%)
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
but anyway container feels bad
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot create temp dir for user data dir
and exit, docker can't restart it and there are errors in /var/log/docker
time="2021-12-26T01:36:06.030765815Z" level=error msg="Driver devicemapper couldn't return diff size of container 258399ca6d95cb3510e5e02fec9253b2f22852e8a3553cfad8774b9f913ed279: Failed to mount; dmesg: <3>[ 3761.830462] Buffer I/O error on dev dm-8, logical block 2185471, lost async page write\n<4>[ 3761.839429] JBD2: recovery failed\n<3>[ 3761.843623] EXT4-fs (dm-8): error loading journal\n: mount /dev/mapper/docker-259:3-394503-26a311e2927d080ef4895f43d7dcd6ddaa26e5c0d8e71b6eb46bcdc8d1601194:/var/lib/docker/devicemapper/mnt/26a311e2927d080ef4895f43d7dcd6ddaa26e5c0d8e71b6eb46bcdc8d1601194: input/output error"
time="2021-12-26T01:36:25.009915383Z" level=info msg="ignoring event" container=f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2021-12-26T01:36:25.010710566Z" level=info msg="shim disconnected" id=f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035
time="2021-12-26T01:36:25.010797187Z" level=error msg="copy shim log" error="read /proc/self/fd/36: file already closed"
time="2021-12-26T01:36:28.788036177Z" level=warning msg="error locating sandbox id c1e0abc725ee3e88f388042a34b8e46db09a8fd8024774862899d0f7d9af721b: sandbox c1e0abc725ee3e88f388042a34b8e46db09a8fd8024774862899d0f7d9af721b not found"
time="2021-12-26T01:36:28.788396052Z" level=error msg="Error unmounting device 8de02009e67a0fea87313b35b117eaed6cf654837532e04ce16a6fc0846d1954: invalid argument" storage-driver=devicemapper
time="2021-12-26T01:36:28.788426923Z" level=error msg="error unmounting container" container=f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035 error="invalid argument"
time="2021-12-26T01:36:28.789562261Z" level=error msg="f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035 cleanup: failed to delete container from containerd: no such container"
time="2021-12-26T01:36:28.794739546Z" level=error msg="restartmanger wait error: mount /dev/mapper/docker-259:3-394503-8de02009e67a0fea87313b35b117eaed6cf654837532e04ce16a6fc0846d1954:/var/lib/docker/devicemapper/mnt/8de02009e67a0fea87313b35b117eaed6cf654837532e04ce16a6fc0846d1954: input/output error\nFailed to mount; dmesg: <3>[ 3784.574178] Buffer I/O error on dev dm-10, logical block 1048578, lost async page write\n<4>[ 3784.583183] JBD2: recovery failed\n<3>[ 3784.587446] EXT4-fs (dm-10): error loading journal\n\ngithub.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).MountDevice\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2392\ngithub.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Get\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:208\ngithub.com/docker/docker/layer.(*referencedRWLayer).Mount\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/layer/mounted_layer.go:104\ngithub.com/docker/docker/daemon.(*Daemon).Mount\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/daemon.go:1320\ngithub.com/docker/docker/daemon.(*Daemon).conditionalMountOnStart\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/daemon_unix.go:1360\ngithub.com/docker/docker/daemon.(*Daemon).containerStart\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/start.go:145\ngithub.com/docker/docker/daemon.(*Daemon).handleContainerExit.func1\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/monitor.go:84\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374"
SOLVED
It was really issue with default AMI Linux docker configuration with devicemapper storage driver on EC2 instance. Clean install docker on Ubuntu 18.04 with overlay2 storage driver solves the issue completely.

Cannot run nodetool commands and cqlsh to Scylla in Docker

I am new to Scylla and I am following the instructions to try it in a container as per this page: https://hub.docker.com/r/scylladb/scylla/.
The following command ran fine.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
I see the container is running.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6c4e19ff1bd scylladb/scylla "/docker-entrypoint.…" 14 seconds ago Up 13 seconds 22/tcp, 7000-7001/tcp, 9042/tcp, 9160/tcp, 9180/tcp, 10000/tcp some-scylla
However, I'm unable to use nodetool or cqlsh. I get the following output.
$ docker exec -it some-scylla nodetool status
Using /etc/scylla/scylla.yaml as the config file
nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)
See 'nodetool help' or 'nodetool help <command>'.
and
$ docker exec -it some-scylla cqlsh
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
Any ideas?
Update
Looking at docker logs some-scylla I see some errors in the logs, the last one is as follows.
2021-10-03 07:51:04,771 INFO spawned: 'scylla' with pid 167
Scylla version 4.4.4-0.20210801.69daa9fd0 with build-id eb11cddd30e88ef39c32c847e70181b5cf786355 starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --overprovisioned --listen-address 172.17.0.2 --rpc-address 172.17.0.2 --seed-provider-parameters seeds=172.17.0.2 --blocked-reactor-notify-ms 999999999"
parsed command line options: [log-to-syslog: 0, log-to-stdout: 1, default-log-level: info, network-stack: posix, developer-mode: 1, overprovisioned, listen-address: 172.17.0.2, rpc-address: 172.17.0.2, seed-provider-parameters: seeds=172.17.0.2, blocked-reactor-notify-ms: 999999999]
ERROR 2021-10-03 07:51:05,203 [shard 6] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application
2021-10-03 07:51:05,316 INFO exited: scylla (exit status 1; not expected)
2021-10-03 07:51:06,318 INFO gave up: scylla entered FATAL state, too many start retries too quickly
Update 2
The reason for the error was described on the docker hub page linked above. I had to start container specifying the number of CPUs with --smp 1 as follows.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --smp 1
According to the above page:
This command will start a Scylla single-node cluster in developer mode
(see --developer-mode 1) limited by a single CPU core (see --smp).
Production grade configuration requires tuning a few kernel parameters
such that limiting number of available cores (with --smp 1) is the
simplest way to go.
Multiple cores requires setting a proper value to the
/proc/sys/fs/aio-max-nr. On many non production systems it will be
equal to 65K. ...
As you have found out, in order to be able to use additional CPU cores you'll need to increase fs.aio-max-nr kernel parameter.
You may run as root:
# sysctl -w fs.aio-max-nr=65535
Which should be enough for most systems. Should you still have any error preventing it to use all of your CPU cores, increase its value further.
Do notice that the above configuration is not persistent. Edit /etc/sysctl.conf in order to make it persistent across reboots.

Error running auditd inside centos docker container: "Unable to set initial audit startup state to 'enable', exiting"

I'm trying to create a docker container with systemd enabled and install auditd on it.
I'm using the standard centos/systemd image provided in dockerhub.
But when I'm trying to start audit, it fails.
Here is the list of commands that I have done to create and get into the docker container:
docker run -d --rm --privileged --name systemd -v /sys/fs/cgroup:/sys/fs/cgroup:ro centos/systemd
docker exec -it systemd bash
Now, inside the docker container:
yum install audit
systemctl start auditd
I'm receiving the following error:
Job for auditd.service failed because the control process exited with error code. See "systemctl status auditd.service" and "journalctl -xe" for details.
Then I run:
systemctl status auditd.service
And I'm getting this info:
auditd[182]: Error sending status request (Operation not permitted)
auditd[182]: Error sending enable request (Operation not permitted)
auditd[182]: Unable to set initial audit startup state to 'enable', exiting
auditd[182]: The audit daemon is exiting.
auditd[181]: Cannot daemonize (Success)
auditd[181]: The audit daemon is exiting.
systemd[1]: auditd.service: control process exited, code=exited status=1
systemd[1]: Failed to start Security Auditing Service.
systemd[1]: Unit auditd.service entered failed state.
systemd[1]: auditd.service failed.
Do you guys have any ideas on why this is happening?
Thank you.
See this discussion:
At the moment, auditd can be used inside a container only for aggregating
logs from other systems. It cannot be used to get events relevant to the
container or the host OS. If you want to aggregate only, then set
local_events=no in auditd.conf.
Container support is still under development.
Also see this:
local_events
This yes/no keyword specifies whether or not to include local events. Normally you want local events so the default value is yes. Cases where you would set this to no is when you want to aggregate events only from the network. At the moment, this is useful if the audit daemon is running in a container. This option can only be set once at daemon start up. Reloading the config file has no effect.
So at least at Date: Thu, 19 Jul 2018 14:53:32 -0400, this feature not support, had to wait.

Docker stopped all of sudden in CentOS 7

I was running docker on my CentOS 7 machine.
Today I was trying to upgrade a container. So I stopped the container and tried to pull new image.
I got the below error
Error getting v2 registry: Get https://registry-1.docker.io/v2/: proxyconnect tcp: dial tcp: lookup https_proxy=http: no such host"
I checked the proxy setting for machine in cat /etc/environment and for docker in cat /etc/systemd/system/docker.service.d/http-proxy.conf
It is set correctly.
I enabled daemon logs for docker and the logs says
Sep 14 10:43:18 myCentOsServer kernel: [4913751.074277] docker0: port 1(veth1e3300a) entered disabled state
Sep 14 10:43:18 myCentOsServer kernel: [4913751.084599] docker0: port 1(veth1e3300a) entered disabled state
Sep 14 10:43:18 myCentOsServer kernel: [4913751.084888] docker0: port 1(veth1e3300a) entered disabled state
Sep 14 10:43:18 myCentOsServer NetworkManager[794]: <info> [1505349798.0267] device (veth1e3300a): released from master device docker0
Sep 14 10:44:48 myCentOsServer dockerd[29136]: time="2017-09-14T10:44:48.802236300+10:00" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: proxyconnect tcp: dial tcp: lookup https_proxy=http: no such host"
I tried below commands but it is stuck.
systemctl daemon-reload
systemctl restart docker
Any idea what might be the issue.
Thanks in advance.
I was finally able to solve this issue.
Issue was with my docker mount points. Mine was set as /var/lib/docker and I suspect it got courrupted when I did data volume export.
Steps I followed
1) Navigated to /var/lib/docker, took a backup of images,containers and volumes folder and deleted them.
2) Reloaded the Daemon
3) Restarted the docker.
Now it is working fine.
However bad news is I lost my datadump which I took from one of the containers (using volumes-from).
But it was a dev version of software. So I reinstalled and did the setup.
It occurs sometimes in CentOS. You can simply restart the docker service by
systemctl restart docker.service

Error creating default "bridge" network: cannot create network (docker0): conflicts with network (docker0): networks have same bridge name

After stopping docker it refused to start again. It complaint that another bridge called docker0 already exists:
level=warning msg="devmapper: Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored."
level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
level=info msg="Graph migration to content-addressability took 0.00 seconds"
level=info msg="Firewalld running: false"
level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: cannot create network fa74b0de61a17ffe68b9a8f7c1cd698692fb56f6151a7898d66a30350ca0085f (docker0): conflicts with network bb9e0aab24dd1f4e61f8e7a46d4801875ade36af79d7d868c9a6ddf55070d4d7 (docker0): networks have same bridge name"
docker.service: Main process exited, code=exited, status=1/FAILURE
Failed to start Docker Application Container Engine.
docker.service: Unit entered failed state.
docker.service: Failed with result 'exit-code'.
Deleting the bridge with ip link del docker0 and then starting docker leads to the same result with another id.
For me I downgraded my OS (Centos Atomic Host in this case) and came across this error message. The docker of the older Centos Atomic was 1.9.1. I did not have any running docker containers or images pulled before running the downgrade.
I simply ran the below and docker was happy again:
sudo rm -rf /var/lib/docker/network
sudo systemctl start docker
More info.
The Problem seems to be in /var/docker/network/. There are a lot of sockets stored that reference the bridge by its old id. To solve the Problem you can delete all sockets, delete the interface and then start docker but all your container will refuse to work since their sockets are gone. In my case I did not care about my stateless containers anyway so this fixed the problem:
ip link del docker0
rm -rf /var/docker/network/*
mkdir /var/docker/network/files
systemctl start docker
# delete all containers
docker ps -a | cut -d' ' -f 1 | xargs -n 1 echo docker rm -f
# recreate all containers
It may sound obvious, but you may want to consider rebooting, especially if there was some major system update recently.
Worked for me, since I didn't reboot my VM after installing some kernel updates, which probably led to many network modules being left in an undefined state.

Resources