I am trying to fetch mass of urls with Selenium WebDriver (selenium/standalone-chrome:96.0 image), running in container on EC2 instance with 30GB storage. I put many efforts to avoid disc space leaking during this proccess, but finally gave up. So after a while container run out of space and i get error from WebDriver like selenium.common.exceptions.WebDriverException: Message: unknown error: cannot create temp dir for user data dir
As workaround I can force container exit after a while, so docker will restart container (with restart:always policy), but disc space is not reclaimed, and sooner or later docker restart manager throws error like
restartmanger wait error: mount /dev/mapper/docker-259:3-394503-72f7b76024003665f890079f6f681414587483fa2f30e0f080c027cd516ba7d2:/var/lib/docker/devicemapper/mnt/72f7b76024003665f890079f6f681414587483fa2f30e0f080c027cd516ba7d2: input/output error\nFailed to mount; and leaves container stopped.
Are there any technique to reclaim disk space on container restart?
UPDATE
creating/closing webdriver, performed after each driver.get()
def create_webdriver():
global driver
try:
logger.info("WebDriver: creating...")
options = Options()
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-browser-side-navigation")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)
except Exception:
logger.exception("WebDriver: exception while creating, can not manage, exiting.")
exit(1)
def close_webdriver():
global driver
if driver is not None:
driver.quit()
driver = None
UPDATE2
It seems that there are no disk space leakage, but some issues with docker devicemapper fs on EC2 instance. I carefully investigate disk and docker space usage during the proccess, and find no issues
devtmpfs 16323728 120 16323608 1% /dev
tmpfs 16333664 0 16333664 0% /dev/shm
/dev/nvme0n1p1 8189348 1919080 6170020 24% /
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 9 1 9.033GB 8.571GB (94%)
Containers 8 6 144.8MB -2B (0%)
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
but anyway container feels bad
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot create temp dir for user data dir
and exit, docker can't restart it and there are errors in /var/log/docker
time="2021-12-26T01:36:06.030765815Z" level=error msg="Driver devicemapper couldn't return diff size of container 258399ca6d95cb3510e5e02fec9253b2f22852e8a3553cfad8774b9f913ed279: Failed to mount; dmesg: <3>[ 3761.830462] Buffer I/O error on dev dm-8, logical block 2185471, lost async page write\n<4>[ 3761.839429] JBD2: recovery failed\n<3>[ 3761.843623] EXT4-fs (dm-8): error loading journal\n: mount /dev/mapper/docker-259:3-394503-26a311e2927d080ef4895f43d7dcd6ddaa26e5c0d8e71b6eb46bcdc8d1601194:/var/lib/docker/devicemapper/mnt/26a311e2927d080ef4895f43d7dcd6ddaa26e5c0d8e71b6eb46bcdc8d1601194: input/output error"
time="2021-12-26T01:36:25.009915383Z" level=info msg="ignoring event" container=f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2021-12-26T01:36:25.010710566Z" level=info msg="shim disconnected" id=f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035
time="2021-12-26T01:36:25.010797187Z" level=error msg="copy shim log" error="read /proc/self/fd/36: file already closed"
time="2021-12-26T01:36:28.788036177Z" level=warning msg="error locating sandbox id c1e0abc725ee3e88f388042a34b8e46db09a8fd8024774862899d0f7d9af721b: sandbox c1e0abc725ee3e88f388042a34b8e46db09a8fd8024774862899d0f7d9af721b not found"
time="2021-12-26T01:36:28.788396052Z" level=error msg="Error unmounting device 8de02009e67a0fea87313b35b117eaed6cf654837532e04ce16a6fc0846d1954: invalid argument" storage-driver=devicemapper
time="2021-12-26T01:36:28.788426923Z" level=error msg="error unmounting container" container=f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035 error="invalid argument"
time="2021-12-26T01:36:28.789562261Z" level=error msg="f47ab38bdab172205bd30c3cdbc6723162e4422ef4dcda23f6fec0ac99a20035 cleanup: failed to delete container from containerd: no such container"
time="2021-12-26T01:36:28.794739546Z" level=error msg="restartmanger wait error: mount /dev/mapper/docker-259:3-394503-8de02009e67a0fea87313b35b117eaed6cf654837532e04ce16a6fc0846d1954:/var/lib/docker/devicemapper/mnt/8de02009e67a0fea87313b35b117eaed6cf654837532e04ce16a6fc0846d1954: input/output error\nFailed to mount; dmesg: <3>[ 3784.574178] Buffer I/O error on dev dm-10, logical block 1048578, lost async page write\n<4>[ 3784.583183] JBD2: recovery failed\n<3>[ 3784.587446] EXT4-fs (dm-10): error loading journal\n\ngithub.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).MountDevice\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2392\ngithub.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Get\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:208\ngithub.com/docker/docker/layer.(*referencedRWLayer).Mount\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/layer/mounted_layer.go:104\ngithub.com/docker/docker/daemon.(*Daemon).Mount\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/daemon.go:1320\ngithub.com/docker/docker/daemon.(*Daemon).conditionalMountOnStart\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/daemon_unix.go:1360\ngithub.com/docker/docker/daemon.(*Daemon).containerStart\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/start.go:145\ngithub.com/docker/docker/daemon.(*Daemon).handleContainerExit.func1\n\t/builddir/build/BUILD/docker-20.10.7-3.71.amzn1/src/github.com/docker/docker/daemon/monitor.go:84\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374"
SOLVED
It was really issue with default AMI Linux docker configuration with devicemapper storage driver on EC2 instance. Clean install docker on Ubuntu 18.04 with overlay2 storage driver solves the issue completely.
Related
Docker-CE 19.03.8
Swarm init
Setup: 1 Manager Node nothing more.
We deploy many new stacks per day and sometime i see the following line:
evel=error msg="Failed to allocate network resources for node sdlk0t6pyfb7lxa2ie3w7fdzr" error="could not find network allocator state for network qnkxurc5etd2xrkb53ry0fu59" module=node node.id=yp0u6n9c31yh3xyekondzr4jc
After 2 to 3 days. No new services can be started because there are no free VIPs.
I see the following line in my logs:
level=error msg="Could not parse VIP address while releasing"
level=error msg="error deallocating vip" error="invalid CIDR address: " vip.addr= vip.network=oqcsj99taftdu3b0t3nrgbgy1
level=error msg="Event api.EventUpdateTask: Failed to get service idid0u7vjuxf2itpv8n31da57 for task 6vnc8jdkgxwxqbs3ixly2i6u4 state NEW: could not find service idid0u7vjuxf2itpv8n31da57" module=node ...
level=error msg="Event api.EventUpdateTask: Failed to get service sbjb7nk0wk31c2ayg8x898fhr for task noo21whnbwkyijnqavseirfg0 state NEW: could not find service sbjb7nk0wk31c2ayg8x898fhr" module=node ...
level=error msg="Failed to find network y73pnq85mjpn1pon38pdbtaw2 on node sdlk0t6pyfb7lxa2ie3w7fdzr" module=node node.id=yp0u6n9c31yh3xyekondzr4jc
We tried to investigate this by using the debug mode.
Here are some lines that get to me:
level=debug msg="Remove interface veth84e7185 failed: Link not found"
level=debug msg="Remove interface veth64c3a65 failed: Link not found"
level=debug msg="Remove interface vethf1703f1 failed: Link not found"
level=debug msg="Remove interface vethe069254 failed: Link not found"
level=debug msg="Remove interface veth2b81763 failed: Link not found"
level=debug msg="Remove interface veth0bf3390 failed: Link not found"
level=debug msg="Remove interface veth2ed04cc failed: Link not found"
level=debug msg="Remove interface veth0bc27ef failed: Link not found"
level=debug msg="Remove interface veth444343f failed: Link not found"
level=debug msg="Remove interface veth036acf9 failed: Link not found"
level=debug msg="Remove interface veth62d7977 failed: Link not found"
and
level=debug msg="Request address PoolID:10.0.0.0/24 App: ipam/default/data, ID: GlobalDefault/10.0.0.0/24, DBIndex: 0x0, Bits: 256, Unselected: 60, Sequence: (0xf7dfeeee, 1)->(0xedddddb7, 1)->(0x77777777, 3)->(0x77777775, 1)->(0x77ffffff, 1)->(0xffd55555, 1)->end Curr:233 Serial:true PrefAddress:<
When the UNSELECTED part goes to 0 no new containers can be deployed. They are stuck in the NEW state.
Has anyone expirenced something like this? Or can someone help me?
We believe, that the problem has to do something with the release of the 10.0.0.0/24 (our ingress) addresses.
Did you tried to stop and re- start the docker demon?
sudo service docker stop
sudo service docker start
Also, you may find it useful to have a look at the magnificent documentation on https://dockerswarm.rocks/
I usually use this sequence to update a service
export DOMAIN=xxxx.xxxxx.xxx
docker stack rm $service_name
export NODE_ID=$(docker info -f '{{.Swarm.NodeID}}')
# export environment vars if needed
# update data if needed
docker node update --label-add $service_name.$service_name-data=true $NODE_ID
docker stack deploy -c $service_name.yml $service_name
If you see your container stuck in NEW state, probably your are affected by this problem: https://github.com/moby/moby/issues/37338 reported by cintiadr:
Docker stack fails to allocate IP on an overlay network, and gets stuck in NEW current state #37338
Reproducing it:
Create a swarm cluster (1 manager, 1 worker). I created AWS t2.large Amazon linux instances, installed docker using their docs, version 18.06.1-ce.
# Deploy a new overlay network from a stack (docker-network.yml)
$ ./deploy-network.sh
Deploy 60 identical services attaching to that network - 3 replicas each - from stacks (docker-network.yml)
$ ./deploy-services.sh
You can verify that all services are happily running.
Now let's bring the worker down.
Run:
docker node update --availability drain <node id> && docker node rm --force <node id>
Note: drain is an async operation (something I wasn't aware), so to reproduce this use case you shouldn't wait for the drain to complete
Create a new worker (completely new node/machine), and join the cluster.
You are going to see that very few services are actually able to start. All other will be continuously being rejected due to no IP available.
In past versions (17 I believe), the containers wouldn't be rejected (but rather be stuck in NEW).
How to avoid that problem?
If you drain and patiently wait for all the containers to be terminated before removing the node, it appears that this problem is completely avoided.
Is their a way to run replicas on memory tmpfs on host. I got the problem (infinity restart)
time="2018-11-02T21:55:05Z" level=fatal msg="Error running start replica command: failed to find extents, error: invalid argument"
Is the service able to work on disks mounted in memory?
Currently OpenEBS Jiva storage engine support only those file systems which supports extents mapping ext4,XFS etc...
where as tmpfs does not support extents mapping hence it fails.
I have an AWS EC2 instance with EBS mounted as /docker:
root#ip-10-0-0-235:~# lsblk /dev/xvdc1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdc1 202:33 0 100G 0 part /docker
Then I have /etc/docker/daemon.json with:
{
"graph": "/docker/"
}
After Docker version was updated - I had to change graph to data-root as mentioned here:
{
"data-root": "/docker/"
}
But after that change - I faced with another error:
Nov 22 09:15:31 ip-10-0-0-235 dockerd[5671]: time="2017-11-22T09:15:31.168591095Z" level=info msg="libcontainerd: new containerd process, pid: 5679"
Nov 22 09:15:32 ip-10-0-0-235 dockerd[5671]: time="2017-11-22T09:15:32.170033835Z" level=warning msg="failed to rename /docker/tmp for background deletion: %!s(<nil>). Deleting synchronously"
Nov 22 09:15:32 ip-10-0-0-235 dockerd[5671]: time="2017-11-22T09:15:32.172631277Z" level=error msg="[graphdriver] prior storage driver aufs failed: invalid argument"
Nov 22 09:15:32 ip-10-0-0-235 dockerd[5671]: Error starting daemon: error initializing graphdriver: invalid argument
What's the correct way now to mount it?
Docker 17.05.0-ce, running on Ubuntu 16.04.3.
P.S. During writing this Q - found weird mounted partition:
root#ip-10-0-0-235:~# mount | grep xvd
/dev/xvda1 on / type ext4 (rw,relatime,discard,data=ordered)
/dev/xvda1 on /docker/aufs type ext4 (rw,relatime,discard,data=ordered)
/dev/xvdb1 on /jenkins type ext4 (rw,relatime,data=ordered)
/dev/xvdc1 on /docker type ext4 (rw,relatime,data=ordered)
/dev/xvda1 on /docker/aufs and /dev/xvdc1 on /docker. That doesn't looks correct...
I am trying to install docker on an Ubuntu 16.04 on a remote webserver. However, I get an error that is really frustrating me. I have been installing docker a lot of times already, also on this OS, but this never happened.
I am stuck at sudo apt-get install -y docker-engine, when docker-engine cannot be installed:
~# systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since So 2017-03-05 17:47:20 CET; 32s ago
Docs: https://docs.docker.com
Main PID: 18194 (code=exited, status=1/FAILURE)
dockerd[18194]: time="2017-03-05T17:47:20.567753592+01:00" level=error msg="'overlay' not found as a supported filesystem on this host. Please e
dockerd[18194]: time="2017-03-05T17:47:20.569299675+01:00" level=error msg="'overlay' not found as a supported filesystem on this host. Please e
dockerd[18194]: time="2017-03-05T17:47:20.591796895+01:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
dockerd[18194]: time="2017-03-05T17:47:20.592394882+01:00" level=warning msg="Your kernel does not support oom control"
dockerd[18194]: time="2017-03-05T17:47:20.592410368+01:00" level=warning msg="Your kernel does not support memory swappiness"
dockerd[18194]: time="2017-03-05T17:47:20.592421460+01:00" level=warning msg="Your kernel does not support kernel memory limit"
dockerd[18194]: time="2017-03-05T17:47:20.592427398+01:00" level=warning msg="Unable to find cpu cgroup in mounts"
dockerd[18194]: time="2017-03-05T17:47:20.592458649+01:00" level=warning msg="Unable to find cpuset cgroup in mounts"
dockerd[18194]: time="2017-03-05T17:47:20.592490516+01:00" level=warning msg="mountpoint for pids not found"
dockerd[18194]: Error starting daemon: Devices cgroup isn't mounted
I added root to the group, also I found the advice to add GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1" to the file /etc/default/grub, but that file does not exist!
I also tried sudo apt-get install cgroupfs-mount but without success :-(
Thank you for your help!
So, folks. There is no solution, unfortunately. The reason is the architecture of the remote webserver, a hosted VM.
My provider told me all VMs use the same kernels in one environment and therefore dokker can't access it.
The only option now is to switch to another server.
Most of docker commands never end. I have to interrupt them manually with CTRL+C. Even simple commands like docker ps or docker info do not respond.
However, docker help and docker version still work.
I think there is something like a deadlock with a particular container, so commands related to containers won't complete.
How to handle such a situation ?
My docker version is 1.12.3. I don't use Swarm mode. The docker logs command doesn't work too. Using dmesg I can see a lot of I/O errors, but I don't know if it is related with my problem:
[12898.121287] loop: Write error at byte offset 8882749440, length 4096.
[12898.122837] loop: Write error at byte offset 8883666944, length 4096.
[12898.124685] loop: Write error at byte offset 8882814976, length 4096.
[12898.126459] loop: Write error at byte offset 8883404800, length 4096.
[12898.128201] loop: Write error at byte offset 8883470336, length 4096.
[12898.129921] loop: Write error at byte offset 8883535872, length 4096.
[12898.131774] loop: Write error at byte offset 8883601408, length 4096.
[12898.133594] loop: Write error at byte offset 8883732480, length 4096.
[12917.269786] loop: Write error at byte offset 8883798016, length 4096.
[12917.270331] quiet_error: 632 callbacks suppressed
[12917.270334] Buffer I/O error on device dm-6, logical block 1313320
[12917.270540] lost page write due to I/O error on dm-6
[12917.270543] Buffer I/O error on device dm-6, logical block 1313321
[12917.270740] lost page write due to I/O error on dm-6
[12917.270742] Buffer I/O error on device dm-6, logical block 1313322
[12917.270957] lost page write due to I/O error on dm-6
[12917.270959] Buffer I/O error on device dm-6, logical block 1313323
[12917.271177] lost page write due to I/O error on dm-6
[12917.271179] Buffer I/O error on device dm-6, logical block 1313324
[12917.271377] lost page write due to I/O error on dm-6
[12917.271379] Buffer I/O error on device dm-6, logical block 1313325
[12917.271573] lost page write due to I/O error on dm-6
[12917.301759] loop: Write error at byte offset 8883863552, length 4096.
[12917.312038] loop: Write error at byte offset 8883929088, length 4096.
[12917.312396] Buffer I/O error on device dm-6, logical block 1313328
[12917.312635] lost page write due to I/O error on dm-6
[12917.312638] Buffer I/O error on device dm-6, logical block 1313329
[12917.312867] lost page write due to I/O error on dm-6
[12917.312869] Buffer I/O error on device dm-6, logical block 1313330
[12917.313121] lost page write due to I/O error on dm-6
[12917.313123] Buffer I/O error on device dm-6, logical block 1313331
[12917.313346] lost page write due to I/O error on dm-6
[13090.853726] INFO: task kworker/u8:0:17212 blocked for more than 120 seconds.
[13090.854055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Using the command sudo systemctl status -l docker, the following messages are printed, but I cannot tell if they are related:
dockerd[1344]: time="2016-11-24T17:49:01.184874648+01:00" level=warning msg="libcontainerd: container c9f35af1836bf856001ca6156663f713c1217a697e8d2451927c67797fb5a770 restart canceled"
dockerd[1344]: time="2016-11-24T17:49:02.627116016+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T17:49:02.627152661+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
dockerd[1344]: time="2016-11-24T18:19:51.472701647+01:00" level=warning msg="libcontainerd: container c9f35af1836bf856001ca6156663f713c1217a697e8d2451927c67797fb5a770 restart canceled"
dockerd[1344]: time="2016-11-24T18:19:56.712126199+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T18:19:56.712159759+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
dockerd[1344]: time="2016-11-24T18:34:24.301786606+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T18:34:24.302208751+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
That Docker commands hanging bug happened after I deleted a container.
The daemon dockerd was in an abnormal state: it couldn't be started (sudo service docker start) after having been stopped (service docker stop).
# sudo service docker start
Redirecting to /bin/systemctl start docker.service
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
# journalctl -xe
kernel: device-mapper: ioctl: unable to remove open device docker-253:0-19468577-d6f74dd67f106d6bfa483df4ee534dd9545dc8ca
...
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
systemd[1]: Unit docker.service entered failed state.
systemd[1]: docker.service failed.
polkitd[896]: Unregistered Authentication Agent for unix-process:22551:34177094 (system bus name :1.290, object path /org
ESCESC
kernel: dev_remove: 41 callbacks suppressed
kernel: device-mapper: ioctl: unable to remove open device docker-253:0-19468577-fc63401af903e22d05a4518e02504527f0d7883f9d997d7d97fdfe72ba789863
...
dockerd[22566]: time="2016-11-28T10:18:09.840268573+01:00" level=fatal msg="Error starting daemon: timeout"
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
Moreover, many zombie Docker processes could be observed using ps -eax | grep docker (presence of a "Z" in the "STAT" column), for example docker-proxies.
After rebooting the server and restarting Docker, the zombie processes disappeared and Docker commands were working again.
I just had a similar issue as well. Rebooting the server did not work for me. I got this issue, because I just installed a new container with some kind of errors. After that, most Docker commands did not respond. I fixed it by executing the following command:
docker system prune -a
This removes all unused containers. In my case also the container I just added. More information:
https://docs.docker.com/engine/reference/commandline/system_prune/
I had the same problem (commands not responding) and I fix it by increasing the resources allocated to Docker.
Docker Desktop -> Preferences -> Advanced
In my case, I increased:
Memory from 2GB to 8GB
Swap from 1GB to 2GB
Try different values according with your machine.
From the symptoms that you present, it seems something I struggled as well.
I did the following, hope it helps!
After checking it the service was not responding successfully, using:
system status docker.service
I used the following command to put it to work:
sudo dockerd --debug
Restarting my PC worked for me