How to use loop devices locally in Docker - docker

I want to use loop devices in a docker container locally. It means, when running a couple of container all of them should have for instance a /dev/loop0 connected to a file local in the container. I tried
[root#600bbfb452d1 /]# mknod /dev/loop20 b 7 20
[root#600bbfb452d1 /]# dd if=/dev/random of=loopfile1 bs=1M count=2
[root#600bbfb452d1 /]# losetup -a | grep 20
/dev/loop20: [0049]:3553002 (/loopfile1)
so far so good. But going back to host I can see:
[loewe#linux-2 ~]$ losetup -a | grep 20
/dev/loop20: []: (/loopfile1)
the loop device /dev/loop20 was also created in the hosts /dev - as my fear was because of the tmpfs mount - and worst the container local file "loopfile1" is attached to hosts loop dev.
I tries to umount the /dev filesystem in the container but didn't succeed (device busy but no proc visible with lsof).
Any idea what I am doing wrong?
BTW: using iscsi devices in a container should have the same problem.
Thanks Heiko

Related

Docker device-cgroups-rule, mknod and mount

I'm attempting to implement what is described here:
https://docs.docker.com/engine/reference/commandline/create/#dealing-with-dynamically-created-devices---device-cgroup-rule
Similar to the page I am creating (and then starting) a container as follows:
docker create --device-cgroup-rule='b 8:* rmw' -name my-container my-image
Quoting from the above page
Then, a user could ask udev to execute a script that would docker exec
my-container mknod newDevX c 42 the required device when it is
added.
Within the container (docker exec -it my-container sh) I then mknod a device:
mknod /dev/sdc1 b 8 33
The device was reported as above by lsblk:
sdc 8:32 1 500M 0 disk
└─sdc1 8:33 1 500M 0 part
mknod succeeds but mounting /dev/sdc1 gives an error:
$ mount /dev/sdc1 /mnt
mount: /mnt: permission denied.
I also tried various other things like
mknod with -m
docker start with --cap-add=CAP_MKNOD
EDIT:
I also tried starting with --privileged but without the /dev/sdc1 precreated and it worked. It must have something to do with Capabilities or other differences between privileged and non-privileged mode. I tried with --cap-add=CAP_MKNOD and CAP_SYS_ADMIN but it now reports a difference message:
$ mount /dev/sdc1 /mnt
mount: /mnt: cannot mount /dev/sdc1 read-only.

Rancher(docker) diskusage cleanup

Rancher system started to use a heavy amount of disksspace. Kubernetes was setup by Rancher's RKE.
Diskusage is already over 5TB however I have only 10-12 replicaset, their real data is binded to PV which uses nfs (which has only a size of 10gb).
df -h --total clearly shows which one takes up so many space:
Filesystem Size Used Avail Use% Mounted on
overlay 98G 78G 16G 84% /var/lib/docker/overlay2/84db..somehash/merged
I have ~50-60 entry from these.
How can I cleanup these? Is there any maintenance feature in rancher for this? Couldn't find any though.
Kubernetes's garbage collection should be cleaning up your nodes.
This looks a lot an issue I saw with some log collectors like Splunk and Datadog.
If the following usage numbers do not match up. Then using the script below to release the file descriptors.
df -h /var/lib/docker
docker system df
Workaround:
ps aux | grep dockerd <<== This pid
cd /proc/`pid of dockerd`/fd
ls -l |grep var.log.journal |grep deleted.$ |awk '{print $9}' |while read x; do :> $x; done;

Can a mounted volume in Kubernetes be accessed from the host os filesystem

My real question is, if secrets are mounted as volumes in pods - can they be read if someone gains root access to the host OS.
For example by accessing /var/lib/docker and drilling down to the volume.
If someone has root access to your host with containers, he can do pretty much whatever he wants... Don't forget that pods are just a bunch of containers, which in fact are processes with pids. So for example, if I have a pod called sleeper:
kubectl get pods sleeper-546494588f-tx6pp -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
sleeper-546494588f-tx6pp 1/1 Running 1 21h 10.200.1.14 k8s-node-2 <none>
running on the node k8s-node-2. With root access to this node, I can check what pid this pod and its containers have (I am using containerd as container engine, but points below are very similar for docker or any other container engine):
[root#k8s-node-2 /]# crictl -r unix:///var/run/containerd/containerd.sock pods -name sleeper-546494588f-tx6pp -q
ec27f502f4edd42b85a93503ea77b6062a3504cbb7ac6d696f44e2849135c24e
[root#k8s-node-2 /]# crictl -r unix:///var/run/containerd/containerd.sock ps -p ec27f502f4edd42b85a93503ea77b6062a3504cbb7ac6d696f44e2849135c24e
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT POD ID
70ca6950de10b 8ac48589692a5 2 hours ago Running sleeper 1 ec27f502f4edd
[root#k8s-node-2 /]# crictl -r unix:///var/run/containerd/containerd.sock# inspect 70ca6950de10b | grep pid | head -n 1
"pid": 24180,
And then finally with those information (pid number), I can access "/" mountpoint of this process and check its content including secrets:
[root#k8s-node-2 /]# ll /proc/24180/root/var/run/secrets/kubernetes.io/serviceaccount/
total 0
lrwxrwxrwx. 1 root root 13 Nov 14 13:57 ca.crt -> ..data/ca.crt
lrwxrwxrwx. 1 root root 16 Nov 14 13:57 namespace -> ..data/namespace
lrwxrwxrwx. 1 root root 12 Nov 14 13:57 token -> ..data/token
[root#k8s-node-2 serviceaccount]# cat /proc/24180/root/var/run/secrets/kubernetes.io/serviceaccount/namespace ; echo
default
[root#k8s-node-2 serviceaccount]# cat /proc/24180/root/var/run/secrets/kubernetes.io/serviceaccount/token | cut -d'.' -f 1 | base64 -d ;echo
{"alg":"RS256","kid":""}
[root#k8s-node-2 serviceaccount]# cat /proc/24180/root/var/run/secrets/kubernetes.io/serviceaccount/token | cut -d'.' -f 2 | base64 -d 2>/dev/null ;echo
{"iss":"kubernetes/serviceaccount","kubernetes.io/serviceaccount/namespace":"default","kubernetes.io/serviceaccount/secret.name":"default-token-6sbz9","kubernetes.io/serviceaccount/service-account.name":"default","kubernetes.io/serviceaccount/service-account.uid":"42e7f596-e74e-11e8-af81-525400e6d25d","sub":"system:serviceaccount:default:default"}
It is one of the reasons why it is super important to properly secure access to your kubernetes infrastructure.

Can I use "mount" inside a Docker Alpine container?

I am Dockerising an old project. A feature in the project pulls in user-specified Git repos, and since the size of a repo could cause the filing system to be overwhelmed, I created a local filing system of a fixed size, and then mounted it. This was intended to prevent the web host from having its file system filled up.
The general approach is this:
IMAGE=filesystem/image.img
MOUNT_POINT=filesystem/mount
SIZE=20
PROJECT_ROOT=`pwd`
# Number of M to set aside for this filing system
dd if=/dev/zero of=$IMAGE bs=1M count=$SIZE &> /dev/null
# Format: the -F permits creation even though it's not a "block special device"
mkfs.ext3 -F -q $IMAGE
# Mount if the filing system is not already mounted
$MOUNTCMD | cut -d ' ' -f 3 | grep -q "^${PROJECT_ROOT}/${MOUNT_POINT}$"
if [ $? -ne 0 ]; then
# -p Create all parent dirs as necessary
mkdir -p $MOUNT_POINT
/bin/mount -t ext3 $IMAGE $MOUNT_POINT
fi
This works fine in a Linux local or remote VM. However, I'd like to run this shell code, or something like it, inside a container. Part of the reason I'd like to do that is to contain all fiddly stuff inside a container, so that building a new host machine is as kept as simple as possible (in my view, setting up custom mounts and cron-restart rules on the host works against that).
So, this command does not work inside a container ("filesystem" is an on-host Docker volume)
mount -t ext3 filesystem/image.img filesystem/mount
mount: can't setup loop device: No space left on device
It also does not work on a container folder ("filesystem2" is a container directory):
dd if=/dev/zero of=filesystem2/image.img bs=1M count=20
mount -t ext3 filesystem2/image.img filesystem2/mount
mount: can't setup loop device: No space left on device
I wonder whether containers just don't have the right internal machinery to do mounting, and thus whether I should change course. I'd prefer not to spend too much time on this (I'm just moving a project to a Docker-only server) which is why I would like to get mount working if I can.
Other options
If that's not possible, then a size-limited Docker volume, that works with both Docker and Swarm, may be an alternative I'd need to look into. There are conflicting reports on the web as to whether this actually works (see this question).
There is a suggestion here to say this is supported in Flocker. However, I am hesitant to use that, as it appears to be abandoned, presumably having been affected by ClusterHQ going bust.
This post indicates I can use --storage-opt size=120G with docker run. However, it does not look like it is supported by docker service create (unless perhaps the option has been renamed).
Update
As per the comment convo, I made some progress; I found that adding --privileged to the docker run enables mounting, at the cost of removing security isolation. A helpful commenter says that it is better to use the more fine-grained control of --cap-add SYS_ADMIN, allowing the container to retain some of its isolation.
However, Docker Swarm has not yet implemented either of these flags, so I can't use this solution. This lengthy feature request suggests to me that this feature is not going to be added in a hurry; it's been pending for two years already.
You won't be able to safely do this inside of a container. Docker removes the mount privilege from containers because using this you could mount the host filesystem and escape the container. However, you can do this outside of the container and mount the filesystem into the container as a volume using the default local driver. The size option isn't supported by most filesystems, tmpfs being one of the few exceptions. Most of them use the size of the underlying device which you defined with the image file creation command:
dd if=/dev/zero of=filesystem/image.img bs=1M count=$SIZE
I had trouble getting docker to create the loop device dynamically, so here's the process to create it manually:
$ sudo losetup --find --show ./vol-image.img
/dev/loop0
$ sudo mkfs -t ext3 /dev/loop0
mke2fs 1.43.4 (31-Jan-2017)
Creating filesystem with 10240 1k blocks and 2560 inodes
Filesystem UUID: 25c95fcd-6c78-4b8e-b923-f808517b28df
Superblock backups stored on blocks:
8193
Allocating group tables: done
Writing inode tables: done
Creating journal (1024 blocks): done
Writing superblocks and filesystem accounting information: done
When defining the volume mount options are passed almost verbatim from the mount command you run on the command line:
docker volume create --driver local --opt type=ext3 \
--opt device=filesystem/image.img app_vol
docker service create --mount type=volume,src=app_vol,dst=/filesystem/mount ...
or in a single service create command:
docker service create \
--mount type=volume,src=app_vol,dst=/filesystem/mount,volume-driver=local,volume-opt=type=ext3,volume-opt=device=filesystem/image.img ...
With docker run, the command looks like:
$ docker run -it --rm --mount type=volume,dst=/data,src=ext3vol,volume-driver=local,volume-opt=type=ext3,volume-opt=device=/dev/loop0 busybox /bin/sh
/ # ls -al /data
total 17
drwxr-xr-x 3 root root 1024 Sep 19 14:39 .
drwxr-xr-x 1 root root 4096 Sep 19 14:40 ..
drwx------ 2 root root 12288 Sep 19 14:39 lost+found
The only prerequisite is that you create this file and loop device before creating the service, and that this file is accessible wherever the service is scheduled. I would also suggest making all of the paths in these commands fully qualified rather than relative to the current directory. I'm pretty sure there are a few places that relative paths don't work.
I have found a size-limiting solution I am happy with, and it does not use the Linux mount command at all. I've not implemented it yet, but the tests documented below are satisfying enough. Readers may wish to note the minor warnings at the end.
I had not tried mounting Docker volumes prior to asking this question, since part of my research stumbled on a Stack Overflow poster casting doubt on whether Docker volumes can be made to respect a size limitation. My test indicates that they can, but you may wish to test this on your own platform to ensure it works for you.
Size limit on Docker container
The below commands have been cobbled together from various sources on the web.
To start with, I create a volume like so, with a 20m size limit:
docker volume create \
--driver local \
--opt o=size=20m \
--opt type=tmpfs \
--opt device=tmpfs \
hello-volume
I then create an Alpine Swarm service with a mount on this container:
docker service create \
--mount source=hello-volume,target=/myvol \
alpine \
sleep 10000
We can ensure the container is mounted by getting a shell on the single container in this service:
docker exec -it amazing_feynman.1.lpsgoyv0jrju6fvb8skrybqap
/ # ls - /myvol
total 0
OK, great. So, while remaining in this shell, let's try slowly overwhelming this disk, in 5m increments. We can see that it fails on the fifth try, which is what we would expect:
/ # cd /myvol
/myvol # ls
/myvol # dd if=/dev/zero of=image1 bs=1M count=5
5+0 records in
5+0 records out
/myvol # dd if=/dev/zero of=image2 bs=1M count=5
5+0 records in
5+0 records out
/myvol # ls -l
total 10240
-rw-r--r-- 1 root root 5242880 Sep 16 13:11 image1
-rw-r--r-- 1 root root 5242880 Sep 16 13:12 image2
/myvol # dd if=/dev/zero of=image3 bs=1M count=5
5+0 records in
5+0 records out
/myvol # dd if=/dev/zero of=image4 bs=1M count=5
5+0 records in
5+0 records out
/myvol # ls -l
total 20480
-rw-r--r-- 1 root root 5242880 Sep 16 13:11 image1
-rw-r--r-- 1 root root 5242880 Sep 16 13:12 image2
-rw-r--r-- 1 root root 5242880 Sep 16 13:12 image3
-rw-r--r-- 1 root root 5242880 Sep 16 13:12 image4
/myvol # dd if=/dev/zero of=image5 bs=1M count=5
dd: writing 'image5': No space left on device
1+0 records in
0+0 records out
/myvol #
Finally, let's see if we can get an error by overwhelming the disk in one go, in case the limitation only applies to newly opened file handles in a full disk:
/ # cd /myvol
/ # rm *
/myvol # dd if=/dev/zero of=image1 bs=1M count=21
dd: writing 'image1': No space left on device
21+0 records in
20+0 records out
It turns out we can, so that looks pretty robust to me.
Nota bene
The volume is created with a type and a device of "tmpfs", which sounded to me worryingly like a RAM disk. I've successfully checked that the volume remains connected and intact after a system reboot, so it looks good to me, at least for now.
However, I'd say that when it comes to organising your data persistence systems, don't just copy what I have. Make sure the volume is robust enough for your use case before you put it into production, and of course, make sure you include it in your back-up process.
(This is for Docker version 18.06.1-ce, build e68fc7a).

'No space left on device' after I changed Docker's storage base directory with DOCKER_OPTIONS

I changed Docker's storage base directory from /var/lib/docker to /home/docker by changing DOCKER_OPTIONS in /etc/default/docker as explained in this other question. After that, I rsynced the old /var/lib/docker to the new place.
Here is my Docker configuration file:
# Docker Upstart and SysVinit configuration file
# ....
# Customize location of Docker binary (especially for development testing).
#DOCKER="/usr/local/bin/docker"
# Use DOCKER_OPTS to modify the daemon startup options.
DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4 -g /home/docker"
# If you need Docker to use an HTTP proxy, it can also be specified here.
#export http_proxy="http://127.0.0.1:3128/"
# This is also a handy place to tweak where Docker's temporary files go.
#export TMPDIR="/mnt/bigdrive/docker-tmp"
Everything was working fine after I rebooted. However, I started getting a "no space left on device" in my containers from time to time. When this error happens, if my container is up, I can't even do a mkdir. If the container is down and I try to start it, I get the following:
Error response from daemon: rpc error: code = 2 desc = "oci runtime
error: could not synchronise with container process: can't create
pivot_root dir , error mkdir .pivot_root: no space left on device"
However, I have space:
Filesystem Size Used Avail Use% Mounted on
udev 32G 4,0K 32G 1% /dev
tmpfs 6,3G 1,6M 6,3G 1% /run
/dev/sda1 92G 56G 32G 64% /
none 4,0K 0 4,0K 0% /sys/fs/cgroup
none 5,0M 0 5,0M 0% /run/lock
none 32G 472K 32G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda5 1,6T 790G 762G 51% /home
I'm suspecting that perhaps I haven't done the storage migration correctly. Does someone know what might be happening?
Running out of disk space can also include inode limits. You can check those with df -i. This post on Unix.SE walks you through the steps required to increase the number of inodes available. Short of that, you can delete files to free up the inodes.
You can try cleaning up images that aren't in use. This fixed the problem for me:
docker images -aq -f 'dangling=true' | xargs docker rmi
As well as volumes. This will remove dangling volumes:
docker volume ls -q -f 'dangling=true' | xargs docker volume rm
https://success.docker.com/article/error-message-no-space-left-on-device-in-default-machine

Resources