mount sysfs with write permissions on unpriviliged docker container - docker

when running docker run -v /sys:/sys:rw $NAME $COMMAND , the container tries to write to a file in /sys, but this causes a "Permission denied" error.
The container user is root. In host, root has write permission to the files.
this issue is fixed with --privileged, but that option has serious security drawbacks.
how can a container have write access to sysfs without using --privileged?
I tried adding all the security capabilities with --cap-add, but it wasn't helpful.

Docker runs /sys filesystem in read-only mode. In the docker forum was said that another container runtime named Sysbox can help, but I didn't try it.
https://forums.docker.com/t/unable-to-mount-sys-inside-the-container-with-rw-access-in-unprivileged-mode/97043

Related

docker volume over fuse : Transport endpoint is not connected

So I have this remote folder /mnt/shared mounted with fuse. It is mostly available, except there shall be some disconnections from time to time.
The actual mounted folder /mnt/shared becomes available again when the re-connection happens.
The issue is that I put this folder into a docker volume to make it available to my app: /shared. When I start the container, the volume is available.
But if a disconnection happens in between, while the /mnt/shared repo on the host machine is available, the /shared folder is not accessible from the container, and I get:
user#machine:~$ docker exec -it e313ec554814 bash
root#e313ec554814:/app# ls /shared
ls: cannot access '/shared': Transport endpoint is not connected
In order to get it to work again, the only solution I found is to docker restart e313ec554814, which brings downtime to my app, hence is not an acceptable solution.
So my questions are:
Is this somehow a docker "bug" not to reconnect to the mounted folder when it is available again?
Can I execute this task manually, without having to restart the whole container?
Thanks
I would try the following solution.
If you mount the volume to your docker like so:
docker run -v /mnt/shared:/shared my-image
I would create an intermediate directory /mnt/base/shared and mount it to docker like so:
docker run -v /mnt/base/shared:/base/shared my-image
and I will also adjust my code to refer to the new path or creating a link from /base/shared to /shared inside the container
Explanation:
The problem is that the mounted directory /mnt/shared is probably deleted on host machine, when there is a disconnection and a new directory is created after connection is back. But, the container started running with directory mapping for the old directory which was deleted. By creating an intermediate directory and mapping to it instead you avoid this mapping issue.
Another solution that might work is to mount the directory using bind-propagation=shared
e.g:
--mount type=bind,source=/mnt/shared,target=/shared,bind-propagation=shared
See docker docs explaining bind-propogation

Docker with selinux enabled - relabeling content in /usr is not allowed

I have Docker on Centos7 with selinux set to enforcing on the host and Docker daemon is started with --selinux-enabled flag.
When I try to run the following command
docker run -it -v /usr/local/xya/log:/usr/local/xya/log:z centos/systemd touch /usr/local/xya/log/test
I get the following error:
docker: Error response from daemon: error setting label on mount source '/usr/local/xya/log': relabeling content in /usr is not allowed.
As per some articles (http://jaormx.github.io/2018/selinux-and-docker-notes/), the 'z' flag is supposed to make /usr writable; not sure if I am missing something.
Docker version 19.03.3, build a872fc2f86
CentOS version: CentOS Linux release 7.5.1804
I recently had a similar (albeit different issue), I found Juan's SELinux and docker notes helpful.
I'm having troubles finding the documentation that highlighted the following point, but I recall seeing it and was able to get around my issues by accepting it as truth. I will update it if/when I stumble across it again; Not everything within /usr or /etc will grant you write access in SELinux. At least not in the context of Docker.
You can access the /etc and /usr directories within SELinux context, but you cannot obtain write everywhere, so z and Z will occasionally give you unable to label issues when spinning up docker containers with volume mounts from those locations. However, if you have SELinux protected files elsewhere, e.g. in a users home directory, you'd be able to have Docker relabel those files appropriately -- that is you'd be able to write to those SELinux protected files/directories with the z or Z flags.
If you need to write within the /usr or /etc directories and obtaining the unable to relabel alert, the --privileged flag or --security-opt label:disable flag should be instead of the z syntax. This would allow you to have write access, but you'd need to remove the z from your volume mount as Docker would still give you the unable to relabel statement.
note, you can also invoke privileged in the docker-compose.yml via privileged: true for a given service
the image has no permission to edit or create new files in usr folder, from the Docs you may start the container with --privileged parameter

Add a user to group docker and such a user can access unauthorized directories in container by mounting directories via -v

In my centos system, I add a user to group docker, and I found such a user can access any folder by attach folder to container via docker run -it -v path-to-directory:directory-in-container. For example, I have a folder with mode 700 which can only access by root, but if someone who doesn't have permission to access this folder run a container and mount this folder to container, he can access this folder in container.How can I prevent such a user to attach unauthorized directories to docker container? My docker version is 17.03.0-ce, system os centOS 7.0. Thanks!
You should refer and follow the container principles for security or external volume permission setting up. But you want to test simply for container features.
You can set up the path-to-directory access mode 777, yes it's world-readable-writable access mode. It's no additional owner-group setting and access mode setting for any container volume mapping.
chmod 777 /path/to/directory
Docker daemon runs as root and normally starts the containers as root, with users inside mapping one-to-one to the host users, so anybody in the docker group has effective root permissions.
There is an option to tell dockerd to run the containers via subusers of a specific user, see [https://docs.docker.com/engine/security/userns-remap/]. That prevents full root access, but everybody accessing the docker daemon will be running the containers under that user—and if that user is not them, they won't be able to usefully mount things in their home.
Also I believe it is incompatible with --privileged containers, but of course those give you full root access via other means as well anyway.

Using SMB shares as docker volumes

I'm new to docker and docker-compose.
I'm trying to run a service using docker-compose on my Raspberry PI. The data this service uses is stored on my NAS and is accessible via samba.
I'm currently using this bash script to launch the container:
sudo mount -t cifs -o user=test,password=test //192.168.0.60/test /mnt/test
docker-compose up --force-recreate -d
Where the docker-compose.yml file simply creates a container from an image and binds it's own local /home/test folder to the /mnt/test folder on the host.
This works perfectly fine, when launched from the script. However, I'd like the container to automatically restart when the host reboots, so I specified 'always' as restart policy. In the case of a reboot then, the container starts automatically without anyone mounting the remote folder, and the service will not work correctly as a result.
What would be the best approach to solve this issue? Should I use a volume driver to mount the remote share (I'm on an ARM architecture, so my choices are limited)? Is there a way to run a shell script on the host when starting the docker-compose process? Should I mount the remote folder from inside the container?
Thanks
What would be the best approach to solve this issue?
As #Frap suggested, use systemd units to manage the mount and the service and the dependencies between them.
This document discusses how you could set up a Samba mount as a systemd unit. Under Raspbian, it should look something like:
[Unit]
Description=Mount Share at boot
After=network-online.target
Before=docker.service
RequiredBy=docker.service
[Mount]
What=//192.168.0.60/test
Where=/mnt/test
Options=credentials=/etc/samba/creds/myshare,rw
Type=cifs
TimeoutSec=30
[Install]
WantedBy=multi-user.target
Place this in /etc/systemd/system/mnt-test.mount, and then:
systemctl enable mnt-test.mount
systemctl start mnt-test.mount
The After=network-online.target line should cause systemd to wait until the network is available before trying to access this share. The Before=docker.service line will cause systemd to only launch docker after this share has been mounted. The RequiredBy=docker.service means that if you start docker.service, this share will be mounted first (if it wasn't already), and that if the mount fails, docker will not start.
This is using a credentials file rather than specifying the username/password in the unit itself; a credentials file would look like:
username=test
password=test
You could just replace the credentials option with username= and password=.
Should I mount the remote folder from inside the container?
A standard Docker container can't mount filesystems. You can create a privileged container (by adding --privileged to the docker run command line), but that's generally a bad idea (because that container now has unrestricted root access to your host).
I finally "solved" my own issue by defining a script to run in the /etc/rc.local file. It will launch the mount and docker-compose up commands on every reboot.
Being just 2 lines of code and not dependent on any particular Unix flavor, it felt to me like the most portable solution, barring a docker-only solution that I was unable to find.
Thanks all for the answers

Docker root access to host system

When I run a container as a normal user I can map and modify directories owned by root on my host filesystem. This seems to be a big security hole. For example I can do the following:
$ docker run -it --rm -v /bin:/tmp/a debian
root#14da9657acc7:/# cd /tmp/a
root#f2547c755c14:/tmp/a# mv df df.orig
root#f2547c755c14:/tmp/a# cp ls df
root#f2547c755c14:/tmp/a# exit
Now my host filesystem will execute the ls command when df is typed (mostly harmless example). I cannot believe that this is the desired behavior, but it is happening in my system (debian stretch). The docker command has normal permissions (755, not setuid).
What am I missing?
Maybe it is good to clarify a bit more. I am not at the moment interested in what the container itself does or can do, nor am I concerned with the root access inside the container.
Rather I notice that anyone on my system that can run a docker container can use it to gain root access to my host system and read/write as root whatever they want: effectively giving all users root access. That is obviously not what I want. How to prevent this?
There are many Docker security features available to help with Docker security issues. The specific one that will help you is User Namespaces.
Basically you need to enable User Namespaces on the host machine with the Docker daemon stopped beforehand:
dockerd --userns-remap=default &
Note this will forbid the container from running in privileged mode (a good thing from a security standpoint) and restart the Docker daemon (it should be stopped before performing this command). When you enter the Docker container, you can restrict it to the current non-privileged user:
docker run -it --rm -v /bin:/tmp/a --user UID:GID debian
Regardless, try to enter the Docker container afterwards with your default command of
docker run -it --rm -v /bin:/tmp/a debian
If you attempt to manipulate the host filesystem that was mapped into a Docker volume (in this case /bin) where files and directories are owned by root, then you will receive a Permission denied error. This proves that User Namespaces provide the security functionality you are looking for.
I recommend going through the Docker lab on this security feature at https://github.com/docker/labs/tree/master/security/userns. I have done all of the labs and opened Issues and PRs there to ensure the integrity of the labs there and can vouch for them.
Access to run docker commands on a host is access to root on that host. This is the design of the tool since the functionality to mount filesystems and isolate an application requires root capabilities on linux. The security vulnerability here is any sysadmin that grants access to users to run docker commands that they wouldn't otherwise trust with root access on that host. Adding users to the docker group should therefore be done with care.
I still see Docker as a security improvement when used correctly, since applications run inside a container are restricted from what they can do to the host. The ability to cause damage is given with explicit options to running the container, like mounting the root filesystem as a rw volume, direct access to devices, or adding capabilities to root that permit escaping the namespace. Barring the explicit creation of those security holes, an application run inside a container has much less access than it would if it was run outside of the container.
If you still want to try locking down users with access to docker, there are some additional security features. User namespacing is one of those which prevents root inside of the container from having root access on the host. There's also interlock which allows you to limit the commands available per user.
You're missing that containers run as uid 0 internally by default. So this is expected. If you want to restrict the permission more inside the container, build it with a USER statement in Dockerfile. This will setuid to the named user at runtime, instead of running as root.
Note that the uid of this user it not necessarily predictable, as it is assigned inside the image you build, and it won't necessarily map to anything on the outside system. However, the point is, it won't be root.
Refer to Dockerfile reference for more information.

Resources