What does 'rw' mean when bind mounting a directory - docker

When starting a container and specifying a volume you can optionally append a third field that's a comma separated list of options like rw.
docker run -v /some-host/path:/some-container/path:rw
This same options are applicable in the docker.compose.yml
services:
myService:
image: some/image
volumes:
- /some-host/path:/some-container/path:rw
I thought that specifing rw would mean that the container would be able to read from and write to that directory (regardless of user). Contrary to my belief, when the host directory doesn't exist, docker creates it as drwxr-xr-x 2 root root no matter what I specify. The application in the container is not running on root though, so it tries to write to the mounted drive and get's Permission denied.
I've dug through the docker documents, even found this github issue describing the same issue, but can't find anything definitive that explains expected behavior.
So what exactly does rw(read/write) mean when specified as a third option for bind mounted directories?

As DavidMaze says in the comments
in the same way that / on your host is mounted read-write but isn’t world-writable on every file; if it were mounted read-only nobody could write any file.
And the docs:
If neither 'rw' or 'ro' is specified then the volume is mounted in read-write mode.
And
If you supply an absolute path for the host-dir, Docker bind-mounts to the path you specify.
The directory is "mounted" as rw by default. So think that to write in a directory it is not enough a rw mount, you also need file permissions on it. In the other hand, having full files permissions is not enough if the directory is mounted as read only. Think it as two layers permissions.
Also:
There is clear value in the ability to make bind mounts read-only, though. Containers are one example: an administrator may wish to create a container in which processes may be running as root. It may be useful for that container to have access to filesystems on the host, but the container should not necessarily have write access to those filesystems.

Related

Does docker container maintain volume data?

This might come across as a stupid question, but I am unable to figure something about docker volumes. Going through the official documentation I can see that we can map the host machine file system on the container for persistent storage. Following the instruction I was successfully able to mount a folder on my container.
Once I exec bash into the container, I can see the mapped directory structure there as expected. My question is, how is the data mapped between these two paths, that is from the container to the mount volume on host OS. Is the data duplicated or the container directly stores the data on the volume on host OS and the mapped paths are shown for something like symlink ?
This question comes across since we are trying to maintain a large amount of data on a mounted disk but accessible by the container, with the assumption that mounting volume would directly store the data on the disk and nothing on the container.
The Docker documentation refers to this type of mount as a "bind mount"; that's also a technical Linux term that allows one part of the filesystem to also appear somewhere else, and there's a mount --bind option you can use outside of Docker (usually a pretty specialized option).
On native Linux, the host content and the container-visible content are literally the exact same disk content. If you have a bind-mounted host directory or a named Docker volume mounted over a container directory, all reads and writes will use that mounted content, and in fact nothing will be written to the container filesystem on that path.
You mention symlinks; these are always resolved as filenames in their respective filesystem space. If the mounted filesystem has a symlink passwd -> /etc/passwd then reading it will yield the host's password file on the host, and the container's password file inside the container. If it has a symlink f -> ../f then it will look at the directory above the mount point in whichever the local filesystem is.
On non-Linux this process is a little bit more technically complex since there is typically a Linux virtual machine involved in the mix. This usually manifests as file synchronization appearing slow. For data you don't need to directly access as a human, storing it in a named Docker volume will usually be faster.

Why have docker-compose volumes to be declared twice when not pointing to an actual folder on the host?

In a docker-compose.yml file, do I really need to specify the volumes twice; inside and outside a service? If yes, Why? (the docker-compose part of the doc doesn't have much information on that)
I have the feeling that, in the case shown here where the myapp volume is not explicitly a folder on the host machine, we have to set it twice, but if it actually is a folder on the host machine, specifying it only inside the frontend service block is enough.
In addition, when the volume is specified outside the service block, it's almost always written as key: without any actual value (or sometime as key: {}), which makes me confused.
Moreover, when I run docker-compose down -v it actually only applies on volumes which are not explicitly specified as a folder on the host machine, according to the doc:
-v, --volumes Remove named volumes declared in the `volumes`
section of the Compose file and anonymous volumes
attached to containers.
So maybe the declaration of a volume outside a service is for making this volume identifiable, hence 'removable'. And on the other hand, it will never be removable if it's not set outside the service?
This is a whole bunch of questions - let's try to answer them sequentially:
1. Do I really need to specify the volumes twice (inside and outside the services section)?
This is not a duplicate specification: outside you declare the volume and inside you specify how to mount it into a container. A volume has an independent life cycle from services. It can be mounted by several services and it will retain data if services are restarted.
2. When the volume is specified outside the service block, it's almost always written as key: {}
This key-only notation is the default and does not require any driver configuration. However, if you needed to e.g. connect to NFS, you would have something like:
volumes:
example:
driver_opts:
type: "nfs"
o: "addr=10.40.0.199,nolock,soft,rw"
device: ":/docker/example"
Also, please differentiate between bind mounts and regular volumes. While regular volumes are managed independently from services (and containers), e.g. with docker volumes list, bind-mounts are mere mappings between the host and the container file system. They are tied to the container they are mounted to.
3. When I run docker-compose down -v it actually only applies on volumes which are not explicitly specified as a folder on the host machine
Yes, this won't remove bind mounts, since bind mounts are mere host-container filesystem mappings and therefore Docker does not create an independent volume entity for them.
For deeper understanding, please consider this excerpt from the documentation:
Bind mounts have been around since the early days of Docker. Bind
mounts have limited functionality compared to volumes. When you use a
bind mount, a file or directory on the host machine is mounted into a
container. The file or directory is referenced by its absolute path on
the host machine. By contrast, when you use a volume, a new directory
is created within Docker’s storage directory on the host machine, and
Docker manages that directory’s contents.

"a bind mount won't copy the container contents to the host automatically, unlike a named volume"

Need clarity on a comment here:
The only 'problem' with a bind mount is that it won't copy the
container contents to the host automatically, unlike a named volume.
docs.docker.com/compose/compose-file/#volumes
Is this accurate? If yes, then:
how does one get the container's "new data" (e.g. a growing database) into the host when using a bind mount (to persist the data in case of a container restart)?
how did Docker persist data across container restarts before there were named volumes?
The only 'problem' with a bind mount is that it won't copy the
container contents to the host automatically, unlike a named volume.
Is this accurate?
Close to accurate, but I can see the confusion. Host volumes, aka bind mounts, do not have an initialization feature from docker. With anonymous and named volumes, docker will initialize the volume with the contents of the image at that path. This initialization includes ownership and permissions which helps avoid permission errors. This initialization only runs when the container is created and the volume is new or empty, so subsequent containers will not pickup changes to the image made in newer image versions.
If yes, then:
how does one get the container's "new data" (e.g. a growing database) into the host when using a bind mount (to persist the data
in case of a container restart)?
Reads and writes from the app in the container will continue through to the host filesystem used in the bind mount as expected. It's only the initialization step that doesn't run.
how did Docker persist data across container restarts before there were named volumes?
There were data containers, mounting volumes from other containers, but this was inflexible (all volume paths were fixed to the path in the data container) and mixed management of persistent data with ephemeral containers, and has therefore been phased out.
Volumes are used to handle data persistence between containers. A single container restarting (rather than being replaced) will still have all the container specific filesystem changes. The docker rm command deletes these filesystem changes, along with container logs and metadata/configuration of the container.
The container specific changes are the read/write top layer of an overlay filesystem used by docker. Volume mounts are all separate mounts into subdirectories of this overlay filesystem (just like /home or /var are often separate filesystem mounts in the / filesystem of a Linux host, all reads and writes to those other paths go to a separate underlying filesystem).
If you're going to mount a volume into a container, and you want that volume to reliably contain some content from the image, you need to manually copy it there at container startup time. One way to do this is with an entrypoint wrapper script:
#!/bin/sh
# Copy data into a possibly-mounted location
cp -a /app/static /var/www
# Then run the image's CMD
exec "$#"
You'd include this in your image's Dockerfile
# Must use JSON-array syntax
ENTRYPOINT ["/app/entrypoint.sh"]
CMD same as it was before
There are two important details about Docker named volumes' initialization behavior to be aware of here. The first, which you note, is that Docker only copies content into a volume for Docker named volumes; it doesn't happen for bind mounts, and it doesn't happen in other environments like Kubernetes.
The second, more subtle detail is that the initialization only happens the first time the container runs. If there's already content in a volume that you mount into a container, it will hide what was already there. In other SO questions you can see this manifest as, for example, "I added a package to my Node package.json file, but when I put the node_modules directory in a volume, it ignores the update" or "I'm using a volume to export content to an nginx proxy but it doesn't update".
I think #BMitch having the accepted answer is correct, but I will just try to add in some details with the hope of being useful.
Is this accurate? If yes, then:
Given it is my claim being scrutinised - I totally defer to #BMitch here :)!
However I would also add:
https://github.com/docker/compose/issues/4581#issuecomment-389559090
Provides a layman explanation of how named volumes / host volumes behave
My explanation needs updated to reflect the notion of 'initialization'
https://stackoverflow.com/a/40030535/3080207
This is how I would recommend setting up volumes in docker-compose at the moment, courtesy of #kaiser
how does one get the container's "new data" (e.g. a growing database) into the host when using a bind mount (to persist the data in case of a container restart)?
Both host volumes and named volumes can achieve this.
I think the point of contention is what you want to happen on the:
first run of the container
subsequent runs of the container and
the location/accessibility of the volume on the host system.
Once a volume is attached to a container (be it a named volume or bind mount), whatever is stored to that volume should be persisted between restarts - that effectively comes for free. This assumes the same docker-compose config, and no manual removal of volumes.
Previously it was a bit limiting using a named volume, as you couldn't tail logs, or edit code directly from the host as easily as you could with a bind mount - but it seems that problem is resolved / has a work around now.
Bind mounts are able to persist data between restarts. I personally find that bind volumes do what I want 99% of the time, that being said, named volumes can now 'do it all' and I'd be using those moving forward.
There are differences between them though, and I'm sure they'll still bite people occasionally, requiring them to reach out to actual experts, instead of users like me :).

How can I share data between host and container using mounts

I've been attempting to share data between my host and my container. I've been reading a lot about volumes and I believe I have misunderstood some of the fundamentals around sharing data.
Here's how I've been doing it (with Docker Compose)
version: "2"
services:
my-server:
volumes:
- type: bind
source: ./test/
target: /var/logs
The problem with this approach is that the initial creation of the mount destroys any data in the target folder. So for example if my image was built from another image that had some logs in that folder (for whatever reason), the logs would be destroyed.
This is a major problem with my use case. I need to mount a volume (a folder, basically) so that I can share data between my host and guest, similar to how a shared folder with a VM would work.
I've looked into named volumes but from what I understand, named and anonymous volumes are designed to share data between containers, and not to share data with the host (which is what I need for my use case).
So besides bind mounts, is it possible share data between the host and container?
This is not really a Docker problem. I think you'll run into this with any mount. Basically you are already using the correct mechanism for sharing data between the host and your container.
When you mount something in linux, the mount target (i.e. the path at which you mount something) is always replaced with the root of whatever you mount. It does not merge the contents of the mount target with the contents of the (in this case) bind mounted directory. I'm surprised that works with VM shared folders because you run a high risk of a collision. e.g. same file in both locations. How would it resolve that? File system mounts are not the same as a dropbox like synchronisation of files between two locations.
I suggest that you do your bind mount to somewhere else in your container which has no contents and then modify your in-container workflow to handle this. In your example it sounds like you are attempting to collect logs. It also sounds like the containers configured log directory might have some contents which you want to be copied to the host. You could achieve this by having your container init itself by configuring a new log directory before starting your services/running anything, and copying any existing logs to that location. This new location would be the bind mount. Your init script could also detect if the bind mount was already used in this fashion and not sync over the data. This is really an application specific problem.

Mount non-existing host directory into non-root container

Lets say I have a container running with a non-root user and I want to bind-mount a volume directory from the host into that container. The container then will write to that directory. Say, the directory on the host is /tmp/container/data. If that path does not exist on the host, I observe that it gets created (by docker) with ownership root. As a consequence the container is not able to write anything into that directory (access denied) because my container is not running with user root.
Of course I can take care of creating the /tmp/container/data directory with correct permissions on the host side before starting the container, but this solution obviously does not scale - I will have to do it for each and every container where I want to use a bind volume from the host for which the directory does not exist.
So my question is, what's the best way to use bind-volumes from the host for directories that do not yet exist while still let a non-root container have write access to the volume.
You accurately described the normal behavior of docker, non-existent bind mounts from the docker engine will get initialized to an empty directory owned by root. Note that this doesn't happen with swarm mode, it will fail to schedule the container on the host instead.
Options to use to avoid this include:
Using named volumes. These get initialized to the directory permissions in the image at that location. This is as easy as changing the full path on the host to a short name of the volume.
Run the container as root, and make the entrypoint fix the permissions and drop to the user before launching the application. Something similar to this is done in a jenkins-docker project I threw out on github recently.
Include a script in the container with permissions setuid-root which performs the chown of the directory.

Resources