Sharing RO data between Docker containers without copying it? - docker

I have one Docker image data-image which contains about 80 GB of data at /data. I would like a container from another image runner-image to have read-only access to this data. The data is never modified.
The recommended way to share data between containers is with a volume. However per the docs:
If you start a container which creates a new volume, and the container has files or directories in the directory to be mounted such as /app/, the directory’s contents are copied into the volume.
When I start a container which creates the volume:
docker run --volume data-volume:/data:ro data-image
the Docker daemon grinds to a halt because it is copying all of the data into the volume.
Once it eventually finishes, I can indeed see the data from the runner image:
docker run --volume data-volume:/data:ro runner-image ls /data
This shows the data as expected, but the initial copy is slow and expensive. How can I avoid that?
(data-image is the image holding the data, runner-image is the image which would like access to the data, data-volume is the name of the volume created here - but maybe volumes are not the best choice here.)
Is it possible to for one container to share read-only data to another, without first copying it? Thanks for any help!

Related

How does volume mount from container to host and vice versa work?

docker run -ti --rm -v DataVolume3:/var ubuntu
Lets say I have a volume DataVolume 3 which pulls the contents of /var in the ubuntu container
even after killing this ubuntu container the volume remains and I can use this volume DataVolume3 to mount it to other containers.
This means with the deletion of container the volume mounts are not deleted.
How does this work ?
Does that volume mount mean that it copies the contents of /var into some local directory because this does not look like a symbolic link ?
If I have the container running and I create a file in the container then the same file gets copied to the host path ?
How does this whole process of volume mount from container to host and host to container work ?
Volumes are used for persistent storage and the volumes persists independent of the lifecycle of the container.
We can go through a demo to understand it clearly.
First, let's create a container using the named volumes approach as:
docker run -ti --rm -v DataVolume3:/var ubuntu
This will create a docker volume named DataVolume3 and it can be viewed in the output of docker volume ls:
docker volume ls
DRIVER VOLUME NAME
local DataVolume3
Docker stores the information about these named volumes in the directory /var/lib/docker/volumes/ (*):
ls /var/lib/docker/volumes/
1617af4bce3a647a0b93ed980d64d97746878564b141f30b6110d0818bf32b76 DataVolume3
Next, let's write some data from the ubuntu container at the mounted path var:
echo "hello" > var/file1
root#2b67a89a0050:/# cat /var/file1
hello
We can see this data with cat even after deleting the container:
cat /var/lib/docker/volumes/DataVolume3/_data/file1
hello
Note: Although, we are able to access the volumes like shown above but it not a recommended practice to access volumes data like this.
Now, next time when another container uses the same volume then the data from the volume gets mounted at the container directory specified as part of -v flag.
(*) The location may vary based on OS as pointed by David and probably can be seen by the docker volume inspect command.
Docker has a concept of a named volume. By default the storage for this lives somewhere on your host system and you can't directly access it from outside Docker (*). A named volume has its own lifecycle, it can be independently docker volume rm'd, and if you start another container mounting the same volume, it will have the same persistent content.
The docker run -v option takes some unit of storage, either a named volume or a specific host directory, and mounts it (as in the mount(8) command) in a specific place in the container filesystem. This will hide what was originally in the image and replace it with the volume content.
As you note, if the thing you mount is an empty named volume, it will get populated from the image content at container initialization time. There are some really important caveats on this functionality:
Named volume initialization happens only if the volume is totally empty.
The contents of the named volume never automatically update.
If the volume isn't empty, the volume contents completely replace what's in the image, even if it's changed.
The initialization happens only on native Docker, and not for example in Kubernetes.
The initialization happens only on named volumes, and not for bind-mounted host directories.
With all of these caveats, I'd avoid relying on this functionality.
If you need to mount a volume into a container, assume it will be empty when your entrypoint or the main container command starts. If you need a particular directory layout or file structure there, an entrypoint script can create it; if you're expecting it to hold particular data, keep a copy of it somewhere else in your image and copy it in if it's not already there (or, perhaps, always).
(*) On native Linux you can find a filesystem location for it, but accessing this isn't a best practice. On other OSes this will be hidden inside a virtual machine or other opaque storage. If you need to directly access the data (or inject config files, or read log files) a docker run -v /host/path:/container/path bind mount is a better choice.
Volumes are part of neither the container nor the host. Well, technically everything resides in the host machine. But the docker directories are only accessible by users in "docker" group. The files in these directories are separately managed by docker.
"Volumes are stored in a part of the host filesystem which is managed by Docker (/var/lib/docker/volumes/ on Linux)."
Hence volumes are like the union of files under the docker container and the host itself. Any addition on either end will be added to the volume(/var/lib/docker/volumes), not hard copy, rather something like symbol link
As volumes can be shared across different containers, deleting a container does not cascade to the volumes associated with it.
To remove unused volumes:
docker volume prune .

Docker volume from existent container

I'm kind of stuck,
I have a docker container that is running, and that container runs some elasticsearch inside.
But I forgot to use volume on the first deploy. So my container has lots of data inside, in a single folder in /app/data.
I would like to use the same container but use volume on that folder, without losing data inside...
So it will be possible to rebuild other containers to use the same volume.
Have you some tips to share?
The important thing is not to remove your container, or you'll lose all that data. I think docker cp is your friend here (docs here). Copy the data to the host, then start another container with a volume.
Once you've secured your data, you can stop and remove the first container.

In docker, can I publish a volume with initial data?

I want to share a file storage between two containers. From the documentation, I've seen that you can create and use volumes like this:
docker volume create --name DataVolume1
docker run -ti --rm -v DataVolume1:/datavolume1 ubuntu
However, I want containers to be able to access an initial set of shared data. Does docker support publishing of volumes? If not, does this mean I should write the initial data manually, after creating the volume, or is there another solution for publishing the data along with the images?
With a named volume (not with a host volume, aka bind mount) docker will initialize an empty named volume to the contents of the image at the location you mount it. So if you have files in your image at /datavolume1, and DataVolume1 is empty, docker will copy those files into the named volume.

When to use auto mapped Docker data volume

What is the main purpose of Docker data volume created by -v option without specified host file? For example docker run -v /data -ti my-image. Doc says it creates a new filesystem mapped to host filesystem to persist data (at some random-ish location). I understand that. But containers also persist all data when they are stopped and started again. So what is the difference between persisted data in stopped container vs. data volume?
I understand use-case for its advanced usage to map specific host file with -v /data:/data/host.
Off the top of my head:
If you are planning on using docker commit at some point, then an ephemeral volume like that can be used to intentionally prevent some contents from getting committed to the new filesystem image (because the contents of volumes are not preserved as part of the image).
If you will be generating a lot of temporary data and you are worried about filling up the root container filesystem, using a volume will give you more space (because your data won't be sharing space with operating system files).

Docker base image filesystem

Docker can't modify base image's filesystem, but can't copy it. How can store its changes during container usage? I see that it stores files under /var/lib/docker, but how can store filesys' changes without modifying it? What is the methodology?
It does store changes through a new filesystem layer, because of its copy-on-write mechanism:
Those changes disappear after a docker rm (unless you docker commit right after a docker stop)
If you want some persistence, you would need to use a volume or use a data volume container.
When doing a docker run, you can mount a volume from your host or mount one from a data container.

Resources