What is the purpose of Dockerfile command "Volume"? - docker

When a Dockerfile contains VOLUME instruction (say) VOLUME [/opt/apache2/www, ...] (hope this path exists in real installation), it means this path is going to be mounted to something (right?). And this VOLUME instruction is for the image and not for one instance of it (container) but for every instance.
Anyway irrespective of whether an image contains a VOLUME defined or not, at the time of starting a container the run command can create a volume by mapping a local host path to a container path.
docker run --name understanding_volumes -v /localhost/path1:/opt/apache2/www -v /localhost/path2:/any/container/path image_name
The above should make it clear that though /any/container/path is not defined as a VOLUME in Dockerfile, we are able to mount it while running container.
That said, this SOF question throws some light on it - What is the purpose of defining VOLUME mount points within DockerFile rather than adhoc cmd-line -v?. Here one benefit of VOLUME instruction is mentioned. Which is, other containers can benefit from it. Using the --from-container (could not find this option for docker run --help, not sure if the answer meant --volumes-from) Anyway thus the mount point is accessible to other container in some kind of automatic way. Great.
My first question is, is the other volume path /any/container/path image_name mounted on to the container understanding_volumes also available to the second container using --from-container or --volumes-from (whichever option is correct)?
My next question is, is the use of VOLUME instruction just to let the other containers link to this path --> that is to make the data on /opt/apache2/www available to other containers through easy linking. So it's just sharing out. Or is there any data that can be made available to first container too.

Defining a volume in a Dockerfile has the advantage of specifying the volume location inside the image definition as documentation from the image creator to the user of the image. That's just about the only upside.
It was added to docker very early on, quite possibly when data containers were the only way to persist data. We now have a solution for named volumes that has obsoleted data containers. We have also added the compose file to define how containers are run in an easy to understand and reuse syntax.
While there is the one upside of self documented images, there are quite a few downsides, to the point that I strongly recommend against defining a volume inside the image to my clients and anyone publishing images for general reuse:
The volume is forced on the end user, there's no way to undefine a volume in the image.
If the volume is not defined at runtime (with a -v or compose file), the user will see anonymous volumes in their docker volume ls that have no association to what created them. These are almost always useless wastes of disk space.
They break the ability to extend the image since any changes to a volume in an image after the VOLUME line are typically ignored by docker. This means a user can never add their own initial volume data, which is very confusing because docker gives no warning that it is ignoring the user changes during the image build.
If you need to have a volume as a user a runtime, you can always define it with a -v or compose file, even if that volume is not defined in the Dockerfile. Many users have the misconception that you must define it in the image to be able to make it a named volume at runtime.
The ability to use --volumes-from is unaffected by defining the volume in the image, but I'd encourage you to avoid this capability. It does not exist in swarm mode, and you can get all the same capabilities along with more granularity by using a named volume that you mount in two containers.

Related

Docker force usage of volume

I know that you can specify a volume inside the dockerfile, but I see the problem that the user is not required to create such a volume.
What if he forgot to specify a volume and than there are many, possibly expensive to create, files saved there, but they are not persistent, because there is no volume specified?
So my question is if it is possible to force the user to create a volume for that mountpoint, or at least check at start time (inside the container) if there is a volume mounted, so that it can react to the missing volume?
EDIT: With the new information that there are automatic created unnnamed volumes I would also accept a user-side solution (not changing the container in such a ways that he checks the volume, but a docker-deamon settings which warn/prevents me from creating unnamed volumes by mistake.
I think the VOLUME declaration is the best you can do here.
In general, a container cannot force itself to be run with any particular options. You could make a similar argument that a container "must" be run with published port or with an attached stdin to be useful, but Docker doesn't allow an image to force these on either. (And more importantly, an image can't require direct access to the host filesystem, host networking, or privileged mode.)
As #masseyb notes in a comment, the key effect of the Dockerfile VOLUME directive is to create a new anonymous volume on the given directory if nothing else is mounted there. docker volume ls will show it and you should be able to use the volume ID directly in docker run -v options, so you won't actually lose data here. (There doesn't seem to be a command to give a name to the volume, surprisingly.)
In principle it's possible to check some things in an entrypoint wrapper script, but that won't work well for this volume case. The container can't tell whether a directory is an automatically-created anonymous volume or a new empty named volume.
(Also remember that volumes, including automatically-created anonymous volumes, are never committed to images. In your Dockerfile you can't change the directory content after you declare it a VOLUME; if an end user tries to docker commit a derived image it won't include the volume data. Unless you're sure it's what you want, I usually advise against declaring VOLUME. The case you describe in the question is pretty much the one case where it's useful.)

What is the actual advantage of declaring a VOLUME in a Dockerfile? [duplicate]

First of all, I want to make it clear I've done due diligence in researching this topic. Very closely related is this SO question, which doesn't really address my confusion.
I understand that when VOLUME is specified in a Dockerfile, this instructs Docker to create an unnamed volume for the duration of the container which is mapped to the specified directory inside of it. For example:
# Dockerfile
VOLUME ["/foo"]
This would create a volume to contain any data stored in /foo inside the container. The volume (when viewed via docker volume ls) would show up as a random jumble of numbers.
Each time you do docker run, this volume is not reused. This is the key point causing confusion here. To me, the goal of a volume is to contain state persistent across all instances of an image (all containers started from it). So basically if I do this, without explicit volume mappings:
#!/usr/bin/env bash
# Run container for the first time
docker run -t foo
# Kill the container and re-run it again. Note that the previous
# volume would now contain data because services running in `foo`
# would have written data to that volume.
docker container stop foo
docker container rm foo
# Run container a second time
docker run -t foo
I expect the unnamed volume to be reused between the 2 run commands. However, this is not the case. Because I did not explicitly map a volume via the -v option, a new volume is created for each run.
Here's important part number 2: Since I'm required to explicitly specify -v to share persistent state between run commands, why would I ever specify VOLUME in my Dockerfile? Without VOLUME, I can do this (using the previous example):
#!/usr/bin/env bash
# Create a volume for state persistence
docker volume create foo_data
# Run container for the first time
docker run -t -v foo_data:/foo foo
# Kill the container and re-run it again. Note that the previous
# volume would now contain data because services running in `foo`
# would have written data to that volume.
docker container stop foo
docker container rm foo
# Run container a second time
docker run -t -v foo_data:/foo foo
Now, truly, the second container will have data mounted to /foo that was there from the previous instance. I can do this without VOLUME in my Dockerfile. From the command line, I can turn any directory inside the container into a mount to either a bound directory on the host or a volume in Docker.
So my question is: What is the point of VOLUME when you have to explicitly map named volumes to containers via commands on the host anyway? Either I'm missing something or this is just confusing and obfuscated.
Note that all of my assertions here are based on my observations of how docker behaves, as well as what I've gathered from the documentation.
Instructions like VOLUME and EXPOSE are a bit anachronistic. Named volumes as we know them today were introduced in Docker 1.9, almost three years ago.
Before Docker 1.9, running a container whose image had one or more VOLUME instructions (or using the --volume option) was the only way to create volumes for data sharing or persistence. In fact, it used to be a best practice to create data-only containers whose sole purpose was to hold one or more volumes, and then share those volumes with your application containers using the --volumes-from option. Here's some articles that describe this outdated pattern.
Docker Data Containers
Why Docker Data Containers (Volumes!) are Good
Also, check out moby/moby#17798 (Data-only containers obsolete with docker 1.9.0?) where the change from data-only containers to named volumes was discussed.
Today, I consider the VOLUME instruction as an advanced tool that should only be used for specialized cases, and after careful thought. For example, the official postgres image declares a VOLUME at /var/lib/postgresql/data. This can improve the performance of postgres containers out of the box by keeping the database data out of the layered filesystem. Docker doesn't have to search through all the layers of the container image for file requests at /var/lib/postgresql/data.
However, the VOLUME instruction does come at a cost.
Users might not be aware of the unnamed volumes being created, and continuing to take up storage space on their Docker host after containers are removed.
There is no way to remove a volume declared in a Dockerfile. Downstream images cannot add data to paths where volumes exist.
The latter issue results in problems like these.
How to “undeclare” volumes in docker image?
GitLab on Docker: how to persist user data between deployments?
For the GitLab question, someone wants to extend the GitLab image with pre-configured data for testing purposes, but it's impossible to commit that data in a downstream image because of the VOLUME at /var/opt/gitlab in the parent image.
tl;dr: VOLUME was designed for a world before Docker 1.9. Best to just leave it out.

Creating a volume in a dockerfile without a persistent volume (claim) in Kubernetes?

I have an application that I am converting into a docker container.
I am going to test some different configuration for the application regarding persisted vs non persisted storage.
E.g. in one scenario I am going to create a persisted volume and mount some data into that volume.
In another scenario I am going to test not having any persisted volume (and accept that any date generated while the container is running is gone when its stopped/restarted).
Regarding the first scenario that works fine. But when I am testing the second scenario - no persisted storage - I am not quite sure what to do on the docker side.
Basically does it make any sense do define a volume in my Dockerfile when I don't plan to have any persisted volumes in kubernetes?
E.g. here is the end of my Dockerfile
...
ENTRYPOINT ["./bin/run.sh"]
VOLUME /opt/application-x/data
So does it make any sense at all to have the last line when I don't create and kubernetes volumes?
Or to put it in another way, are there scenarios where creating a volume in a dockerfile makes sense even though no corresponding persistent volumes are created?
It usually doesn’t make sense to define a VOLUME in your Dockerfile.
You can use the docker run -v option or Kubernetes’s container volume mount setting on any directory in the container filesystem space, regardless of whether or not its image originally declared it as a VOLUME. Conversely, a VOLUME can leak anonymous volumes in an iterative development sequence, and breaks RUN commands later in the Dockerfile.
In the scenario you describe, if you don’t have a VOLUME, everything is straightforward: if you mount something on to that path, in either plain Docker or Kubernetes, storage uses the mounted volume, and if not, data stays in the container filesystem and is lost when the container exits (which you want). I think if you do have a VOLUME then the container runtime will automatically create an anonymous volume for you; the overall behavior will be similar (it’s hard for other containers to find/use the anonymous volume) but in plain Docker at least you need to remember to clean it up.

What is the practical purpose of VOLUME in Dockerfile?

First of all, I want to make it clear I've done due diligence in researching this topic. Very closely related is this SO question, which doesn't really address my confusion.
I understand that when VOLUME is specified in a Dockerfile, this instructs Docker to create an unnamed volume for the duration of the container which is mapped to the specified directory inside of it. For example:
# Dockerfile
VOLUME ["/foo"]
This would create a volume to contain any data stored in /foo inside the container. The volume (when viewed via docker volume ls) would show up as a random jumble of numbers.
Each time you do docker run, this volume is not reused. This is the key point causing confusion here. To me, the goal of a volume is to contain state persistent across all instances of an image (all containers started from it). So basically if I do this, without explicit volume mappings:
#!/usr/bin/env bash
# Run container for the first time
docker run -t foo
# Kill the container and re-run it again. Note that the previous
# volume would now contain data because services running in `foo`
# would have written data to that volume.
docker container stop foo
docker container rm foo
# Run container a second time
docker run -t foo
I expect the unnamed volume to be reused between the 2 run commands. However, this is not the case. Because I did not explicitly map a volume via the -v option, a new volume is created for each run.
Here's important part number 2: Since I'm required to explicitly specify -v to share persistent state between run commands, why would I ever specify VOLUME in my Dockerfile? Without VOLUME, I can do this (using the previous example):
#!/usr/bin/env bash
# Create a volume for state persistence
docker volume create foo_data
# Run container for the first time
docker run -t -v foo_data:/foo foo
# Kill the container and re-run it again. Note that the previous
# volume would now contain data because services running in `foo`
# would have written data to that volume.
docker container stop foo
docker container rm foo
# Run container a second time
docker run -t -v foo_data:/foo foo
Now, truly, the second container will have data mounted to /foo that was there from the previous instance. I can do this without VOLUME in my Dockerfile. From the command line, I can turn any directory inside the container into a mount to either a bound directory on the host or a volume in Docker.
So my question is: What is the point of VOLUME when you have to explicitly map named volumes to containers via commands on the host anyway? Either I'm missing something or this is just confusing and obfuscated.
Note that all of my assertions here are based on my observations of how docker behaves, as well as what I've gathered from the documentation.
Instructions like VOLUME and EXPOSE are a bit anachronistic. Named volumes as we know them today were introduced in Docker 1.9, almost three years ago.
Before Docker 1.9, running a container whose image had one or more VOLUME instructions (or using the --volume option) was the only way to create volumes for data sharing or persistence. In fact, it used to be a best practice to create data-only containers whose sole purpose was to hold one or more volumes, and then share those volumes with your application containers using the --volumes-from option. Here's some articles that describe this outdated pattern.
Docker Data Containers
Why Docker Data Containers (Volumes!) are Good
Also, check out moby/moby#17798 (Data-only containers obsolete with docker 1.9.0?) where the change from data-only containers to named volumes was discussed.
Today, I consider the VOLUME instruction as an advanced tool that should only be used for specialized cases, and after careful thought. For example, the official postgres image declares a VOLUME at /var/lib/postgresql/data. This can improve the performance of postgres containers out of the box by keeping the database data out of the layered filesystem. Docker doesn't have to search through all the layers of the container image for file requests at /var/lib/postgresql/data.
However, the VOLUME instruction does come at a cost.
Users might not be aware of the unnamed volumes being created, and continuing to take up storage space on their Docker host after containers are removed.
There is no way to remove a volume declared in a Dockerfile. Downstream images cannot add data to paths where volumes exist.
The latter issue results in problems like these.
How to “undeclare” volumes in docker image?
GitLab on Docker: how to persist user data between deployments?
For the GitLab question, someone wants to extend the GitLab image with pre-configured data for testing purposes, but it's impossible to commit that data in a downstream image because of the VOLUME at /var/opt/gitlab in the parent image.
tl;dr: VOLUME was designed for a world before Docker 1.9. Best to just leave it out.

Deploy web app in docker data container vs volume

I'm confused about common consensus that one shouldn't use data containers. I have specific use case that I want to accomplish.
I want to have docker nginx container and behind it some other container with application. To run newest version of my app I want to download ready container from my private docker registry. The application is for now purely static html, javascript something.
So my plan is to create docker image which will hold the files, and will specify a named volume in some /webapp folder. The nginx container will serve this volume. I do not see any other way how to move bunch of files to remote system the "docker containerized" way. Am I not actually creating cursed data container?
Anyway what happens during app containers exchange? When I stop the app container the volume remains accesible, as it is placed on host. When I pull and start new version of app container. The volume will be created again and prefiled with image files stored at the same location, replacing the content on host so the nginx container will server from now new version of the application.Right? What happens when I will reference volume that does not exist yet from the nginx container.
It seem that named values are not automatically filed with the content of the image. As well I'm not sure how to create named volume in docker file as this syntax taken from here doesn't work
FROM training/webapp
VOLUME webapp:/webapp
I think you might want what i have described here https://stackoverflow.com/a/41576040/3625317
The problem with volumes is, that when a container is recreated, not docker-compose down but rather docker-compose pull + up, the new container will not have your "new code stored in the volume" but rather, due to the recycled volume, still the old anon volume. The point is, you will need a anon-volume for the code anyway, since you want it redeployable, not a named volume since you want the code to be exchangeable.
On re-create the anon-volume is not removed, that said, lets say you have the image:v1 right now and you pull image:v2 and then do a docker-compose up. It will recreate your container based on image:v2 - when this finished, you will have a new container, but the code is still from the old container, which was based on image:v1, since the anon-volume has not been replaced, it was re-assigned. docker-compose down && docker-compose up will resolve that for you - but you have to keep this in mind when dealing with your idea. (down removes anon-volumes)
In general, there is a pro / con, see my other post.
Data-containers in general have a other meaning and have been replaced by so called named volumes. Data-containers have been used to establish a volume-mount which is "named" and not based on a anon-volume.
In the past, you had to create a container with a volume, and later use a container-name based mount of this volume ( the container would be the static / name part ), today, you just create a named volume name and mount by this volume-name, no need for a busybox killed after start based container-name based volume mount.

Resources