I am looking at Managing Data in Containers. There are two ways to manage data in Docker.
Data Volumes, and
Data Volume Containers
https://docs.docker.com/userguide/dockervolumes/
My question is: What are the pros and cons of these two methods?
I wouldn't think of them as different methods.
Volumes are the mechanism to bypass the Union File System, thereby allowing data to be easily shared with other containers and the host. Data-containers simply wrap a volume (or volumes) to provide a handy name which you can use in --volumes-from statements to share data between containers. You can't have data-containers without data volumes.
There are basically three ways you can manage data within a container and it would perhaps be best to outline and provide some case-by-case examples as to when and why you would use these.
First, you have the option to use the Union File System. Each container that runs has an associated writable layer provided by the UFS, so if I run a container based on my choice image, the writes I perform during that session when the container runs can be committed back to the image and persisted, through the means that they are permanently associated with the image's build. So if you have a Debian image and do apt-get update && apt-get install -y python, you have the possibility to commit that back to the image, share it with others and save everyone the time required to perform all those multiple network requests to have an up-to-date container with Python pre-installed.
Secondly, you can use volumes. When the container runs, writes to the directories that are targeted as volumes are kept distinctively of the UFS and remain associated with the container. As long as the associated container exists, so does the volume. Say you had a container who's entry point is a process that produces logs at /var/logs/myapp. Without volumes, the data written by the process could inadvertently be committed back to the image, needlessly adding to it's size. Conversely, as long as the container exists, should the process crash and bring down the container, you can access the logs and inspect what happened. By it's very nature, data stored in volumes associated with such containers is meant to be transient--discard the container and the data generated by the process is gone. If the container's image is updated and your dealt a new one, or you have no need for the generated logs anymore, you can simply remove and recreate the container and effectively flush the generated logs from disk.
While this seems great, what happens with, say, data that's written by a database? Surely, it's not something you'd keep as a part of the UFS, but you can't simply have it flushed if you update the DB image or switch over from foo/postgresql to bar/postgresql and end up with a new container in each case. Clearly, it's unacceptable and that's where the third option comes in, to have a persistent, named container with associated volumes and utilizing the full scope of volume capabilities, such as being able to share them with other containers, even when the associated container isn't actually running. With this pattern, you can have a dbdata container, with /var/lib/postgresql/data configured as a volume. You can then reliably have transient database containers and remove and re-create them leniently without losing important data.
So, to recap some facts about volumes:
Volumes are associated with containers
Writing to volume directories writes directly to the volume itself, bypassing the UFS
This makes it possible to share volumes independently across several containers
Volumes are destroyed when the last associated container is removed
If you don't want to lose important data stored in volumes when removing transient containers, associate the volume with a permanent, named container and share it with the non-persisting containers to retain the data
Then, as a general rule of thumb:
Data which you want to become a permanent feature for every container environment should be written to UFS and committed to the image
Data which is generated through by the container's managed process should be written to a volume
Data written to a volume which you don't want to accidentally lose if you remove a container should be associated with a named container which you intend to keep, and then shared with other transient containers which can be safely removed afterwards
Data containers offer:
A sensible layer of abstraction for your storage. The files that make up your volume are stored and managed for you by Docker.
Data containers are very handy to share using the special "--volumes--from" directive.
The data is automatically cleaned up by deleting the data container.
Project's like Flocker demonstrate how eventually the storage associated with a container will be shared between docker hosts.
Can use the "docker cp" command to pull files out of the container onto the host.
Data volume mappings offer:
Simpler to understand
Explicit and direct control over where the data is stored.
I have experienced file ownership and permission issues. For example Docker processes running as root within a container create files owned by root on the file-system. (Yes, I understand that data volumes store their data the same way, but using the "docker cp" command pulls out the files owned by me :-))
In conclusion I think it really boils down to how much control you wish to exert over the underlying storage. Personally I like the abstraction and indirection provided by Docker, for the same reason I like port mapping.
Related
I have an application that I am converting into a docker container.
I am going to test some different configuration for the application regarding persisted vs non persisted storage.
E.g. in one scenario I am going to create a persisted volume and mount some data into that volume.
In another scenario I am going to test not having any persisted volume (and accept that any date generated while the container is running is gone when its stopped/restarted).
Regarding the first scenario that works fine. But when I am testing the second scenario - no persisted storage - I am not quite sure what to do on the docker side.
Basically does it make any sense do define a volume in my Dockerfile when I don't plan to have any persisted volumes in kubernetes?
E.g. here is the end of my Dockerfile
...
ENTRYPOINT ["./bin/run.sh"]
VOLUME /opt/application-x/data
So does it make any sense at all to have the last line when I don't create and kubernetes volumes?
Or to put it in another way, are there scenarios where creating a volume in a dockerfile makes sense even though no corresponding persistent volumes are created?
It usually doesn’t make sense to define a VOLUME in your Dockerfile.
You can use the docker run -v option or Kubernetes’s container volume mount setting on any directory in the container filesystem space, regardless of whether or not its image originally declared it as a VOLUME. Conversely, a VOLUME can leak anonymous volumes in an iterative development sequence, and breaks RUN commands later in the Dockerfile.
In the scenario you describe, if you don’t have a VOLUME, everything is straightforward: if you mount something on to that path, in either plain Docker or Kubernetes, storage uses the mounted volume, and if not, data stays in the container filesystem and is lost when the container exits (which you want). I think if you do have a VOLUME then the container runtime will automatically create an anonymous volume for you; the overall behavior will be similar (it’s hard for other containers to find/use the anonymous volume) but in plain Docker at least you need to remember to clean it up.
I have been reading about Docker, and one of the first things that I read about docker was that it runs images in a read-only manner. This has raised this question in my mind, what happens if I need users to upload files? In that case where would the file go (are they appended to the image)? or in other words, how to handle uploaded files?
Docker containers are meant to be immutable and replaceable - you should be able to stop a container and replace it with a newer version without any ill effects. It's bad practice to store any configuration or operational data inside the container.
The situation you describe with file uploads would typically be resolved with a volume, which mounts a folder from the host filesystem into the container. Any modifications performed by the container to the mounted folder would persist on the host filesystem. When the container is replaced, the folder is re-mounted when the new container is started.
It may be helpful to read up on volumes: https://docs.docker.com/storage/volumes/
docker containers use file systems similar to their underlying operating system, as it seems in your case Windows Nano Server(windows optimized to be used in a container).
so any uploads to your container will be placed on the corresponding path you provided when uploading the file.
but this data is ephemeral, this means your data will persist until the container is for whatever reason stopped.
to use persistent storage you must provide a volume for your docker container, you can think of volumes as external disks attached to a container that mount on a path inside the container. this will persist data regardless of container state
Let's say you are trying to dockerise a database (couchdb for example).
Then there are at least two assets you consider volumes for:
database files
log files
Let's further say you want to keep the db-files private but want to expose the log-files for later processing.
As far as I undestand the documentation, you have two options:
First option
define managed volumes for both, log- and db-files within the db-image
import these in a second container (you will get both) and work with the logs
Second option
create data container with a managed volume for the logs
create the db-image with a managed volume for the db-files only
import logs-volume from data container when running db-image
Two questions:
Are both options realy valid/ possible?
What is the better way to do it?
br volker
The answer to question 1 is that, yes both are valid and possible.
My answer to question 2 is that I would consider a different approach entirely and which one to choose depends on whether or not this is a mission critical system and that data loss must be avoided.
Mission critical
If you absolutely cannot lose your data, then I would recommend that you bind mount a reliable disk into your database container. Bind mounting is essentially mounting a part of the Docker Host filesystem into the container.
So taking the database files as an example, you could image these steps:
Create a reliable disk e.g. NFS that is backed-up on a regular basis
Attach this disk to your Docker host
Bind mount this disk into my database container which then writes database files to this disk.
So following the above example, lets say I have created a reliable disk that is shared over NFS and mounted on my Docker Host at /reliable/disk. To use that with my database I would run the following Docker command:
docker run -d -v /reliable/disk:/data/db my-database-image
This way I know that the database files are written to reliable storage. Even if I lose my Docker Host, I will still have the database files and can easily recover by running my database container on another host that can access the NFS share.
You can do exactly the same thing for the database logs:
docker run -d -v /reliable/disk/data/db:/data/db -v /reliable/disk/logs/db:/logs/db my-database-image
Additionally you can easily bind mount these volumes into other containers for separate tasks. You may want to consider bind mounting them as read-only into other containers to protect your data:
docker run -d -v /reliable/disk/logs/db:/logs/db:ro my-log-processor
This would be my recommended approach if this is a mission critical system.
Not mission critical
If the system is not mission critical and you can tolerate a higher potential for data loss, then I would look at Docker Volume API which is used precisely for what you want to do: managing and creating volumes for data that should live beyond the lifecycle of a container.
The nice thing about the docker volume command is that it lets you created named volumes and if you name them well it can be quite obvious to people what they are used for:
docker volume create db-data
docker volume create db-logs
You can then mount these volumes into your container from the command line:
docker run -d -v db-data:/db/data -v db-logs:/logs/db my-database-image
These volumes will survive beyond the lifecycle of your container and are stored on the filesystem if your Docker host. You can use:
docker volume inspect db-data
To find out where the data is being stored and back-up that location if you want to.
You may also want to look at something like Docker Compose which will allow you to declare all of this in one file and then create your entire environment through a single command.
I have a Python app using a SQLite database (it's a data collector that runs daily by cron). I want to deploy it, probably on AWS or Google Container Engine, using Docker. I see three main steps:
1. Containerize and test the app locally.
2. Deploy and run the app on AWS or GCE.
3. Backup the DB periodically and download back to a local archive.
Recent posts (on Docker, StackOverflow and elsewhere) say that since 1.9, Volumes are now the recommended way to handle persisted data, rather than the "data container" pattern. For future compatibility, I always like to use the preferred, idiomatic method, however Volumes seem to be much more of a challenge than data containers. Am I missing something??
Following the "data container" pattern, I can easily:
Build a base image with all the static program and config files.
From that image create a data container image and copy my DB and backup directory into it (simple COPY in the Dockerfile).
Push both images to Docker Hub.
Pull them down to AWS.
Run the data and base images, using "--volume-from" to refer to the data.
Using "docker volume create":
I'm unclear how to copy my DB into the volume.
I'm very unclear how to get that volume (containing the DB) up to AWS or GCE... you can't PUSH/PULL a volume.
Am I missing something regarding Volumes?
Is there a good overview of using Volumes to do what I want to do?
Is there a recommended, idiomatic way to backup and download data (either using the data container pattern or volumes) as per my step 3?
When you first use an empty named volume, it will receive a copy of the image's volume data where it's first used (unlike a host based volume that completely overlays the mount point with the host directory). So you can initialize the volume contents in your main image as a volume, upload that image to your registry and pull that image down to your target host, create a named volume on that host, point your image to that named volume (using docker-compose makes the last two steps easy, it's really 2 commands at most docker volume create <vol-name> and docker run -v <vol-name>:/mnt <image>), and it will be populated with your initial data.
Retrieving the data from a container based volume or a named volume is an identical process, you need to mount the volume in a container and run an export/backup to your outside location. The only difference is in the command line, instead of --volumes-from <container-id> you have -v <vol-name>:/mnt. You can use this same process to import data into the volume as well, removing the need to initialize the app image with data in it's volume.
The biggest advantage of the new process is that it clearly separates data from containers. You can purge all the containers on the system without fear of losing data, and any volumes listed on the system are clear in their name, rather than a randomly assigned name. Lastly, named volumes can be mounted anywhere on the target, and you can pick and choose which of the volumes you'd like to mount if you have multiple data sources (e.g. config files vs databases).
I'm a bit confused about data-only docker containers. I read it's a bad practice to mount directories directly to the source-os: https://groups.google.com/forum/#!msg/docker-user/EUndR1W5EBo/4hmJau8WyjAJ
And I get how I make data-only containers: http://container42.com/2014/11/18/data-only-container-madness/
And I see somewhat similar question like mine: How to deal with persistent storage (e.g. databases) in docker
But what if I have a lamp-server setup.. and I have everything nice setup with data-containers, not linking them 'directly' to my source-os and make a backup once a while..
Than someone comes by, and restarts my server.. How do I setup my docker (data-only)-containers again, so I don't lose any data?
Actually, even though it was shykes who said it was considered a "hack" in that link you provide, note the date. Several eons worth of Docker years have passed since that post about volumes, and it's no longer considered bad practice to mount volumes on the host. In fact, here is a link to the very same shykes saying that he has "definitely used them at large scale in production for several years with no issues". Mount a host OS directory as a docker volume and don't worry about it. This means that your data persists across docker restarts/deployments/whatever. It's right there on the disk of the host, and doesn't go anywhere when your container goes away.
I've been using docker volumes that mount host OS directories for data storage (database persistent storage, configuration data, et cetera) for as long as I've been using Docker, and it's worked perfectly. Furthermore, it appears shykes no longer considers this to be bad practice.
Docker containers will persist on disk until they are explicitly deleted with docker rm. If your server restarts you may need to restart your service containers, but your data containers will continue to exist and their volumes will be available to other containers.
docker rm alone doesn't remove the actual data (which lives on in /var/lib/docker/vfs/dir)
Only docker rm -v would clear out the data as well.
The only issue is that, after a docker rm, a new docker run would re-create an empty volume in /var/lib/docker/vfs/dir.
In theory, you could with symlink redirect the new volume folders to the old ones, but that supposes you notes which volumes were associated to which data container... before the docker rm.
It's worth noting that the volumes you create with "data-only containers" are essentially still directories on your host OS, just in a different location (/var/lib/docker/...). One benefit is that you get to label your volumes with friendly identifiers and thus you don't have to hardcode your directory paths.
The downside is that administrative work like backing up specific data volumes is a bit of a hassle now since you have to manually inspect metadata to find the directory location. Also, if you accidentally wipe your docker installation or all of your docker containers, you'll lose your data volumes.