Huge files in Docker containers - docker

I need to create a Docker image (and consequently containers from that image) that use large files (containing genomic data, thus reaching ~10GB in size).
How am I supposed to optimize their usage? Am I supposed to include them in the container (such as COPY large_folder large_folder_in_container)? Is there a better way of referencing such files? The point is that it sounds strange to me to push such container (which would be >10GB) in my private repository. I wonder if there is a way of attaching a sort of volume to the container, without packing all those GBs together.
Thank you.

Is there a better way of referencing such files?
If you already have some way to distribute the data I would use a "bind mount" to attach a volume to the containers.
docker run -v /path/to/data/on/host:/path/to/data/in/container <image> ...
That way you can change the image and you won't have to re-download the large data set each time.
If you wanted to use the registry to distribute the large data set, but want to manage changes to the data set separately, you could use a data volume container with a Dockerfile like this:
FROM tianon/true
COPY dataset /dataset
VOLUME /dataset
From your application container you can attach that volume using:
docker run -d --name dataset <data volume image name>
docker run --volumes-from dataset <image> ...
Either way, I think https://docs.docker.com/engine/tutorials/dockervolumes/ are what you want.

Am I supposed to include them in the container (such as COPY large_folder large_folder_in_container)?
If you do so, that would include them in the image, not the container: you could launch 20 containers from that image, the actual disk space used would still be 10 GB.
If you were to make another image from your first image, the layered filesystem will reuse the layers from the parent image, and the new image would still be "only" 10GB.

I have was having trouble with a 900MB json file and changing the Memory limit in the preferences and it fixed it.

Related

can I move a single docker image to another directory?

I'd like to keep some less used images on an external disk.
Is that possible?
Or should I move all images to an external disk changing some base path?
All of the Docker images are stored in an opaque, backend-specific format inside the /var/lib/docker directory. You can't move some of the images to a different location, only the entire Docker storage tree. See for example How to change the docker image installation directory?.
If you have images you only rarely use, you can docker rmi them for now and then docker pull them again from Docker Hub or another repository when you need them.

Why does docker have to create an image from a dockerfile then create a container from the image instead of creating a container from a Dockerfile?

Why does docker have to create an image from a dockerfile then create a container from the image instead of creating a container directly from a Dockerfile?
What is the purpose/benefit of creating the image first from the Dockerfile then from that create a container?
-----EDIT-----
This question What is the difference between a Docker image and a container?
Does not answer my question.
My question is: Why do we need to create a container from an image and not a dockerfile? What is the purpose/benefit of creating the image first from the Dockerfile then from that create a container?
the Dockerfile is the recipe to create an image
the image is a virtual filesystem
the container is the a running process on a host machine
You don't want every host to build its own image based on the recipe. It's easier for some hosts to just download an image and work with that.
Creating an image can be very expensive. I have complicated Dockerfiles that may take hours to build, may download 50 GB of data, yet still only create a 200 MB image that I can send to different hosts.
Spinning up a container from an existing image is very cheap.
If all you had was the Dockerfile in order to spin up image-containers, the entire workflow would become very cumbersome.
Images and Containers are two different concepts.
Basically, images are like a snapshot of a filesystem, along with some meta-data.
A container is one of several process that are actually running (and which is based on an image). As soon as the processes end, your container do not exist anymore (well, it is stopped to be exact)
You can view the image as the base that you will make your container run on.
Thus, you Dockerfile will create an image (which is static) which you can store locally or push on a repository, to be able to use it later.
The container cannot be "stored" because it is a "living" thing.
You can think of Images vs Containers similar to Classes vs Objects or the Definition vs Instance. The image contains the filesystem and default settings for creating the container. The container contains the settings for a specific instance, and when running, the namespaces and running process.
As for why you'd want to separate them, efficiency and portability. Since we have separate images, we also have inheritance, where one image extends another. The key detail of that inheritance is that filesystem layers in the image are not copied for each image. Those layers are static, and you can them by creating a new image with new layers. Using the overlay filesystem (or one of the other union filesystem drivers) we can append additional changes to that filesystem with our new image. Containers do the same when the run the image. That means you can have a 1 Gig base image, extend it with a child image with 100 Megs of changes, and run 5 containers that each write 1 Meg of files, and the overall disk space used on the docker host is only 1.105 Gigs rather than 7.6 Gigs.
The portability part comes into play when you use registries, e.g. Docker Hub. The image is the part of the container that is generic, reusable, and transferable. It's not associated with an instance on any host. So you can push and pull images, but containers are tightly bound to the host they are running on, named volumes on that host, networks defined on that host, etc.

What is the purpose of Dockerfile command "Volume"?

When a Dockerfile contains VOLUME instruction (say) VOLUME [/opt/apache2/www, ...] (hope this path exists in real installation), it means this path is going to be mounted to something (right?). And this VOLUME instruction is for the image and not for one instance of it (container) but for every instance.
Anyway irrespective of whether an image contains a VOLUME defined or not, at the time of starting a container the run command can create a volume by mapping a local host path to a container path.
docker run --name understanding_volumes -v /localhost/path1:/opt/apache2/www -v /localhost/path2:/any/container/path image_name
The above should make it clear that though /any/container/path is not defined as a VOLUME in Dockerfile, we are able to mount it while running container.
That said, this SOF question throws some light on it - What is the purpose of defining VOLUME mount points within DockerFile rather than adhoc cmd-line -v?. Here one benefit of VOLUME instruction is mentioned. Which is, other containers can benefit from it. Using the --from-container (could not find this option for docker run --help, not sure if the answer meant --volumes-from) Anyway thus the mount point is accessible to other container in some kind of automatic way. Great.
My first question is, is the other volume path /any/container/path image_name mounted on to the container understanding_volumes also available to the second container using --from-container or --volumes-from (whichever option is correct)?
My next question is, is the use of VOLUME instruction just to let the other containers link to this path --> that is to make the data on /opt/apache2/www available to other containers through easy linking. So it's just sharing out. Or is there any data that can be made available to first container too.
Defining a volume in a Dockerfile has the advantage of specifying the volume location inside the image definition as documentation from the image creator to the user of the image. That's just about the only upside.
It was added to docker very early on, quite possibly when data containers were the only way to persist data. We now have a solution for named volumes that has obsoleted data containers. We have also added the compose file to define how containers are run in an easy to understand and reuse syntax.
While there is the one upside of self documented images, there are quite a few downsides, to the point that I strongly recommend against defining a volume inside the image to my clients and anyone publishing images for general reuse:
The volume is forced on the end user, there's no way to undefine a volume in the image.
If the volume is not defined at runtime (with a -v or compose file), the user will see anonymous volumes in their docker volume ls that have no association to what created them. These are almost always useless wastes of disk space.
They break the ability to extend the image since any changes to a volume in an image after the VOLUME line are typically ignored by docker. This means a user can never add their own initial volume data, which is very confusing because docker gives no warning that it is ignoring the user changes during the image build.
If you need to have a volume as a user a runtime, you can always define it with a -v or compose file, even if that volume is not defined in the Dockerfile. Many users have the misconception that you must define it in the image to be able to make it a named volume at runtime.
The ability to use --volumes-from is unaffected by defining the volume in the image, but I'd encourage you to avoid this capability. It does not exist in swarm mode, and you can get all the same capabilities along with more granularity by using a named volume that you mount in two containers.

Image of a data volume using docker

I am very interested in reproducible data science work. To that end, I am now exploring Docker as a platform which enables bundling of code, data and environment's settings. My first simple attempt is a Docker image which contains the data it needs (link).
However, this is only the first step, in this example, the data is part of the image, and thus when the image is loaded into a container, the data is already there. My next objective is to decouple the code of the analysis and the data. As far as I understand, that would mean to have two containers, one with the code (code) and one with the data (data).
For the code I use a simple Dockerfile:
FROM continuumio/miniconda3
RUN conda install ipython
and for the data:
FROM atlassian/ubuntu-minimal
COPY data.csv /tmp
where data.csv is a data file I'm copying to the image.
After building these two images I can run them as described in this solution:
docker run -i -t --name code --net=data-testing --net-alias=code drorata/minimal-python /bin/bash
docker run -i -t --name data --net=data-testing --net-alias=data drorata/data-image /bin/bash
after starting a network: docker network create data-testing
After these steps I can ping one container from the other, and probably also access data.csv from code. But I have this feeling this is a sub optimal solution and cannot be considered good practice.
What is considered a good practice to have a container that can access data? I read a little about data volumes but I don't understand how to utilize them and how to turn them into images.
the use of a container as data storage is largely considered outdated and deprecated, at this point. you should be using data volumes instead.
but a data volume is not something that you can turn into an image. really, there is no need for this.
if you want to deliver a .csv file to someone and let them use that in their docker container, just give them the .csv file.
the easiest way to get the file into the container and be able to use it, is with a host mounted volume.
using the -v flag on docker run, you can specify a local folder or file to be mounted into the docker container.
Say, for example, your docker image expects to find a file at /data/input.csv. When you call docker run and you want to provide your own input.csv file, you would do something like
docker run -v /my/file/path/input.csv:/data/ my-image
i'm not providing all of the options in this example that you are showing, but i am illustrating the -v flag. this will take your local filesystem's input.csv and mount it into the docker container. now your container will be able to use your copy of that data.

Deploy a docker app using volume create

I have a Python app using a SQLite database (it's a data collector that runs daily by cron). I want to deploy it, probably on AWS or Google Container Engine, using Docker. I see three main steps:
1. Containerize and test the app locally.
2. Deploy and run the app on AWS or GCE.
3. Backup the DB periodically and download back to a local archive.
Recent posts (on Docker, StackOverflow and elsewhere) say that since 1.9, Volumes are now the recommended way to handle persisted data, rather than the "data container" pattern. For future compatibility, I always like to use the preferred, idiomatic method, however Volumes seem to be much more of a challenge than data containers. Am I missing something??
Following the "data container" pattern, I can easily:
Build a base image with all the static program and config files.
From that image create a data container image and copy my DB and backup directory into it (simple COPY in the Dockerfile).
Push both images to Docker Hub.
Pull them down to AWS.
Run the data and base images, using "--volume-from" to refer to the data.
Using "docker volume create":
I'm unclear how to copy my DB into the volume.
I'm very unclear how to get that volume (containing the DB) up to AWS or GCE... you can't PUSH/PULL a volume.
Am I missing something regarding Volumes?
Is there a good overview of using Volumes to do what I want to do?
Is there a recommended, idiomatic way to backup and download data (either using the data container pattern or volumes) as per my step 3?
When you first use an empty named volume, it will receive a copy of the image's volume data where it's first used (unlike a host based volume that completely overlays the mount point with the host directory). So you can initialize the volume contents in your main image as a volume, upload that image to your registry and pull that image down to your target host, create a named volume on that host, point your image to that named volume (using docker-compose makes the last two steps easy, it's really 2 commands at most docker volume create <vol-name> and docker run -v <vol-name>:/mnt <image>), and it will be populated with your initial data.
Retrieving the data from a container based volume or a named volume is an identical process, you need to mount the volume in a container and run an export/backup to your outside location. The only difference is in the command line, instead of --volumes-from <container-id> you have -v <vol-name>:/mnt. You can use this same process to import data into the volume as well, removing the need to initialize the app image with data in it's volume.
The biggest advantage of the new process is that it clearly separates data from containers. You can purge all the containers on the system without fear of losing data, and any volumes listed on the system are clear in their name, rather than a randomly assigned name. Lastly, named volumes can be mounted anywhere on the target, and you can pick and choose which of the volumes you'd like to mount if you have multiple data sources (e.g. config files vs databases).

Resources