I am very interested in reproducible data science work. To that end, I am now exploring Docker as a platform which enables bundling of code, data and environment's settings. My first simple attempt is a Docker image which contains the data it needs (link).
However, this is only the first step, in this example, the data is part of the image, and thus when the image is loaded into a container, the data is already there. My next objective is to decouple the code of the analysis and the data. As far as I understand, that would mean to have two containers, one with the code (code) and one with the data (data).
For the code I use a simple Dockerfile:
FROM continuumio/miniconda3
RUN conda install ipython
and for the data:
FROM atlassian/ubuntu-minimal
COPY data.csv /tmp
where data.csv is a data file I'm copying to the image.
After building these two images I can run them as described in this solution:
docker run -i -t --name code --net=data-testing --net-alias=code drorata/minimal-python /bin/bash
docker run -i -t --name data --net=data-testing --net-alias=data drorata/data-image /bin/bash
after starting a network: docker network create data-testing
After these steps I can ping one container from the other, and probably also access data.csv from code. But I have this feeling this is a sub optimal solution and cannot be considered good practice.
What is considered a good practice to have a container that can access data? I read a little about data volumes but I don't understand how to utilize them and how to turn them into images.

the use of a container as data storage is largely considered outdated and deprecated, at this point. you should be using data volumes instead.
but a data volume is not something that you can turn into an image. really, there is no need for this.
if you want to deliver a .csv file to someone and let them use that in their docker container, just give them the .csv file.
the easiest way to get the file into the container and be able to use it, is with a host mounted volume.
using the -v flag on docker run, you can specify a local folder or file to be mounted into the docker container.
Say, for example, your docker image expects to find a file at /data/input.csv. When you call docker run and you want to provide your own input.csv file, you would do something like
docker run -v /my/file/path/input.csv:/data/ my-image
i'm not providing all of the options in this example that you are showing, but i am illustrating the -v flag. this will take your local filesystem's input.csv and mount it into the docker container. now your container will be able to use your copy of that data.


Best practice to connect my own code into a standard docker image in kubernetes

I have a lot of standard runtime docker images like python3 with tensorflow 1.7 installed and I want to use these standard images to run some customers code out side of them. The scenario seems quite similar with the serverless. So what is the best way to put the code into runtime dockers?
Right now I am trying to use a persistent volume to mount the code into runtime. But it has a lot of work. Is there some solution easier for this?
What is the workflow for google machine learning engine or floydhub. I think what I want is similar. They have a command line tool to make the local code combine with a standard env.
Following cloud native practices, code should be immutable, and releases and their dependencies uniquely identifiable for repeat-ability, replic-ability, etc - in short: you should really create images with your src code.
In your case, that would mean basing your Dockerfile on upstream python3 or TF images, there are a couple projects that may help with the workflow for above (code+build-release-run): -- looks like better suited for your case -- more golang friendly afaics
Hope it helps --jjo
One of the best practices is NOT to mount the code from a volume into it, but create a client-specific image that uses your TensorFlow image as a base image:
# Your base image comes in here.
FROM aisensiy/tensorflow:1
# Copy the client into your image.
COPY src /
# As Kubernetes will run your containers with an
# arbitrary UID, we set the user to nobody.
USER nobody
# ... and they will run with GID 0, so we
# need to change the group to 0 and make
# your stuff accessible to GID 0.
chgrp -R 0 /src && \
chmod -R g=u && \
CMD ["/usr/bin/python", ...]
Some more best practices:
Always log to stdout instead of log files.
One process per container. If you need multiple local
processes, co-locate them into a single pod.
Even more best practices are provided in the OpenShift documentation:
The code file can be passed from stdin when the container is being started. This way you can run arbitrary code when starting the container.
Please see below for example:
root#node-1:~# cat
print("This line will be printed.")
root#node-1:~# docker run --rm -i python python <
This line will be printed.
If this is your case,
You have a docker image with code in it.
Aim: To update the code inside docker image.
Run a bash session with the docker image with a directory in your file system mounted as volume.
Place the updated code in the volume directory.
From the docker bash session replace the real code with updated code from the volume.
Save the current state of container as new docker image.
Sample Commands
Assume ~/my-dir in your file system has the new code
$ docker run -it --volume ~/my-dir:/workspace --workdir /workspace my-docker-image bash
Now a new bash session will start inside docker container.
Assuming you have the code in '/code/' inside docker container,
You can simply update the code by
$ cp /workspace/ /code/
Or you can create new directory and place the code.
$ cp /workspace/ /my-new-dir/
Now the docker container contains updated code. But changes will be reset if you close the container and again run the image. To create a docker image with latest code, save this state of container using docker commit.
Open a new tab in the terminal.
$ docker ps
Will list all running docker containers.
Find CONTAINER ID of your docker container and save it.
$ docker commit id-of-your-container new-docker-image-name
Now run the docker image with latest code
$ docker run -it new-docker-image-name
Note: It is recommended to remove the old docker image using docker rmi command as docker images are heavy.
We're dealing with a similar challenge also. Our approach is to build a static docker image where Tensorflow, Python, etc are built once and maintained.
Each user has a PVC (persistent volume claim) where large files that may change such as datasets and workspaces live.
Then we have a bash shell that launches the cluster resources and syncs the workspace using ksync (like rsync for a kubernetes cluster).

How to deal with files of web applications in docker?

How do you guys deal with files of web applications for your docker containers? We are using same application for >400 customers. It's the same application with enabled/disabled modules (there are extra files).
I am currently using this approach: build the images, e.g. for Mysql, nginx+php, and then start the container with specific prepared application folder:
docker create -v /dbdata --name dbstore x/mysql /bin/true
docker run -d --volumes-from dbstore --name db1 x/mysql
docker run -d -P --name web --link db1:db1 -v /webapp:/opt/webapp x/webapp php-start index.php
IMHO, it's a space overusing.
I think it's a little bit complex to create >100 tags(revisions) of a webapp docker data container.
Please advice how to manage this problem?
First, recent versions of Docker let you create and use named volumes. This means that "data-only containers" are antiquated and no longer necessary, and in fact are considered an anti-pattern these days. It's pretty straightforward to create and use a named volume:
docker volume create --name=foo
docker run -d -v "foo:/dbdata" --name "db1" x/mysql
You can view your volumes with:
docker volume ls
As far as your main question, you could take advantage of Docker's union filesystem (which could also more simply be called a "shared layer") design. What this means is that if you create two containers from the ubuntu image (e.g. docker run -d --name=one ubuntu and docker run -d --name=two ubuntu), both of those containers are going to use the same filesystem objects in the base ubuntu image. So for example the /etc/passwd file in both of those containers point to the same /etc/passwd data stored on disk. This is part of what is meant by the term "union filesystem" in the context of Docker.
So just take this knowledge a step further and "bake" those modules into your base image for use by all of the containers for your different customers. That just means creating your own image from a Dockerfile which uses FROM wordpress:latest at the top. Continuing with the WordPress example, and if you wanted to make a bunch of WP plugins available, you could just store them in /var/www/html/wp-plugins (or whatever) and only enable certain ones in your configuration. Since they're baked into the image you have created (and used the same image to create all of your different containers), all of those module files point to the same exact data stored on disk, via the union filesystem. Of course, if someone changes the code in one of their modules, for example, the individual container's image will store the changes in its own image layer, but the base files will all be from the same data, not taking up any extra space. Of course, you can substitute in whichever CMS you're using.
Now, where I work, I've recently created a Docker-based hosting system for people to use. The issue is that we wanted each and every customer to have their own copy of the CMS filesystem. Even though the union filesystem means that changes to the base image would be stored in their own image layers, that wasn't good enough for the guy that signs my paycheck. They wanted each customer to have their own EBS volume with their own copy of the CMS filesystem on it. So in that situation, where you want each and every customer to have their own volume (for example in order to transport them for backup, or move to a new host, etc), then you won't be able to get around the issue of using extra storage for those files.
It depends:
If the files are static and you want to be able to move the container around easily, then I keep the files in the container by just copying them into the web location as single directory.
If you have a reliable external location, and you change the files more regular (for example by using some kind of CMS), you could just run an apache or a nginx container and mount the volume

build a data container or point to the existing files in docker?

currently I am building a ghost blog in docker from the offical ghost docker image -
as pointed out, there are two way to link the data.
You can also point the image to your existing content on your host:
docker run --name some-ghost -v /path/to/ghost/blog:/var/lib/ghost ghost
2.Alternatively you can use a data container that has a volume that points to /var/lib/ghost and then reference it:
docker run --name some-ghost --volumes-from some-ghost-data ghost
previously I used the first way, and I am puzzled why we want to build the data container, is it better than the first way?
The idea of a Data container is the following (quoting Raman Gupta, link below)
“this data logically exists within a data-only container and I (probably) don’t care where it physically exists on my host”
To complete this statement I would add : as long as I can access it and backup it. That's just a matter of where it is, and how you want to access it. Thanks to --volume-from you can attach to volume from other container so, to give you an example, if you wanted to backup the ghost "data" with a data container, you would have to do something like :
docker run -it --rm --volume-from some-ghost my-backup-image > some-ghost-backup.tar.gz
The my-backup-image would be doing something like : tar cv /var/lib/ghost | gzip (I did not try or run it, but that's the basic idea). And you could also use this to manage the data volume and have a common way to access/export (backup) volumes no matter who's using it :
Raman Gupta write about it there :
But if you care about where the actual data is, and/or it has to be accessible easily on the host, that's fine too.

In Docker, how can I share files between containers and then save them to an image?

I want to commit the data in a container's shared volume to an image. I cannot seem to do it? I kind of get the impression this perhaps is not possible in Docker but that seems totally at odds with the whole philosophy of not leaving data on the host so part of me thinks there must be a way to do this.
1. Terminal 1
Start up a container in Terminal 1 with a volume.
$ docker run -it -v /data ubuntu:14.10 /bin/bash
root#19fead4f6a68:/# echo "Hello Docker Volumes." > /data/foo.txt
2. Terminal 2
Start up a second container in Terminal 2 the file from container 1 is there so docker volumes are all working.
$ docker run -it --volumes-from 19fead4f6a68 ubuntu:14.10 /bin/bash
root#5c7cdbfc67d8:/# cat /data/foo.txt
Hello Docker Volumes.
3. Terminal 3
My understanding is that I can only commit diffs to images so I check what the diffs are on both the containers. For some bizarre reason my changes do not show up!??
$ docker diff 19fead4f6a68
A /data
$ docker diff 5c7cdbfc67d8
A /data
4. Back in Terminal 1
I create a file outside of the volume folder
root#19fead4f6a68:/# echo "Docker you are a very strange beast...." > /var/beast.txt
5. Back in Terminal 3
We now have some changes we can commit although I am rather frustrated as this is not the data from the volume I needed to share with my other container.
$ docker diff 19fead4f6a68
A /data
C /var
A /var/beast.txt
Clearly this is by design. Anyone have any ideas as to why docker don't allow me to save volume data to a commit? Is there anyway at all to share files between containers and then save them to an image? I feel like there must be something I am missing? Especially to the ends of sharing data whilst avoiding host dependencies.
Volumes are outside of container images. That's exactly what they are for - bringing data inside a container that isn't in the image.
From the Docker docs:
A data volume is a specially-designated directory within one or more containers that bypasses the Union File System to provide several useful features for persistent or shared data:
Data volumes can be shared and reused between containers
Changes to a data volume are made directly
Changes to a data volume will not be included when you update an image
If you want to save some changes as part of an image, make the changes inside the image and not in a volume. If you want to share changes across multiple containers, put that data in a volume but you have to make your own arrangements for snapshots, rollback, etc., because Docker doesn't have that feature.
Maybe you would be interested in Flocker.
It looks as though there is an open issue around adding volume layers to docker:

Volume and data persistence

What is the best way to persist containers data with docker? I would like to be able to retain some data and be able to get them back when restarting my container. I have read this interesting post but it does not exactly answer my question.
As far as I understand, I only have one option:
docker run -v /home/host/app:/home/container/app
This will mount the countainer folder onto the host.
Is there any other option? FYI, I don't use linking containers (--link )
Using volumes is the best way of handling data which you want to keep from a container. Using the -v flag works well and you shouldn't run into issues with this.
You can also use the VOLUME instruction in the Dockerfile which means you will not have to add any more options at run time, however they're quite tightly coupled with the specific container, you'd need to use docker start, rather than docker run to get the data back (or of course -v to the volume which was created in the past, likely in /var/ somewhere).
A common way of handling volumes is to create a data volume container with volumes defined by -v Then when you create your app container, use the --volumes-from flag. This will make your new container use the same volumes as the container you used the -v on (your data volume container). Of course this may seem like you're shifting the issue somewhere else.
This makes it quite simple to share volumes over multiple containers. Perhaps you have a container for your application, and another for logstash.
create a volume-container: this format of -v creates a volume, directory e.g. /var/lib/docker/volume/d3b0d5b781b7f92771b7342824c9f136c883af321a6e9fbe9740e18b93f29b69
which is still a bind mounted /container/path/vol
docker run -v /foo/bar/vol --name volbox ubuntu
I can now use this container, as my volume.
docker run --volumes-from volbox --name foobox ubuntu /bin/bash
root#foobox# ls /container/path/vol
Now, if I distribute these two containers, they will just work. The volume will always be available to foobox, regardless which host it is deployed to.
The snag of course comes if you don't want your storage to be in /var/lib/docker/volumes...
I suggest you take a look at some of the excellent post by Michael Crosby
and the docker docs
