Containers with pipelines: should/can you keep your data separate from the container - docker

I am very new to containers and I was wondering if there is a "best practice" for the following situation:
Let's say I have developed a general pipeline using multiple software tools to analyze next generation sequencing data (I work in science). I decided to make a container for this pipeline so I can share it easily with colleagues. The container would have the required tools and their dependencies installed, as well as all the scripts to run the pipeline. There would be some wrapper/master script to run the whole pipeline, something like: bash run-pipeline.sh -i input data.txt
My question is: if you are using a container for this purpose, do you need to place your data INSIDE the container OR can you run the pipeline one your data which is place outside your container? In other words, do you have to place your input data inside the container to then run the pipeline on it?
I'm struggling to find a case example.
Thanks.

To me the answer is obvious - the data belongs outside the image.
The reason is that if you build an image with the data inside, how are your colleagues going to use it with their data?
It does not make sense to talk about the data being inside or outside the container. The data will be inside the container. The only question is how did it get there?
My recommended process is something like:
Create an image with all your scripts, required tools, dependencies, etc; but not data. For simplicity let us name this image pipeline.
Bind mount data in volumes to the container. docker container create --mount type=bind,source=/path/to/data/files/on/host,target=/srv/data,readonly=true pipeline
Of course, replace /path/to/data/files/on/host with the appropriate path. You can store your data in one place and your colleagues in another. You make a substitution appropriate for you and they will have to make a substitution appropriate for them.
However inside the container, the data will be at /srv/data. Your scripts can just assume that it will be there.

To handle the described scenario I would recommend files to exchange data between your processing steps. To bring the files into your container you could mount a local directory into your container. That also enables some kind of persistence for your containers. The way how to mount local file system into your container is displayed in the following example.
version: '3.2'
services:
container1:
image: "your.image1"
volumes:
- "./localpath:/container/internal"
container2:
image: "your.image2"
volumes:
- "./localpath:/container/internal"
container3:
image: "your.image3"
volumes:
- "./localpath:/container/internal"
The example uses a docker compose file to describe the dependencies between your containers. You can implement the same without docker-compose. Then you have to specify your container mounts in your docker run command.
https://docs.docker.com/engine/reference/commandline/run/

Related

How to expose files from a docker container through a webserver

I have this website which uses angular for the frontend and has a NodeJs backend. The backend serves the angular files and handles client calls.
As it is now, they are both packages and deployed as one docker image. Meaning, if I change the frontend, I also need to build the backend in order to create a new image. So it makes sense to seperate them.
But if I create an image for the backend and frontend, how can the backend serve files from the frontend container?
Is this the right approach?
I think I would like to have the frontend inside a docker image, so I can do stuff like rollback easily (which is not possible with docker volumes for example)!
Yes! Containerize them to have their own containers is the way to go! This make us deploy/deliver faster and also separate build pipelines to make steps clearer to everyone involved.
I won't bother having backend serving frontend files. I usually create my frontend image with a webserver (eg nginx:alpine), since frontend and backend can be separately deployed to different machines or systems. And don't forget to use multi-stage builds to minimize image size.
But if you must do that, I guess you can use docker-compose to have them in one network, and then, forward requests of those static files from backend to the frontend webserver. (Just some hacks, there must be a better way to handle this from more advanced people here :P)
I have something similar, an Emberjs running in one docker container that connects to nodejs that is running in its own container (not to mention the DB that runs on a third container). It all works rather well.
I would recommend that you create your containers using docker-compose which will automatically create the network so that both containers can talk to each other using :.
Also I set it up so that the code is mapped from a folder in my machine to a folder in the container. This allows me to easily change stuff, work with Git , etc...
Here is a snippet of my docker-compose file as an example:
version: "3"
services:
....
ember_gui:
image: danlynn/ember-cli
container_name: ember_dev
depends_on:
- node_server
volumes:
- ./Ember:/myapp
command: ember server
ports:
- "4200:4200"
- "7020:7020"
- "7357:7357"
Here I create an ember_gui service which creates a container named ember_dev based on an existing image from docker hub. Then it tells docker that this container is dependent on another container that needs to be compiled first and which I do not show in the snippet but that is defined in the same docker-compose file (node_server). After that, I map the ./Ember directory to the /myapp folder in the container so that I can share the code. Finally I start the ember server and open some ports

entry point of docker container dependent on local file system and not in the image

I am working on a docker container that is being created from a generic image. The entry point of this container is dependent on a file in the local file system and not in the generic image. My docker-compose file looks something like this:
service_name:
image: base_generic_image
container_name: container_name
entrypoint:
- "/user/dlc/bin"
- "-p"
- "localFolder/fileName.ext"
- more parameters
The challenge that I am facing is removing this dependency and adding it to the base_generic_image at run time so that I can deploy it independently. Should I add this file to the base generic image and then proceed(this file is not required by others) or should this be done when creating the container, if so then what is the best way of going about it.
You should create a separate image for each part of your application. These can be based on the base image if you'd like; the Dockerfile might look like
FROM base_generic_image
COPY dlc /usr/bin
CMD ["dlc"]
Your Docker Compose setup might have a subdirectory for each component and could look like
servicename:
image: my/servicename
build:
context: ./servicename
command: ["dlc", "-p", ...]
In general Docker volumes and bind-mounts are good things to use for persistent data (when absolutely required; stateless containers with external databases are often easier to manage), getting log files out of containers, and pushing complex configuration into containers. The actual program that's being run generally should be built into the base image. The goal is that you can take the image and docker run it on a clean system without any of your development environment on it.

Docker swarm having some shared volume

I will try to describe my desired functionality:
I'm running docker swarm over docker-compose
In the docker-compose, I've services,for simplicity lets call it A ,B ,C.
Assume C service that include shared code modules need to be accessible for services A and B.
My questions are:
1. Should each service that need access to the shared volume must mount the C service to its own local folder,(using the volumes section as below) or can it be accessible without mounting/coping to a path in local container.
In docker swarm, it can be that 2 instances of Services A and B will reside in computer X, while Service C will reside on computer Y.
Is it true that because the services are all maintained under the same docker swarm stack, they will communicate without problem with service C.
If not which definitions should it have to acheive it?
My structure is something like that:
version: "3.4"
services:
A:
build: .
volumes:
- C:/usr/src/C
depends_on:
- C
B:
build: .
volumes:
- C:/usr/src/C
depends_on:
- C
C:
image: repository.com/C:1.0.0
volumes:
- C:/shared_code
volumes:
C:
If what you’re sharing is code, you should build it into the actual Docker images, and not try to use a volume for this.
You’re going to encounter two big problems. One is getting a volume correctly shared in a multi-host installation. The second is a longer-term issue: what are you going to do if the shared code changes? You can’t just redeploy the C module with the shared code, because the volume that holds the code already exists; you need to separately update the code in the volume, restart the dependent services, and hope they both work. Actually baking the code into the images makes it possible to test the complete setup before you try to deploy it.
Sharing code is an anti-pattern in a distributed model like Swarm. Like David says, you'll need that code in the image builds, even if there's duplicate data. There are lots of ways to have images built on top of others to limit the duplicate data.
If you still need to share data between containers in swarm on a file system, you'll need to look at some shared storage like AWS EFS (multi-node read/write) plus REX-Ray to get your data to the right containers.
Also, depends_on doesn't work in swarm. Your apps in a distributed system need to handle the lack of connection to other services in a predicable way. Maybe they just exit (and swarm will re-create them) or go into a retry loop in code, etc. depends_on is mean for local docker-compose cli in development where you want to spin up a app and its dependencies by doing something like docker-compose up api.

docker-compose: where to store configuration for services?

I'm building an ELK (elasticsearch/logstash/kibana) stack using docker-compose/docker-machine. The plan is to deploy it to a digitalocean droplet and, if needed, use Swarm to scale it.
It works really well, but I'm a bit confused where I should store configuration for the services (e.g. configuration files for logstash, or the SSL certs for nginx).
At first, I just mounted a host directory as volume. The problem with that is that all the configuration files have to be available on the docker host, so I have to sync them to the digitalocean droplet.
Then I thought I had a very smart idea: create a data container with all the configuration, and let the other services access it using volumes_from:
config:
volumes:
- /conf
build:
context: .
# this just copies the conf folder into the image
dockerfile: /dockerfiles/config/Dockerfile
logstash:
image: logstash:2.2
volumes_from:
- config
The problem with this approach became obvious quite fast: every time I change any configuration, I need to stop all containers that are linked to the config container, recreate the config image and container, and then start up the services again. Not great for uptime :(.
So, what's the best way? Ideally, the configuration files would be inside a container, so I can just ship it to wherever.
One common solution to this problem is to put a load balancer in front of the services. That way when you want to change the configuration you can start a new container and the load balancer will pick it up, then stop the old container. No downtime, and it lets you reload the config.
Another option might be to use a named volume. Then you can just modify the contents of the named volume and any containers using it will see the new files. However if you are using multiple nodes with swarm, you'll need to use a volume driver that supports multi-host volumes.
Did you consider to use the extension mechanism and override a settings file? Put a second docker-compose.override.yml in the same directory as the main compose file, or use explicit extension within the compose file. See
https://docs.docker.com/compose/extends/
That way you could integrate a configuration file in a transparent way, or control the parameters you want to change via environment variables that are different in the overriding composition.

Overwrite files with `docker run`

Maybe I'm missing this when reading the docs, but is there a way to overwrite files on the container's file system when issuing a docker run command?
Something akin to the Dockerfile COPY command? The key desire here is to be able to take a particular Docker image, and spin several of the same image up, but with different configuration files. (I'd prefer to do this with environment variables, but the application that I'm Dockerizing is not partial to that.)
You have a few options. Using something like docker-compose, you could automatically build a unique image for each container using your base image as a template. For example, if you had a docker-compose.yml that look liked:
container0:
build: container0
container1:
build: container1
And then inside container0/Dockerfile you had:
FROM larsks/thttpd
COPY index.html /index.html
And inside container0/index.html you had whatever content you
wanted, then running docker-compose build would generate unique
images for each entry (and running docker-compose up would start
everything up).
I've put together an example of the above
here.
Using just the Docker command line, you can use host volume mounts,
which allow you to mount files into a container as well as
directories. Using my thttpd as an example again, you could use the
following -v argument to override /index.html in the container
with the content of your choice:
docker run -v index.html:/index.html larsks/thttpd
And you could accomplish the same thing with docker-compose via the
volume entry:
container0:
image: larsks/thttpd
volumes:
- ./container0/index.html:/index.html
container1:
image: larsks/thttpd
volumes:
- ./container1/index.html:/index.html
I would suggest that using the build mechanism makes more sense if you are trying to override many files, while using volumes is fine for one or two files.
A key difference between the two mechanisms is that when building images, each container will have a copy of the files, while using volume mounts, changes made to the file within the image will be reflected on the host filesystem.

Resources