docker compose: use GPU if available, else start container without one - docker

I'm using docker compose to run a container:
version: "3.9"
services:
app:
image: nvidia/cuda:11.0.3-base-ubuntu20.04
deploy:
resources:
reservations:
devices:
- capabilities: [ gpu ]
The container can benefit from the presence of a GPU, but it does not strictly need one. Using the above docker-compose.yaml results in an error
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
when being used on a machine without a GPU. Is it possible to specify "use a GPU, if one is available, else start the container without one"?

#herku, there are no conditional statements in docker compose. In 2018 the feature was out of scope https://github.com/docker/compose/issues/5756
Anyway you can check this answer with options how to workaround the problem
https://stackoverflow.com/a/50393225/3730077

Related

docker-compose: Reserve a different GPU for each scaled container

I have a docker-compose file that looks like the following:
version: "3.9"
services:
api:
build: .
ports:
- "5000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
count: 1
When I run docker-compose up, this runs as intended, using the first GPU on the machine.
However, if I run docker-compose up --scale api=2, I would expect each docker container to reserve one GPU on the host.
The actual behaviour is that both containers receive the same GPU, meaning that they compete for resources. Additionally, I also get this behaviour if I have two containers specified in the docker-compose.yml, both with count: 1. If I manually specify device_ids for each container, it works.
How can I make it so that each docker container reserves exclusive access to 1 GPU? Is this a bug or intended behaviour?
The behavior of docker-compose when a scale is requested is to create additional containers as per the exact specification provided by the service.
There are very few specification parameters that will vary during the creation of the additional containers and the devices which are part of the host_config set of parameters are copied without modifications.
docker-compose is python project, so if this is important feature for you, you can try to implement it. The logic that drives the lifecycle of the services (creation, scaling, etc.) reside in compose/services.py.

How to declare docker volumes to exist on an external hardrive in docker compose

Due to space limitations on my local machine I need to ensure my docker containers store their data on my external hard drive.
The docker project I am using has a docker compose file and it specifies a number of volumes like so:
version: "2"
volumes:
pgdata:
cache:
services:
postgres:
image: "openmaptiles/postgis:${TOOLS_VERSION}"
volumes:
- pgdata:/var/lib/postgresql/data
Those volumes ultimately exist on my local machine. I'd like to change their location to somewhere on my external drive e.g /Volumes/ExternalDrive/docker/
How do I go about this?
I have read the docker documentation on volumes and and docker-compose but can't find a way to specify the path of where a volume should exist.
If anyone could point the way I would be most grateful.
You can explore and test more features related to volumes using the CLI help and then transpose to compose.
docker volume create --help
https://docs.docker.com/engine/reference/commandline/volume_create/
Note that example below might not work on Windows hence the built-in local driver on Windows does not support any options, but if you're running docker on Linux, the compose file below should do the job:
version: "2"
volumes:
pgdata:
driver: local
driver_opts:
type: 'none'
o: 'bind'
device: '/path/to/the/external/storage'
cache:
services:
postgres:
image: "openmaptiles/postgis:${TOOLS_VERSION}"
volumes:
- pgdata:/var/lib/postgresql/data
You might also consider to change the docker launch options to store its data in a location of your choice. Here's a guide: https://linuxconfig.org/how-to-move-docker-s-default-var-lib-docker-to-another-directory-on-ubuntu-debian-linux
Alternatively, if you're more comfortable trying to solve this in the OS rather than docker, you could try some tricks at the filesystem level to create a symbolic link at /var/lib/docker/volumes/ to point at your external storage, but be careful and backup everything. I personally never tried this, but I believe it should be transparent for the docker storage driver.

How to run tensorflow with gpu support in docker-compose?

I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:
# Start the services
sudo docker-compose -f docker-compose-test.yml up --build
Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1 ... done
Recreating vw_image_cls_tensorflow_1 ... error
ERROR: for vw_image_cls_tensorflow_1 Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: for tensorflow Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.
My docker-compose file looks as follows:
# version 2.3 is required for NVIDIA runtime
version: '2.3'
services:
nvidia-driver:
# NVIDIA GPU driver used by the CUDA Toolkit
image: nvidia/driver:440.33.01-ubuntu18.04
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Do we need this volume to make the driver accessible by other containers in the network?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
networks:
- net
nvidia-cuda:
depends_on:
- nvidia-driver
image: nvidia/cuda:10.1-base-ubuntu18.04
volumes:
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need to create an additional volume for this service to be accessible by the tensorflow service?
devices:
# Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
networks:
- net
tensorflow:
image: tensorflow/tensorflow:2.0.1-gpu # Does this ship with cuda10.0 installed or do I need a separate container for it?
runtime: nvidia
restart: always
privileged: true
depends_on:
- nvidia-cuda
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Volumes related to source code and config files
- ./src:/src
- ./configs:/configs
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need an additional volume from the nvidia-cuda service?
command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
devices:
# Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
- /dev/nvidia-uvm-tools
networks:
- net
volumes:
nvidia_driver:
networks:
net:
driver: bridge
And my /etc/docker/daemon.json file looks as follows:
{"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:
Is it actually possible to do what I am trying to do?
If yes, did I setup my docker-compose file correctly (see comments in docker-compose.yml)?
How do I fix the error message I received above?
Thank you very much for your help, I highly appreciate it.
I agree that installing all tensorflow-gpu dependencies is rather painful. Fortunately, it's rather easy with Docker, as you only need NVIDIA Driver and NVIDIA Container Toolkit (a sort of a plugin). The rest (CUDA, cuDNN) Tensorflow images have inside, so you don't need them on the Docker host.
The driver can be deployed as a container too, but I do not recommend that for a workstation. It is meant to be used on servers where there is no GUI (X-server, etc). The subject of containerized driver is covered at the end of this post, for now let's see how to start tensorflow-gpu with docker-compose. The process is the same regardless of whether you have the driver in container or not.
How to launch Tensorflow-GPU with docker-compose
Prerequisites:
docker & docker-compose
NVIDIA Container Toolkit & NVIDIA Driver
To enable GPU support for a container you need to create the container with NVIDIA Container Toolkit. There are two ways you can do that:
You can configure Docker to always use nvidia container runtime. It is fine to do so as it works just as the default runtime unless some NVIDIA-specific environment variables are present (more on that later). This is done by placing "default-runtime": "nvidia" into Docker's daemon.json:
/etc/docker/daemon.json:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
You can select the runtime during container creation. With docker-compose it is only possible with format version 2.3.
Here is a sample docker-compose.yml to launch Tensorflow with GPU:
version: "2.3" # the only version where 'runtime' option is supported
services:
test:
image: tensorflow/tensorflow:2.3.0-gpu
# Make Docker create the container with NVIDIA Container Toolkit
# You don't need it if you set 'nvidia' as the default runtime in
# daemon.json.
runtime: nvidia
# the lines below are here just to test that TF can see GPUs
entrypoint:
- /usr/local/bin/python
- -c
command:
- "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"
By running this with docker-compose up you should see a line with the GPU specs in it. It appears at the end and looks like this:
test_1 | 2021-01-23 11:02:46.500189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 1624 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
And that is all you need to launch an official Tensorflow image with GPU.
NVIDIA Environment Variables and custom images
As I mentioned, NVIDIA Container Toolkit works as the default runtime unless some variables are present. These are listed and explained here. You only need to care about them if you build a custom image and want to enable GPU support in it. Official Tensorflow images with GPU have them inherited from CUDA images they use a base, so you only need to start the image with the right runtime as in the example above.
If you are interested in customising a Tensorflow image, I wrote another post on that.
Host Configuration for NVIDIA driver in container
As mentioned in the beginning, this is not something you want on a workstation. The process require you to start the driver container when no other display driver is loaded (that is via SSH, for example). Furthermore, at the moment of writing only Ubuntu 16.04, Ubuntu 18.04 and Centos 7 were supported.
There is an official guide and below are extractions from it for Ubuntu 18.04.
Edit 'root' option in NVIDIA Container Toolkit settings:
sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
Disable the Nouveau driver modules:
sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
&& sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
&& sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
If you are using an AWS kernel, ensure that the i2c_core kernel module is enabled:
sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"
Update the initramfs:
sudo update-initramfs -u
Now it's time to reboot for the changes to take place. After reboot check that no nouveau or nvidia modules are loaded. The commands below should return nothing:
lsmod | grep nouveau
lsmod | grep nvidia
Starting driver in container
The guide offers a command to run the driver, I prefer docker-compose. Save the following as driver.yml:
version: "3.0"
services:
driver:
image: nvidia/driver:450.80.02-ubuntu18.04
privileged: true
restart: unless-stopped
volumes:
- /run/nvidia:/run/nvidia:shared
- /var/log:/var/log
pid: "host"
container_name: nvidia-driver
Use docker-compose -f driver.yml up -d to start the driver container. It will take a couple of minutes to compile modules for your kernel. You may use docker logs nvidia-driver -f to overview the process, wait for 'Done, now waiting for signal' line to appear. Otherwise use lsmod | grep nvidia to see if the driver modules are loaded. When it's ready you should see something like this:
nvidia_modeset 1183744 0
nvidia_uvm 970752 0
nvidia 19722240 17 nvidia_uvm,nvidia_modeset
Docker Compose v1.27.0+
since 2022 version 3.x
version: "3.6"
services:
jupyter-8888:
image: "tensorflow/tensorflow:latest-gpu-jupyter"
env_file: "env-file"
deploy:
resources:
reservations:
devices:
- driver: "nvidia"
device_ids: ["0"]
capabilities: [gpu]
ports:
- 8880:8888
volumes:
- workspace:/workspace
- data:/data
if you want to specify different GPU id eg. 0 and 3
device_ids: ['0', '3']
Managed to get it working by installing WSL2 on my windows machine to to use VS Code along with the Remote-Containers extension. Here is a collection of articles that helped a lot with the installation of WSL2 and using VS Code from within it:
https://learn.microsoft.com/en-us/windows/wsl/install-win10
ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2
https://code.visualstudio.com/docs/remote/containers
With the remote-containers extension of VS Code, you can then setup you devcontainer based on a docker-compose file (or just a Dockerfile as I did), which is probably better explained in the third link above. One thing for myself to remember is that when defining the .devcontainer.json file you need to make sure to set
// Optional arguments passed to ``docker run ... ``
"runArgs": [
"--gpus", "all"
]
Before VS Code, I have used Pycharm for a long time, so switching to VS Code was quite a pain at first, but VS Code along with WSL2, the remote-containers, and the pylance extension have made it quite easy to develop in a container with GPU support. As far as I know Pycharcm doesnt support debugging inside a container in WSL atm, because of
https://intellij-support.jetbrains.com/hc/en-us/community/posts/360009752059-Using-docker-compose-interpreter-on-wsl-project-Windows-
https://youtrack.jetbrains.com/issue/WI-53325

Saving logs from docker container to windows file system

i just started learning docker and am interested in saving logs from container to local machine(for storage/review)
Is there a way to save /var/lib/docker/containers/CONTAINER_ID/CONTAINER_ID-json.log to Windows filesystem?
i tried specifying volume in my docker-compose.yml file running image "dtf"
services:
web:
image: dtf
ports:
- '5000:5000'
logging:
driver: "json-file"
options:
max-size: "1k"
max-file: "3"
volumes:
- C:\logs:/var/lib/docker/containers/
from what i understood about docker volumes, i should be able to access the .log file at C:\logs, but i'm not sure how to correctly write path to the file itself (the /CONTAINER_ID/ part)
For this you need to look up docker volumes. You can expose a part of the host file system to the docker container.
Check out Docker logging strategies which illustrates different approaches to perform logging. The recommended method is Docker logging driver, please check out more at Configuring logging drivers.
As shown in Better ways of handling logging in containers, you can link the log folder with a host folder via a data volume container using this command:
# docker run -ti -v /dev/log:/dev/log fedora sh
The above solution has been taken from this stackoverflow answer, and just provided the answer in case the original solution link becomes obsolete due to deletion or something.

docker stack ignoring unsupported options

I am running docker Server Version: 18.06.0-ce on centos 7.5.
I have a docker-compose file running db2 server with the following sample definition:
The docker-compose file has the following options:
version: "3.7"
services:
db2exp:
image: db2
ports:
- "50000:50000"
networks:
- lmnet
ipc: host
cap_add:
- IPC_LOCK
- IPC_OWNER
environment:
- DB2INSTANCE=db2inst1
- DB2PASSWD=db2inst1
- LICENSE=accept
volumes:
- db2data:/home
When using docker-compose up, I do not have problems with starting the db2 service. However when I try to use docker stack, I get the following message:
docker stack deploy test --compose-file docker-compose.yml
Ignoring unsupported options: cap_add, ipc
This renders db2start to return SQL1042C An unexpected system error occurred.
It would be ideal if what runs in compose runs in stack. What, if any, can be done so that the db2 container can be used in a docker stack environment and not just docker-compose?
If it matters, I have docker-compose version 1.23.0-rc1, build 320e4819.
Thanks in advance.
This is not supported by swarm mode currently as the error message you've show and documentation identify. Personally I'd question whether you really want to have your database running in swarm mode. Docker does not migrate the volume for you, so you wouldn't see your data if rescheduled on another node.
You can follow the progress of getting this added to Swarm Mode in the github issues, there are several, including:
https://github.com/moby/moby/issues/24862
https://github.com/moby/moby/issues/25885
The hacky solution I've seen if you really need this run from swarm mode is to schedule a container with the docker socket mounted and docker binaries in the image, which then executes a docker run command directly against the local engine. E.g.:
version: "3.7"
services:
db2exp-wrapper:
image: docker:stable
volumes:
- /var/run/docker.sock:/var/run/docker.sock
command: docker run --rm --cap-add IPC_LOCK --cap-add IPC_OWNER -p 50000:50000 ... db2
I don't really recommend the above solution, sticking with docker-compose would likely be a better implementation for your use case. Downsides of this solution include only publishing the port on the single host, and potential security risks of anyone else with access to that docker socket.

Resources