Building docker image with cuda runtime - docker

I'm building a image which requires testing GPU usability in the meantime. GPU containers runs well:
$ docker run --rm --runtime=nvidia nvidia/cuda:9.2-devel-ubuntu18.04 nvidia-smi
Wed Aug 7 07:53:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 24% 43C P8 17W / 250W | 2607MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
but failed when building with GPU:
$ cat Dockerfile
FROM nvidia/cuda:9.2-devel-ubuntu18.04
RUN nvidia-smi
# RUN build something
# RUN tests require GPU
$ docker build .
Sending build context to Docker daemon 2.048kB
Step 1/2 : FROM nvidia/cuda:9.2-devel-ubuntu18.04
---> cdf6d16df818
Step 2/2 : RUN nvidia-smi
---> Running in 88f12f9dd7a5
/bin/sh: 1: nvidia-smi: not found
The command '/bin/sh -c nvidia-smi' returned a non-zero code: 127
I'm new to docker but I think we need sanity checks when building an image. So how could I build docker image with cuda runtime?

Configuring docker daemon with --default-runtime=nvidia solved the problem.
Please refer to this wiki for more info.

Maybe it's because you are using "RUN" command on the Dockerfile. I'd try "CMD" (see documentation for this command) or "ENTRYPOINT" due to call 'docker run' with arguments.
I think that "RUN" commands are for previous jobs that you need to execute before the container get available, instead of a process with output and stuff.
Good luck with that,

Related

How to get my docker container to recognize my nvidia GPU in cloud config script

More details:
I mostly followed the following instructions to setup my init script:
https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus
I used the docker base image:
nvidia/cuda:11.2.1-runtime-ubuntu20.04
My cloud init ExecStart command for the docker part is currently:
MY COMMAND
docker run --rm --name=myapp -dit -p 80:80 --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl <Docker Container Name> <Uvicorn Startup Command>
When I SSH into the running VM,
I did the following to help log what processes are using the gpu:
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
On first pass, I get the following output with /var/lib/nvidia/bin/nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 55C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
When I go to the root (by doing cd .. twice) and run MY COMMAND, I get the same issue with no processes using the GPU.
However, when I run MY COMMAND in the home/{username} directory, I get the following output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 51C P0 28W / 70W | 14734MiB / 15109MiB | 13% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4694 C /usr/bin/python3 2329MiB |
| 0 N/A N/A 4695 C /usr/bin/python3 2329MiB |
| 0 N/A N/A 4696 C /usr/bin/python3 2521MiB |
| 0 N/A N/A 4700 C /usr/bin/python3 5221MiB |
| 0 N/A N/A 4701 C /usr/bin/python3 2329MiB |
+-----------------------------------------------------------------------------+
My basic question is: How do I do the same thing in my cloud config script as I was able to do manually in my VM?
I already tried adding a user to my cloud init script like in the example provided by the google link and starting docker with -u flag, but that ran into permissioning issues (specifically PermissionError: [Errno 13] Permission denied: '/.cache')
Edit:
Never found a solution that I understood, but, turned out I needed to run
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
In the same directory I was running my docker command using ExecStartPre

Cannot run nvidia-smi inside the docker without sudo

I installed the nvidia-docker2 following the instructions here. When running the following command I will get the expected output as shown.
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:0B:00.0 On | N/A |
| 24% 31C P8 13W / 250W | 222MiB / 11011MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
However, running the above command without "sudo" results in the following error for me:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create
failed: runc create failed: unable to start container process: error during container
init: error running hook #0: error running hook: exit status 1, stdout: , stderr:
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1:
cannot open shared object file: no such file or directory: unknown.
Can anyone please help me with how I can solve this problem?
Add docker group to your user:
sudo usermod -aG docker your_user
Update:
Check here https://github.com/NVIDIA/nvidia-docker/issues/539
Maybe something from the comments will help you.
try adding "sudo" to you docker command.
e.g sudo docker-compose ...

Using GPU inside docker container - CUDA Version: N/A and torch.cuda.is_available returns False

I'm trying to use GPU from inside my docker container. I'm using docker with version 19.03 on Ubuntu 18.04.
Outside the docker container if I run nvidia-smi I get the below output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If I run the samething inside the container created from nvidia/cuda docker image, I get the same output as above and everything is running smoothly. torch.cuda.is_available() returns True.
But If I run the same nvidia-smi command inside any other docker container, it gives the following output where you can see that the CUDA Version is coming as N/A. Inside the containers torch.cuda.is_available() also returns False.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have installed nvidia-container-toolkit using the following commands.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker
I started my containers using the following commands
sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
sudo docker run -it --rm --gpus all ubuntu nvidia-smi
docker run --rm --gpus all nvidia/cuda nvidia-smi should NOT return CUDA Version: N/A if everything (aka nvidia driver, CUDA toolkit, and nvidia-container-toolkit) is installed correctly on the host machine.
Given that docker run --rm --gpus all nvidia/cuda nvidia-smi returns correctly. I also had problem with CUDA Version: N/A inside of the container, which I had luck in solving:
Please see my answer https://stackoverflow.com/a/64422438/2202107 (obviously you need to adjust and install the matching/correct versions of everything)
For anybody arriving here looking how to do it with docker compose, add to your service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities:
- gpu
- utility # nvidia-smi
- compute # CUDA. Required to avoid "CUDA version: N/A"
- video # NVDEC/NVENC. For instance to use a hardware accelerated ffmpeg. Skip it if you don't need it
Doc: https://docs.docker.com/compose/gpu-support

Vglrun does not work in a Docker container

Note: This is on a headless AWS box on VNC, the current desktop I'm running on is DISPLAY=:1.0
I am trying to build a container that can hold an opengl application but I'm having trouble getting vglrun to work correctly. I am currently running it with --gpus all on the docker run line as well
# xhost +si:localuser:root
# docker run --rm -it \
-e DISPLAY=unix$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \
--gpus all centos:7 \
sh -c "yum install epel-release -y && \
yum install -y VirtualGL glx-utils && \
vglrun glxgears"
No protocol specified
[VGL] ERROR: Could not open display :0
On the host:
$ nvidia-smi
Tue Jan 28 22:32:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 16W / 150W | 56MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2387 G /usr/bin/X 55MiB |
+-----------------------------------------------------------------------------+
I can confirm running glxgears without vglrun works fine but my application I'm trying to build into docker inherently uses vglrun. I have also tried using the nvidia container nvidia/opengl:1.1-glvnd-runtime-centos7 with no success.
running it with vglrun -d :1.0 glxgears or vglrun -d unix:1.0 glxgears gives me this error:
Error: couldn't get an RGB, Double-buffered visual
What am I doing wrong here? does vglrun not work in a container?
EDIT: It seems I was approaching this problem the wrong way, it seems to work when I'm on the primary :0 display but when using VNC to view display :1, the Mesa drivers get used instead of the Nvidia ones. Is there a way I can use the GPU on spawned VNC displays?
I met the same problem and I solved by setting the variable "VGL_DISPLAY".
docker run ... -e VGL_DISPLAY=$DISPLAY ...
Then it worked! Please try.

jupyter notebook running in docker on remote server: keras not using gpu

I'm setting up a jupyter notebook run on a remote server but my code appears not to be using the GPU. It looks like tensorflow is identifying the GPU but Keras is missing it somehow. Is there something in my setup process leading to this?
I installed nvidia docker via the github instructions:
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
I'm ssh'ing into my server:
ssh me#serverstuff
And then on the server running:
docker run -it -p 9999:9999 --name mycontainer -v /mydata:/mycontainer/mydata ufoym/deepo bash
jupyter notebook --ip 0.0.0.0 --port 9999 --no-browser --allow-root
And then opening up a new command prompt on my desktop and running:
ssh -N -f -L localhost:9999:serverstuff:9999 me#serverstuff
Then signing in, and opening up localhost:9999 in my browser, and logging in with the provided token successfully.
But when I run DL training in my notebook the speed is such that it doesn't seem to be using GPU.
!nvidia-smi
gives:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.86 Driver Version: 430.86 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 730 WDDM | 00000000:01:00.0 N/A | N/A |
| 25% 41C P8 N/A / N/A | 551MiB / 2048MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
and
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
gives:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7106107654095923441
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 13064397814134284140
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 14665046342845873047
physical_device_desc: "device: XLA_GPU device"
]
and
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
gives:
[]
Try installing another image, I also had problems with custom images so I went with a direct nvidia image:
docker pull nvcr.io/nvidia/tensorflow:19.08-py3
there are other versions as well, you can check them out here

Resources