Note: This is on a headless AWS box on VNC, the current desktop I'm running on is DISPLAY=:1.0
I am trying to build a container that can hold an opengl application but I'm having trouble getting vglrun to work correctly. I am currently running it with --gpus all on the docker run line as well
# xhost +si:localuser:root
# docker run --rm -it \
-e DISPLAY=unix$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \
--gpus all centos:7 \
sh -c "yum install epel-release -y && \
yum install -y VirtualGL glx-utils && \
vglrun glxgears"
No protocol specified
[VGL] ERROR: Could not open display :0
On the host:
$ nvidia-smi
Tue Jan 28 22:32:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 16W / 150W | 56MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2387 G /usr/bin/X 55MiB |
+-----------------------------------------------------------------------------+
I can confirm running glxgears without vglrun works fine but my application I'm trying to build into docker inherently uses vglrun. I have also tried using the nvidia container nvidia/opengl:1.1-glvnd-runtime-centos7 with no success.
running it with vglrun -d :1.0 glxgears or vglrun -d unix:1.0 glxgears gives me this error:
Error: couldn't get an RGB, Double-buffered visual
What am I doing wrong here? does vglrun not work in a container?
EDIT: It seems I was approaching this problem the wrong way, it seems to work when I'm on the primary :0 display but when using VNC to view display :1, the Mesa drivers get used instead of the Nvidia ones. Is there a way I can use the GPU on spawned VNC displays?
I met the same problem and I solved by setting the variable "VGL_DISPLAY".
docker run ... -e VGL_DISPLAY=$DISPLAY ...
Then it worked! Please try.
Related
I am trying to use the base images provided by NVIDIA that let us use their GPUs via Docker containers. Because I am using docker, there is no need for me to have CUDA Toolkit or CuDNN on my system. All I need to have is the right driver - which I have.
I can run the official pytorch docker containers and the containers utilize my GPU. However when I run anything using the base images from NVIDIA then I get the following Warning -
$ docker run --gpus all -it --rm -p 8000:8000 ubuntu-cuda-gpu:latest
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
The application executes, it just uses CPU. But I want to be able to use my GPU like I can when I run the same code(it is a simple pytorch example) using official pytorch docker images.
The base image used is -
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
# Setup
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
# Your stuff
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
torch \
transformers \
...
If I just run the image without any machine learning code and try to execute nvidia-smi then I get the output as -
$ docker run --gpus all -it --rm -p 8000:8000 ubuntu-cuda-gpu:latest nvidia-smi
Sat Jun 12 19:15:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 31C P8 9W / 170W | 14MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This leads me to believe that at least something is right. But why is it that I am not able to use my GPU and how to make sure that I can?
I am on Ubuntu 20.04.
I'm trying to use GPU from inside my docker container. I'm using docker with version 19.03 on Ubuntu 18.04.
Outside the docker container if I run nvidia-smi I get the below output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If I run the samething inside the container created from nvidia/cuda docker image, I get the same output as above and everything is running smoothly. torch.cuda.is_available() returns True.
But If I run the same nvidia-smi command inside any other docker container, it gives the following output where you can see that the CUDA Version is coming as N/A. Inside the containers torch.cuda.is_available() also returns False.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have installed nvidia-container-toolkit using the following commands.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker
I started my containers using the following commands
sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
sudo docker run -it --rm --gpus all ubuntu nvidia-smi
docker run --rm --gpus all nvidia/cuda nvidia-smi should NOT return CUDA Version: N/A if everything (aka nvidia driver, CUDA toolkit, and nvidia-container-toolkit) is installed correctly on the host machine.
Given that docker run --rm --gpus all nvidia/cuda nvidia-smi returns correctly. I also had problem with CUDA Version: N/A inside of the container, which I had luck in solving:
Please see my answer https://stackoverflow.com/a/64422438/2202107 (obviously you need to adjust and install the matching/correct versions of everything)
For anybody arriving here looking how to do it with docker compose, add to your service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities:
- gpu
- utility # nvidia-smi
- compute # CUDA. Required to avoid "CUDA version: N/A"
- video # NVDEC/NVENC. For instance to use a hardware accelerated ffmpeg. Skip it if you don't need it
Doc: https://docs.docker.com/compose/gpu-support
I want to use docker 19.03 and above in order to have GPU support. I currently have docker 19.03.12 in my system. I can run this command to check that Nvidia drivers are running:
docker run -it --rm --gpus all ubuntu nvidia-smi
Wed Jul 1 14:25:55 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:01:00.0 Off | N/A |
| 26% 54C P5 13W / 180W | 734MiB / 8119MiB | 39% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Also, if run locally my module works with GPU support just fine. But if I build a docker image and try to run it I get a message:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
I am using cuda 9.0 with tensorflow 1.12.0 but I can switch to cuda 10.0 with tensorflow 1.15.
As I get it the problem is that I am probably using a previous dockerfile version with commands which does not make it compatible with new docker GPU enabled version (19.03 and above).
The actual commands are these:
FROM nvidia/cuda:9.0-base-ubuntu16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cuda-command-line-tools-9-0 \
cuda-cublas-9-0 \
cuda-cufft-9-0 \
cuda-curand-9-0 \
cuda-cusolver-9-0 \
cuda-cusparse-9-0 \
libcudnn7=7.0.5.15-1+cuda9.0 \
libnccl2=2.2.13-1+cuda9.0 \
libfreetype6-dev \
libhdf5-serial-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
software-properties-common \
unzip \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get install nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0 && \
apt-get update && \
apt-get install libnvinfer4=4.1.2-1+cuda9.0
I could not find a docker base file for fundamental GPU usage either.
In this answer there was a proposal for exposing libcuda.so.1 but it did not work in my case.
So, is there any workaround for this problem or a base dockerfile to adjust to?
My system is Ubuntu 16.04.
Edit:
I just noticed that nvidia-smi from within docker does not display any cuda version:
CUDA Version: N/A
in contrast with the one locally run. So, this probably means that no cuda is loaded inside docker for some reason I guess.
tldr;
A base Dockerfile which seems to work with docker 19.03+ & cuda 10 is this:
FROM nvidia/cuda:10.0-base
which can be conbined with tf 1.14 but for some reason could not found tf 1.15.
I just used this Dockerfile to test it:
FROM nvidia/cuda:10.0-base
CMD nvidia-smi
longer answer:
Well, after a lot of trials and errors (and frustration) I managed to make it work for docker 19.03.12+cuda 10 (although with tf 1.14 not 1.15).
I used the code from this post and used the base Dockerfiles provided there.
First I tried to check the nvidia-smi from within docker using Dockerfile:
FROM nvidia/cuda:10.0-base
CMD nvidia-smi
$docker build -t gpu_test .
...
$docker run -it --gpus all gpu_test
Fri Jul 3 07:31:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:01:00.0 Off | N/A |
| 45% 65C P2 142W / 180W | 8051MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
which finally seems to find cuda binaries: CUDA Version: 10.1.
Then, I made a minimal Dockerfile which could test the successful loading of tensorflow binary libraries within docker:
FROM nvidia/cuda:10.0-base
# The following are just declaring variables and ultimately use
ARG USE_PYTHON_3_NOT_2=True
ARG _PY_SUFFIX=${USE_PYTHON_3_NOT_2:+3}
ARG PYTHON=python${_PY_SUFFIX}
ARG PIP=pip${_PY_SUFFIX}
RUN apt-get update && apt-get install -y \
${PYTHON} \
${PYTHON}-pip
RUN ${PIP} install tensorflow_gpu==1.14.0
COPY bashrc /etc/bash.bashrc
RUN chmod a+rwx /etc/bash.bashrc
WORKDIR /src
COPY *.py /src/
ENTRYPOINT ["python3", "tf_minimal.py"]
and tf_minimal.py was simply:
import tensorflow as tf
print(tf.__version__)
and for completeness I just post the bashrc file I am using:
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ==============================================================================
export PS1="\[\e[31m\]tf-docker\[\e[m\] \[\e[33m\]\w\[\e[m\] > "
export TERM=xterm-256color
alias grep="grep --color=auto"
alias ls="ls --color=auto"
echo -e "\e[1;31m"
cat<<TF
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
TF
echo -e "\e[0;33m"
if [[ $EUID -eq 0 ]]; then
cat <<WARN
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u \$(id -u):\$(id -g) args...
WARN
else
cat <<EXPL
You are running this container as user with ID $(id -u) and group $(id -g),
which should map to the ID and group for your user on the Docker host. Great!
EXPL
fi
# Turn off colors
echo -e "\e[m"
I'm setting up a jupyter notebook run on a remote server but my code appears not to be using the GPU. It looks like tensorflow is identifying the GPU but Keras is missing it somehow. Is there something in my setup process leading to this?
I installed nvidia docker via the github instructions:
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
I'm ssh'ing into my server:
ssh me#serverstuff
And then on the server running:
docker run -it -p 9999:9999 --name mycontainer -v /mydata:/mycontainer/mydata ufoym/deepo bash
jupyter notebook --ip 0.0.0.0 --port 9999 --no-browser --allow-root
And then opening up a new command prompt on my desktop and running:
ssh -N -f -L localhost:9999:serverstuff:9999 me#serverstuff
Then signing in, and opening up localhost:9999 in my browser, and logging in with the provided token successfully.
But when I run DL training in my notebook the speed is such that it doesn't seem to be using GPU.
!nvidia-smi
gives:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.86 Driver Version: 430.86 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 730 WDDM | 00000000:01:00.0 N/A | N/A |
| 25% 41C P8 N/A / N/A | 551MiB / 2048MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
and
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
gives:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7106107654095923441
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 13064397814134284140
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 14665046342845873047
physical_device_desc: "device: XLA_GPU device"
]
and
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
gives:
[]
Try installing another image, I also had problems with custom images so I went with a direct nvidia image:
docker pull nvcr.io/nvidia/tensorflow:19.08-py3
there are other versions as well, you can check them out here
I'm building a image which requires testing GPU usability in the meantime. GPU containers runs well:
$ docker run --rm --runtime=nvidia nvidia/cuda:9.2-devel-ubuntu18.04 nvidia-smi
Wed Aug 7 07:53:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 24% 43C P8 17W / 250W | 2607MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
but failed when building with GPU:
$ cat Dockerfile
FROM nvidia/cuda:9.2-devel-ubuntu18.04
RUN nvidia-smi
# RUN build something
# RUN tests require GPU
$ docker build .
Sending build context to Docker daemon 2.048kB
Step 1/2 : FROM nvidia/cuda:9.2-devel-ubuntu18.04
---> cdf6d16df818
Step 2/2 : RUN nvidia-smi
---> Running in 88f12f9dd7a5
/bin/sh: 1: nvidia-smi: not found
The command '/bin/sh -c nvidia-smi' returned a non-zero code: 127
I'm new to docker but I think we need sanity checks when building an image. So how could I build docker image with cuda runtime?
Configuring docker daemon with --default-runtime=nvidia solved the problem.
Please refer to this wiki for more info.
Maybe it's because you are using "RUN" command on the Dockerfile. I'd try "CMD" (see documentation for this command) or "ENTRYPOINT" due to call 'docker run' with arguments.
I think that "RUN" commands are for previous jobs that you need to execute before the container get available, instead of a process with output and stuff.
Good luck with that,