I'm setting up a jupyter notebook run on a remote server but my code appears not to be using the GPU. It looks like tensorflow is identifying the GPU but Keras is missing it somehow. Is there something in my setup process leading to this?
I installed nvidia docker via the github instructions:
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
I'm ssh'ing into my server:
ssh me#serverstuff
And then on the server running:
docker run -it -p 9999:9999 --name mycontainer -v /mydata:/mycontainer/mydata ufoym/deepo bash
jupyter notebook --ip 0.0.0.0 --port 9999 --no-browser --allow-root
And then opening up a new command prompt on my desktop and running:
ssh -N -f -L localhost:9999:serverstuff:9999 me#serverstuff
Then signing in, and opening up localhost:9999 in my browser, and logging in with the provided token successfully.
But when I run DL training in my notebook the speed is such that it doesn't seem to be using GPU.
!nvidia-smi
gives:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.86 Driver Version: 430.86 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 730 WDDM | 00000000:01:00.0 N/A | N/A |
| 25% 41C P8 N/A / N/A | 551MiB / 2048MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
and
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
gives:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7106107654095923441
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 13064397814134284140
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 14665046342845873047
physical_device_desc: "device: XLA_GPU device"
]
and
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
gives:
[]
Try installing another image, I also had problems with custom images so I went with a direct nvidia image:
docker pull nvcr.io/nvidia/tensorflow:19.08-py3
there are other versions as well, you can check them out here
Related
I installed the nvidia-docker2 following the instructions here. When running the following command I will get the expected output as shown.
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:0B:00.0 On | N/A |
| 24% 31C P8 13W / 250W | 222MiB / 11011MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
However, running the above command without "sudo" results in the following error for me:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create
failed: runc create failed: unable to start container process: error during container
init: error running hook #0: error running hook: exit status 1, stdout: , stderr:
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1:
cannot open shared object file: no such file or directory: unknown.
Can anyone please help me with how I can solve this problem?
Add docker group to your user:
sudo usermod -aG docker your_user
Update:
Check here https://github.com/NVIDIA/nvidia-docker/issues/539
Maybe something from the comments will help you.
try adding "sudo" to you docker command.
e.g sudo docker-compose ...
Problem description
When I run vulkaninfo in docker, it complains:
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
ERROR at /build/vulkan-tools-1.3.204.0~rc3-1lunarg20.04/vulkaninfo/vulkaninfo.h:649:vkCreateInstance failed with ERROR_INCOMPATIBLE_DRIVER
It seems like this problem was caused by driver, so I run nvidia-smi to check:
Sun Apr 17 03:12:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:68:00.0 Off | N/A |
| 32% 37C P8 12W / 250W | 18MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1550 G 16MiB |
+-----------------------------------------------------------------------------+
It seems the driver works well. And I check the environment variable NVIDIA_DRIVER_CAPABILITIES and run lspci|grep -i vga:
$ echo ${NVIDIA_DRIVER_CAPABILITIES}
all
$ lspci|grep -i vga
19:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
If I install mesa-vulkan-drivers, vulkaninfo works fine but does not recognize NVIDIA GPU:
$ apt install mesa-vulkan-drivers
$ vulkaninfo --summary
Devices:
========
GPU0:
apiVersion = 4198582 (1.1.182)
driverVersion = 1 (0x0001)
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 12.0.0, 256 bits)
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 21.2.6 (LLVM 12.0.0)
conformanceVersion = 1.0.0.0
deviceUUID = 00000000-0000-0000-0000-000000000000
driverUUID = 00000000-0000-0000-0000-000000000000
Reproduction details
Host system information:
OS: Ubuntu 16.04 xenial
Kernel: x86_64 Linux 4.4.0-210-generic
CPU: Intel Core i9-10940X CPU # 4.8GHz
GPU: NVIDIA GeForce RTX 2080 Ti x 4 (driver 510.60.02)
Docker information:
version: 20.10.7
nvidia-container-toolkit version: 1.9.0-1
Docker start command:
$ docker run --rm -it \
--runtime=nvidia \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=3 \
<docker_image_name> /bin/bash
Dockerfile:
FROM nvidia/cudagl:11.4.2-base-ubuntu20.04
ENV NVIDIA_DRIVER_CAPABILITIES compute,graphics,utility
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
libx11-xcb-dev \
libxkbcommon-dev \
libwayland-dev \
libxrandr-dev \
libegl1-mesa-dev
wget && \
rm -rf /var/lib/apt/lists/*
RUN wget -O - http://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
wget -O /etc/apt/sources.list.d/lunarg-vulkan-focal.list http://packages.lunarg.com/vulkan/lunarg-vulkan-focal.list && \
apt update && \
apt install -y vulkan-sdk
EDIT1:
I am running ssh on a headless server and I want to do offline rendering in docker.
I am trying to use the base images provided by NVIDIA that let us use their GPUs via Docker containers. Because I am using docker, there is no need for me to have CUDA Toolkit or CuDNN on my system. All I need to have is the right driver - which I have.
I can run the official pytorch docker containers and the containers utilize my GPU. However when I run anything using the base images from NVIDIA then I get the following Warning -
$ docker run --gpus all -it --rm -p 8000:8000 ubuntu-cuda-gpu:latest
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
The application executes, it just uses CPU. But I want to be able to use my GPU like I can when I run the same code(it is a simple pytorch example) using official pytorch docker images.
The base image used is -
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
# Setup
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
# Your stuff
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
torch \
transformers \
...
If I just run the image without any machine learning code and try to execute nvidia-smi then I get the output as -
$ docker run --gpus all -it --rm -p 8000:8000 ubuntu-cuda-gpu:latest nvidia-smi
Sat Jun 12 19:15:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 31C P8 9W / 170W | 14MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This leads me to believe that at least something is right. But why is it that I am not able to use my GPU and how to make sure that I can?
I am on Ubuntu 20.04.
I'm trying to use GPU from inside my docker container. I'm using docker with version 19.03 on Ubuntu 18.04.
Outside the docker container if I run nvidia-smi I get the below output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If I run the samething inside the container created from nvidia/cuda docker image, I get the same output as above and everything is running smoothly. torch.cuda.is_available() returns True.
But If I run the same nvidia-smi command inside any other docker container, it gives the following output where you can see that the CUDA Version is coming as N/A. Inside the containers torch.cuda.is_available() also returns False.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have installed nvidia-container-toolkit using the following commands.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker
I started my containers using the following commands
sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
sudo docker run -it --rm --gpus all ubuntu nvidia-smi
docker run --rm --gpus all nvidia/cuda nvidia-smi should NOT return CUDA Version: N/A if everything (aka nvidia driver, CUDA toolkit, and nvidia-container-toolkit) is installed correctly on the host machine.
Given that docker run --rm --gpus all nvidia/cuda nvidia-smi returns correctly. I also had problem with CUDA Version: N/A inside of the container, which I had luck in solving:
Please see my answer https://stackoverflow.com/a/64422438/2202107 (obviously you need to adjust and install the matching/correct versions of everything)
For anybody arriving here looking how to do it with docker compose, add to your service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities:
- gpu
- utility # nvidia-smi
- compute # CUDA. Required to avoid "CUDA version: N/A"
- video # NVDEC/NVENC. For instance to use a hardware accelerated ffmpeg. Skip it if you don't need it
Doc: https://docs.docker.com/compose/gpu-support
Note: This is on a headless AWS box on VNC, the current desktop I'm running on is DISPLAY=:1.0
I am trying to build a container that can hold an opengl application but I'm having trouble getting vglrun to work correctly. I am currently running it with --gpus all on the docker run line as well
# xhost +si:localuser:root
# docker run --rm -it \
-e DISPLAY=unix$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \
--gpus all centos:7 \
sh -c "yum install epel-release -y && \
yum install -y VirtualGL glx-utils && \
vglrun glxgears"
No protocol specified
[VGL] ERROR: Could not open display :0
On the host:
$ nvidia-smi
Tue Jan 28 22:32:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 16W / 150W | 56MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2387 G /usr/bin/X 55MiB |
+-----------------------------------------------------------------------------+
I can confirm running glxgears without vglrun works fine but my application I'm trying to build into docker inherently uses vglrun. I have also tried using the nvidia container nvidia/opengl:1.1-glvnd-runtime-centos7 with no success.
running it with vglrun -d :1.0 glxgears or vglrun -d unix:1.0 glxgears gives me this error:
Error: couldn't get an RGB, Double-buffered visual
What am I doing wrong here? does vglrun not work in a container?
EDIT: It seems I was approaching this problem the wrong way, it seems to work when I'm on the primary :0 display but when using VNC to view display :1, the Mesa drivers get used instead of the Nvidia ones. Is there a way I can use the GPU on spawned VNC displays?
I met the same problem and I solved by setting the variable "VGL_DISPLAY".
docker run ... -e VGL_DISPLAY=$DISPLAY ...
Then it worked! Please try.