Old docker containers are not usable (no GPU) after updating the GPU driver in the host machine - docker

Today, we updated the GPU driver for our host machine, and the new containers that we created are all working fine. However, all of our existing docker containers give the following error when running the nvidia-smi command inside:
Failed to initialize NVML: Driver/library version mismatch
How to rescue these old containers? Our previous GPU driver version in the host machine was 384.125 and it is now 430.64.
Host Configuration
nvidia-smi gives
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS... Off | 00000000:07:00.0 On | 0 |
| N/A 40C P0 39W / 300W | 182MiB / 32505MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-DGXS... Off | 00000000:08:00.0 Off | 0 |
| N/A 40C P0 39W / 300W | 12MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-DGXS... Off | 00000000:0E:00.0 Off | 0 |
| N/A 39C P0 40W / 300W | 12MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-DGXS... Off | 00000000:0F:00.0 Off | 0 |
| N/A 40C P0 38W / 300W | 12MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1583 G /usr/lib/xorg/Xorg 169MiB |
+-----------------------------------------------------------------------------+
nvcc --version gives
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
dpkg -l | grep -i docker gives
ii dgx-docker-cleanup 1.0-1 amd64 DGX Docker cleanup script
rc dgx-docker-options 1.0-7 amd64 DGX docker daemon options
ii dgx-docker-repo 1.0-1 amd64 docker repository configuration file
ii docker-ce 5:18.09.2~3-0~ubuntu-xenial amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:18.09.2~3-0~ubuntu-xenial amd64 Docker CLI: the open-source application container engine
ii nvidia-container-runtime 2.0.0+docker18.09.2-1 amd64 NVIDIA container runtime
ii nvidia-docker 1.0.1-1 amd64 NVIDIA Docker container tools
rc nvidia-docker2 2.0.3+docker18.09.2-1 all nvidia-docker CLI wrapper
docker version gives
Client:
Version: 18.09.2
API version: 1.39
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 04:13:50 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:42:13 2019
OS/Arch: linux/amd64
Experimental: false

I ran into this issue as well. In my case, I had the line:
apt install -y nvidia-cuda-toolkit
in my Dockerfile. Removing this line resolved the issue. In general, I would recommend using an nvidia provided container compatible with the drivers on your local machine.

Related

How to properly use current Docker Desktop, WSL 2, and NVidia GPU support

First time posting a question - suggestions welcome to improve the post!
OS: Windows 10 Enterprise, 21H2 19044.2468
WSL2: Ubuntu-20.04
NVIDIA Hardware: A2000 8GB Ampere-based laptop GPU
NVIDIA Driver: Any from 517.88 to 528.24 (current prod. release)
DOCKER DESKTOP: Any from 4.9 to 4.16
I get the following error when I try to use GPU-enabled docker containers:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error running hook #0: error running hook: exit status 1, stdout: , stderr:
Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: WSL environment detected but no adapters were found: unknown.
I have searched this error somewhat extensively, and have found many questions and answers. Such questions and answers are different because:
They deal with beta Nvidia drivers, which is dated
They deal with docker version 19 and prior, not docker desktop 20 and up
They are based on the windows insider channel, which is dated
They are relevant for older and less supported hardware - this is on an Ampere based card
They relate to the docker-nvidia2 package which is deprecated
They may relate to using the docker toolkit which does not follow the Nvidia instructions.
I followed the instructions at https://docs.nvidia.com/cuda/wsl-user-guide/index.html carefully. I also followed instructions at https://docs.docker.com/desktop/windows/wsl/. This install was previously working for me, and suddenly stopped one day without my intervention. Obviously, something changed, but I wiped all the layers of software here and reinstalled to no avail. Then I tried on other systems which previously were working, but which were outside the enterprise domain, and they too had this issue.
Relevant info from nvidia-smi in windows, wsl, and the dxdiag:
PS C:\> nvidia-smi
Mon Jan 30 10:37:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 517.88 Driver Version: 517.88 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A200... WDDM | 00000000:01:00.0 On | N/A |
| N/A 55C P8 14W / N/A | 1431MiB / 8192MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1880 C+G C:\Windows\System32\dwm.exe N/A |
| 0 N/A N/A 5252 C+G ...3d8bbwe\CalculatorApp.exe N/A |
| 0 N/A N/A 6384 C+G ...werToys.ColorPickerUI.exe N/A |
| 0 N/A N/A 6764 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 7024 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 10884 C+G ...ll\Dell Pair\DellPair.exe N/A |
| 0 N/A N/A 11044 C+G ...\AppV\AppVStreamingUX.exe N/A |
| 0 N/A N/A 11944 C+G ...me\Application\chrome.exe N/A |
| 0 N/A N/A 12896 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 14372 C+G ...---------<edit>---------- N/A |
| 0 N/A N/A 14748 C+G ...\PowerToys.FancyZones.exe N/A |
| 0 N/A N/A 15472 C+G ...ontend\Docker Desktop.exe N/A |
| 0 N/A N/A 16500 C+G ...werToys.PowerLauncher.exe N/A |
| 0 N/A N/A 17356 C+G ...---------<edit>---------- N/A |
| 0 N/A N/A 17944 C+G ...ge\Application\msedge.exe N/A |
| 0 N/A N/A 18424 C+G ...t\Teams\current\Teams.exe N/A |
| 0 N/A N/A 18544 C+G ...y\ShellExperienceHost.exe N/A |
| 0 N/A N/A 18672 C+G ...oft OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 19216 C+G ...t\Teams\current\Teams.exe N/A |
| 0 N/A N/A 20740 C+G ...ck\app-4.29.149\slack.exe N/A |
| 0 N/A N/A 22804 C+G ...lPanel\SystemSettings.exe N/A |
+-----------------------------------------------------------------------------+
On the WSL side
$ nvidia-smi
Mon Jan 30 10:35:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 517.88 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A200... On | 00000000:01:00.0 On | N/A |
| N/A 54C P8 14W / N/A | 1276MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
You can see from this that the card is in the WDDM mode, which addresses other answers to previous questions, as well.
cat /proc/version
Linux version 5.15.79.1-microsoft-standard-WSL2 (oe-user#oe-host) (x86_64-msft-linux-gcc (GCC)
9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Wed Nov 23 01:01:46 UTC 2022
The wsl version seems fine.
Running dxdiag in windows produced a result with WHQL signed drivers with no problems reported. Everything in the hardware is exactly as it was when the containers worked previously.
This led me to query the lspci with:
$ lspci | grep -i nvidia
$
So this points to the root of the issue. My card is certainly CUDA capable, my system has the proper drivers and cuda toolkit, but the WSL2 doesn't see any NVIDIA devices. There is a 3D controller: MS Corp Basic Render Driver. As a result of this I ran the update-pciids
$ sudo update-pciids
$ ... (it pulled 270k)
$ Done.
$ lspci | grep -i nvidia
$
Blank again...
I reverted drivers to the 514.08 version to see if it would help. No dice.
Out of curiosity, I tried this entire clean install process with and without the linux CUDA install, with and without the windows CUDA toolkit, and with and without the docker nvidia toolkit. None of these changed the behavior. I tried with about 10 different versions of the nvidia driver and nvidia cuda toolkit as well. All have the same behavior, but in the CLI mode I can run current releases of everything. I just can't publish that as the fix, see below.
I was able to work around this by installing docker cli directly within Ubuntu, and then using the <edit: docker engine / docker-ce packages currently sourced from docker's repo. However, this option does not work with solutions to mount the docker-data folder to an external hard drive mounted in windows, which is a requirement for the system. (there are many reasons for this, but Docker Desktop does solve these with its WSL repo locations). /edit>
I replicated this issue with a Windows 11 machine, brought the machine fully up to date, and the issue still persisted. Thus, I have ruled out the windows 10 build version as the root cause. Plus, both systems worked previously, and neither had any updates applied manually. Both may have had automatic updates applied, however, the enterprise system cannot be controlled, and the Windows 11 machine was brought fully up to date without fixing the issue. Both machines have the issue where lspci does not see the card.
Am I just an idiot? Have I failed to set something up properly? I don't understand how my own setup could fail after it was working working properly. I have followed NVIDIA and DOCKER directions to a T. So many processes update dependencies prior to installing it feels impossible to track. But at the same time, something had to have changed and others must be experiencing this issue. Thanks for reading along this far! Suggestions on revision welcome.

Hardware Acceleration for Headless Chrome on WSL2 Docker

my target is to server-side-render a frame of a WebGL2 application in a performant way.
To do so, I'm using Puppeteer with headless Chrome inside a Docker container running Ubuntu 20.04.4 on WSL2 on Windows 10 22H2.
However, I can't get any hardware acceleration to become active.
It seems that Chrome doesn't detect my Nvidia GTX1080 card, while on the host system with headless: false it's being used and increases render performance drastically.
I have followed the tutorial here to install the CUDA toolkit on WSL2.
A container spun up using sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark uses the GTX1080.
Running nvidia-smi inside the container shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65 Driver Version: 527.37 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 49C P8 10W / N/A | 871MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
Using flags like
--ignore-gpu-blocklist
--enable-gpu-rasterization
--enable-zero-copy
--use-gl=desktop / --use-gl=egl
did not solve the problem.
I also tried libosmesa6 with --use-gl=osmesa which had no effect.
I'm starting my container with docker-compose including the following section into the service's block:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Chrome version is HeadlessChrome/108.0.5351.0.
The Chrome WebGL context's WEBGL_debug_renderer_info tells me
vendor: Google Inc. (Google)
renderer: ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device (Subzero) (0x0000C0DE)), SwiftShader driver)
Visiting chrome://gpu with Puppeteer shows me the following:
enter image description here
Any ideas what might still be missing to have Chrome use the GPU for hardware acceleration?

Cannot find CUDA device even if installed on the system when accessed from pytorch

NVCC version
C:\Users\acute>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:24:09_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0
NVIDIA SMI version
C:\Users\acute>nvidia-smi
Mon May 02 00:12:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 431.02 Driver Version: 431.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 43C P8 5W / N/A | 134MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
C:\Users\acute>pip freeze
certifi==2021.10.8
charset-normalizer==2.0.12
idna==3.3
numpy==1.22.3
Pillow==9.1.0
requests==2.27.1
torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113
typing_extensions==4.2.0
urllib3==1.26.9
Accessing from torch
>>> torch.cuda.is_available()
C:\Users\acute\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\cuda_init_.py:82: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\c10\cuda\CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False
enter code here

nvidia driver support on ubuntu in docker from host windows - 'Found no NVIDIA driver on your system' error

I have built a docker image: ubuntu20.04 + py38 + torch, various libs(llvmlite, cuda, pyyaml etc) + my flask app. The app uses torch and it needs nvidia driver inside the container. The host machine is win10 x64.
when running container and testing it with postman, the error appeared:
<head>
<title>AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx // Werkzeug Debugger</title>
On my machine nvidia-smi is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 442.92 Driver Version: 442.92 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 3W / N/A | 382MiB / 6144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6212 C+G ...ta\Local\Postman\app-7.31.0\Postman.exe N/A |
| 0 6752 C+G ...are\Brave-Browser\Application\brave.exe N/A |
+-----------------------------------------------------------------------------+
It has been asked many times on SO, and the traditional answers are that nvidia can't support gpu acceleration from windows on linux docker container.
I found similar answers. I have read question and answers to this question. But these solutions involve host ubuntu + docker image with ubuntu inside.
this link instructs how to use nvidia-docker2, but nvidia-docker2 is deprecated according to this answer
The official nvidia-docker repo has instructions - but for linux host only.
But there is also this WSL on docker(backend linux) software installed - can it be used?
Is there still way to make ubuntu container use nvidia gpu from host windows machine?
It looks like you can now run Docker in Ubuntu with the Windows Subsystem for Linux (WSL 2) and do GPU-through.
This link goes through installation, setup and running a TensorFlow Jupyter notebook with GPU support:
https://ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2
Note - I haven't done this myself yet.

Cannot use GPU on Minikube with Docker driver

Goal:
I'm trying to use Nvidia GPU capabilities on a Minikube cluster that uses the default Docker driver.
Problem:
I'm able to use nvidia-docker with the default docker context, but when switching to minikube docker-env I get the following error:
$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
Environment:
Ubuntu 18.04
Minikube v1.10.0
Docker version:
$ docker version
Client: Docker Engine - Community
Version: 19.03.10
API version: 1.40
Go version: go1.13.10
Git commit: 9424aeaee9
Built: Thu May 28 22:16:49 2020
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.12)
Go version: go1.12.9
Git commit: 6a30dfca03
Built: Wed Sep 11 22:45:55 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.3.3-14-g449e9269
GitCommit: 449e926990f8539fd00844b26c07e2f1e306c760
runc:
Version: 1.0.0-rc10
GitCommit:
docker-init:
Version: 0.18.0
GitCommit:
Nvidia Container Runtime version:
$ nvidia-container-runtime --version
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev
Additional Info:
The cluster was created with:
minikube start --cpus 3 --memory 8G
The following minikube addons are currently enabled:
$ minikube addons list
|-----------------------------|----------|--------------|
| ADDON NAME | PROFILE | STATUS |
|-----------------------------|----------|--------------|
| dashboard | minikube | disabled |
| default-storageclass | minikube | enabled ✅ |
| efk | minikube | disabled |
| freshpod | minikube | disabled |
| gvisor | minikube | disabled |
| helm-tiller | minikube | disabled |
| ingress | minikube | disabled |
| ingress-dns | minikube | disabled |
| istio | minikube | disabled |
| istio-provisioner | minikube | disabled |
| logviewer | minikube | disabled |
| metallb | minikube | disabled |
| metrics-server | minikube | disabled |
| nvidia-driver-installer | minikube | enabled ✅ |
| nvidia-gpu-device-plugin | minikube | enabled ✅ |
| registry | minikube | disabled |
| registry-aliases | minikube | disabled |
| registry-creds | minikube | disabled |
| storage-provisioner | minikube | enabled ✅ |
| storage-provisioner-gluster | minikube | disabled |
|-----------------------------|----------|--------------|
And this is a working example outside the minikube context:
$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Fri Jun 5 09:23:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 On | N/A |
| 0% 51C P8 6W / 120W | 1293MiB / 6077MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This is a community wiki answer. Feel free to edit and expand it if needed.
Nvidia GPU is not officially supported with the docker driver for Minikube. This leaves you with two possible options:
Try to use NVIDIA Container Toolkit and NVIDIA device plugin. This is a workaround way and might not be the best solution in your use case.
Use the KVM2 driver or None driver. These two are officially supported and documented.
I hope it helps.

Resources