nvidia driver support on ubuntu in docker from host windows - 'Found no NVIDIA driver on your system' error - docker

I have built a docker image: ubuntu20.04 + py38 + torch, various libs(llvmlite, cuda, pyyaml etc) + my flask app. The app uses torch and it needs nvidia driver inside the container. The host machine is win10 x64.
when running container and testing it with postman, the error appeared:
<head>
<title>AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx // Werkzeug Debugger</title>
On my machine nvidia-smi is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 442.92 Driver Version: 442.92 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 3W / N/A | 382MiB / 6144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6212 C+G ...ta\Local\Postman\app-7.31.0\Postman.exe N/A |
| 0 6752 C+G ...are\Brave-Browser\Application\brave.exe N/A |
+-----------------------------------------------------------------------------+
It has been asked many times on SO, and the traditional answers are that nvidia can't support gpu acceleration from windows on linux docker container.
I found similar answers. I have read question and answers to this question. But these solutions involve host ubuntu + docker image with ubuntu inside.
this link instructs how to use nvidia-docker2, but nvidia-docker2 is deprecated according to this answer
The official nvidia-docker repo has instructions - but for linux host only.
But there is also this WSL on docker(backend linux) software installed - can it be used?
Is there still way to make ubuntu container use nvidia gpu from host windows machine?

It looks like you can now run Docker in Ubuntu with the Windows Subsystem for Linux (WSL 2) and do GPU-through.
This link goes through installation, setup and running a TensorFlow Jupyter notebook with GPU support:
https://ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2
Note - I haven't done this myself yet.

Related

How to properly use current Docker Desktop, WSL 2, and NVidia GPU support

First time posting a question - suggestions welcome to improve the post!
OS: Windows 10 Enterprise, 21H2 19044.2468
WSL2: Ubuntu-20.04
NVIDIA Hardware: A2000 8GB Ampere-based laptop GPU
NVIDIA Driver: Any from 517.88 to 528.24 (current prod. release)
DOCKER DESKTOP: Any from 4.9 to 4.16
I get the following error when I try to use GPU-enabled docker containers:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error running hook #0: error running hook: exit status 1, stdout: , stderr:
Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: WSL environment detected but no adapters were found: unknown.
I have searched this error somewhat extensively, and have found many questions and answers. Such questions and answers are different because:
They deal with beta Nvidia drivers, which is dated
They deal with docker version 19 and prior, not docker desktop 20 and up
They are based on the windows insider channel, which is dated
They are relevant for older and less supported hardware - this is on an Ampere based card
They relate to the docker-nvidia2 package which is deprecated
They may relate to using the docker toolkit which does not follow the Nvidia instructions.
I followed the instructions at https://docs.nvidia.com/cuda/wsl-user-guide/index.html carefully. I also followed instructions at https://docs.docker.com/desktop/windows/wsl/. This install was previously working for me, and suddenly stopped one day without my intervention. Obviously, something changed, but I wiped all the layers of software here and reinstalled to no avail. Then I tried on other systems which previously were working, but which were outside the enterprise domain, and they too had this issue.
Relevant info from nvidia-smi in windows, wsl, and the dxdiag:
PS C:\> nvidia-smi
Mon Jan 30 10:37:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 517.88 Driver Version: 517.88 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A200... WDDM | 00000000:01:00.0 On | N/A |
| N/A 55C P8 14W / N/A | 1431MiB / 8192MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1880 C+G C:\Windows\System32\dwm.exe N/A |
| 0 N/A N/A 5252 C+G ...3d8bbwe\CalculatorApp.exe N/A |
| 0 N/A N/A 6384 C+G ...werToys.ColorPickerUI.exe N/A |
| 0 N/A N/A 6764 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 7024 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 10884 C+G ...ll\Dell Pair\DellPair.exe N/A |
| 0 N/A N/A 11044 C+G ...\AppV\AppVStreamingUX.exe N/A |
| 0 N/A N/A 11944 C+G ...me\Application\chrome.exe N/A |
| 0 N/A N/A 12896 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 14372 C+G ...---------<edit>---------- N/A |
| 0 N/A N/A 14748 C+G ...\PowerToys.FancyZones.exe N/A |
| 0 N/A N/A 15472 C+G ...ontend\Docker Desktop.exe N/A |
| 0 N/A N/A 16500 C+G ...werToys.PowerLauncher.exe N/A |
| 0 N/A N/A 17356 C+G ...---------<edit>---------- N/A |
| 0 N/A N/A 17944 C+G ...ge\Application\msedge.exe N/A |
| 0 N/A N/A 18424 C+G ...t\Teams\current\Teams.exe N/A |
| 0 N/A N/A 18544 C+G ...y\ShellExperienceHost.exe N/A |
| 0 N/A N/A 18672 C+G ...oft OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 19216 C+G ...t\Teams\current\Teams.exe N/A |
| 0 N/A N/A 20740 C+G ...ck\app-4.29.149\slack.exe N/A |
| 0 N/A N/A 22804 C+G ...lPanel\SystemSettings.exe N/A |
+-----------------------------------------------------------------------------+
On the WSL side
$ nvidia-smi
Mon Jan 30 10:35:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 517.88 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A200... On | 00000000:01:00.0 On | N/A |
| N/A 54C P8 14W / N/A | 1276MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
You can see from this that the card is in the WDDM mode, which addresses other answers to previous questions, as well.
cat /proc/version
Linux version 5.15.79.1-microsoft-standard-WSL2 (oe-user#oe-host) (x86_64-msft-linux-gcc (GCC)
9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Wed Nov 23 01:01:46 UTC 2022
The wsl version seems fine.
Running dxdiag in windows produced a result with WHQL signed drivers with no problems reported. Everything in the hardware is exactly as it was when the containers worked previously.
This led me to query the lspci with:
$ lspci | grep -i nvidia
$
So this points to the root of the issue. My card is certainly CUDA capable, my system has the proper drivers and cuda toolkit, but the WSL2 doesn't see any NVIDIA devices. There is a 3D controller: MS Corp Basic Render Driver. As a result of this I ran the update-pciids
$ sudo update-pciids
$ ... (it pulled 270k)
$ Done.
$ lspci | grep -i nvidia
$
Blank again...
I reverted drivers to the 514.08 version to see if it would help. No dice.
Out of curiosity, I tried this entire clean install process with and without the linux CUDA install, with and without the windows CUDA toolkit, and with and without the docker nvidia toolkit. None of these changed the behavior. I tried with about 10 different versions of the nvidia driver and nvidia cuda toolkit as well. All have the same behavior, but in the CLI mode I can run current releases of everything. I just can't publish that as the fix, see below.
I was able to work around this by installing docker cli directly within Ubuntu, and then using the <edit: docker engine / docker-ce packages currently sourced from docker's repo. However, this option does not work with solutions to mount the docker-data folder to an external hard drive mounted in windows, which is a requirement for the system. (there are many reasons for this, but Docker Desktop does solve these with its WSL repo locations). /edit>
I replicated this issue with a Windows 11 machine, brought the machine fully up to date, and the issue still persisted. Thus, I have ruled out the windows 10 build version as the root cause. Plus, both systems worked previously, and neither had any updates applied manually. Both may have had automatic updates applied, however, the enterprise system cannot be controlled, and the Windows 11 machine was brought fully up to date without fixing the issue. Both machines have the issue where lspci does not see the card.
Am I just an idiot? Have I failed to set something up properly? I don't understand how my own setup could fail after it was working working properly. I have followed NVIDIA and DOCKER directions to a T. So many processes update dependencies prior to installing it feels impossible to track. But at the same time, something had to have changed and others must be experiencing this issue. Thanks for reading along this far! Suggestions on revision welcome.

bug after installing nvidia drivers on ubuntu PREEMPT_RT kernel

I installed Nvidia drivers on my Ubuntu 20.04 PC with PREEMPT_RT kernel patch:5.15.79-rt54.For installation I used this tutorial: https://gist.github.com/pantor/9786c41c03a97bca7a52aa0a72fa9387.
After installation I'm getting this bug: bug scheduling while atomic: irq/88-s-nvidia/773/0x00000002. Bug appears sometimes every few seconds or few minutes.
I'm aware that nvidia drivers are not supported on realtime kernels but maybe anyone has found a solution to this problem or workaround? I don't have much experience with ubuntu kernels but PREEMPT_RT kernel is required to control 6DOF robotic arm that I'm working with. I also need CUDA for image processing that will determine movement of the robot. I was considering use of second PC that runs PREEMPT_RT kernel and commands robot velocities. I would make all calculations on the first machine(with Nvidia GPU and generic kernel) and transfer data over TCP/IP but I'm afraid that it would add to much latency to the system. I do apologize if I didn't provide enough information but I didn't really know what would be helpful in this situation.
I tried to install old version of nvidia drivers (520.56.06) but it didn't help.
This is Nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 38C P8 5W / N/A | 243MiB / 4096MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1629 G /usr/lib/xorg/Xorg 59MiB |
| 0 N/A N/A 2154 G /usr/lib/xorg/Xorg 68MiB |
| 0 N/A N/A 2278 G /usr/bin/gnome-shell 104MiB |
+-----------------------------------------------------------------------------+

Hardware Acceleration for Headless Chrome on WSL2 Docker

my target is to server-side-render a frame of a WebGL2 application in a performant way.
To do so, I'm using Puppeteer with headless Chrome inside a Docker container running Ubuntu 20.04.4 on WSL2 on Windows 10 22H2.
However, I can't get any hardware acceleration to become active.
It seems that Chrome doesn't detect my Nvidia GTX1080 card, while on the host system with headless: false it's being used and increases render performance drastically.
I have followed the tutorial here to install the CUDA toolkit on WSL2.
A container spun up using sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark uses the GTX1080.
Running nvidia-smi inside the container shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65 Driver Version: 527.37 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 49C P8 10W / N/A | 871MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
Using flags like
--ignore-gpu-blocklist
--enable-gpu-rasterization
--enable-zero-copy
--use-gl=desktop / --use-gl=egl
did not solve the problem.
I also tried libosmesa6 with --use-gl=osmesa which had no effect.
I'm starting my container with docker-compose including the following section into the service's block:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Chrome version is HeadlessChrome/108.0.5351.0.
The Chrome WebGL context's WEBGL_debug_renderer_info tells me
vendor: Google Inc. (Google)
renderer: ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device (Subzero) (0x0000C0DE)), SwiftShader driver)
Visiting chrome://gpu with Puppeteer shows me the following:
enter image description here
Any ideas what might still be missing to have Chrome use the GPU for hardware acceleration?

Vulkan does not detect GPU when running Unity build in Docker container

Running Unity builds on my PC usually works fine.
However when I am trying to run Unity builds within a Docker container I get a segmentation error Segmentation fault (core dumped). I am using Ubuntu 20.04 with an Nvidia GTX1080 and installed all required dependencies like the Nvidia Docker toolkit.
Looking at the logs generated by Unity, it seems that my Nvidia GPU is not detected by Vulkan.
[Vulkan init] SelectPhysicalDevice requestedDeviceIndex=-1 xrDevice=(nil)
[Vulkan init] Physical Device 0xfe9930 [0]: "llvmpipe (LLVM 12.0.0, 256 bits)" deviceType=4 vendorID=10005 deviceID=0
[Vulkan init] Selected physical device (nil)
Caught fatal signal - signo:11 code:1 errno:0 addr:(nil)
By looking at the output of vulkan-info only llvmpipe is detected as physical device.
GPU0:
VkPhysicalDeviceProperties:
---------------------------
apiVersion = 4198582 (1.1.182)
driverVersion = 1 (0x0001)
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 12.0.0, 256 bits)
In my Dockerfile I set following Nvidia settings
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES all
and used --gpus='all,"capabilities=compute,utility,graphics,display"' -e DISPLAY when starting the container.
Also running nvidia-smi within the container works.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03 Driver Version: 450.203.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:15:00.0 On | N/A |
| 12% 28C P8 18W / 250W | 644MiB / 11170MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Any ideas on resolving this problem? Thanks!
I don't know what the problem was, but i found a workaround.
Instead of using ubuntu:20.04 as base image I am using unityci/editor:ubuntu-2022.1.20f1-base-1 in my Dockerfile now.
I am using following settings in a docker compose file to start the container:
unity:
build:
dockerfile: unity.Dockerfile
volumes:
- /tmp/.X11-unix:/tmp/.X11-unix
- ${XAUTHORITY}:${XAUTHORITY}
- $XDG_RUNTIME_DIR:$XDG_RUNTIME_DIR
environment:
- XAUTHORITY
- DISPLAY
- XDG_RUNTIME_DIR
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]

CUDA Version mismatch in Docker with WSL2 backend

I am trying to use docker (Docker Desktop for Windows 10 Pro) with the WSL2 Backend (WINDOWS SUBSHELL LINUX (WSL) (Ubuntu 20.04.4 LTS)).
That part seems to be working fine, except I would like to pass my GPU (Nvidia RTX A5000) through to my docker container.
Before I even get that far, I am still trying to set things up. I found a very good tutorial aimed at 18.04, but found all the steps are the same for 20.04, just with some version numbers bumpede.
At the end, I can see that my Cuda versions do not match. You can see that here, .
The real issue is when I try to run the test command as shown on the docker website:
docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
I get this error:
--> docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380:
starting container process caused: process_linux.go:545: container init caused: Running
hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli:
requirement error: unsatisfied condition: cuda>=11.6, please update your driver to a
newer version, or use an earlier cuda container: unknown.
... and I just don't know what to do, or how I can fix this.
Can someone explain how to get the GPU to pass through to a docker container successfully.
I had the same issue on Ubuntu when I tried to run the container:
s.evloev#some-pc:~$ docker run --gpus all --rm nvidia/cuda:11.7.0-base-ubuntu18.04
docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.7, please update your driver to a newer version, or use an earlier cuda container: unknown.
In my case it occurred when I tried to launch docker image that have nvidia cuda version which is higher than what was installed on my host.
When I checked my cuda version that was installed on my host I have found that it is version 11.3.
s.evloev#some-pc:~$ nvidia-smi
Thu Jul 21 15:06:33 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|... |
+-----------------------------------------------------------------------------+
So when I try to run the same cuda version (11.3) it works well:
s.evloev#some-pc:~$ docker run -it --gpus all --rm nvidia/cuda:11.3.0-base-ubuntu18.04 nvidia-smi
Thu Jul 21 12:13:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:65:00.0 Off | N/A |
| 0% 44C P8 7W / 180W | 1404MiB / 8110MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The comment from #RobertCrovella resolved this:
please update your driver to a newer version when using WSL, the driver in your WSL setup is not something you install in WSL, it is provided by the driver on the windows side. Your WSL driver is 472.84 and this is too old to work with CUDA 11.6 (it only supports up to CUDA 11.4). So you would need to update your windows side driver to the latest one possible for your GPU, if you want to run a CUDA 11.6 test case. Regarding the "mismatch" of CUDA versions, this provides general background material for interpretation.
Downloading the most current Nvidia driver:
Version: R510 U3 (511.79) WHQL
Release Date: 2022.2.14
Operating System: Windows 10 64-bit, Windows 11
Language: English (US)
File Size: 640.19 MB
Now I am able to support CUDA 11.6, and the test from the docker documentation now works:
--> docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6
> Compute 8.6 CUDA device: [NVIDIA RTX A5000]
65536 bodies, total time for 10 iterations: 58.655 ms
= 732.246 billion interactions per second
= 14644.916 single-precision GFLOP/s at 20 flops per interaction
Thank you for the quick response!

Resources