bug after installing nvidia drivers on ubuntu PREEMPT_RT kernel - driver

I installed Nvidia drivers on my Ubuntu 20.04 PC with PREEMPT_RT kernel patch:5.15.79-rt54.For installation I used this tutorial: https://gist.github.com/pantor/9786c41c03a97bca7a52aa0a72fa9387.
After installation I'm getting this bug: bug scheduling while atomic: irq/88-s-nvidia/773/0x00000002. Bug appears sometimes every few seconds or few minutes.
I'm aware that nvidia drivers are not supported on realtime kernels but maybe anyone has found a solution to this problem or workaround? I don't have much experience with ubuntu kernels but PREEMPT_RT kernel is required to control 6DOF robotic arm that I'm working with. I also need CUDA for image processing that will determine movement of the robot. I was considering use of second PC that runs PREEMPT_RT kernel and commands robot velocities. I would make all calculations on the first machine(with Nvidia GPU and generic kernel) and transfer data over TCP/IP but I'm afraid that it would add to much latency to the system. I do apologize if I didn't provide enough information but I didn't really know what would be helpful in this situation.
I tried to install old version of nvidia drivers (520.56.06) but it didn't help.
This is Nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 38C P8 5W / N/A | 243MiB / 4096MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1629 G /usr/lib/xorg/Xorg 59MiB |
| 0 N/A N/A 2154 G /usr/lib/xorg/Xorg 68MiB |
| 0 N/A N/A 2278 G /usr/bin/gnome-shell 104MiB |
+-----------------------------------------------------------------------------+

Related

How to properly use current Docker Desktop, WSL 2, and NVidia GPU support

First time posting a question - suggestions welcome to improve the post!
OS: Windows 10 Enterprise, 21H2 19044.2468
WSL2: Ubuntu-20.04
NVIDIA Hardware: A2000 8GB Ampere-based laptop GPU
NVIDIA Driver: Any from 517.88 to 528.24 (current prod. release)
DOCKER DESKTOP: Any from 4.9 to 4.16
I get the following error when I try to use GPU-enabled docker containers:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error running hook #0: error running hook: exit status 1, stdout: , stderr:
Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: WSL environment detected but no adapters were found: unknown.
I have searched this error somewhat extensively, and have found many questions and answers. Such questions and answers are different because:
They deal with beta Nvidia drivers, which is dated
They deal with docker version 19 and prior, not docker desktop 20 and up
They are based on the windows insider channel, which is dated
They are relevant for older and less supported hardware - this is on an Ampere based card
They relate to the docker-nvidia2 package which is deprecated
They may relate to using the docker toolkit which does not follow the Nvidia instructions.
I followed the instructions at https://docs.nvidia.com/cuda/wsl-user-guide/index.html carefully. I also followed instructions at https://docs.docker.com/desktop/windows/wsl/. This install was previously working for me, and suddenly stopped one day without my intervention. Obviously, something changed, but I wiped all the layers of software here and reinstalled to no avail. Then I tried on other systems which previously were working, but which were outside the enterprise domain, and they too had this issue.
Relevant info from nvidia-smi in windows, wsl, and the dxdiag:
PS C:\> nvidia-smi
Mon Jan 30 10:37:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 517.88 Driver Version: 517.88 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A200... WDDM | 00000000:01:00.0 On | N/A |
| N/A 55C P8 14W / N/A | 1431MiB / 8192MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1880 C+G C:\Windows\System32\dwm.exe N/A |
| 0 N/A N/A 5252 C+G ...3d8bbwe\CalculatorApp.exe N/A |
| 0 N/A N/A 6384 C+G ...werToys.ColorPickerUI.exe N/A |
| 0 N/A N/A 6764 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 7024 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 10884 C+G ...ll\Dell Pair\DellPair.exe N/A |
| 0 N/A N/A 11044 C+G ...\AppV\AppVStreamingUX.exe N/A |
| 0 N/A N/A 11944 C+G ...me\Application\chrome.exe N/A |
| 0 N/A N/A 12896 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 14372 C+G ...---------<edit>---------- N/A |
| 0 N/A N/A 14748 C+G ...\PowerToys.FancyZones.exe N/A |
| 0 N/A N/A 15472 C+G ...ontend\Docker Desktop.exe N/A |
| 0 N/A N/A 16500 C+G ...werToys.PowerLauncher.exe N/A |
| 0 N/A N/A 17356 C+G ...---------<edit>---------- N/A |
| 0 N/A N/A 17944 C+G ...ge\Application\msedge.exe N/A |
| 0 N/A N/A 18424 C+G ...t\Teams\current\Teams.exe N/A |
| 0 N/A N/A 18544 C+G ...y\ShellExperienceHost.exe N/A |
| 0 N/A N/A 18672 C+G ...oft OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 19216 C+G ...t\Teams\current\Teams.exe N/A |
| 0 N/A N/A 20740 C+G ...ck\app-4.29.149\slack.exe N/A |
| 0 N/A N/A 22804 C+G ...lPanel\SystemSettings.exe N/A |
+-----------------------------------------------------------------------------+
On the WSL side
$ nvidia-smi
Mon Jan 30 10:35:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 517.88 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A200... On | 00000000:01:00.0 On | N/A |
| N/A 54C P8 14W / N/A | 1276MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
You can see from this that the card is in the WDDM mode, which addresses other answers to previous questions, as well.
cat /proc/version
Linux version 5.15.79.1-microsoft-standard-WSL2 (oe-user#oe-host) (x86_64-msft-linux-gcc (GCC)
9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Wed Nov 23 01:01:46 UTC 2022
The wsl version seems fine.
Running dxdiag in windows produced a result with WHQL signed drivers with no problems reported. Everything in the hardware is exactly as it was when the containers worked previously.
This led me to query the lspci with:
$ lspci | grep -i nvidia
$
So this points to the root of the issue. My card is certainly CUDA capable, my system has the proper drivers and cuda toolkit, but the WSL2 doesn't see any NVIDIA devices. There is a 3D controller: MS Corp Basic Render Driver. As a result of this I ran the update-pciids
$ sudo update-pciids
$ ... (it pulled 270k)
$ Done.
$ lspci | grep -i nvidia
$
Blank again...
I reverted drivers to the 514.08 version to see if it would help. No dice.
Out of curiosity, I tried this entire clean install process with and without the linux CUDA install, with and without the windows CUDA toolkit, and with and without the docker nvidia toolkit. None of these changed the behavior. I tried with about 10 different versions of the nvidia driver and nvidia cuda toolkit as well. All have the same behavior, but in the CLI mode I can run current releases of everything. I just can't publish that as the fix, see below.
I was able to work around this by installing docker cli directly within Ubuntu, and then using the <edit: docker engine / docker-ce packages currently sourced from docker's repo. However, this option does not work with solutions to mount the docker-data folder to an external hard drive mounted in windows, which is a requirement for the system. (there are many reasons for this, but Docker Desktop does solve these with its WSL repo locations). /edit>
I replicated this issue with a Windows 11 machine, brought the machine fully up to date, and the issue still persisted. Thus, I have ruled out the windows 10 build version as the root cause. Plus, both systems worked previously, and neither had any updates applied manually. Both may have had automatic updates applied, however, the enterprise system cannot be controlled, and the Windows 11 machine was brought fully up to date without fixing the issue. Both machines have the issue where lspci does not see the card.
Am I just an idiot? Have I failed to set something up properly? I don't understand how my own setup could fail after it was working working properly. I have followed NVIDIA and DOCKER directions to a T. So many processes update dependencies prior to installing it feels impossible to track. But at the same time, something had to have changed and others must be experiencing this issue. Thanks for reading along this far! Suggestions on revision welcome.

Hardware Acceleration for Headless Chrome on WSL2 Docker

my target is to server-side-render a frame of a WebGL2 application in a performant way.
To do so, I'm using Puppeteer with headless Chrome inside a Docker container running Ubuntu 20.04.4 on WSL2 on Windows 10 22H2.
However, I can't get any hardware acceleration to become active.
It seems that Chrome doesn't detect my Nvidia GTX1080 card, while on the host system with headless: false it's being used and increases render performance drastically.
I have followed the tutorial here to install the CUDA toolkit on WSL2.
A container spun up using sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark uses the GTX1080.
Running nvidia-smi inside the container shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65 Driver Version: 527.37 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 49C P8 10W / N/A | 871MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
Using flags like
--ignore-gpu-blocklist
--enable-gpu-rasterization
--enable-zero-copy
--use-gl=desktop / --use-gl=egl
did not solve the problem.
I also tried libosmesa6 with --use-gl=osmesa which had no effect.
I'm starting my container with docker-compose including the following section into the service's block:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Chrome version is HeadlessChrome/108.0.5351.0.
The Chrome WebGL context's WEBGL_debug_renderer_info tells me
vendor: Google Inc. (Google)
renderer: ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device (Subzero) (0x0000C0DE)), SwiftShader driver)
Visiting chrome://gpu with Puppeteer shows me the following:
enter image description here
Any ideas what might still be missing to have Chrome use the GPU for hardware acceleration?

Cannot find CUDA device even if installed on the system when accessed from pytorch

NVCC version
C:\Users\acute>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:24:09_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0
NVIDIA SMI version
C:\Users\acute>nvidia-smi
Mon May 02 00:12:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 431.02 Driver Version: 431.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 43C P8 5W / N/A | 134MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
C:\Users\acute>pip freeze
certifi==2021.10.8
charset-normalizer==2.0.12
idna==3.3
numpy==1.22.3
Pillow==9.1.0
requests==2.27.1
torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113
typing_extensions==4.2.0
urllib3==1.26.9
Accessing from torch
>>> torch.cuda.is_available()
C:\Users\acute\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\cuda_init_.py:82: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\c10\cuda\CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False
enter code here

Is it correct that "nvidia-smi" on Docker does not show "Processes"?

Is it normal that when I run "nvidia-smi" on Docker, it doesn't show up in the "Processes" section?
I'm building an environment for deep learning on ubuntu with Docker + GPU on ubuntu.
I think it's almost done, but there is one thing that bothers me.
When I do "nvidia-smi" on Ubuntu, I see "processes".
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... Off | 00000000:01:00.0 On | N/A |
| 42% 37C P8 8W / 125W | 249MiB / 5936MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1127 G /usr/lib/xorg/Xorg 35MiB |
| 0 2006 G /usr/lib/xorg/Xorg 94MiB |
| 0 2202 G /usr/bin/gnome-shell 97MiB |
| 0 6565 G /usr/lib/firefox/firefox 2MiB |
| 0 7875 G /usr/lib/firefox/firefox 2MiB |
| 0 10070 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+
When I do 'nvidia-smi' on Docker, I don't see the 'processes'.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... Off | 00000000:01:00.0 On | N/A |
| 42% 36C P8 8W / 125W | 342MiB / 5936MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
When I run "Jupyter Notebook", the GPU seems to be running.
”It's a spec," I read an article written a few years ago.
Is it a "spec" that Process is still not showing up today?
Or is it because I'm not doing the right installation?
Lend me your wisdom.
Thanks in advace!
Yes, you will not be able to see, due to driver not being aware of PID namespace. You can peruse the thread and the work-around using Python in particular, at
https://github.com/NVIDIA/nvidia-docker/issues/179#issuecomment-598059213
(I presume you are not using a VM, since persistence mode is OFF in the log shown).
A shim driver allows in-docker nvidia-smi showing correct process list without modify anything.
https://github.com/matpool/mpu

nvidia-smi does not display memory usage [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
I want to use nvidia-smi to monitor my GPU for my machine-learning/ AI projects. However, when I run nvidia-smi in my cmd, git bash or powershell, I get the following results:
$ nvidia-smi
Sun May 28 13:25:46 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.53 Driver Version: 376.53 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 WDDM | 0000:28:00.0 On | N/A |
| 0% 49C P2 36W / 166W | 7240MiB / 8192MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7676 C+G ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 8580 C+G Insufficient Permissions N/A |
| 0 9704 C+G ...x86)\Google\Chrome\Application\chrome.exe N/A |
| 0 10532 C ...\Anaconda3\envs\tensorflow-gpu\python.exe N/A |
| 0 11384 C+G Insufficient Permissions N/A |
| 0 12896 C+G C:\Windows\explorer.exe N/A |
| 0 13868 C+G Insufficient Permissions N/A |
| 0 14068 C+G Insufficient Permissions N/A |
| 0 14568 C+G Insufficient Permissions N/A |
| 0 15260 C+G ...osoftEdge_8wekyb3d8bbwe\MicrosoftEdge.exe N/A |
| 0 16912 C+G ...am Files (x86)\Dropbox\Client\Dropbox.exe N/A |
| 0 18196 C+G ...I\AppData\Local\hyper\app-1.3.3\Hyper.exe N/A |
| 0 18228 C+G ...oftEdge_8wekyb3d8bbwe\MicrosoftEdgeCP.exe N/A |
| 0 20032 C+G ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
The column GPU Memory Usage shows N/A for every single process. Also, there are a lot more processes listed than I found for most examples on the Internet. What could be the reason for this?
I am running a Nvidia GTX 1070 by ASUS, my OS is Windows 10 Pro.
If you perform the following : nvidia-smi -q you will see the following:
Processes
Process ID : 6564
Type : C+G
Name : C:\Windows\explorer.exe
Used GPU Memory : Not available in WDDM driver model
Not available in WDDM driver model => WDDM stand for Windows Display Driver Model. You can switch to TCC and obtain the information with the command: nvidia-smi -dm 1, however this operation can only performed if you do not have any display attached to the GPU. So... It's mostly impossible...
By the way, don't worry about a high memory usage, Tensorflow reserve as much GPU memory as it can to speed up its processes. If you prefer a finer grained control on the memory taken use the following (it may slow down a little bit your computations):
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
You can create a dual boot on Ubuntu or just forget about this.

Resources