RTX 3080 LHR Missing gpu__dram_throughput CUDA metric - machine-learning

As part of a machine learning project, we are optimizing some custom CUDA kernels.
We are trying to profile them using Nsight Compute, but encounter the following error running on the LHR RTX 3080 when running a simple wrapper around the CUDA Kernel:
==ERROR== Failed to access the following 4 metrics: dram__cycles_active.avg.pct_of_peak_sustained_elapsed, dram__cycles_elapsed.avg.per_second, gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed, gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
==ERROR== Failed to profile kernel "kernel" in process 20204
Running a diff against the metrics available on an RTX 3080 TI (non-LHR) vs an RTX-3080 (LHR) via nv-nsight-cu-cli --devices 0 --query-metrics, We notice the following metrics are missing in the RTX 3080 LHR version:
gpu__compute_memory_request_throughput
gpu__compute_memory_throughput
gpu__dram_throughput
All of these are required for even basic memory profiling using Nsight Compute. All other metrics are correctly present, except for these. Is this a limitation of LHR cards? Why would they not be present?
Details:
Gigabyte RTX 3080 Turbo (LHR)
Cuda Version: 11.5
Driver version: 497.29.
Windows 10

I saw your post on the nvidia developer forums and from what it looks like, nvidia didn't intend on this, so I'd either just go with what works (non-lhr) for now until they fix it. Quadro and tesla cards are supported by Nsight Compute so they might be a holdover solution.
So to answer the main question:
Will buying a non-LHR GPU address this problem?
for right now, yes, buying a non-lhr 3080 should fix the issue.

As per Nvidia forums, this is an unintended bug that is fixed by upgrading from CUDA 11.5 to CUDA 11.6, under which all profiling is working correctly with all metrics available.
Successful conditions:
Gigabyte RTX 3080 Turbo (LHR)
Cuda Version: 11.6
Driver version: 511.23.
Windows 10
We don't know why these metrics were unavailable, but the version update is definitely the correct fix.

Related

vulkaninfo failed with VK_ERROR_INITIALIZATION_FAILED

OS: ubuntu 18.04
GPU: Geforce GTX 1060
Driver: Nvidia Driver 440.82
Vulkan Package: libvulkan1/bionic-updates,now 1.1.70+dfsg1-1ubuntu0.18.04.1 amd64
Nvidia-smi shows the configuration correctly.
However, when i invoke vulkaninfo, I get /build/vulkan-UL09PJ/vulkan-1.1.70+dfsg1/demos/vulkaninfo.c:2700: failed with VK_ERROR_INITIALIZATION_FAILED
It seems vulkan can not detect physical device. Any idea why?
I guess you might invoke vulkaninfo via the ssh terminal, just run on you physical device may solve it.

Nvidia drivers (440, 450) cannot find with GeForce 2080ti (Ubuntu 20.04)

I am trying to get Ubuntu 20.04 running on a computer (desktop) with a GeForce 2080Ti and I had no luck with various versions of the nvidia drivers (440 from ppa, latest 450 from nividia website).
However, I could not get it to work :
nvidia-smi --> No devices were found.
In /var/log/Xorg.0.log --> (EE) Failed to initialize the NVIDIA GPU at PCI:7:0:0.
dmesg --> NVRM: GPU 0000:07/00.0: RmInitAdapter failed!
Other info that could help:
I have a GUI when I prime-select intel
Secure boot is disabled
nouveau is blacklisted
Thanks for help,

Run NVIDIA for GPGPU, Intel for graphics simultaneously

I have a laptop running Ubuntu 18.04 with both Intel and NVIDIA graphics cards
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GM204M [GeForce GTX 970M] (rev a1)
I would like to use the Intel card for my actual graphics display, and my NVIDIA card for simultaneously running GPGPU things (e.g. TensorFlow models, other CUDA stuff, OpenCL). Is this possible? How would I go about this?
Ideally, I'd be able to turn the NVIDIA GPU on and off easily, so that I can just turn it on when I need to run something on it, and turn it off after to save power.
Currently, I have it set up with nvidia-prime so that I can switch between one card or the other (I need to reboot in between). However, if I've loaded the Intel card for graphics (prime-select intel), then the NVIDIA kernel drivers never get loaded and I can't access the NVIDIA GPU (nvidia-smi doesn't work).
I tried loading the NVIDIA kernel module with sudo modprobe nvidia when running the graphics on Intel, but I get ERROR: could not insert 'nvidia': No such device.
Yes, this is indeed possible. It is called "Nvidia Optimus" and means that the integrated Intel GPU is used by default to save power and the dedicated Nvidia GPU is used only for high-performance applications. Here are guides on how to set it up in Linux:
The Ultimate Guide to Setting Up Nvidia Optimus on Linux
archlinux: Nvidia Optimus
Short answer: You can give a try to my modified version of prime-select, which adds an 'hybrid' profile (graphics on Intel, TensorFlow and other CUDA stuff on Nvidia GPU). https://github.com/lperez31/prime-select-hybrid
Long answer:
I came around the same issue and found several blogs talking about different solutions, but I wanted a more straightforward solution, and I didn't want to have to switch between profiles each time I needed TensorFlow to run on Nvidia GPU.
When setting the 'intel' profile, prime-select blacklists three modules: nvidia, nvidia-drm and nvidia-modeset. It also removes the three aliases to these modules. Thus, when running in intel profile, the sudo modprobe nvidia command should fail. Indeed, if the alias would not have been removed, this command should do the trick.
In order to use Intel for graphics and Nvidia GPU for TensorFlow, the 'hybrid' profile in the modified version of prime-select above blacklists nvidia-drm and nvidia-modeset modules, but not nvidia module. Thus nvidia drivers are loaded, but as nvidia-drm (Display Rendering Manager) is not loaded, the graphics remain on Intel GPU.
If you don't want to use my version of prime-select, you could just edit /usr/bin/prime-select and comment the two following lines:
blacklist nvidia
alias nvidia off
With these lines commented, nvidia-smi command should run even in 'intel' profile, you should be able to use CUDA stuff on Nvidia GPU and your graphics will use Intel.

ML training process is not on GPU

I just moved from AWS to Google Cloud Platform because of its lower GPU price. I followed the instructions on the website creating a compute engine instance with a K80 GPU, installed the latest-versioned Tensorflow, keras, cuda driver and cnDNN, everything goes very well. However when I try to train my model, the training process is still on CPU.
NVIDIA-SMI 387.26 Driver Version: 387.26
Cuda compilation tools, release 9.1, V9.1.85
Tensorflow version -1.4.1
cudnn cudnn-9.1-linux-x64-v7
Ubuntu 16.04
Perhaps you installed the cpu version of tensorflow?
Now, Google Cloud's compute engine has a VM OS image with all the needed software installed for an easier/faster way to get started: https://cloud.google.com/deep-learning-vm/

How to run GPGPU inside docker image with different from host kernel and GPU driver version

I have machine with several GPUs. My idea is to attach them to different docker instances in order to use that instances in CUDA (or OpenCL) calculations.
My goal is to setup docker image with quite old Ubuntu and quite old AMD video drivers (13.04). Reason is simple: upgrade to newer version of driver will broke my OpenCL program (due to buggy AMD linux drivers).
So question is following. Is it possible to run docker image with old Ubuntu, old kernel (3.14 for example) and old AMD (fglrx) driver on fresh Arch Linux setup with fresh kernel 4.2 and newer AMD (fglrx) drivers in repository?
P.S. I tried this answer (with Nvidia cards) and unfortunately deviceQuery inside docker image doesn't see any CUDA devices (as It happened with some commentors of original answer)...
P.P.S. My setup:
CPU: Intel Xeon E5-2670
GPUs:
1 x Radeon HD 7970
$ lspci -nn | grep Rad
83:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] [1002:6798]
83:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT HDMI Audio [Radeon HD 7970 Series] [1002:aaa0]
2 x GeForce GTX Titan Black
With docker you rely on virtualization on Operating System level. That means you use the same kernel in all containers. If you wish to run different kernels for each container, you'll probably have to use system-level virtualization, e.g., KVM, VirtualBox. If your setup supports Intel's VT-d, you can pass the GPU as a PCIe device to the container(better terminology in this case is, Virtual Machine).

Resources