vulkaninfo failed with VK_ERROR_INITIALIZATION_FAILED - nvidia

OS: ubuntu 18.04
GPU: Geforce GTX 1060
Driver: Nvidia Driver 440.82
Vulkan Package: libvulkan1/bionic-updates,now 1.1.70+dfsg1-1ubuntu0.18.04.1 amd64
Nvidia-smi shows the configuration correctly.
However, when i invoke vulkaninfo, I get /build/vulkan-UL09PJ/vulkan-1.1.70+dfsg1/demos/vulkaninfo.c:2700: failed with VK_ERROR_INITIALIZATION_FAILED
It seems vulkan can not detect physical device. Any idea why?

I guess you might invoke vulkaninfo via the ssh terminal, just run on you physical device may solve it.

Related

RTX 3080 LHR Missing gpu__dram_throughput CUDA metric

As part of a machine learning project, we are optimizing some custom CUDA kernels.
We are trying to profile them using Nsight Compute, but encounter the following error running on the LHR RTX 3080 when running a simple wrapper around the CUDA Kernel:
==ERROR== Failed to access the following 4 metrics: dram__cycles_active.avg.pct_of_peak_sustained_elapsed, dram__cycles_elapsed.avg.per_second, gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed, gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
==ERROR== Failed to profile kernel "kernel" in process 20204
Running a diff against the metrics available on an RTX 3080 TI (non-LHR) vs an RTX-3080 (LHR) via nv-nsight-cu-cli --devices 0 --query-metrics, We notice the following metrics are missing in the RTX 3080 LHR version:
gpu__compute_memory_request_throughput
gpu__compute_memory_throughput
gpu__dram_throughput
All of these are required for even basic memory profiling using Nsight Compute. All other metrics are correctly present, except for these. Is this a limitation of LHR cards? Why would they not be present?
Details:
Gigabyte RTX 3080 Turbo (LHR)
Cuda Version: 11.5
Driver version: 497.29.
Windows 10
I saw your post on the nvidia developer forums and from what it looks like, nvidia didn't intend on this, so I'd either just go with what works (non-lhr) for now until they fix it. Quadro and tesla cards are supported by Nsight Compute so they might be a holdover solution.
So to answer the main question:
Will buying a non-LHR GPU address this problem?
for right now, yes, buying a non-lhr 3080 should fix the issue.
As per Nvidia forums, this is an unintended bug that is fixed by upgrading from CUDA 11.5 to CUDA 11.6, under which all profiling is working correctly with all metrics available.
Successful conditions:
Gigabyte RTX 3080 Turbo (LHR)
Cuda Version: 11.6
Driver version: 511.23.
Windows 10
We don't know why these metrics were unavailable, but the version update is definitely the correct fix.

Nvidia drivers (440, 450) cannot find with GeForce 2080ti (Ubuntu 20.04)

I am trying to get Ubuntu 20.04 running on a computer (desktop) with a GeForce 2080Ti and I had no luck with various versions of the nvidia drivers (440 from ppa, latest 450 from nividia website).
However, I could not get it to work :
nvidia-smi --> No devices were found.
In /var/log/Xorg.0.log --> (EE) Failed to initialize the NVIDIA GPU at PCI:7:0:0.
dmesg --> NVRM: GPU 0000:07/00.0: RmInitAdapter failed!
Other info that could help:
I have a GUI when I prime-select intel
Secure boot is disabled
nouveau is blacklisted
Thanks for help,

installing DPDK on Ubuntu 18.04 and intel XL710

I'm trying to make DPDK work on my machine without success; My machine is running Ubuntu 18.04 and the NIC I'm trying to bind is an Intel XL710. I'm completely new to DPDK and not an expert on linux.
Additional context: I need DPDK in order to get more bandwidth when using a USRP SDR (Software Defined Radio), that has this capability.
What I've done so far:
Added default_hugepagesz=1G hugepagesz=1G hugepages=8 to the grub config
Cloned and compiled DPDK 19, installed with make install. Result: Installation in /usr/local/ complete
get the status of the devices and drivers using ./dpdk-devbind.py -s. The relevant line I get from this command is: 0000:02:00.0 'Ethernet Controller XL710 for 40GbE QSFP+ 1583' if=enp2s0f0 drv=i40e unused=
When I try to bind the device (even if it is already binded? shouldn't I get a different driver/option for that?) using sudo ./dpdk-devbind.py -b i40e 0000:02:00.0 I get:
Warning: no supported DPDK kernel modules are loaded
Notice: 0000:02:00.0 already bound to driver i40e, skipping
What am I missing?
Thanks in advance for the help.
Before binding the i40e NIC to DPDK PMD driver, you need to load uio or vfio Linux driver as shown below:
modprobe uio
insmod ./x86_64-native-linux-gcc/kmod/igb_uio.ko
or
modprobe vfio-pci
Take a look at link to know why we should load kernel module before ports are bind to DPDK.

Windows Server 2019 Hyper-V Discrete Device Assignment (DDA) of a NVIDIA Tesla V100 to Ubuntu 18.04 LTS or Centos Guest: Not found

Does anyone have experience of using DDA to passthrough an NVIDIA Tesla into a linux guest on Hyper-V? This setup works perfectly when the guest is Windows 10 and according to what I have read, this should be perfectly supported when using Ubuntu 18.04 LTS or Centos 7/8 as the guest operating system. However the driver fails to detect the presence of the Tesla at install time despite it appearing on the virtual PCI bus.
Thanks
A combination of installing the latest kernel (5.0.0-1028-azure or 5.3.0-26-generic) on Ubuntu 18.04.03 LTS then shuting down the guest. Increasing the high MIMO to 33Gb with:
Set-VM -HighMemoryMappedIoSpace 33GB -VMName vm-name
And then restarting, fixed the issue.

NVIDIA-SMI failed. Could'nt communicate with Nvidia driver [duplicate]

This question already has answers here:
Error: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver
(4 answers)
Closed 2 years ago.
I am running a cloud instance on a gpu node. I installed CUDA and nvidia-smi showed the driver details, memory utlilization. After a couple of days, I face this error
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running".
I installed the latest driver (Nvidia-375.39 for Tesla M40 Gpus). I still face the same issue. Is there any way to
i) debug why nvidia-smi is not able to communicate with the driver?
ii)check if the driver is running properly.
This is an operating system issue. The solution will depend on your operating system. For example, if you are running Ubuntu 16 the solution might be something like this:
Uninstall / purge all Nvidia drivers
sudo apt-get remove --purge nvidia* && sudo apt autoremove
Download Nvidia driver from Nvidia's website (.run file)
I met the same question as you, I solved this by modify the security option, the step is when I reboot the system,enter the bios,modify secure boot option as disabled,then reboot,It is ok!

Resources