Nvidia-smi fails on AWS's g4ad server with deep learning AMI - nvidia

I'm trying to use AWS's g4ad instance with deep learning AMI (Deep Learning AMI (Ubuntu 16.04) Version 50.0) and when I'm trying to execute:
nvidia-smi
I'm getting:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Nevertheless, it seems that CUDA is working fine. When executing nvcc --version, I'm getting:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

Apparently, g4ad instances don't have an Nvidia GPU, but an AMD GPU. With g4dn it works perfectly.

Related

Lower CUDA version in container is not working for higher host cuda [duplicate]

I am very confused by the different CUDA versions shown by running which nvcc and nvidia-smi. I have both cuda9.2 and cuda10 installed on my ubuntu 16.04. Now I set the PATH to point to cuda9.2. So when I run
$ which nvcc
/usr/local/cuda-9.2/bin/nvcc
However, when I run
$ nvidia-smi
Wed Nov 21 19:41:32 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 26W / N/A | 379MiB / 6078MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1324 G /usr/lib/xorg/Xorg 225MiB |
| 0 2844 G compiz 146MiB |
| 0 15550 G /usr/lib/firefox/firefox 1MiB |
| 0 19992 G /usr/lib/firefox/firefox 1MiB |
| 0 23605 G /usr/lib/firefox/firefox 1MiB |
So am I using cuda9.2 as which nvcc suggests, or am I using cuda10 as nvidia-smi suggests? I saw this answer but it does not provide direct answer to the confusion, it just asks us to reinstall the CUDA Toolkit, which I already did.
CUDA has 2 primary APIs, the runtime and the driver API. Both have a corresponding version (e.g. 8.0, 9.0, etc.)
The necessary support for the driver API (e.g. libcuda.so on linux) is installed by the GPU driver installer.
The necessary support for the runtime API (e.g. libcudart.so on linux, and also nvcc) is installed by the CUDA toolkit installer (which may also have a GPU driver installer bundled in it).
In any event, the (installed) driver API version may not always match the (installed) runtime API version, especially if you install a GPU driver independently from installing CUDA (i.e. the CUDA toolkit).
The nvidia-smi tool gets installed by the GPU driver installer, and generally has the GPU driver in view, not anything installed by the CUDA toolkit installer.
Recently (somewhere between 410.48 and 410.73 driver version on linux) the powers-that-be at NVIDIA decided to add reporting of the CUDA Driver API version installed by the driver, in the output from nvidia-smi.
This has no connection to the installed CUDA runtime version.
nvcc, the CUDA compiler-driver tool that is installed with the CUDA toolkit, will always report the CUDA runtime version that it was built to recognize. It doesn't know anything about what driver version is installed, or even if a GPU driver is installed.
Therefore, by design, these two numbers don't necessarily match, as they are reflective of two different things.
If you are wondering why nvcc -V displays a version of CUDA you weren't expecting (e.g. it displays a version other than the one you think you installed) or doesn't display anything at all, version wise, it may be because you haven't followed the mandatory instructions in step 7 (prior to CUDA 11) (or step 6 in the CUDA 11 linux install guide) of the cuda linux install guide
Note that although this question mostly has linux in view, the same concepts apply to windows CUDA installs. The driver has a CUDA driver version associated with it (which can be queried with nvidia-smi, for example). The CUDA runtime also has a CUDA runtime version associated with it. The two will not necessarily match in all cases.
In most cases, if nvidia-smi reports a CUDA version that is numerically equal to or higher than the one reported by nvcc -V, this is not a cause for concern. That is a defined compatibility path in CUDA (newer drivers/driver API support "older" CUDA toolkits/runtime API). For example if nvidia-smi reports CUDA 10.2, and nvcc -V reports CUDA 10.1, that is generally not cause for concern. It should just work, and it does not necessarily mean that you "actually installed CUDA 10.2 when you meant to install CUDA 10.1"
If nvcc command doesn't report anything at all (e.g. Command 'nvcc' not found...) or if it reports an unexpected CUDA version, this may also be due to an incorrect CUDA install, i.e the mandatory steps mentioned above were not performed correctly. You can start to figure this out by using a linux utility like find or locate (use man pages to learn how, please) to find your nvcc executable. Assuming there is only one, the path to it can then be used to fix your PATH environment variable. The CUDA linux install guide also explains how to set this. You may need to adjust the CUDA version in the PATH variable to match your actual CUDA version desired/installed.
Similarly, when using docker, the nvidia-smi command will generally report the driver version installed on the base machine, whereas other version methods like nvcc --version will report the CUDA version installed inside the docker container.
Similarly, if you have used another installation method for the CUDA "toolkit" such as Anaconda, you may discover that the version indicated by Anaconda does not "match" the version indicated by nvidia-smi. However, the above comments still apply. Older CUDA toolkits installed by Anaconda can be used with newer versions reported by nvidia-smi, and the fact that nvidia-smi reports a newer/higher CUDA version than the one installed by Anaconda does not mean you have an installation problem.
Here is another question that covers similar ground. The above treatment does not in any way indicate that this answer is only applicable if you have installed multiple CUDA versions intentionally or unintentionally. The situation presents itself any time you install CUDA. The version reported by nvcc and nvidia-smi may not match, and that is expected behavior and in most cases quite normal.
nvcc is in the CUDA bin folder - as such check if the CUDA bin folder has been added to your $PATH.
Specifically, ensure that you have carried out the CUDA Post-Installation actions (e.g. from here):
Add the CUDA Bin to $PATH (i.e. add the following line to your ~/.bashrc)
export PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH}}
PS. Ensure the following two paths above, exist first: /usr/local/cuda-10.1/bin and /usr/local/cuda-10.1/NsightCompute-2019.1 (the NsightCompute path could have a slightly different ending depending on the version of Nsight compute installed...
Update $LD_LIBRARY_PATH (i.e. add the following line to your ~/bashrc).
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
After this, both nvcc and nvidia-smi (or nvtop) report the same version of CUDA...
If you are using cuda 10.2 :
export PATH=/usr/local/cuda-10.2/bin:/opt/nvidia/nsight-compute/2019.5.0${PATH:+:${PATH}}
might help because when I checked, there was no directory for nsight-compute in cuda-10.2.
I am not sure if this was just the problem with me or else why wouldn't they mention it in the official documentation.
Adding onto Robert Crovella's answer...
The difference between the device driver and the runtime driver is that, with device driver you will be able to run compiled CUDA C code. That is, you can download CUDA powered applications and they will be able to successfully execute their code on your GPU.
Whereas, with the runtime driver you will be able to able to compile the CUDA C code, which then will be executed with the help of the device driver on your GPU.
Section 2.2.3 - Cuda Development Toolkit
nvidia-smi can show a “different CUDA version” from the one that is reported by nvcc. Because they are reporting two different things:
nvidia-smi shows that maximum available CUDA version support for a given GPU driver.
And the 2nd thing which nvcc -V reports is the CUDA version that is currently being used by the system.
In short
nvidia-smi shows the highest version of CUDA supported by your driver. nvcc -V shows the version of the current CUDA installation. As long as your driver-supported version is higher than your installed version, it's fine. You can even have several versions of CUDA installed at the same time.

Can I run a Docker container with CUDA 10 when host has CUDA 9?

Im deploying an application in a docker container that requires CUDA 10. This is necessary to run some of the underlying pytorch functionality that the application uses.
However, the host server is running docker ce 17, Nvidia-docker v 1.0 with CUDA version 9, and I will not be able to upgrade the host.
I’m under the impression that I’m handcuffed to the v1 nvidia docker runtime and CUDA version available on the host.
Is there a way to run CUDA 10 on the container so I can leverage the functionality of this toolkit?
In the general case, any specific CUDA version will require a minimum GPU driver version. That is covered in places like here and here (table 1). So to use CUDA 9.0 you would need at least a GPU driver version that supports CUDA 9.0, such as a R384 driver. To use CUDA 10.0 you would need at least a GPU driver version that supports CUDA 10.0, such as a R410 driver.
The usage of containers doesn't fundamentally change this. If you want to use a container that has CUDA 10 code in it, your base machine needs a driver that supports CUDA 10.
NVIDIA did start publishing compatibility libraries that allow modifications to the above statements. These compatibility libraries are available but not installed by default with a CUDA toolkit install. These compatibility libraries only work in certain cases, and they have certain requirements to be usable. The compatibility libraries are documented here.
One of the specific requirements for use of these compatibility libraries is that the GPU(s) in use must be Tesla-brand GPUs. GeForce, Quadro, Jetson, and Titan family GPUs are not supported by these compatibility libraries.
Furthermore, the libraries only work with certain combination of CUDA toolkit versions, and GPU driver versions installed on the base machine. This "compatibility matrix" is documented here (Table 3). Only the specific combinations of CUDA toolkit versions with installed driver versions will be usable for compatibility. To pick one example, if you wish to use CUDA 10.0, and your base machine has a Tesla GPU with a R396 driver installed, there is no compatibility support. In the same setup, however, if you wish to use CUDA 10.1, there is compatibility support for that.
If you have satisfied the requirements for compatibility usage, then the remaining step would be to install the compatibility libraries (or build your container from a base container that has the compatibility libraries already installed).
For a package manager CUDA install method, the method to install the compatibility libraries is simple (example on Ubuntu, installing the CUDA 10.1 compatibility to match CUDA 10.1 toolkit install):
sudo apt-get install cuda-compat-10.1
Make sure to match the version to the CUDA toolkit version that you are using (that you installed with the package manager method, or that was already installed in your container).
This compatibility "path" only began in the CUDA 9.0 timeframe. Systems that are equipped with drivers that predate CUDA 9.0 will not be usable in any way for this compatibility path. There are also various functional limitations and restrictions, which are covered in the documentation.
When this "compatibility path" is correctly installed and in use, the overall system configuration can "appear" to be violating the rules indicated at the top of this answer. For example a CUDA 10.1 application could possibly be running on a machine that had only a R396 driver installed.
For the specific question in view here, OP eventually indicated that the base machine had a Quadro GPU, so this "compatibility path" does not apply, and the only way to run e.g. a CUDA 10.0 container would be if a CUDA 10.0-capable driver is installed in the base machine, e.g. R410 or later driver.

Nvidia-smi command not found - nvidia drivers installed

~$ nvidia-smi
nvidia-smi: command not found
~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148
Its worked before. I made changes in grub using the command
gksu gedit /etc/default/grub
but revert it back. (I don't know if it could have effect)
Reason
The most common reason for any “command not found” error is that the software being accessed is not installed on the system. Check out the example below where the tool is not installed on the system and the error “nvidia-smi command not found” is thrown:
Solution
The best way of resolving this error is to install the “nvidia-utils” package which will also contain the “nvidia-smi” tool inside it. To install this package, run the command in the terminal:
$ sudo apt install nvidia-utils-515
I got the detailed solution form this post: https://itslinuxfoss.com/fix-nvidia-smi-command-not-found-error/

CUDA driver version is insufficient for CUDA runtime version - OpenCV - GPU Toolkit

I am trying to run the CUDA GPU Toolkit 7.5 built with OpenCV 3.1.0 .
My graphic card is : Nvidia Quadro FX 5800 . Driver version : 341.92 (Latest available version for the same)
Nvidia classifies my Graphics card in the legacy category with the 1.3 compute capability.
I keep getting the error in the title. and can understand the driver mismatch.
I updated to the latest driver for the graphics card.
My question is what version of the GPU toolkit should i build opencv with ? that would also be compatible with VS 2013 C++ env. I tried building it with CUDA toolkit 6.0 and its not compatible with VS 2013.
Sticky situation any advice would be appreciated.
This was fixed by building OpenCV with 1.3 compute capability. Dont let Cmake choose it automatically. CUDA_ARCH_PTX was set to 1.3 ->(which is the compute capability of my legacy graphics card).

Can not find CUDA_NPP_LIBRARY_ROOT_DIR in cmake

I am trying to install opencv 2.2 with cuda enabled. I have installed CUDA and also have compiled and tested some programs with it.
Now I want to install opencv with gpu. I downloaded cmake-gui and set CUDA on. When I hit the configure button it does find cuda toolkit but gives me this error:
CMake Error at modules/gpu/FindNPP.cmake:102 (message):
NPP headers/libraries are not found. Please specify
CUDA_NPP_LIBRARY_ROOT_DIR in CMake or set $NPP_ROOT_DIR.
Call Stack (most recent call first):
modules/gpu/CMakeLists.txt:38 (find_package)
I have installed everything in the default directory.
My cuda-toolkit version is 4.0. Also I have opencv 2.2 without gpu installed.
And in case it might help:
My OS is ubuntu 11.04, my gpu is geforce gtx550, and my cuda driver version is 280.13

Resources