feature level 12_2 on RTX 30701 ti - directx

I upgraded from GTX 1050 TI to RTX 3070 TI, without reinstalling the driver.
Feature level goes up to 12_1 but not 12_2?
Do I need to reinstall the driver or something else to get 12_2?

Related

Is that normal that $DISPLAY environment variable value is 1?

Well, I have some sort of problems and that one may be the root of them. I just want to know if that output is normal.
~$ echo $DISPLAY
:1
Is there any problem? According to my research, it should be :0.
Why is $DISPLAY sometimes :0 and sometimes :1
:0 is usually the local display (i.e. the main display of the computer when you sit in front of it).
:1 is often used by services like SSH when you enable display forwarding and log into a remote computer.
Although it is my local display, $DISPLAY is :1. Thus, how can we fix the problem(if it is a problem)?
Additional information: While I log in, the screens is flickering.
System Info:
~$ neofetch
.-/+oossssoo+/-. ********#***************
`:+ssssssssssssssssss+:` -------------------------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 20.04.4 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: 20URS0BG00 ThinkPad T15g Gen 1
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 5.13.0-30-generic
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 2 hours, 32 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 3895 (dpkg), 26 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Resolution: 1920x1080, 1920x1080, 1920x1080
ossyNMMMNyMMhsssssssssssssshmmmhssssssso DE: GNOME
ossyNMMMNyMMhsssssssssssssshmmmhssssssso WM: Mutter
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ WM Theme: Adwaita
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Theme: Yaru-dark [GTK2/3]
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/ Icons: Yaru [GTK2/3]
+sssssssssdmydMMMMMMMMddddyssssssss+ Terminal: x-terminal-emul
/ssssssssssshdmNNNNmyNMMMMhssssss/ CPU: Intel i7-10750H (12) # 5.000GHz
.ossssssssssssssssssdMMMNysssso. GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 1e91
-+sssssssssssssssssyyyssss+- GPU: Intel UHD Graphics
`:+ssssssssssssssssss+:` Memory: 6676MiB / 31894MiB
.-/+oossssoo+/-.

Why multi-gpu faster than single gpu in caffe training?

In the same hardware/software env, With the same net and solver, just differ in command line.
While command line is:
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=6
It tasks about 50 seconds per 100 iters.
While command is :
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=4,5,6,7
It takes about 48 seconds per 100 iters.
As usual, multi-gpu training should cost more time than single-gpu because of cost like replication. So does anyone can tell me why. Thanks very much!
Env:
2 * Intel(R) Xeon(R) CPU E5-2699 v4 # 2.20GHz
8 * Nvidia Tesla V100 PCIE 16GB
Caffe 1.0.0 / use_cudnn on
Cuda 9.0.176
Cudnn 6.0.21

Installing TensorFlow-GPU

I try to install tensorflow-gpu. The problem is that I have nvidia-375.82 driver, while tensorflow requires 375.66.
When I got this error
ImportError: libnvidia-fatbinaryloader.so.375.66: cannot open shared object file: No such file or directory
I tried to make link
sudo ln -s /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.82 /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66
It helps to avoid ImportError, but nothing more. If I try to run smth
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
I get result by cpu and prints
2017-10-07 15:56:03.329769: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329832: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329850: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329864: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329878: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.429055: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-10-07 15:56:03.429198: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: sklert-new-comp
2017-10-07 15:56:03.429226: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: sklert-new-comp
2017-10-07 15:56:03.429317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.66.0
2017-10-07 15:56:03.429384: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
2017-10-07 15:56:03.429446: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.82.0
2017-10-07 15:56:03.429473: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 375.82.0 does not match DSO version 375.66.0 -- cannot find working devices in this configuration
Device mapping: no known devices.
2017-10-07 15:56:03.430336: I tensorflow/core/common_runtime/direct_session.cc:300] Device mapping:
MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467133: I tensorflow/core/common_runtime/simple_placer.cc:872] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467201: I tensorflow/core/common_runtime/simple_placer.cc:872] b: (Const)/job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467226: I tensorflow/core/common_runtime/simple_placer.cc:872] a: (Const)/job:localhost/replica:0/task:0/cpu:0
[[ 22. 28.]
[ 49. 64.]]
Is there any way to use tensorflow with gpu without downgrading?
...
Seems that problem is not in tensorflow, but in nvidia-drivers
sudo dmesg | grep NVRM
[ 1.267417] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017 (using threaded interrupts)
[ 108.803115] NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 375.82. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[ 1419.021917] NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 375.82. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
Some drivers have different version:
locate 375.66
/usr/lib/i386-linux-gnu/libcuda.so.375.66
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.375.66
/usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66
/usr/lib/x86_64-linux-gnu/libcuda.so.375.66
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.375.66
/usr/lib32/nvidia-375/libnvidia-fatbinaryloader.so.375.66

Tensorflow app freezes in docker container

I have a tensorflow app that runs fine in ubuntu 16.04 but when I attempt to run it in the tensorflow/tensorflow docker image w/ nvidia-docker, it gets to this point and then freezes:
2017-07-12 22:06:10.917255: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations.
2017-07-12 22:06:10.917289: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-07-12 22:06:11.023765: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero
2017-07-12 22:06:11.024133: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0
with properties:
name: Quadro M4000
major: 5 minor: 2 memoryClockRate (GHz) 0.7725
pciBusID 0000:00:05.0
Total memory: 7.93GiB
Free memory: 7.87GiB
2017-07-12 22:06:11.024159: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-07-12 22:06:11.024168: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-07-12 22:06:11.024190: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M4000, pci
bus id: 0000:00:05.0)
Since it's not outputting an error message, I don't know where to start; any suggestions for something I might be missing or steps to troubleshoot this further?
I verified that my nvidia-docker installation is functioning correctly.
Turns out that the application was running, it just appeared to have frozen because output from python apps running in docker containers tends to get stuck in the buffer and never show up in the docker logs. To fix the problem I passed -u to python - I can see my application's output in docker logs now and all is well.

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

I'm facing a strange problem with Torque assignment of GPUs.
I'm running Torque 6.1.0 on a single machine that has two NVIDIA GTX Titan X GPUs. I'm using pbs_sched for scheduling. nvidia-smi output at rest is as follows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:03:00.0 On | N/A |
| 22% 40C P8 15W / 250W | 0MiB / 12204MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 33C P8 14W / 250W | 0MiB / 12207MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have a simple test script to assess GPU assignment as follows:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
deviceQuery is the utility that comes with CUDA. When I run it from the command line, it correctly finds both GPUs. When I restrict to one device from the command-line like this...
CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
... it also correctly finds one or the other GPU.
When I submit test.sh to the queue with qsub, and when no other jobs are running, it again works correctly. Here's the output:
CUDA_VISIBLE_DEVICES: 0
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12204 MBytes (12796887040 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS
However, if a job is already running on gpu0 (i.e. if it is assigned CUDA_VISIBLE_DEVICES=1), the job cannot find any GPUs. Output:
CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
Anyone know what is going on here?
I think I've solved my own problem, but unfortunately I tried two things at once. I don't want to go back and confirm which solved the issue. It's one of the following:
Remove the --enable-cgroups option from Torque's configure script before building.
Running these steps in the Torque install process:
make packages
sh torque-package-server-linux-x86_64.sh --install
sh torque-package-mom-linux-x86_64.sh --install
sh torque-package-clients-linux-x86_64.sh --install
For the second option, I know that these steps are properly documented in the Torque install instructions. However, I have a simple setup where I just have a single node (compute node and server are same machine). I thought that 'make install' should do everything that the package installs do for that single node, but maybe I was mistaken.

Resources