OpenCV Python on WSL 2 error: "(-217:Gpu API call) CUDA driver version is insufficient for CUDA runtime version" - opencv

I am trying to run YOLO v4 on OpenCV 4.6 Python with CUDA on WSL 2 with Ubuntu 20.04 (Focal Fossa):
import cv2 as cv
import time
Conf_threshold = 0.4
NMS_threshold = 0.4
COLORS = [(0, 255, 0), (0, 0, 255), (255, 0, 0),
(255, 255, 0), (255, 0, 255), (0, 255, 255)]
class_name = []
with open('classes.txt', 'r') as f:
class_name = [cname.strip() for cname in f.readlines()]
# print(class_name)
net = cv.dnn.readNet('yolov4-tiny.weights', 'yolov4-tiny.cfg')
net.setPreferableBackend(cv.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CUDA_FP16)
model = cv.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/255, swapRB=True)
cap = cv.VideoCapture('output.avi')
starting_time = time.time()
frame_counter = 0
while True:
ret, frame = cap.read()
frame_counter += 1
if ret == False:
break
classes, scores, boxes = model.detect(frame, Conf_threshold, NMS_threshold)
...
I got the following error:
Traceback (most recent call last): \
File "yolov4.py", line 28, in classes, scores, boxes = model.detect(frame, Conf_threshold, NMS_threshold)
cv2.error: OpenCV(4.6.0-dev) /home/tong/source/opencv/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) CUDA driver version is insufficient for CUDA runtime version in function 'ManagedPtr'
I have checked the Nvidia driver by using nvidia-smi.exe:
NVIDIA-SMI 516.59 Driver Version: 516.59 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:0D:00.0 On | N/A |
| 24% 51C P0 44W / 215W | 2711MiB / 8192MiB | 6% Default |
| | | N/A
As well as the CUDA instance installed by dpkg -l | grep cuda-toolkit:
ii cuda-toolkit-11-6 11.6.2-1 amd64 CUDA Toolkit 11.6 meta-package
ii cuda-toolkit-11-6-config-common 11.6.55-1 all Common config package for CUDA Toolkit 11.6.
ii cuda-toolkit-11-7 11.7.0-1 amd64 CUDA Toolkit 11.7 meta-package
ii cuda-toolkit-11-7-config-common 11.7.60-1 all Common config package for CUDA Toolkit 11.7.
ii cuda-toolkit-11-config-common 11.7.60-1 all Common config package for CUDA Toolkit 11.
ii cuda-toolkit-config-common 11.7.60-1 all Common config package for CUDA Toolkit.
OpenCV should be compiled by CUDA 11.7 according to file CMakeCache.txt:
// Version of CUDA as computed from nvcc.
CUDA_VERSION:STRING=11.7
The nvcc is also pointed to 11.7:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
How can I fix this problem?

Related

Even Easier Introduction to CUDA - not printing after memory initialization

I'm following the Even Easier Introduction to CUDA tutorial. I have literally copied and pasted the complete code. add.cu compiles, however, when I run it, it doesn't print anything. I put in some more print statements and narrowed it down:
printf("Hi\n");
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
printf("Bye");
It prints "Hi", but never prints "Bye". So something seems to be wrong with the memory initialization. What is going wrong here?
I solved the problem myself. Basically, my device drivers were screwed up. To check if you have the same problem, run Command Prompt as administrator and run nvidia-smi. If you have the same problem, it will give you an error saying it failed to communicate because your device drivers are not up to date or wrong or something of the sort.
Download the latest nvidia driver for your computer (I found mine on Dell Drivers & Downlods) and install it. Now, when you run nvidia-smi as admin on command prompt, it should give you a whole bunch of details about your setup (driver version, Cuda version, etc) like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.14 Driver Version: 441.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 30C P8 N/A / N/A | 78MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You should now be able to compile and run Cuda scripts with unified memory.

Dynamically decide which GPU to run on - TF on NVIDIA docker

I have a queue of models, which I allow only 2 to be executed in parallel, since I have 2 GPUs.
For that, in the beginning of my code I try to determine which GPU is available by using GPUtil. Maybe its relevant, this code in run inside a docker container that was launched using the --runtime=nvidia flag.
The code that determines which GPU to run on, looks like this:
import os
import GPUtil
gpu1, gpu2 = GPUtil.getGPUs()
available_gpu = gpu1 if gpu1.memoryFree > gpu2.memoryFree else gpu2
os.environ['CUDA_VISIBLE_DEVICES'] = str(available_gpu.id)
import tensorflow as tf
Now, I launched two scripts this way (with a slight delay until the first one occupied a GPU) but both of them tried to use the same GPU!
I went further to examine the problem - I manually set the os.environ['CUDA_VISIBLE_DEVICES'] = '1' and let the model run.
As it was training, I checked the output of nvidia-smi and saw the following
user#server:~$ docker exec awesome_gpu_container nvidia-smi
Mon Mar 12 06:59:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 50C P2 131W / 280W | 5846MiB / 6075MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 0% 39C P8 14W / 200W | 2MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
And I notice that while I've set the visible device to be 1 it is actually running on 0
I stress again, that my mission is while queuing multiple models that each one that start running will decide for itself which GPU to use.
I explored allow_soft_placement=True, but that allocated the memory on both GPUs so I stopped the process.
Bottom line, how can I make sure my training scripts only use one GPU, and make them choose the free one?
As described in the CUDA programming guide, the default device enumeration used by CUDA is "fastest first":
CUDA_​DEVICE_​ORDER
FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)
FASTEST_FIRST causes CUDA to guess which device is
fastest using a simple heuristic, and make that device 0, leaving the
order of the rest of the devices unspecified.
PCI_BUS_ID orders devices by PCI bus ID in ascending order.
If you set CUDA_​DEVICE_​ORDER=PCI_BUS_ID the CUDA ordering will match the device ordering shown by nvidia-smi.
Since you are using docker, you can also enforce a stronger isolation with our runtime:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ...
But that's at container startup time.

Unsure whether tensorflow-gpu actually uses GPU

I am currently trying to run a Convolutional Neural Network using Keras on a tensorflow backend with the help of a Udemy course on deep learning. However it is running extremely slowly, taking around 1,000s per epoch while the lecturer's machine takes around 60s (he's running it on a CPU by the way).
The CNN is a simple image recognition network that recognizes whether an image is of a cat or a dog. The training and test data consist of a total of 10,000 images, all images together take up 237 MB on my SSD.
When I run the CNN in a Python shell, I get the following output:
Epoch 1/25
2017-05-28 13:23:03.967337: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your m
achine and could speed up CPU computations.
2017-05-28 13:23:03.967574: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your
machine and could speed up CPU computations.
2017-05-28 13:23:03.968153: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your
machine and could speed up CPU computations.
2017-05-28 13:23:03.968329: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on you
r machine and could speed up CPU computations.
2017-05-28 13:23:03.968576: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on you
r machine and could speed up CPU computations.
2017-05-28 13:23:04.505726: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:28:00.0
Total memory: 8.00GiB
Free memory: 6.68GiB
2017-05-28 13:23:04.505944: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:908] DMA: 0
2017-05-28 13:23:04.506637: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:918] 0: Y
2017-05-28 13:23:04.506895: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:2
8:00.0)
2684/8000 [=========>....................] - ETA: 845s - loss: 0.5011 - acc: 0.7427
This should indicate that tensorflow is using the GPU for its computations. However, when I check on nvidia-smi, I get the following output:
$ nvidia-smi
Sun May 28 13:25:46 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.53 Driver Version: 376.53 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 WDDM | 0000:28:00.0 On | N/A |
| 0% 49C P2 36W / 166W | 7240MiB / 8192MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7676 C+G ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 8580 C+G Insufficient Permissions N/A |
| 0 9704 C+G ...x86)\Google\Chrome\Application\chrome.exe N/A |
| 0 10532 C ...\Anaconda3\envs\tensorflow-gpu\python.exe N/A |
| 0 11384 C+G Insufficient Permissions N/A |
| 0 12896 C+G C:\Windows\explorer.exe N/A |
| 0 13868 C+G Insufficient Permissions N/A |
| 0 14068 C+G Insufficient Permissions N/A |
| 0 14568 C+G Insufficient Permissions N/A |
| 0 15260 C+G ...osoftEdge_8wekyb3d8bbwe\MicrosoftEdge.exe N/A |
| 0 16912 C+G ...am Files (x86)\Dropbox\Client\Dropbox.exe N/A |
| 0 18196 C+G ...I\AppData\Local\hyper\app-1.3.3\Hyper.exe N/A |
| 0 18228 C+G ...oftEdge_8wekyb3d8bbwe\MicrosoftEdgeCP.exe N/A |
| 0 20032 C+G ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
Note that every single process is displayed to use to use both CPU and GPU (Type C+G) while the tensorflow process is the only one to only use the CPU (Type C).
Is there any sensible explanation to this? I have been trying to fix this issue for the full last week but have gotten nowhere.
I am running a Windows 10 Pro machine with a Nvidia GTX 1070 by Asus, 24GB RAM and an Intel Xeon X5670 CPU #2.93GHz. I created my Anaconda environment with the following commands:
conda create -n tensorflow-gpu python=3.5 anaconda
source activate tensorflow-gpu
conda install theano
conda install mingw libpython
pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/gpu/tensorflow_gpu-1.1.0-cp35-cp35m-win_amd64.whl
pip install keras
conda update --all
I also installed the CUDA Toolkit and CUDNN and included their respective folders to my %PATH%
Every and any help would be greatly appreciated.
[EDIT]
The code in case anything is wrong with it.
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
# Defining the CNN
classifier = Sequential()
# Convolution 1
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Convolution 2
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Flatten + MLP
classifier.add(Flatten())
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the CNN to the images
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('dataset/training_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')
test_set = test_datagen.flow_from_directory('dataset/test_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')
classifier.fit_generator(training_set,
steps_per_epoch = 8000,
epochs = 25,
validation_data = test_set,
validation_steps = 2000)
It does not have anything to do with your machine, I discussed the problem in this post on Udemy. Everyone seem to have the same issue and wonder how come it could be 20 minutes on instructor's machine. The answer is simple: the instructor has posted different source code than what he presented in the video!
Check doc for steps_per_epoch
steps_per_epoch: Total number of steps (batches of samples) to yield
from generator before declaring one epoch finished and starting the
next epoch. It should typically be equal to the number of unique
samples of your dataset divided by the batch size.
Currently for a single epoch you take 8000 * 32 = 256000 images. That's the number of samples you are processing in every epoch. Doesn't make sense whatsoever if you consider your data set is merely 10000 (20k with augmentation).
if you check the video, you'll see the instructor is using samples_per_epoch, meaning 32x less data. Case solved.
My reply has nothing to do with the Udemy example, it's simply about checking whether the GPU is being utilized. On Linux, the PSensor utility allows you to observe GPU load and temperature (which is a good indication of usage). I'm not sure how to confirm GPU usage on Windows, perhaps someone else can help with that.

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

I'm facing a strange problem with Torque assignment of GPUs.
I'm running Torque 6.1.0 on a single machine that has two NVIDIA GTX Titan X GPUs. I'm using pbs_sched for scheduling. nvidia-smi output at rest is as follows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:03:00.0 On | N/A |
| 22% 40C P8 15W / 250W | 0MiB / 12204MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 33C P8 14W / 250W | 0MiB / 12207MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have a simple test script to assess GPU assignment as follows:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
deviceQuery is the utility that comes with CUDA. When I run it from the command line, it correctly finds both GPUs. When I restrict to one device from the command-line like this...
CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
... it also correctly finds one or the other GPU.
When I submit test.sh to the queue with qsub, and when no other jobs are running, it again works correctly. Here's the output:
CUDA_VISIBLE_DEVICES: 0
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12204 MBytes (12796887040 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS
However, if a job is already running on gpu0 (i.e. if it is assigned CUDA_VISIBLE_DEVICES=1), the job cannot find any GPUs. Output:
CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
Anyone know what is going on here?
I think I've solved my own problem, but unfortunately I tried two things at once. I don't want to go back and confirm which solved the issue. It's one of the following:
Remove the --enable-cgroups option from Torque's configure script before building.
Running these steps in the Torque install process:
make packages
sh torque-package-server-linux-x86_64.sh --install
sh torque-package-mom-linux-x86_64.sh --install
sh torque-package-clients-linux-x86_64.sh --install
For the second option, I know that these steps are properly documented in the Torque install instructions. However, I have a simple setup where I just have a single node (compute node and server are same machine). I thought that 'make install' should do everything that the package installs do for that single node, but maybe I was mistaken.

Tensorflow Bazel 0.3.0 build CUDA 8.0 GTX 1070 fails

Here are my specs:
GTX 1070
Driver 367 (installed from .run)
Ubuntu 16.04
CUDA 8.0 (installed from .run)
Cudnn 5
Bazel 0.3.0 (potential problem?)
gcc 4.9.3
Tensorflow installed from source
To verify versions:
volcart#volcart-Precision-Tower-7910:~/$ nvidia-smi
Fri Aug 5 15:03:32 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:03:00.0 On | N/A |
| 0% 38C P8 11W / 185W | 495MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 20303 G /usr/lib/xorg/Xorg 280MiB |
| 0 20909 G compiz 114MiB |
| 0 21562 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 98MiB |
+-----------------------------------------------------------------------------+
volcart#volcart-Precision-Tower-7910:~/$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
volcart#volcart-Precision-Tower-7910:~/$ bazel version
Build label: 0.3.0
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jun 10 11:38:23 2016 (1465558703)
Build timestamp: 1465558703
Build timestamp as int: 1465558703
volcart#volcart-Precision-Tower-7910:~/$ gcc -vUsing built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.9.3-13ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
I did switch bazel versions, so I executed bazel clean successfully.
I can verify CUDA is functional via ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$
volcart#volcart-Precision-Tower-7910:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1070"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8113 MBytes (8507162624 bytes)
(15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1797 MHz (1.80 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070
Result = PASS
When I ./configure I enter all the defaults.
The current errors
When I build the training example I get this:
volcart#volcart-Precision-Tower-7910:/usr/local/lib/python2.7/dist-packages/tensorflow$ sudo bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
Sending SIGTERM to previous Bazel server (pid=7108)... done.
.
INFO: Found 1 target...
...
./tensorflow/core/platform/default/logging.h: In instantiation of 'std::string* tensorflow::internal::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::basic_string<char>]':
tensorflow/core/common_runtime/gpu/gpu_device.cc:567:5: required from here
./tensorflow/core/platform/default/logging.h:197:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
TF_DEFINE_CHECK_OP_IMPL(Check_LT, < )
^
./tensorflow/core/platform/macros.h:54:29: note: in definition of macro 'TF_PREDICT_TRUE'
#define TF_PREDICT_TRUE(x) (x)
^
./tensorflow/core/platform/default/logging.h:197:1: note: in expansion of macro 'TF_DEFINE_CHECK_OP_IMPL'
TF_DEFINE_CHECK_OP_IMPL(Check_LT, < )
^
ERROR: /usr/local/lib/python2.7/dist-packages/tensorflow/tensorflow/cc/BUILD:199:1: Linking of rule '//tensorflow/cc:tutorials_example_trainer' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/local_linux-opt/bin/tensorflow/cc/tutorials_example_trainer ... (remaining 805 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
bazel-out/local_linux-opt/bin/tensorflow/cc/_objs/tutorials_example_trainer/tensorflow/cc/tutorials/example_trainer.o: In function `tensorflow::example::ConcurrentSteps(tensorflow::example::Options const*, int)':
example_trainer.cc:(.text._ZN10tensorflow7example15ConcurrentStepsEPKNS0_7OptionsEi+0x517): undefined reference to `google::protobuf::internal::empty_string_'
bazel-out/local_linux-opt/bin/tensorflow/core/kernels/libidentity_reader_op.lo(identity_reader_op.o): In function `tensorflow::IdentityReader::SerializeStateLocked(std::string*)':
identity_reader_op.cc:(.text._ZN10tensorflow14IdentityReader20SerializeStateLockedEPSs[_ZN10tensorflow14IdentityReader20SerializeStateLockedEPSs]+0x36): undefined reference to `google::protobuf::MessageLite::SerializeToString(std::string*) const'
bazel-out/local_linux-opt/bin/tensorflow/core/kernels/libwhole_file_read_ops.lo(whole_file_read_ops.o): In function `tensorflow::WholeFileReader::SerializeStateLocked(std::string*)':
And when I try to build the pip package I get this:
volcart#volcart-Precision-Tower-7910:/usr/local/lib/python2.7/dist-packages/tensorflow$ sudo bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
WARNING: /usr/local/lib/python2.7/dist-packages/tensorflow/util/python/BUILD:11:16: in includes attribute of cc_library rule //util/python:python_headers: 'python_include' resolves to 'util/python/python_include' not in 'third_party'. This will be an error in the future.
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/bit_depth.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/gemmlowp.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/map.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/output_stages.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/profiling/instrumentation.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/profiling/profiler.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
INFO: Found 1 target...
INFO: From Compiling external/protobuf/src/google/protobuf/util/internal/utility.cc [for host]:
...
INFO: From Compiling tensorflow/core/distributed_runtime/tensor_coding.cc:
tensorflow/core/distributed_runtime/tensor_coding.cc: In member function 'bool tensorflow::TensorResponse::ParseTensorSubmessage(google::protobuf::io::CodedInputStream*, tensorflow::TensorProto*)':
tensorflow/core/distributed_runtime/tensor_coding.cc:123:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (num_bytes != buf.size()) return false;
^
ERROR: /usr/local/lib/python2.7/dist-packages/tensorflow/tensorflow/core/kernels/BUILD:1498:1: undeclared inclusion(s) in rule '//tensorflow/core/kernels:batchtospace_op_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/batchtospace_op_gpu.cu.cc':
'/usr/local/cuda-8.0/include/cuda_runtime.h'
'/usr/local/cuda-8.0/include/host_config.h'
'/usr/local/cuda-8.0/include/builtin_types.h'
'/usr/local/cuda-8.0/include/device_types.h'
'/usr/local/cuda-8.0/include/host_defines.h'
'/usr/local/cuda-8.0/include/driver_types.h'
'/usr/local/cuda-8.0/include/surface_types.h'
'/usr/local/cuda-8.0/include/texture_types.h'
'/usr/local/cuda-8.0/include/vector_types.h'
'/usr/local/cuda-8.0/include/library_types.h'
'/usr/local/cuda-8.0/include/channel_descriptor.h'
'/usr/local/cuda-8.0/include/cuda_runtime_api.h'
'/usr/local/cuda-8.0/include/cuda_device_runtime_api.h'
'/usr/local/cuda-8.0/include/driver_functions.h'
'/usr/local/cuda-8.0/include/vector_functions.h'
'/usr/local/cuda-8.0/include/vector_functions.hpp'
'/usr/local/cuda-8.0/include/common_functions.h'
'/usr/local/cuda-8.0/include/math_functions.h'
'/usr/local/cuda-8.0/include/math_functions.hpp'
'/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.h'
'/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.hpp'
'/usr/local/cuda-8.0/include/cuda_surface_types.h'
'/usr/local/cuda-8.0/include/cuda_texture_types.h'
'/usr/local/cuda-8.0/include/device_functions.h'
'/usr/local/cuda-8.0/include/device_functions.hpp'
'/usr/local/cuda-8.0/include/device_atomic_functions.h'
'/usr/local/cuda-8.0/include/device_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/device_double_functions.h'
'/usr/local/cuda-8.0/include/device_double_functions.hpp'
'/usr/local/cuda-8.0/include/sm_20_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_20_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_32_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_32_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_35_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_60_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_60_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_20_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_20_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_30_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_30_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_32_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_32_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_35_intrinsics.h'
'/usr/local/cuda-8.0/include/surface_functions.h'
'/usr/local/cuda-8.0/include/texture_fetch_functions.h'
'/usr/local/cuda-8.0/include/texture_indirect_functions.h'
'/usr/local/cuda-8.0/include/surface_indirect_functions.h'
'/usr/local/cuda-8.0/include/device_launch_parameters.h'
'/usr/local/cuda-8.0/include/cuda_fp16.h'
'/usr/local/cuda-8.0/include/math_constants.h'
'/usr/local/cuda-8.0/include/curand_kernel.h'
'/usr/local/cuda-8.0/include/curand.h'
'/usr/local/cuda-8.0/include/curand_discrete.h'
'/usr/local/cuda-8.0/include/curand_precalc.h'
'/usr/local/cuda-8.0/include/curand_mrg32k3a.h'
'/usr/local/cuda-8.0/include/curand_mtgp32_kernel.h'
'/usr/local/cuda-8.0/include/cuda.h'
'/usr/local/cuda-8.0/include/curand_mtgp32.h'
'/usr/local/cuda-8.0/include/curand_philox4x32_x.h'
'/usr/local/cuda-8.0/include/curand_globals.h'
'/usr/local/cuda-8.0/include/curand_uniform.h'
'/usr/local/cuda-8.0/include/curand_normal.h'
'/usr/local/cuda-8.0/include/curand_normal_static.h'
'/usr/local/cuda-8.0/include/curand_lognormal.h'
'/usr/local/cuda-8.0/include/curand_poisson.h'
'/usr/local/cuda-8.0/include/curand_discrete2.h'.
nvcc warning : option '--relaxed-constexpr' has been deprecated and replaced by option '--expt-relaxed-constexpr'.
nvcc warning : option '--relaxed-constexpr' has been deprecated and replaced by option '--expt-relaxed-constexpr'.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 138.913s, Critical Path: 102.63s
I saw some people complaining about bazel 0.3.1, maybe need to downgrade to 0.3.0. The error you gave is not very informative, that's just the parent script saying that child script failed, there should be more info on the console with the actual error.
I went through the setup steps two days ago for GTX 1080 and it worked with this config.
Ubuntu 16.04
Nvidia Driver: nvidia-367.35 (installed from .run file)
Bazel 0.3.0
gcc: 4.9.3 (default with 16.04)
CUDA 8.0.27 (installed from .run file into default dirs)
compute capability: (use default values for config)

Resources