Tensorflow Bazel 0.3.0 build CUDA 8.0 GTX 1070 fails - machine-learning

Here are my specs:
GTX 1070
Driver 367 (installed from .run)
Ubuntu 16.04
CUDA 8.0 (installed from .run)
Cudnn 5
Bazel 0.3.0 (potential problem?)
gcc 4.9.3
Tensorflow installed from source
To verify versions:
volcart#volcart-Precision-Tower-7910:~/$ nvidia-smi
Fri Aug 5 15:03:32 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:03:00.0 On | N/A |
| 0% 38C P8 11W / 185W | 495MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 20303 G /usr/lib/xorg/Xorg 280MiB |
| 0 20909 G compiz 114MiB |
| 0 21562 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 98MiB |
+-----------------------------------------------------------------------------+
volcart#volcart-Precision-Tower-7910:~/$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
volcart#volcart-Precision-Tower-7910:~/$ bazel version
Build label: 0.3.0
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jun 10 11:38:23 2016 (1465558703)
Build timestamp: 1465558703
Build timestamp as int: 1465558703
volcart#volcart-Precision-Tower-7910:~/$ gcc -vUsing built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.9.3-13ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
I did switch bazel versions, so I executed bazel clean successfully.
I can verify CUDA is functional via ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$
volcart#volcart-Precision-Tower-7910:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1070"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8113 MBytes (8507162624 bytes)
(15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1797 MHz (1.80 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070
Result = PASS
When I ./configure I enter all the defaults.
The current errors
When I build the training example I get this:
volcart#volcart-Precision-Tower-7910:/usr/local/lib/python2.7/dist-packages/tensorflow$ sudo bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
Sending SIGTERM to previous Bazel server (pid=7108)... done.
.
INFO: Found 1 target...
...
./tensorflow/core/platform/default/logging.h: In instantiation of 'std::string* tensorflow::internal::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::basic_string<char>]':
tensorflow/core/common_runtime/gpu/gpu_device.cc:567:5: required from here
./tensorflow/core/platform/default/logging.h:197:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
TF_DEFINE_CHECK_OP_IMPL(Check_LT, < )
^
./tensorflow/core/platform/macros.h:54:29: note: in definition of macro 'TF_PREDICT_TRUE'
#define TF_PREDICT_TRUE(x) (x)
^
./tensorflow/core/platform/default/logging.h:197:1: note: in expansion of macro 'TF_DEFINE_CHECK_OP_IMPL'
TF_DEFINE_CHECK_OP_IMPL(Check_LT, < )
^
ERROR: /usr/local/lib/python2.7/dist-packages/tensorflow/tensorflow/cc/BUILD:199:1: Linking of rule '//tensorflow/cc:tutorials_example_trainer' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/local_linux-opt/bin/tensorflow/cc/tutorials_example_trainer ... (remaining 805 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
bazel-out/local_linux-opt/bin/tensorflow/cc/_objs/tutorials_example_trainer/tensorflow/cc/tutorials/example_trainer.o: In function `tensorflow::example::ConcurrentSteps(tensorflow::example::Options const*, int)':
example_trainer.cc:(.text._ZN10tensorflow7example15ConcurrentStepsEPKNS0_7OptionsEi+0x517): undefined reference to `google::protobuf::internal::empty_string_'
bazel-out/local_linux-opt/bin/tensorflow/core/kernels/libidentity_reader_op.lo(identity_reader_op.o): In function `tensorflow::IdentityReader::SerializeStateLocked(std::string*)':
identity_reader_op.cc:(.text._ZN10tensorflow14IdentityReader20SerializeStateLockedEPSs[_ZN10tensorflow14IdentityReader20SerializeStateLockedEPSs]+0x36): undefined reference to `google::protobuf::MessageLite::SerializeToString(std::string*) const'
bazel-out/local_linux-opt/bin/tensorflow/core/kernels/libwhole_file_read_ops.lo(whole_file_read_ops.o): In function `tensorflow::WholeFileReader::SerializeStateLocked(std::string*)':
And when I try to build the pip package I get this:
volcart#volcart-Precision-Tower-7910:/usr/local/lib/python2.7/dist-packages/tensorflow$ sudo bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
WARNING: /usr/local/lib/python2.7/dist-packages/tensorflow/util/python/BUILD:11:16: in includes attribute of cc_library rule //util/python:python_headers: 'python_include' resolves to 'util/python/python_include' not in 'third_party'. This will be an error in the future.
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/bit_depth.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/gemmlowp.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/map.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/output_stages.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/profiling/instrumentation.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/profiling/profiler.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
INFO: Found 1 target...
INFO: From Compiling external/protobuf/src/google/protobuf/util/internal/utility.cc [for host]:
...
INFO: From Compiling tensorflow/core/distributed_runtime/tensor_coding.cc:
tensorflow/core/distributed_runtime/tensor_coding.cc: In member function 'bool tensorflow::TensorResponse::ParseTensorSubmessage(google::protobuf::io::CodedInputStream*, tensorflow::TensorProto*)':
tensorflow/core/distributed_runtime/tensor_coding.cc:123:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (num_bytes != buf.size()) return false;
^
ERROR: /usr/local/lib/python2.7/dist-packages/tensorflow/tensorflow/core/kernels/BUILD:1498:1: undeclared inclusion(s) in rule '//tensorflow/core/kernels:batchtospace_op_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/batchtospace_op_gpu.cu.cc':
'/usr/local/cuda-8.0/include/cuda_runtime.h'
'/usr/local/cuda-8.0/include/host_config.h'
'/usr/local/cuda-8.0/include/builtin_types.h'
'/usr/local/cuda-8.0/include/device_types.h'
'/usr/local/cuda-8.0/include/host_defines.h'
'/usr/local/cuda-8.0/include/driver_types.h'
'/usr/local/cuda-8.0/include/surface_types.h'
'/usr/local/cuda-8.0/include/texture_types.h'
'/usr/local/cuda-8.0/include/vector_types.h'
'/usr/local/cuda-8.0/include/library_types.h'
'/usr/local/cuda-8.0/include/channel_descriptor.h'
'/usr/local/cuda-8.0/include/cuda_runtime_api.h'
'/usr/local/cuda-8.0/include/cuda_device_runtime_api.h'
'/usr/local/cuda-8.0/include/driver_functions.h'
'/usr/local/cuda-8.0/include/vector_functions.h'
'/usr/local/cuda-8.0/include/vector_functions.hpp'
'/usr/local/cuda-8.0/include/common_functions.h'
'/usr/local/cuda-8.0/include/math_functions.h'
'/usr/local/cuda-8.0/include/math_functions.hpp'
'/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.h'
'/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.hpp'
'/usr/local/cuda-8.0/include/cuda_surface_types.h'
'/usr/local/cuda-8.0/include/cuda_texture_types.h'
'/usr/local/cuda-8.0/include/device_functions.h'
'/usr/local/cuda-8.0/include/device_functions.hpp'
'/usr/local/cuda-8.0/include/device_atomic_functions.h'
'/usr/local/cuda-8.0/include/device_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/device_double_functions.h'
'/usr/local/cuda-8.0/include/device_double_functions.hpp'
'/usr/local/cuda-8.0/include/sm_20_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_20_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_32_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_32_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_35_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_60_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_60_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_20_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_20_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_30_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_30_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_32_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_32_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_35_intrinsics.h'
'/usr/local/cuda-8.0/include/surface_functions.h'
'/usr/local/cuda-8.0/include/texture_fetch_functions.h'
'/usr/local/cuda-8.0/include/texture_indirect_functions.h'
'/usr/local/cuda-8.0/include/surface_indirect_functions.h'
'/usr/local/cuda-8.0/include/device_launch_parameters.h'
'/usr/local/cuda-8.0/include/cuda_fp16.h'
'/usr/local/cuda-8.0/include/math_constants.h'
'/usr/local/cuda-8.0/include/curand_kernel.h'
'/usr/local/cuda-8.0/include/curand.h'
'/usr/local/cuda-8.0/include/curand_discrete.h'
'/usr/local/cuda-8.0/include/curand_precalc.h'
'/usr/local/cuda-8.0/include/curand_mrg32k3a.h'
'/usr/local/cuda-8.0/include/curand_mtgp32_kernel.h'
'/usr/local/cuda-8.0/include/cuda.h'
'/usr/local/cuda-8.0/include/curand_mtgp32.h'
'/usr/local/cuda-8.0/include/curand_philox4x32_x.h'
'/usr/local/cuda-8.0/include/curand_globals.h'
'/usr/local/cuda-8.0/include/curand_uniform.h'
'/usr/local/cuda-8.0/include/curand_normal.h'
'/usr/local/cuda-8.0/include/curand_normal_static.h'
'/usr/local/cuda-8.0/include/curand_lognormal.h'
'/usr/local/cuda-8.0/include/curand_poisson.h'
'/usr/local/cuda-8.0/include/curand_discrete2.h'.
nvcc warning : option '--relaxed-constexpr' has been deprecated and replaced by option '--expt-relaxed-constexpr'.
nvcc warning : option '--relaxed-constexpr' has been deprecated and replaced by option '--expt-relaxed-constexpr'.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 138.913s, Critical Path: 102.63s

I saw some people complaining about bazel 0.3.1, maybe need to downgrade to 0.3.0. The error you gave is not very informative, that's just the parent script saying that child script failed, there should be more info on the console with the actual error.
I went through the setup steps two days ago for GTX 1080 and it worked with this config.
Ubuntu 16.04
Nvidia Driver: nvidia-367.35 (installed from .run file)
Bazel 0.3.0
gcc: 4.9.3 (default with 16.04)
CUDA 8.0.27 (installed from .run file into default dirs)
compute capability: (use default values for config)

Related

OpenCV Python on WSL 2 error: "(-217:Gpu API call) CUDA driver version is insufficient for CUDA runtime version"

I am trying to run YOLO v4 on OpenCV 4.6 Python with CUDA on WSL 2 with Ubuntu 20.04 (Focal Fossa):
import cv2 as cv
import time
Conf_threshold = 0.4
NMS_threshold = 0.4
COLORS = [(0, 255, 0), (0, 0, 255), (255, 0, 0),
(255, 255, 0), (255, 0, 255), (0, 255, 255)]
class_name = []
with open('classes.txt', 'r') as f:
class_name = [cname.strip() for cname in f.readlines()]
# print(class_name)
net = cv.dnn.readNet('yolov4-tiny.weights', 'yolov4-tiny.cfg')
net.setPreferableBackend(cv.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CUDA_FP16)
model = cv.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/255, swapRB=True)
cap = cv.VideoCapture('output.avi')
starting_time = time.time()
frame_counter = 0
while True:
ret, frame = cap.read()
frame_counter += 1
if ret == False:
break
classes, scores, boxes = model.detect(frame, Conf_threshold, NMS_threshold)
...
I got the following error:
Traceback (most recent call last): \
File "yolov4.py", line 28, in classes, scores, boxes = model.detect(frame, Conf_threshold, NMS_threshold)
cv2.error: OpenCV(4.6.0-dev) /home/tong/source/opencv/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) CUDA driver version is insufficient for CUDA runtime version in function 'ManagedPtr'
I have checked the Nvidia driver by using nvidia-smi.exe:
NVIDIA-SMI 516.59 Driver Version: 516.59 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:0D:00.0 On | N/A |
| 24% 51C P0 44W / 215W | 2711MiB / 8192MiB | 6% Default |
| | | N/A
As well as the CUDA instance installed by dpkg -l | grep cuda-toolkit:
ii cuda-toolkit-11-6 11.6.2-1 amd64 CUDA Toolkit 11.6 meta-package
ii cuda-toolkit-11-6-config-common 11.6.55-1 all Common config package for CUDA Toolkit 11.6.
ii cuda-toolkit-11-7 11.7.0-1 amd64 CUDA Toolkit 11.7 meta-package
ii cuda-toolkit-11-7-config-common 11.7.60-1 all Common config package for CUDA Toolkit 11.7.
ii cuda-toolkit-11-config-common 11.7.60-1 all Common config package for CUDA Toolkit 11.
ii cuda-toolkit-config-common 11.7.60-1 all Common config package for CUDA Toolkit.
OpenCV should be compiled by CUDA 11.7 according to file CMakeCache.txt:
// Version of CUDA as computed from nvcc.
CUDA_VERSION:STRING=11.7
The nvcc is also pointed to 11.7:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
How can I fix this problem?

Tensorflow-serving docker container adds the GPU device but GPU has 0% utilization

Hi, I'm having issues with dockerized TF Serving seeing but not using my GPU.
It adds the GPU as device 0, allocates memory on it, but then loads the ML model into CPU device memory and runs inference using only the CPUs. GPU-util on nvidia-smi never leaves 0%.
Could anyone help me figure out why this is happening, and what should be changed?
The setup:
OS: Amazon/Deep Learning AMI (Ubuntu 18.04) on EC2 g4dn.xlarge
GPU: Tesla T4
Model: pretrained gpt2-xl tensorflow from huggingface, which I froze into a SavedModel and uploaded to S3.
Docker: came stock with Deep Learning AMI. I've already checked and confirmed that nvidia-smi runs containerized, so it's not a nvidia+docker issue.
TF Serving: I use the below Dockerfile to pull the latest-gpu image and download the model directly into it at buildtime:
FROM tensorflow/serving:latest-gpu
RUN apt-get update
ENV TZ=America
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
RUN apt-get install -y awscli
ENV AWS_ACCESS_KEY_ID=...
ENV AWS_SECRET_ACCESS_KEY=...
ARG model_name
ENV MODEL_NAME=$model_name
# Use AWS CLI to download the SavedModel into the docker container from S3 bucket
RUN aws s3 cp s3://v3-models/models/pretrained_tf_serving/${MODEL_NAME} /models/${MODEL_NAME} --recursive
EXPOSE 8500
I build and run the above Dockerfile with these commands:
#!/bin/bash
# first build the image with the model_name arg, and tag it as xl-serving
docker build -t xl-serving --build-arg model_name=gpt2-xl ../../model_server
# then run it with gpus, exposing gRPC port
docker run -it --rm --gpus all --runtime nvidia -p 8500:8500 xl-serving
Running the serving container prints this output. Notice that the GPU is added.
2020-11-06 05:25:34.671071: I tensorflow_serving/model_servers/server.cc:87] Building single TensorFlow model file config: model_name: gpt2-xl model_base_path: /models/gpt2-xl
2020-11-06 05:25:34.671274: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-11-06 05:25:34.671295: I tensorflow_serving/model_servers/server_core.cc:575] (Re-)adding model: gpt2-xl
2020-11-06 05:25:34.771644: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771673: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771687: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771724: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/gpt2-xl/1
2020-11-06 05:25:35.222512: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-11-06 05:25:35.222545: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Reading SavedModel debug info (if present) from: /models/gpt2-xl/1
2020-11-06 05:25:35.222672: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-06 05:25:35.223994: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-06 05:25:35.262238: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.263132: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-11-06 05:25:35.263149: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-11-06 05:25:35.263236: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264122: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264948: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-06 05:25:36.185140: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-06 05:25:36.185165: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-11-06 05:25:36.185171: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-11-06 05:25:36.185334: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.186222: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187046: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187852: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13896 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-11-06 05:25:37.279837: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:199] Restoring SavedModel bundle.
2020-11-06 05:25:56.154008: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/gpt2-xl/1
2020-11-06 05:25:57.551535: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:303] SavedModel load for tags { serve }; Status: success: OK. Took 22777844 microseconds.
2020-11-06 05:25:57.832736: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/gpt2-xl/1/assets.extra/tf_serving_warmup_requests
2020-11-06 05:25:57.835030: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:57.838329: I tensorflow_serving/model_servers/server.cc:367] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-11-06 05:25:57.840415: I tensorflow_serving/model_servers/server.cc:387] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
I then hit this server with a single, non-batched gRPC call. It will successfully run and return a correct GPT2 output. However, it takes as long as the same setup takes on a CPU. htop shows that 8gb of ram (gpt2-xl model size) is loaded into the CPU machine. It then shows the TF Serving process running, and maxing out one or two CPU cores. It appears to only run on CPU.
This is what nvidia-smi looks like while the call is running. Notice the allocated memory, and 0% GPU-Util:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 26W / 70W | 14240MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 13357 C tensorflow_model_server 14221MiB |
+-----------------------------------------------------------------------------+
I've scoured the web and can't find any advice for this. Closest I found was this github issue: GPU utilization with TF serving #1440, for which the fixes did not work for me. They were dealing with low GPU-util, I'm dealing with 0%.
Any advice on what the issue is?
Thank you very much. I've been banging my head against the wall for days on this, so I very much appreciate your help :)
Update #1 :
I've written a python script (below) to use tensorflow==2.3.0 to load the model and run it. It's running in a conda env with CUDA=11.0. It successfully runs inference on the GPU, and it's a good 15x faster than what I'm getting on TF-serving.
import tensorflow as tf
import numpy as np
model = tf.saved_model.load('/home/ubuntu/models/gpt2-xl/1/')
servable = model.signatures["forward"]
# Create input tensor
tensor_in = tf.constant([[198, 15667, 6530, 25, 29437, 1706, 1610, 977, 948, 33611]])
# Run a loop of 10 inferences on the model, to predict the next 10 tokens.
for i in range(10):
pred = servable(tensor_in)
logits = pred['output_0']
logits = logits[:, -1, :] / 0.8
next_id = tf.random.categorical(tf.nn.log_softmax(logits, axis=-1), num_samples=1)
next_id = tf.dtypes.cast(next_id, tf.int32).numpy()
tensor_in = np.concatenate((tensor_in, next_id), axis=1)
Up next: will be trying running tf-serving outside of container. Update to come...
How did you save your model? Add clear_devices=True when saving model and have anather try.

Ubuntu 18.04: GNU parallel can't find and use the full number of cores on a local system

I am using GNU parallel (version 20200522) on Ubuntu Linux 18.04.4 and running jobs on all cores of a local server minus 2 cores, that is I am using the -j-2 parameter.
find /folder/ -type f -iname "*.pdf" | parallel -j-2 --nice 2 "script.sh {1} {1/.}; mv -f -v {1} /folder2/; mv -f {1/.}.txt /folder3/" :::: -
However, the program shows
Error: Cannot run any jobs.
I tried using the -j100% parameter and I have seen that it uses just 1 core(job), and I deduce that, for GNU parallel, 100% of the available cores on this system is just one core.
If I use the -j5 parameter (which does not imply autodetection of the total number of cores), everything is alright, parallel launches 5 jobs and uses 5 cores.
The interesting part is that the file /root/.parallel/tmp/sshlogin/MACHINE_NAME/cpuspec contains the following:
1
6
6
which means, I think, that GNU parallel should see 6 available cores.
I have tried deleting the cpuspec file and running parallel again to redetect the total number of cores, but the cpuspec file and the behavior of the program remain the same.
On different systems, deleting the cpuspec file solved all issues, but on this particular system it is not working. The virtual machine is copied from another server with a different configuration, that is why I need deleting the cpuspec file.
What should I do to get GNU parallel to correctly detect the number of cores on the system, so that I can use the -j-2 parameter?
Update 21.07:
After deleting once again the folder with the cpuspec file, running the parallel --number-of-sockets/cores/threads commands and using just once the -S 6/: parameter, the problem seems to have resolved itself. Now GNU parallel correctly detects the number of cores and the -j-2 parameters works.
I am not sure what good things happened, but I am not able to reproduce the bug anymore.
Ole, thank you for your answer. If I meet the bug again or if I am able to reproduce it, I will post it here.
And here is the output to the commands:
parallel --number-of-sockets
1
parallel --number-of-cores
6
parallel --number-of-threads
6
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz
stepping : 2
microcode : 0xffffffff
cpu MHz : 2397.218
cache size : 15360 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm pti fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 4794.43
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:
And it repeats itself for 5 more cores.
You may have found a bug. Please post the output of:
cat /proc/cpuinfo
parallel --number-of-sockets
parallel --number-of-cores
parallel --number-of-threads
Also see if you can make an MCVE.
As a workaround you can use -S 6/: to force GNU Parallel to detect 6 cores on your system.
find /folder/ -type f -iname "*.pdf" |
parallel -S 6/: -j-2 --nice 2 "script.sh {1} {1/.}; mv -f -v {1} /folder2/; mv -f {1/.}.txt /folder3/"
(Also :::: - can be left out completely: If there is no ::: :::: then GNU Parallel reads from stdin).

Dynamically decide which GPU to run on - TF on NVIDIA docker

I have a queue of models, which I allow only 2 to be executed in parallel, since I have 2 GPUs.
For that, in the beginning of my code I try to determine which GPU is available by using GPUtil. Maybe its relevant, this code in run inside a docker container that was launched using the --runtime=nvidia flag.
The code that determines which GPU to run on, looks like this:
import os
import GPUtil
gpu1, gpu2 = GPUtil.getGPUs()
available_gpu = gpu1 if gpu1.memoryFree > gpu2.memoryFree else gpu2
os.environ['CUDA_VISIBLE_DEVICES'] = str(available_gpu.id)
import tensorflow as tf
Now, I launched two scripts this way (with a slight delay until the first one occupied a GPU) but both of them tried to use the same GPU!
I went further to examine the problem - I manually set the os.environ['CUDA_VISIBLE_DEVICES'] = '1' and let the model run.
As it was training, I checked the output of nvidia-smi and saw the following
user#server:~$ docker exec awesome_gpu_container nvidia-smi
Mon Mar 12 06:59:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 50C P2 131W / 280W | 5846MiB / 6075MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 0% 39C P8 14W / 200W | 2MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
And I notice that while I've set the visible device to be 1 it is actually running on 0
I stress again, that my mission is while queuing multiple models that each one that start running will decide for itself which GPU to use.
I explored allow_soft_placement=True, but that allocated the memory on both GPUs so I stopped the process.
Bottom line, how can I make sure my training scripts only use one GPU, and make them choose the free one?
As described in the CUDA programming guide, the default device enumeration used by CUDA is "fastest first":
CUDA_​DEVICE_​ORDER
FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)
FASTEST_FIRST causes CUDA to guess which device is
fastest using a simple heuristic, and make that device 0, leaving the
order of the rest of the devices unspecified.
PCI_BUS_ID orders devices by PCI bus ID in ascending order.
If you set CUDA_​DEVICE_​ORDER=PCI_BUS_ID the CUDA ordering will match the device ordering shown by nvidia-smi.
Since you are using docker, you can also enforce a stronger isolation with our runtime:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ...
But that's at container startup time.

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

I'm facing a strange problem with Torque assignment of GPUs.
I'm running Torque 6.1.0 on a single machine that has two NVIDIA GTX Titan X GPUs. I'm using pbs_sched for scheduling. nvidia-smi output at rest is as follows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:03:00.0 On | N/A |
| 22% 40C P8 15W / 250W | 0MiB / 12204MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 33C P8 14W / 250W | 0MiB / 12207MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have a simple test script to assess GPU assignment as follows:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
deviceQuery is the utility that comes with CUDA. When I run it from the command line, it correctly finds both GPUs. When I restrict to one device from the command-line like this...
CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
... it also correctly finds one or the other GPU.
When I submit test.sh to the queue with qsub, and when no other jobs are running, it again works correctly. Here's the output:
CUDA_VISIBLE_DEVICES: 0
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12204 MBytes (12796887040 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS
However, if a job is already running on gpu0 (i.e. if it is assigned CUDA_VISIBLE_DEVICES=1), the job cannot find any GPUs. Output:
CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
Anyone know what is going on here?
I think I've solved my own problem, but unfortunately I tried two things at once. I don't want to go back and confirm which solved the issue. It's one of the following:
Remove the --enable-cgroups option from Torque's configure script before building.
Running these steps in the Torque install process:
make packages
sh torque-package-server-linux-x86_64.sh --install
sh torque-package-mom-linux-x86_64.sh --install
sh torque-package-clients-linux-x86_64.sh --install
For the second option, I know that these steps are properly documented in the Torque install instructions. However, I have a simple setup where I just have a single node (compute node and server are same machine). I thought that 'make install' should do everything that the package installs do for that single node, but maybe I was mistaken.

Resources