Installing TensorFlow-GPU - nvidia

I try to install tensorflow-gpu. The problem is that I have nvidia-375.82 driver, while tensorflow requires 375.66.
When I got this error
ImportError: libnvidia-fatbinaryloader.so.375.66: cannot open shared object file: No such file or directory
I tried to make link
sudo ln -s /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.82 /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66
It helps to avoid ImportError, but nothing more. If I try to run smth
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
I get result by cpu and prints
2017-10-07 15:56:03.329769: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329832: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329850: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329864: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329878: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.429055: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-10-07 15:56:03.429198: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: sklert-new-comp
2017-10-07 15:56:03.429226: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: sklert-new-comp
2017-10-07 15:56:03.429317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.66.0
2017-10-07 15:56:03.429384: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
2017-10-07 15:56:03.429446: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.82.0
2017-10-07 15:56:03.429473: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 375.82.0 does not match DSO version 375.66.0 -- cannot find working devices in this configuration
Device mapping: no known devices.
2017-10-07 15:56:03.430336: I tensorflow/core/common_runtime/direct_session.cc:300] Device mapping:
MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467133: I tensorflow/core/common_runtime/simple_placer.cc:872] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467201: I tensorflow/core/common_runtime/simple_placer.cc:872] b: (Const)/job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467226: I tensorflow/core/common_runtime/simple_placer.cc:872] a: (Const)/job:localhost/replica:0/task:0/cpu:0
[[ 22. 28.]
[ 49. 64.]]
Is there any way to use tensorflow with gpu without downgrading?
...
Seems that problem is not in tensorflow, but in nvidia-drivers
sudo dmesg | grep NVRM
[ 1.267417] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017 (using threaded interrupts)
[ 108.803115] NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 375.82. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[ 1419.021917] NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 375.82. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
Some drivers have different version:
locate 375.66
/usr/lib/i386-linux-gnu/libcuda.so.375.66
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.375.66
/usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66
/usr/lib/x86_64-linux-gnu/libcuda.so.375.66
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.375.66
/usr/lib32/nvidia-375/libnvidia-fatbinaryloader.so.375.66

Related

Nvidia NVML Driver/library version mismatch: dkms, modules, drivers and modinfo versions

I have the Nvidia NVML Driver/library version mismatch error, for which I recommend reading the following answers:
https://stackoverflow.com/a/67105064/1782553
https://stackoverflow.com/a/71672261/1782553
However, after purging and re-installing the nvidia packages (necessary because of the issue described here), I still get the error. A reboot resolved the issue but I'd like to understand why?
Also, I'm not sure I understand the meaning of the 4 following commands and the Nvidia version they display.
Before reboot
$ cat /sys/module/nvidia/version
515.65.01
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
$ dkms status
nvidia, 515.65.07, 5.4.0-131-generic, x86_64: installed
$ modinfo nvidia | grep ^version
version: 515.65.07
After reboot
$ cat /sys/module/nvidia/version
515.65.07
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.07 Thu Sep 22 00:22:12 UTC 2022
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
$ dkms status
nvidia, 515.65.07, 5.4.0-131-generic, x86_64: installed
$ modinfo nvidia | grep ^version
version: 515.65.07
I understand there's a difference between the loaded driver version and the driver version that will be loaded after next reboot, but I would have guessed that the modinfo version would match the cat /proc/... version, or that the dkms version would match the cat /sys/... version, but they don't…
Could someone make a brief summary of what each command is reading?

Keras running in Docker very slow and crashes - ValueError: Feature my_feature is not in features dictionary

I can run Keras neural net locally on my W10 laptop fine
But same code running in Docker is extremely slow and always crashes with error:
ValueError: Feature my_feature is not in features dictionary.
The feature not found is always the target feature
There are version differences between laptop and container but I'm not convinced this has bearing
Laptop
Windows 10 Enterprise 64bit
Intel Core i7-7820HQ # 2.90GHz
16GB RAM
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
λ pip list | grep tensorflow
tensorflow 2.0.0
tensorflow-estimator 2.0.1
λ pip list | grep pandas
pandas 0.23.3
pandas-ml 0.6.1
λ pip list | grep numpy
numpy 1.17.4
Docker
# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Python 3.6.10 (default, Apr 23 2020, 15:40:23)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
root#modelbuilder:~# pip list | grep tensorflow
tensorflow 2.3.0
tensorflow-estimator 2.3.0
root#modelbuilder:~# pip list | grep pandas
pandas 0.24.0
pandas-ml 0.6.1
root#modelbuilder:~# pip list | grep numpy
numpy 1.19.2
Verified what was mentioned here: ValueError: Feature not in features dictionary
Target is not being fed into feature columns, features correspond etc, and this would also fail locally.
Any help will be much appreciated
Figured this out.
Crash issue:
In error created feature column for target, so removed the target from features for columns
Slow Docker:
Was running model.fit() over and over (many times)

I can't find configuration file of Wireshark on Centos7

I installed the Wireshark on Centos7. But I can not find the configuration files. By default it should be in "/usr/local/share/Wireshark", but didn't it exist.
~[root#localhost share]# tshark -v
TShark 1.10.14 (Git Rev Unknown from unknown)
Copyright 1998-2015 Gerald Combs <gerald#wireshark.org> and contributors.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compiled (64-bit) with GLib 2.54.2, with libpcap, with libz 1.2.7, with POSIX
capabilities (Linux), without libnl, with SMI 0.4.8, with c-ares 1.10.0, with
Lua 5.1, without Python, with GnuTLS 3.3.26, with Gcrypt 1.5.3, with MIT
Kerberos, without GeoIP.
Running on Linux 3.10.0-862.el7.x86_64, with locale en_US.UTF-8, with libpcap
version 1.5.3, with libz 1.2.7.
Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
Built using gcc 4.8.5 20150623 (Red Hat 4.8.5-36).
[root#localhost ~]# cd /usr/local/share/
[root#localhost share]# ll
total 0
drwxr-xr-x. 2 root root 28 Mar 14 16:39 applications
drwxr-xr-x. 2 root root 6 Apr 11 2018 info
drwxr-xr-x. 21 root root 243 Feb 11 2019 man
[root#localhost ~]# ll /etc/ | grep wireshark
[root#localhost ~]#
To check where tshark config files are stored, use tshark -G folders. As an example, this is what I see on my system:
ubuntu$ tshark -G folders
Temp: /tmp
Personal configuration: /home/rj/.config/wireshark
Global configuration: /usr/share/wireshark
System: /etc
Program: /usr/bin
Personal Plugins: /home/rj/.local/lib/wireshark/plugins/2.6
Global Plugins: /usr/lib/x86_64-linux-gnu/wireshark/plugins/2.6
Personal Lua Plugins: /home/rj/.local/lib/wireshark/plugins
Global Lua Plugins: /usr/lib/x86_64-linux-gnu/wireshark/plugins
Extcap path: /usr/lib/x86_64-linux-gnu/wireshark/extcap
MaxMind database path: /usr/share/GeoIP
MaxMind database path: /var/lib/GeoIP
MaxMind database path: /usr/share/GeoIP
More information is available on using wireshark configuration files as well as official docs, which detail the types.

Missing nvcc compiler - theano

I use ubuntu 14.04 and cuda 7.5. I get cuda version information using $ nvcc --version :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
$PATH and $LD_LIBRARY_PATH are below :
$ echo $PATH
/usr/local/cuda-7.5/bin:/usr/local/cuda-7.5/bin/:/opt/ros/indigo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
$ echo $LD_LIBRARY_PATH
/usr/local/cuda-7.5/lib64
I install theano. I use it with cpu but not gpu. This guide says that
Testing Theano with GPU¶
To see if your GPU is being used, cut and paste the following program into a file and run it.
from theano import function, config, shared, sandbox import
> theano.tensor as T import numpy import time
>
> vlen = 10 * 30 * 768 # 10 x #cores x # threads per core iters = 1000
>
> rng = numpy.random.RandomState(22) x =
> shared(numpy.asarray(rng.rand(vlen), config.floatX)) f = function([],
> T.exp(x)) print(f.maker.fgraph.toposort()) t0 = time.time() for i in
> range(iters):
> r = f() t1 = time.time() print("Looping %d times took %f seconds" % (iters, t1 - t0)) print("Result is %s" % (r,)) if
> numpy.any([isinstance(x.op, T.Elemwise) for x in
> f.maker.fgraph.toposort()]):
> print('Used the cpu') else:
> print('Used the gpu') The program just computes the exp() of a bunch of random numbers. Note that we use the shared function to make
> sure that the input x is stored on the graphics device.
If I run this program (in check1.py) with device=cpu, my computer
takes a little over 3 seconds, whereas on the GPU it takes just over
0.64 seconds. The GPU will not always produce the exact same floating-point numbers as the CPU. As a benchmark, a loop that calls
numpy.exp(x.get_value()) takes about 46 seconds.
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python
check1.py [Elemwise{exp,no_inplace}()]
Looping 1000 times took 3.06635117531 seconds Result is [ 1.23178029
1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284] Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python
check1.py Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(),
HostFromGpu(GpuElemwise{exp,no_inplace}.0)] Looping 1000 times took
0.638810873032 seconds Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296] Used the gpu Note that GPU operations in Theano require for now floatX to be float32 (see also below).
I run gpu version command without sudo, it throws permission denied error :
/theano/gof/cmodule.py", line 741, in refresh
files = os.listdir(root)
OSError: [Errno 13] Permission denied: '/home/user/.theano/compiledir_Linux-3.16--generic-x86_64-with-Ubuntu-14.04-trusty-x86_64-2.7.6-64/tmp077r7U'
If I use it with sudo, the compiler cannot find nvcc path.
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
How can I fix this error?
Try running
chown -R user /home/user/.theano
chmod -R 775 /home/user/.theano
this will change the permissions of the folder that your python script can't access. The first one will make the folder belong to your user and the second one will change the permissions to be readable, writable and executable by the user.
Regarding this error only:
You can check where your NVCC is installed , default path is '/usr/local/cuda/bin', if you could see it there then do as below:
$ export PATH="/usr/local/cuda/bin:$PATH"
$ source .bashrc
This worked for me and now I can use NVCC and it is no longer missing.

Cross Compile OpenCV for Rpi2 with Java bindings

What I try to do is to cross compile OpenCV from a x86 host machine to an ARM target machine (Raspberry Pi 2) with Java bindings.
All I've achieved is to compile OpenCV with Java bindings for x86 platform, or even OpenCV with NO Java bindings for ARM platform. However I cannot compile OpenCV with Java bindings for ARM platform.
I've kind of followed thousands of guides to do this. This is from OpenCV's official site, and seems to be very simple: http://docs.opencv.org/2.4/doc/tutorials/introduction/crosscompilation/arm_crosscompile_with_cmake.html
My host machine is the following:
$ uname -a:
Linux ubuntu 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 14:46:51 UTC 2015 i686 i686 i686 GNU/Linux
I've installed gcc and g++ cross compilation tools (gnueabi and gnueabihf):
$ sudo apt-get install gcc-arm-linux-gnueabi
$ sudo apt-get install g++-arm-linux-gnueabi
$ sudo apt-get install gcc-arm-linux-gnueabihf
$ sudo apt-get install g++-arm-linux-gnueabihf
$ which arm-linux-gnueabihf-gcc
/usr/bin/arm-linux-gnueabihf-gcc
$ which arm-linux-gnueabihf-g++
/usr/bin/arm-linux-gnueabihf-g++
Since I want to compile OpenCV with the Java bindings, I installed jdk and ant:
$ sudo apt-get install openjdk-7-jre
$ sudo apt-get install openjdk-7-jdk
$ sudo apt-get install ant
Then I add these lines to .bashrc file:
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
PATH=$JAVA_HOME/bin:$PATH
export PATH
Then I execute:
$ source $HOME/.bashrc
I've downloaded OpenCV's source code and moved to the platforms/linux folder as the official guide does:
$ cd ~/opencv/platforms/linux
$ mkdir -p build_hardfp
$ cd build_hardfp
Then in the "opencv/platforms/arm-gnueabi.toolchain.cmake" file, replaced these lines:
set(CMAKE_C_COMPILER arm-linux-gnueabi${FLOAT_ABI_SUFFIX}-gcc-${GCC_COMPILER_VERSION})
set(CMAKE_CXX_COMPILER arm-linux-gnueabi${FLOAT_ABI_SUFFIX}-g++-${GCC_COMPILER_VERSION})
by these:
set(CMAKE_C_COMPILER /usr/bin/arm-linux-gnueabihf-gcc)
set(CMAKE_CXX_COMPILER /usr/bin/arm-linux-gnueabihf-g++)
In order to use arm compiler instead of x86's.
Then I try to get cmake ready:
$ cmake -DBUILD_SHARED_LIBS=OFF -D BUILD_NEW_PYTHON_SUPPORT=NO -DCMAKE_TOOLCHAIN_FILE=../arm-gnueabi.toolchain.cmake ../../..
Cmake's output tells it will use arm cross compiler for ARM platform target, but it will not use Java bindings:
...
-- Platform:
-- Host: Linux 4.2.0-16-generic i686
-- Target: Linux 1 arm
-- CMake: 3.2.2
-- CMake generator: Unix Makefiles
-- CMake build tool: /usr/bin/make
-- Configuration: Release
...
C++ Compiler: /usr/bin/arm-linux-gnueabihf-g++ (ver 5.2.1)
...
-- OpenCV modules:
-- To be built: core flann imgproc highgui features2d calib3d ml video legacy objdetect photo gpu ocl nonfree contrib stitching superres ts videostab
-- Disabled: world
-- Disabled by dependency: -
-- Unavailable: androidcamera dynamicuda java python viz
So, I try to set cmake compile variables by myself without using cmake's toolchain file:
$ export CMAKE_C_COMPILER=/usr/bin/arm-linux-gnueabihf-gcc
$ export CMAKE_CXX_COMPILER=/usr/bin/arm-linux-gnueabihf-g++
$ cmake -DBUILD_SHARED_LIBS=OFF -D BUILD_NEW_PYTHON_SUPPORT=NO ../../..
Now cmake's output tells it will include Java support, but it won't use arm cross compiler:
...
-- Platform:
-- Host: Linux 4.2.0-16-generic i686
-- CMake: 3.2.2
-- CMake generator: Unix Makefiles
-- CMake build tool: /usr/bin/make
-- Configuration: Release
...
C++ Compiler: /usr/bin/c++ (ver 5.2.1)
...
-- OpenCV modules:
-- To be built: core flann imgproc highgui features2d calib3d ml video legacy objdetect photo gpu ocl nonfree contrib java stitching superres ts videostab
-- Disabled: world
-- Disabled by dependency: -
-- Unavailable: androidcamera dynamicuda python viz
Of course, if I execute make command with this latest cmake configuration, this is the ".so" file I get:
$ readelf -h lib/libopencv_java249.so | grep Machine
Machine: Intel 80386
where it should tell:
Machine: ARM
So, once again: I can compile OpenCV with Java bindings for x86 platform, or either OpenCV with NO Java bindings for ARM platform, but not both.
How should I do this?
Thank you!
UPDATE 1:
#Notlikethat I forgot to tell I had already tried that (i.e. use ARM jdk instead of x86). I did not mention it because I though I should be using x86.
However, I have tried it again:
I've downloaded ARM's jdk, set JAVA_HOME and PATH variables properly to point this new jdk and tried cmake command.
The result is the same, it lets me compile for ARM without Java bingings, or for x86 with Java bindings.
UPDATE 2:
I've added the following variables to the "arm-gnueabi.toolchain.cmake" file:
set(JAVA_HOME /usr/lib/jvm/jdk1.7.0_60_ARM)
set(JAVA_AWT_LIBRARY $JAVA_HOME/include/jawt.h)
set(JAVA_JVM_LIBRARY $JAVA_HOME/jre/lib/arm/jvm.cfg)
set(JAVA_INCLUDE_PATH $JAVA_HOME/include/jni.h)
set(JAVA_INCLUDE_PATH2 $JAVA_HOME/include/linux/jni_md.h)
set(JAVA_AWT_INCLUDE_PATH $JAVA_HOME/include/jawt.h)
Now if I execute:
$ cmake -DBUILD_SHARED_LIBS=OFF -D BUILD_NEW_PYTHON_SUPPORT=NO -DCMAKE_TOOLCHAIN_FILE=../arm-gnueabi.toolchain.cmake ../../..
the output shows that java module is still unavailable, but at least, one of its key dependencies is ok (JNI):
...
-- Java:
-- ant: NO
-- JNI: $JAVA_HOME/include/jni.h $JAVA_HOME/include/linux/jni_md.h $JAVA_HOME/include/jawt.h
-- Java tests: NO
...
I'm pretty sure the problem here is the fact that ant is not found, which I can't understand.
Ant is installed:
$ echo $PATH:/usr/lib/jvm/jdk1.7.0_60_ARM/bin:/opt/apache/ant/apache-ant-1.9.6/bin:...
I've retried by adding the following variables to the "arm-gnueabi.toolchain.cmake" file, without success:
set(ANT_HOME /opt/apache/ant/apache-ant-1.9.6)
set(JAVA_ANT $ANT_HOME/bin/ant)

Resources