Docker GPU enabled version (>19.03) does not load tensorflow successfully - docker

I want to use docker 19.03 and above in order to have GPU support. I currently have docker 19.03.12 in my system. I can run this command to check that Nvidia drivers are running:
docker run -it --rm --gpus all ubuntu nvidia-smi
Wed Jul 1 14:25:55 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:01:00.0 Off | N/A |
| 26% 54C P5 13W / 180W | 734MiB / 8119MiB | 39% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Also, if run locally my module works with GPU support just fine. But if I build a docker image and try to run it I get a message:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
I am using cuda 9.0 with tensorflow 1.12.0 but I can switch to cuda 10.0 with tensorflow 1.15.
As I get it the problem is that I am probably using a previous dockerfile version with commands which does not make it compatible with new docker GPU enabled version (19.03 and above).
The actual commands are these:
FROM nvidia/cuda:9.0-base-ubuntu16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cuda-command-line-tools-9-0 \
cuda-cublas-9-0 \
cuda-cufft-9-0 \
cuda-curand-9-0 \
cuda-cusolver-9-0 \
cuda-cusparse-9-0 \
libcudnn7=7.0.5.15-1+cuda9.0 \
libnccl2=2.2.13-1+cuda9.0 \
libfreetype6-dev \
libhdf5-serial-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
software-properties-common \
unzip \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get install nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0 && \
apt-get update && \
apt-get install libnvinfer4=4.1.2-1+cuda9.0
I could not find a docker base file for fundamental GPU usage either.
In this answer there was a proposal for exposing libcuda.so.1 but it did not work in my case.
So, is there any workaround for this problem or a base dockerfile to adjust to?
My system is Ubuntu 16.04.
Edit:
I just noticed that nvidia-smi from within docker does not display any cuda version:
CUDA Version: N/A
in contrast with the one locally run. So, this probably means that no cuda is loaded inside docker for some reason I guess.

tldr;
A base Dockerfile which seems to work with docker 19.03+ & cuda 10 is this:
FROM nvidia/cuda:10.0-base
which can be conbined with tf 1.14 but for some reason could not found tf 1.15.
I just used this Dockerfile to test it:
FROM nvidia/cuda:10.0-base
CMD nvidia-smi
longer answer:
Well, after a lot of trials and errors (and frustration) I managed to make it work for docker 19.03.12+cuda 10 (although with tf 1.14 not 1.15).
I used the code from this post and used the base Dockerfiles provided there.
First I tried to check the nvidia-smi from within docker using Dockerfile:
FROM nvidia/cuda:10.0-base
CMD nvidia-smi
$docker build -t gpu_test .
...
$docker run -it --gpus all gpu_test
Fri Jul 3 07:31:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:01:00.0 Off | N/A |
| 45% 65C P2 142W / 180W | 8051MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
which finally seems to find cuda binaries: CUDA Version: 10.1.
Then, I made a minimal Dockerfile which could test the successful loading of tensorflow binary libraries within docker:
FROM nvidia/cuda:10.0-base
# The following are just declaring variables and ultimately use
ARG USE_PYTHON_3_NOT_2=True
ARG _PY_SUFFIX=${USE_PYTHON_3_NOT_2:+3}
ARG PYTHON=python${_PY_SUFFIX}
ARG PIP=pip${_PY_SUFFIX}
RUN apt-get update && apt-get install -y \
${PYTHON} \
${PYTHON}-pip
RUN ${PIP} install tensorflow_gpu==1.14.0
COPY bashrc /etc/bash.bashrc
RUN chmod a+rwx /etc/bash.bashrc
WORKDIR /src
COPY *.py /src/
ENTRYPOINT ["python3", "tf_minimal.py"]
and tf_minimal.py was simply:
import tensorflow as tf
print(tf.__version__)
and for completeness I just post the bashrc file I am using:
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ==============================================================================
export PS1="\[\e[31m\]tf-docker\[\e[m\] \[\e[33m\]\w\[\e[m\] > "
export TERM=xterm-256color
alias grep="grep --color=auto"
alias ls="ls --color=auto"
echo -e "\e[1;31m"
cat<<TF
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
TF
echo -e "\e[0;33m"
if [[ $EUID -eq 0 ]]; then
cat <<WARN
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u \$(id -u):\$(id -g) args...
WARN
else
cat <<EXPL
You are running this container as user with ID $(id -u) and group $(id -g),
which should map to the ID and group for your user on the Docker host. Great!
EXPL
fi
# Turn off colors
echo -e "\e[m"

Related

vulkaninfo failed to work in docker and did not recognize NVIDIA GPU

Problem description
When I run vulkaninfo in docker, it complains:
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
ERROR at /build/vulkan-tools-1.3.204.0~rc3-1lunarg20.04/vulkaninfo/vulkaninfo.h:649:vkCreateInstance failed with ERROR_INCOMPATIBLE_DRIVER
It seems like this problem was caused by driver, so I run nvidia-smi to check:
Sun Apr 17 03:12:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:68:00.0 Off | N/A |
| 32% 37C P8 12W / 250W | 18MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1550 G 16MiB |
+-----------------------------------------------------------------------------+
It seems the driver works well. And I check the environment variable NVIDIA_DRIVER_CAPABILITIES and run lspci|grep -i vga:
$ echo ${NVIDIA_DRIVER_CAPABILITIES}
all
$ lspci|grep -i vga
19:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
If I install mesa-vulkan-drivers, vulkaninfo works fine but does not recognize NVIDIA GPU:
$ apt install mesa-vulkan-drivers
$ vulkaninfo --summary
Devices:
========
GPU0:
apiVersion = 4198582 (1.1.182)
driverVersion = 1 (0x0001)
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 12.0.0, 256 bits)
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 21.2.6 (LLVM 12.0.0)
conformanceVersion = 1.0.0.0
deviceUUID = 00000000-0000-0000-0000-000000000000
driverUUID = 00000000-0000-0000-0000-000000000000
Reproduction details
Host system information:
OS: Ubuntu 16.04 xenial
Kernel: x86_64 Linux 4.4.0-210-generic
CPU: Intel Core i9-10940X CPU # 4.8GHz
GPU: NVIDIA GeForce RTX 2080 Ti x 4 (driver 510.60.02)
Docker information:
version: 20.10.7
nvidia-container-toolkit version: 1.9.0-1
Docker start command:
$ docker run --rm -it \
--runtime=nvidia \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=3 \
<docker_image_name> /bin/bash
Dockerfile:
FROM nvidia/cudagl:11.4.2-base-ubuntu20.04
ENV NVIDIA_DRIVER_CAPABILITIES compute,graphics,utility
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
libx11-xcb-dev \
libxkbcommon-dev \
libwayland-dev \
libxrandr-dev \
libegl1-mesa-dev
wget && \
rm -rf /var/lib/apt/lists/*
RUN wget -O - http://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
wget -O /etc/apt/sources.list.d/lunarg-vulkan-focal.list http://packages.lunarg.com/vulkan/lunarg-vulkan-focal.list && \
apt update && \
apt install -y vulkan-sdk
EDIT1:
I am running ssh on a headless server and I want to do offline rendering in docker.

NVIDIA cuda enabled docker container issue - UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount()

I am trying to use the base images provided by NVIDIA that let us use their GPUs via Docker containers. Because I am using docker, there is no need for me to have CUDA Toolkit or CuDNN on my system. All I need to have is the right driver - which I have.
I can run the official pytorch docker containers and the containers utilize my GPU. However when I run anything using the base images from NVIDIA then I get the following Warning -
$ docker run --gpus all -it --rm -p 8000:8000 ubuntu-cuda-gpu:latest
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
The application executes, it just uses CPU. But I want to be able to use my GPU like I can when I run the same code(it is a simple pytorch example) using official pytorch docker images.
The base image used is -
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
# Setup
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
# Your stuff
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
torch \
transformers \
...
If I just run the image without any machine learning code and try to execute nvidia-smi then I get the output as -
$ docker run --gpus all -it --rm -p 8000:8000 ubuntu-cuda-gpu:latest nvidia-smi
Sat Jun 12 19:15:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 31C P8 9W / 170W | 14MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This leads me to believe that at least something is right. But why is it that I am not able to use my GPU and how to make sure that I can?
I am on Ubuntu 20.04.

OpenCV claims to find "wrong" cuda version

I build OpenCV 3.4. with Cuda 10.0 support in "/usr"local/opencv_custom" like this:
cmake -D CMAKE_BUILD_TYPE=RELEASE
-D CMAKE_INSTALL_PREFIX=/usr/local/opencv_custom
-D OPENCV_GENERATE_PKGCONFIG=ON
-D OPENCV_DNN_CUDA=ON
-D INSTALL_C_EXAMPLES=ON
-D INSTALL_PYTHON_EXAMPLES=ON
-D OPENCV_EXTRA_MODULES_PATH=/home/ohmnibot/opencv_contrib/modules
-D BUILD_EXAMPLES=ON
-D BUILD_opencv_python2=OFF
-D WITH_FFMPEG=1
-D WITH_CUDA=ON
-D WITH_OPENGL=ON
-D ENABLE_FAST_MATH=1
-D CUDA_FAST_MATH=1
-D WITH_CUBLAS=0
-D WITH_LAPACK=OFF
-D BUILD_opencv_cudacodec=OFF
-D CUDA_VERSION=10.0 ..
All is fine but when I try to include this opencv version in my CMakeList like this
set(OpenCV_DIR "/usr/local/opencv_custom")
find_package(OpenCV REQUIRED)
I get this error:
Could NOT find CUDA: Found unsuitable version "10.2", but required is exact
version "10.0" (found /usr/local/cuda-10.0)
Now I had cuda 10.2 installed but I removed every possible trace of it and only cuda 10.0 exists on my system (as far as I know).
Also I just don´t get this message.... cuda 10.0 has been found but its unsuitable because its the wrong version even though... well its the right one?
I don´t know this just confuses the hell out of me... any help is deeply appreciated.
I´m working with catkin on a Ubuntu 18.04 system with a GTX 1650.
Old cuda versions have been removed with
sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*"
sudo rm -rf /usr/local/cuda*
nvcc -V output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 Off | 00000000:07:00.0 On | N/A |
| 35% 29C P8 8W / 75W | 447MiB / 3908MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Just in case someone every manages to run into a similar error:
delete all the build files
rebuild
enjoy
I have ran into this problem a few times. The accepted answer has worked for me a few times but not always. I think adding the following brings value to anyone who has similar problems.
Check the values of variables PATH and LD_LIBRARY_PATH
You can check the value like this:
echo ${PATH}
echo ${LD_LIBRARY_PATH}
I had the wrong CUDA version there and I corrected it using a text editor like this:
nano ~/.bashrc
Search for PATH (using nano press ctrl+W to find). I had these at the very bottom:
PATH=$PATH:/usr/local/cuda-11.8/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64
and surprisingly I had wrong CUDA version there. I corrected it using text editor to what I wanted (in my case 11.6).
Note that this only changes the variables. You must check that you really have the right version of CUDA at /usr/local/cuda-* and that your /usr/local/cuda is pointing to the right place. We will check that next:
Check that cuda points to the correct version
As shown here,
cd /usr/local
ls -l | grep cuda
then see which versions of CUDA you see there. For me I have only one but I used to have many
/usr/local$ ls -l | grep cuda
lrwxrwxrwx 1 root root 21 tammi 10 15:15 cuda -> /usr/local/cuda-11.6/
drwxr-xr-x 16 root root 4096 tammi 10 15:16 cuda-11.6
that hints that I don't have any conflicting CUDA installations. But you might have that if you have several versions installed.
This answer might be helpful, too.
Check that OpenCV is searching for the correct version
when you're running the configuration step of OpenCV build, check that the -D CUDA_VERSION is right:
cd build-opencv
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_TBB=ON -D ENABLE_FAST_MATH=1 -D CUDA_FAST_MATH=1 -D CUDA_GENERATION=AUTO -D WITH_CUBLAS=1 -D WITH_CUDA=ON -D BUILD_opencv_cudacodec=OFF -D WITH_CUDNN=ON -D OPENCV_DNN_CUDA=ON -D CUDA_ARCH_BIN=8.6 -D CUDA_GENERATION=Auto -D WITH_V4L=ON -D WITH_QT=OFF -D WITH_OPENGL=OFF -D WITH_GSTREAMER=ON -D OPENCV_GENERATE_PKGCONFIG=ON -D OPENCV_PC_FILE_NAME=opencv.pc -D OPENCV_ENABLE_NONFREE=ON -D OPENCV_EXTRA_MODULES_PATH=../opencv_contrib-4.x/modules ../opencv-4.x -D INSTALL_PYTHON_EXAMPLES=OFF -D INSTALL_C_EXAMPLES=OFF -D BUILD_EXAMPLES=OFF -D CUDA_VERSION=11.6 -D CUDA_NVCC_FLAGS="-ccbin gcc-8"
then, double check that the prints after this command are what you expect and possibly save them to a text file. They are really helpful:
-- CUDA detected: 11.6
-- CUDA NVCC target flags: -ccbin gcc-8;-gencode;arch=compute_86,code=sm_86;-D_FORCE_INLINES
(skipping many other lines...)
-- NVIDIA CUDA: YES (ver 11.6, CUFFT CUBLAS FAST_MATH)
-- NVIDIA GPU arch: 86
-- NVIDIA PTX archs:
--
-- cuDNN: YES (ver 8.7.0)
--
Add -D CUDA_TOOLKIT_ROOT_DIR
This answer and this thread might be outdated but if nothing else helps you it doesn't hurt to check them.
Check that you really made a clean build
I rebuilt my OpenCV a couple of times but it didn't really help. The reason might be that I didn't remove all OpenCV -related files.
sudo make uninstall
as well as removing associated files:
sudo rm -r /usr/local/include/opencv2 /usr/local/include/opencv /usr/include/opencv /usr/include/opencv2 /usr/local/share/opencv /usr/local/share/OpenCV /usr/share/opencv /usr/share/OpenCV /usr/local/bin/opencv* /usr/local/lib/libopencv*
If you want to be super careful, please also check the output of this
sudo find / -name "*opencv*"
as different versions of OpenCV might have different install directories.
Other
Then, you also have the option of uninstalling OpenCV, removing all build files, uninstalling CUDA and starting from scratch. Make sure you follow the CUDA installation guides carefully and perform all pre- and post-installation steps. In addition, make sure you have followed all installation steps of OpenCV here, here and here. Please check the output of all steps for errors that might have been left unseen earlier.
Remember to make a clean build of your application after reinstalling CUDA and OpenCV as well as restarting your PC and/or refreshing your environment variables.

Vglrun does not work in a Docker container

Note: This is on a headless AWS box on VNC, the current desktop I'm running on is DISPLAY=:1.0
I am trying to build a container that can hold an opengl application but I'm having trouble getting vglrun to work correctly. I am currently running it with --gpus all on the docker run line as well
# xhost +si:localuser:root
# docker run --rm -it \
-e DISPLAY=unix$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \
--gpus all centos:7 \
sh -c "yum install epel-release -y && \
yum install -y VirtualGL glx-utils && \
vglrun glxgears"
No protocol specified
[VGL] ERROR: Could not open display :0
On the host:
$ nvidia-smi
Tue Jan 28 22:32:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 16W / 150W | 56MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2387 G /usr/bin/X 55MiB |
+-----------------------------------------------------------------------------+
I can confirm running glxgears without vglrun works fine but my application I'm trying to build into docker inherently uses vglrun. I have also tried using the nvidia container nvidia/opengl:1.1-glvnd-runtime-centos7 with no success.
running it with vglrun -d :1.0 glxgears or vglrun -d unix:1.0 glxgears gives me this error:
Error: couldn't get an RGB, Double-buffered visual
What am I doing wrong here? does vglrun not work in a container?
EDIT: It seems I was approaching this problem the wrong way, it seems to work when I'm on the primary :0 display but when using VNC to view display :1, the Mesa drivers get used instead of the Nvidia ones. Is there a way I can use the GPU on spawned VNC displays?
I met the same problem and I solved by setting the variable "VGL_DISPLAY".
docker run ... -e VGL_DISPLAY=$DISPLAY ...
Then it worked! Please try.

jupyter notebook running in docker on remote server: keras not using gpu

I'm setting up a jupyter notebook run on a remote server but my code appears not to be using the GPU. It looks like tensorflow is identifying the GPU but Keras is missing it somehow. Is there something in my setup process leading to this?
I installed nvidia docker via the github instructions:
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
I'm ssh'ing into my server:
ssh me#serverstuff
And then on the server running:
docker run -it -p 9999:9999 --name mycontainer -v /mydata:/mycontainer/mydata ufoym/deepo bash
jupyter notebook --ip 0.0.0.0 --port 9999 --no-browser --allow-root
And then opening up a new command prompt on my desktop and running:
ssh -N -f -L localhost:9999:serverstuff:9999 me#serverstuff
Then signing in, and opening up localhost:9999 in my browser, and logging in with the provided token successfully.
But when I run DL training in my notebook the speed is such that it doesn't seem to be using GPU.
!nvidia-smi
gives:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.86 Driver Version: 430.86 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 730 WDDM | 00000000:01:00.0 N/A | N/A |
| 25% 41C P8 N/A / N/A | 551MiB / 2048MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
and
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
gives:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7106107654095923441
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 13064397814134284140
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 14665046342845873047
physical_device_desc: "device: XLA_GPU device"
]
and
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
gives:
[]
Try installing another image, I also had problems with custom images so I went with a direct nvidia image:
docker pull nvcr.io/nvidia/tensorflow:19.08-py3
there are other versions as well, you can check them out here

Resources