Keras with TensorFlow backend not using GPU - docker

I built the gpu version of the docker image https://github.com/floydhub/dl-docker with keras version 2.0.0 and tensorflow version 0.12.1. I then ran the mnist tutorial https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py but realized that keras is not using GPU. Below is the output that I have
root#b79b8a57fb1f:~/sharedfolder# python test.py
Using TensorFlow backend.
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2017-09-06 16:26:54.866833: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866863: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866870: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866876: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Can anyone let me know if there are some settings that need to be made before keras uses GPU ? I am very new to all these so do let me know if I need to provide more information.
I have installed the pre-requisites as mentioned on the page
Install Docker following the installation guide for your platform: https://docs.docker.com/engine/installation/
I am able to launch the docker image
docker run -it -p 8888:8888 -p 6006:6006 -v /sharedfolder:/root/sharedfolder floydhub/dl-docker:cpu bash
GPU Version Only: Install Nvidia drivers on your machine either from Nvidia directly or follow the instructions here. Note that you don't have to install CUDA or cuDNN. These are included in the Docker container.
I am able to run the last step
cv#cv-P15SM:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.66 Mon May 1 15:29:16 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
GPU Version Only: Install nvidia-docker: https://github.com/NVIDIA/nvidia-docker, following the instructions here. This will install a replacement for the docker CLI. It takes care of setting up the Nvidia host driver environment inside the Docker containers and a few other things.
I am able to run the step here
# Test nvidia-smi
cv#cv-P15SM:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
Thu Sep 7 00:33:06 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 780M Off | 0000:01:00.0 N/A | N/A |
| N/A 55C P0 N/A / N/A | 310MiB / 4036MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I am also able to run the nvidia-docker command to launch a gpu supported image.
What I have tried
I have tried the following suggestions below
Check if you have completed step 9 of this tutorial ( https://github.com/ignaciorlando/skinner/wiki/Keras-and-TensorFlow-installation ). Note: Your file paths may be completely different inside that docker image, you'll have to locate them somehow.
I appended the suggested lines to my bashrc and have verified that the bashrc file is updated.
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64' >> ~/.bashrc
echo 'export CUDA_HOME=/usr/local/cuda-8.0' >> ~/.bashrc
To import the following commands in my python file
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0"
Both steps, done separately or together unfortunately did not solve the issue. Keras is still running with the CPU version of tensorflow as its backend. However, I might have found the possible issue. I checked the version of my tensorflow via the following commands and found two of them.
This is the CPU version
root#08b5fff06800:~# pip show tensorflow
Name: tensorflow
Version: 1.3.0
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource#google.com
License: Apache 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: tensorflow-tensorboard, six, protobuf, mock, numpy, backports.weakref, wheel
And this is the GPU version
root#08b5fff06800:~# pip show tensorflow-gpu
Name: tensorflow-gpu
Version: 0.12.1
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource#google.com
License: Apache 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: mock, numpy, protobuf, wheel, six
Interestingly, the output shows that keras is using tensorflow version 1.3.0 which is the CPU version and not 0.12.1, the GPU version
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import tensorflow as tf
print('Tensorflow: ', tf.__version__)
Output
root#08b5fff06800:~/sharedfolder# python test.py
Using TensorFlow backend.
Tensorflow: 1.3.0
I guess now I need to figure out how to have keras use the gpu version of tensorflow.

It is never a good idea to have both tensorflow and tensorflow-gpu packages installed side by side (the one single time it happened to me accidentally, Keras was using the CPU version).
I guess now I need to figure out how to have keras use the gpu version of tensorflow.
You should simply remove both packages from your system, and then re-install tensorflow-gpu [UPDATED after comment]:
pip uninstall tensorflow tensorflow-gpu
pip install tensorflow-gpu
Moreover, it is puzzling why you seem to use the floydhub/dl-docker:cpu container, while according to the instructions you should be using the floydhub/dl-docker:gpu one...

I had similar kind of issue - keras didn't use my GPU. I had tensorflow-gpu installed according to instruction into conda, but after installation of keras it simply not listed GPU as available device. I've realized that installation of keras adds tensorflow package! So I had both tensorflow and tensorflow-gpu packages. I've found that there is keras-gpu package available. After complete uninstallation of keras, tensorflow, tensorflow-gpu and installation of tensorflow-gpu, keras-gpu the problem was solved.

In the future, you can try using virtual environments to separate tensorflow CPU and GPU, for example:
conda create --name tensorflow python=3.5
activate tensorflow
pip install tensorflow
AND
conda create --name tensorflow-gpu python=3.5
activate tensorflow-gpu
pip install tensorflow-gpu

This worked for me:
Install tensorflow v2.2.0
pip install tensorflow==2.2.0
Also remove tensorflow-gpu (if it's present)

Related

Yolov5 Training keep running on local system

I recently bought a GPU (RTX 3060 Ti) before that I used to work on google collab (Free version). I have downloaded yolov5 on my local machine and made environment variable for it and downloaded the required dependency libraries. I ran training for 3 epochs to test my gpu with same dataset which I use on collab that takes only around 30 seconds to complete (Tesla T4 which has around 2000 cuda cores less than RTX 3060 Ti)on the otherhand my GPU kept running for around 3 hours but didnt stop (So I Intrupted it).
Screenshot of Yolo in VS Code
The code I ran on my local machine is:
# !git clone https://github.com/ultralytics/yolov5 # clone
# %cd yolov5
%pip install -qr requirements.txt # install
import torch
import utils
display = utils.notebook_init() # checks
# Train YOLOv5s on COCO128 for 3 epochs
!python train.py --img 412 --batch 16 --epochs 3 --data train_data/data.yaml --weights yolov5s.pt

Getting different RGB values after saving and re-loading an image via PIL in Ubuntu 18.04 vs any other OS

Whenever I load an image, save it with 90% quality, reload that saved image and then print the sum of it's RGB matrix, I get one value in Ubuntu 18.04.5, CentOS 8.2 and a different value in Ubuntu 20.04+, Fedora 33 and Windows 10. I have tested it with the same version of pillow/PIL, numpy and python in all the above mentioned operating systems but the result is the same.
img = Image.open('Sp_D_CNN_A_art0024_ani0032_0268.jpg')
np.sum(np.array(img))
OUTPUT : 28586794 (Same for all the OS)
img.save('temp.jpg', 'JPEG', quality = 90)
tempimg = Image.open('temp.jpg')
np.sum(np.array(tempimg))
OUTPUT : 28588237 (for Ubuntu 18.04.5 and CentOS 8.2.2004)
28588547 (for Ubuntu 20.04+, Fedora 33, Windows 10 20H2)
Now, the difference here might look very slight but the problem is that after further processing by my Error Level Analysis algorithm the difference becomes huge and as I trained my segmentation model on Google Colab (which uses Ubuntu 18.04.5 in its runtime) the generated mask comes out to be very inaccurate in Ubuntu 20.04+, Fedora 33, Windows 10 20H2.
Why is that happening and how can I fix it?
The culprit behind it is the underlying libjpeg version used by the pillow (PIL) library. Although both, Ubuntu 18.04.5 and Ubuntu 20.04.1 has the same libjpeg package, the pillow version installed by default on Ubuntu 18.04.5 has it's own binaries. So a solution for this is to remove current pillow and reinstall it via downloading and building it from source so that it guaranteedly uses the system's libjpeg.
First remove current pillow module
python3 -m pip uninstall Pillow
Then just install the dependency packages listed here:
https://pillow.readthedocs.io/en/stable/installation.html#building-on-linux
And then run:
python3 -m pip install Pillow --no-binary :all:

VersionMismatchWarning: Mismatched versions found - blosc

I cannot do a 'pip install blosc' on windows. I devop on windows and have my workers and schedule running on vm's with dask-docker. Anyone have any ideas? Seem like dask really wants all linux all the time.
blosc
+-----------------------+---------+
| | version |
+-----------------------+---------+
| client | None |
| scheduler | 1.9.1 |
| tcp://127.0.0.1:38323 | 1.9.1 |
+-----------------------+---------+
(venv) D:\dev\code\datacrunch>pip install -U blosc
Collecting blosc
Using cached blosc-1.9.1.tar.gz (809 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Building wheels for collected packages: blosc
Building wheel for blosc (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: 'd:\dev\code\netsense.support\datacrunch\venv\scripts\python.exe' 'd:\dev\code\netsense.support\datacrunch\venv\lib\site-packages\pip_vendor\pep517_in_process.py' build_wheel 'C:\Users\H166631\AppData\Local\Temp\tmpwgt4t634'
cwd: C:\Users\H166631\AppData\Local\Temp\pip-install-r1476vwy\blosc
Complete output (162 lines):
Not searching for unused variables given on the command line.
-- The C compiler identification is unknown
CMake Error at CMakeLists.txt:3 (ENABLE_LANGUAGE):
No CMAKE_C_COMPILER could be found.
The compression has to match throughout the dask cluster and because you don't have blosc installed you run into some issues. As a side note, there is an effort to improve messaging of the error in PR #3742 . I can think of two solutions:
Switch to conda instead of pip (though this is perhaps a non-starter for you)
Use a different compression (one that you have installed or can easily install on your machine)
For 2. you can either set the compression programmatically like the following:
In [1]: import dask
In [2]: import distributed
In [3]: dask.config.set({'distributed.comm.compression': 'lz4'})
Or on the CLI:
DASK_DISTRIBUTED__COMM__COMPRESSION=zlib dask-worker
Or with with the dask config file. For more info, I would recommend reading through: https://docs.dask.org/en/latest/configuration.html and https://docs.dask.org/en/latest/configuration-reference.html#distributed.comm.compression
You can always just not install blosc on your linux machines. Dask is happy to run on Windows. It's even happy (to a certain extent) to mix between Windows and Linux. But it's not happy if you have libraries on some of your machines that you don't have on others. Library uniformity is key.

XGB via Scikit learn API doesn't seem to be running in GPU although compiled to run for GPU

It appears although XGB is compiled to run on GPU, when called/executed via Scikit learn API, it doesn't seem to be running on GPU.
Please advise if this is expected behaviour
As far as I can tell, the Scikit learn API does not currently support GPU. You need to use the learning API (e.g. xgboost.train(...)). This also requires you to first convert your data into xgboost DMatrix.
Example:
params = {"updater":"grow_gpu"}
train = xgboost.DMatrix(x_train, label=y_train)
clf = xgboost.train(params, train, num_boost_round=10)
UPDATE:
The Scikit Learn API now supports GPU via the **kwargs argument:
http://xgboost.readthedocs.io/en/latest/python/python_api.html#id1
I couldn't get this working from the pip installed XGBoost, but I pulled the most recent XGBoost from GitHub (git clone --recursive https://github.com/dmlc/xgboost) and compiled it with the PLUGIN_UPDATER_GPU flag which allowed me to use the GPU with the sklearn API. This required me to also change some NVCC flags to work on my GTX960 that was causing some build errors, then some runtime errors due to architecture mismatch. After it built, I installed with pip install -e python-package/ within the repo directory. To use the Scikit learn API (using either grow_gpu or grow_hist_gpu):
import xgboost as xgb
model = xgb.XGBClassifier(
max_depth=5,
objective='binary:logistic',
**{"updater": "grow_gpu"}
)
model.fit(train_x, train_y)
If anyone is interested in the process to fix the build with the GPU flag, here is the process that I went through on Ubuntu 14.04.
i) git clone git clone --recursive https://github.com/dmlc/xgboost
ii) cd insto xgboost and make -j4 to create multi-threaded, if no GPU is desired
iii) to make GPU, edit make/config.mk to use PLUGIN_UPDATER_GPU
iv) Edit the makefile Makefile, on the NVCC section to use the flag --gpu-architecture=sm_xx for GPU version (5.2 for GTX 960) on line 101
#CODE = $(foreach ver,$(COMPUTE),-gencode arch=compute_$(ver),code=sm_$(ver)) TO
CODE = --gpu-architecture=sm_52
v) Run the ./build.sh, it should say completed in multi-threaded mode or the NVCC build probably failed (or another error, look above for the error)
vi) In the virtualenv (if desired) in the same directory run pip install -e python-package/
These are some things that caused some nvcc errors for me:
i) Installing/updating the Cuda Toolkit by downloading the cuda toolkit .deb from Nvidia (version 8.0 worked for me, and is required in some cases?).
ii) Install/update cuda
sudo apt-get update
sudo apt-get install cuda
iii) Add nvcc to your path. Mine was in /usr/local/cuda/bin/
iv) A restart may be required if running nvidia-smi does not work due to some of the cuda/driver/toolkit updates.

What is the correct way to have Tensorflow available in a docker container or docker image?

I was trying to run a simple docker container with Tensorflow available (first with CPU). I thought it would be a good idea to setup my Dockerimage only once (i.e. not update the tensorflow version every time I run a container).
To do this I was suggested to do as follow in my Dockerfile (the comment came from source that gave me the suggestion):
# This means you derive your docker image from the tensorflow docker image
FROM gcr.io/tensorflow/tensorflow
however, when I ran my Docker container I did pip list and didn't see Tensorflow available anywhere plus when I ran my script I got the familiar error:
ImportError: No module named 'tensorflow'
I thought of a way to solve this by just having my Dockerfile explicitly pip3 install tensorflow. I planned to make a bash script and have my Dockerfile run it:
# bash script intall_tensorflow.sh
# to install Tensorflow in container
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.0rc1-py3-none-any.whl
pip3 install --upgrade $TF_BINARY_URL
and then just add to the docker file:
RUN sh intall_tensorflow.sh
however, my intuition tells me this might be wrong or too hacky. Why would I need the tensorflow base image FROM gcr.io/tensorflow/tensorflow in the first place if I am just going to manually install Tensorflow later anyway?
I tried researching online what gcr.io/tensorflow/tensorflow might be doing but I have not found anything super useful. Does someone know what is the proper way to have Tensorflow available in a Docker container from the image itself (i.e. from building the Docker image)?
Sorry if I'm being really dense but it just feels I'm doing something wrong and I couldn't find something online that addressed my question.
After looking at the answer it seems that the main issue might be that python 3 cannot find tensorflow for some reason but python 2 can. Does that mean that I need to directly install TensorFlow myself (with pip in the docker image) for the right version of TensorFlow to be available?
Judging by your usage of pip3 - are you using python 3? That might be causing your issues. I tried to recreate your problem, but python 2 seems to be working fine.:
user#computer:~$ docker run -it gcr.io/tensorflow/tensorflow /bin/bash
root#61bb0f99582b:/notebooks# python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>>
root#61bb0f99582b:/notebooks# python3
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'tensorflow'
>>>
If this for some reason still causes you issues, you can also just install it yourself the way you describe. A good thing about docker is that it caches images when it creates them from Dockerfiles, so you don't end up reinstalling tensorflow every time you build an image. This article explains some of the concepts.

Resources