Dynamically decide which GPU to run on - TF on NVIDIA docker - docker

I have a queue of models, which I allow only 2 to be executed in parallel, since I have 2 GPUs.
For that, in the beginning of my code I try to determine which GPU is available by using GPUtil. Maybe its relevant, this code in run inside a docker container that was launched using the --runtime=nvidia flag.
The code that determines which GPU to run on, looks like this:
import os
import GPUtil
gpu1, gpu2 = GPUtil.getGPUs()
available_gpu = gpu1 if gpu1.memoryFree > gpu2.memoryFree else gpu2
os.environ['CUDA_VISIBLE_DEVICES'] = str(available_gpu.id)
import tensorflow as tf
Now, I launched two scripts this way (with a slight delay until the first one occupied a GPU) but both of them tried to use the same GPU!
I went further to examine the problem - I manually set the os.environ['CUDA_VISIBLE_DEVICES'] = '1' and let the model run.
As it was training, I checked the output of nvidia-smi and saw the following
user#server:~$ docker exec awesome_gpu_container nvidia-smi
Mon Mar 12 06:59:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 50C P2 131W / 280W | 5846MiB / 6075MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 0% 39C P8 14W / 200W | 2MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
And I notice that while I've set the visible device to be 1 it is actually running on 0
I stress again, that my mission is while queuing multiple models that each one that start running will decide for itself which GPU to use.
I explored allow_soft_placement=True, but that allocated the memory on both GPUs so I stopped the process.
Bottom line, how can I make sure my training scripts only use one GPU, and make them choose the free one?

As described in the CUDA programming guide, the default device enumeration used by CUDA is "fastest first":
CUDA_​DEVICE_​ORDER
FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)
FASTEST_FIRST causes CUDA to guess which device is
fastest using a simple heuristic, and make that device 0, leaving the
order of the rest of the devices unspecified.
PCI_BUS_ID orders devices by PCI bus ID in ascending order.
If you set CUDA_​DEVICE_​ORDER=PCI_BUS_ID the CUDA ordering will match the device ordering shown by nvidia-smi.
Since you are using docker, you can also enforce a stronger isolation with our runtime:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ...
But that's at container startup time.

Related

any doable approach to use multiple GPUs, multiple process with tensorflow?

I am using docker container to run my experiment. I have multiple GPUs available and I want to use all of them for my experiment. I mean I want to utilize all GPUs for one program. To do so, I used tf.distribute.MirroredStrategy that suggested on tensorflow site, but it is not working. here is the full error messages on gist.
here is available GPUs info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:6A:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:6B:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:6C:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:6D:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
my current attempt
here is my attempt using tf.distribute.MirroredStrategy:
device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(device_type)
devices_names = [d.name.split("e:")[1] for d in devices]
strategy = tf.distribute.MirroredStrategy(devices=devices_names[:3])
with strategy.scope():
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
The above attempt is not working and gave the error that listed on above gist. I don't find another way of using multiple GPUs for a single experiment.
does anyone any workable approach to make this happens? any thoughts?
Is the MirrordStrategy proper way to distribute the workload
The approach is correct, as long as the GPUs are on the same host. The TensorFlow manual has examples how the tf.distribute.MirroredStrategy can be used with keras to train the MNIST set.
Is it the MirrordStrategy the only strategy
No, there are multiple strategies that can be used to acheive the workload distribution. For example, the tf.distribute.MultiWorkerMirroredStrategy can also be used to distribute the work on multiple devices trough multiple workers.
The TF documentation explains the strategies, the limitations associated with the strategies and provides some examples to help kick-start the work.
The strategy is throwing an error
According to the issue from github, the ValueError: SyncOnReadVariable does not support 'assign_add' ..... is a bug in TensorFlow which is fixed in TF 2.4
You can try to upgrade the tensorflow libraries by
pip install --ignore-installed --upgrade tensorflow
Implementing variables that are not aware of distributed strategy
If you have tried the standard example from the documentation, and it works fine, but your model is not working, you might be having variables that are incorrectly set-up or you are using distributed variables that do not have support for the aggregation functions required by the distributed strategy.
As per the TF documentation:
..."
A distributed variable is variables created on multiple devices. As discussed in the glossary, mirrored variable and SyncOnRead variable are two examples.
"...
To better understand how to implement the custom support for the distributed varialbes, check the following page in the documentation

Tensorflow-serving docker container adds the GPU device but GPU has 0% utilization

Hi, I'm having issues with dockerized TF Serving seeing but not using my GPU.
It adds the GPU as device 0, allocates memory on it, but then loads the ML model into CPU device memory and runs inference using only the CPUs. GPU-util on nvidia-smi never leaves 0%.
Could anyone help me figure out why this is happening, and what should be changed?
The setup:
OS: Amazon/Deep Learning AMI (Ubuntu 18.04) on EC2 g4dn.xlarge
GPU: Tesla T4
Model: pretrained gpt2-xl tensorflow from huggingface, which I froze into a SavedModel and uploaded to S3.
Docker: came stock with Deep Learning AMI. I've already checked and confirmed that nvidia-smi runs containerized, so it's not a nvidia+docker issue.
TF Serving: I use the below Dockerfile to pull the latest-gpu image and download the model directly into it at buildtime:
FROM tensorflow/serving:latest-gpu
RUN apt-get update
ENV TZ=America
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
RUN apt-get install -y awscli
ENV AWS_ACCESS_KEY_ID=...
ENV AWS_SECRET_ACCESS_KEY=...
ARG model_name
ENV MODEL_NAME=$model_name
# Use AWS CLI to download the SavedModel into the docker container from S3 bucket
RUN aws s3 cp s3://v3-models/models/pretrained_tf_serving/${MODEL_NAME} /models/${MODEL_NAME} --recursive
EXPOSE 8500
I build and run the above Dockerfile with these commands:
#!/bin/bash
# first build the image with the model_name arg, and tag it as xl-serving
docker build -t xl-serving --build-arg model_name=gpt2-xl ../../model_server
# then run it with gpus, exposing gRPC port
docker run -it --rm --gpus all --runtime nvidia -p 8500:8500 xl-serving
Running the serving container prints this output. Notice that the GPU is added.
2020-11-06 05:25:34.671071: I tensorflow_serving/model_servers/server.cc:87] Building single TensorFlow model file config: model_name: gpt2-xl model_base_path: /models/gpt2-xl
2020-11-06 05:25:34.671274: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-11-06 05:25:34.671295: I tensorflow_serving/model_servers/server_core.cc:575] (Re-)adding model: gpt2-xl
2020-11-06 05:25:34.771644: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771673: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771687: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771724: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/gpt2-xl/1
2020-11-06 05:25:35.222512: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-11-06 05:25:35.222545: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Reading SavedModel debug info (if present) from: /models/gpt2-xl/1
2020-11-06 05:25:35.222672: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-06 05:25:35.223994: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-06 05:25:35.262238: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.263132: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-11-06 05:25:35.263149: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-11-06 05:25:35.263236: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264122: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264948: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-06 05:25:36.185140: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-06 05:25:36.185165: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-11-06 05:25:36.185171: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-11-06 05:25:36.185334: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.186222: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187046: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187852: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13896 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-11-06 05:25:37.279837: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:199] Restoring SavedModel bundle.
2020-11-06 05:25:56.154008: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/gpt2-xl/1
2020-11-06 05:25:57.551535: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:303] SavedModel load for tags { serve }; Status: success: OK. Took 22777844 microseconds.
2020-11-06 05:25:57.832736: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/gpt2-xl/1/assets.extra/tf_serving_warmup_requests
2020-11-06 05:25:57.835030: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:57.838329: I tensorflow_serving/model_servers/server.cc:367] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-11-06 05:25:57.840415: I tensorflow_serving/model_servers/server.cc:387] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
I then hit this server with a single, non-batched gRPC call. It will successfully run and return a correct GPT2 output. However, it takes as long as the same setup takes on a CPU. htop shows that 8gb of ram (gpt2-xl model size) is loaded into the CPU machine. It then shows the TF Serving process running, and maxing out one or two CPU cores. It appears to only run on CPU.
This is what nvidia-smi looks like while the call is running. Notice the allocated memory, and 0% GPU-Util:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 26W / 70W | 14240MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 13357 C tensorflow_model_server 14221MiB |
+-----------------------------------------------------------------------------+
I've scoured the web and can't find any advice for this. Closest I found was this github issue: GPU utilization with TF serving #1440, for which the fixes did not work for me. They were dealing with low GPU-util, I'm dealing with 0%.
Any advice on what the issue is?
Thank you very much. I've been banging my head against the wall for days on this, so I very much appreciate your help :)
Update #1 :
I've written a python script (below) to use tensorflow==2.3.0 to load the model and run it. It's running in a conda env with CUDA=11.0. It successfully runs inference on the GPU, and it's a good 15x faster than what I'm getting on TF-serving.
import tensorflow as tf
import numpy as np
model = tf.saved_model.load('/home/ubuntu/models/gpt2-xl/1/')
servable = model.signatures["forward"]
# Create input tensor
tensor_in = tf.constant([[198, 15667, 6530, 25, 29437, 1706, 1610, 977, 948, 33611]])
# Run a loop of 10 inferences on the model, to predict the next 10 tokens.
for i in range(10):
pred = servable(tensor_in)
logits = pred['output_0']
logits = logits[:, -1, :] / 0.8
next_id = tf.random.categorical(tf.nn.log_softmax(logits, axis=-1), num_samples=1)
next_id = tf.dtypes.cast(next_id, tf.int32).numpy()
tensor_in = np.concatenate((tensor_in, next_id), axis=1)
Up next: will be trying running tf-serving outside of container. Update to come...
How did you save your model? Add clear_devices=True when saving model and have anather try.

Even Easier Introduction to CUDA - not printing after memory initialization

I'm following the Even Easier Introduction to CUDA tutorial. I have literally copied and pasted the complete code. add.cu compiles, however, when I run it, it doesn't print anything. I put in some more print statements and narrowed it down:
printf("Hi\n");
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
printf("Bye");
It prints "Hi", but never prints "Bye". So something seems to be wrong with the memory initialization. What is going wrong here?
I solved the problem myself. Basically, my device drivers were screwed up. To check if you have the same problem, run Command Prompt as administrator and run nvidia-smi. If you have the same problem, it will give you an error saying it failed to communicate because your device drivers are not up to date or wrong or something of the sort.
Download the latest nvidia driver for your computer (I found mine on Dell Drivers & Downlods) and install it. Now, when you run nvidia-smi as admin on command prompt, it should give you a whole bunch of details about your setup (driver version, Cuda version, etc) like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.14 Driver Version: 441.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 30C P8 N/A / N/A | 78MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You should now be able to compile and run Cuda scripts with unified memory.

Unsure whether tensorflow-gpu actually uses GPU

I am currently trying to run a Convolutional Neural Network using Keras on a tensorflow backend with the help of a Udemy course on deep learning. However it is running extremely slowly, taking around 1,000s per epoch while the lecturer's machine takes around 60s (he's running it on a CPU by the way).
The CNN is a simple image recognition network that recognizes whether an image is of a cat or a dog. The training and test data consist of a total of 10,000 images, all images together take up 237 MB on my SSD.
When I run the CNN in a Python shell, I get the following output:
Epoch 1/25
2017-05-28 13:23:03.967337: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your m
achine and could speed up CPU computations.
2017-05-28 13:23:03.967574: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your
machine and could speed up CPU computations.
2017-05-28 13:23:03.968153: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your
machine and could speed up CPU computations.
2017-05-28 13:23:03.968329: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on you
r machine and could speed up CPU computations.
2017-05-28 13:23:03.968576: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cp
u_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on you
r machine and could speed up CPU computations.
2017-05-28 13:23:04.505726: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:28:00.0
Total memory: 8.00GiB
Free memory: 6.68GiB
2017-05-28 13:23:04.505944: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:908] DMA: 0
2017-05-28 13:23:04.506637: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:918] 0: Y
2017-05-28 13:23:04.506895: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runt
ime\gpu\gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:2
8:00.0)
2684/8000 [=========>....................] - ETA: 845s - loss: 0.5011 - acc: 0.7427
This should indicate that tensorflow is using the GPU for its computations. However, when I check on nvidia-smi, I get the following output:
$ nvidia-smi
Sun May 28 13:25:46 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.53 Driver Version: 376.53 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 WDDM | 0000:28:00.0 On | N/A |
| 0% 49C P2 36W / 166W | 7240MiB / 8192MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7676 C+G ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 8580 C+G Insufficient Permissions N/A |
| 0 9704 C+G ...x86)\Google\Chrome\Application\chrome.exe N/A |
| 0 10532 C ...\Anaconda3\envs\tensorflow-gpu\python.exe N/A |
| 0 11384 C+G Insufficient Permissions N/A |
| 0 12896 C+G C:\Windows\explorer.exe N/A |
| 0 13868 C+G Insufficient Permissions N/A |
| 0 14068 C+G Insufficient Permissions N/A |
| 0 14568 C+G Insufficient Permissions N/A |
| 0 15260 C+G ...osoftEdge_8wekyb3d8bbwe\MicrosoftEdge.exe N/A |
| 0 16912 C+G ...am Files (x86)\Dropbox\Client\Dropbox.exe N/A |
| 0 18196 C+G ...I\AppData\Local\hyper\app-1.3.3\Hyper.exe N/A |
| 0 18228 C+G ...oftEdge_8wekyb3d8bbwe\MicrosoftEdgeCP.exe N/A |
| 0 20032 C+G ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
Note that every single process is displayed to use to use both CPU and GPU (Type C+G) while the tensorflow process is the only one to only use the CPU (Type C).
Is there any sensible explanation to this? I have been trying to fix this issue for the full last week but have gotten nowhere.
I am running a Windows 10 Pro machine with a Nvidia GTX 1070 by Asus, 24GB RAM and an Intel Xeon X5670 CPU #2.93GHz. I created my Anaconda environment with the following commands:
conda create -n tensorflow-gpu python=3.5 anaconda
source activate tensorflow-gpu
conda install theano
conda install mingw libpython
pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/gpu/tensorflow_gpu-1.1.0-cp35-cp35m-win_amd64.whl
pip install keras
conda update --all
I also installed the CUDA Toolkit and CUDNN and included their respective folders to my %PATH%
Every and any help would be greatly appreciated.
[EDIT]
The code in case anything is wrong with it.
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
# Defining the CNN
classifier = Sequential()
# Convolution 1
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Convolution 2
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Flatten + MLP
classifier.add(Flatten())
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the CNN to the images
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('dataset/training_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')
test_set = test_datagen.flow_from_directory('dataset/test_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')
classifier.fit_generator(training_set,
steps_per_epoch = 8000,
epochs = 25,
validation_data = test_set,
validation_steps = 2000)
It does not have anything to do with your machine, I discussed the problem in this post on Udemy. Everyone seem to have the same issue and wonder how come it could be 20 minutes on instructor's machine. The answer is simple: the instructor has posted different source code than what he presented in the video!
Check doc for steps_per_epoch
steps_per_epoch: Total number of steps (batches of samples) to yield
from generator before declaring one epoch finished and starting the
next epoch. It should typically be equal to the number of unique
samples of your dataset divided by the batch size.
Currently for a single epoch you take 8000 * 32 = 256000 images. That's the number of samples you are processing in every epoch. Doesn't make sense whatsoever if you consider your data set is merely 10000 (20k with augmentation).
if you check the video, you'll see the instructor is using samples_per_epoch, meaning 32x less data. Case solved.
My reply has nothing to do with the Udemy example, it's simply about checking whether the GPU is being utilized. On Linux, the PSensor utility allows you to observe GPU load and temperature (which is a good indication of usage). I'm not sure how to confirm GPU usage on Windows, perhaps someone else can help with that.

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

I'm facing a strange problem with Torque assignment of GPUs.
I'm running Torque 6.1.0 on a single machine that has two NVIDIA GTX Titan X GPUs. I'm using pbs_sched for scheduling. nvidia-smi output at rest is as follows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:03:00.0 On | N/A |
| 22% 40C P8 15W / 250W | 0MiB / 12204MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 33C P8 14W / 250W | 0MiB / 12207MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have a simple test script to assess GPU assignment as follows:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
deviceQuery is the utility that comes with CUDA. When I run it from the command line, it correctly finds both GPUs. When I restrict to one device from the command-line like this...
CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
... it also correctly finds one or the other GPU.
When I submit test.sh to the queue with qsub, and when no other jobs are running, it again works correctly. Here's the output:
CUDA_VISIBLE_DEVICES: 0
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12204 MBytes (12796887040 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS
However, if a job is already running on gpu0 (i.e. if it is assigned CUDA_VISIBLE_DEVICES=1), the job cannot find any GPUs. Output:
CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
Anyone know what is going on here?
I think I've solved my own problem, but unfortunately I tried two things at once. I don't want to go back and confirm which solved the issue. It's one of the following:
Remove the --enable-cgroups option from Torque's configure script before building.
Running these steps in the Torque install process:
make packages
sh torque-package-server-linux-x86_64.sh --install
sh torque-package-mom-linux-x86_64.sh --install
sh torque-package-clients-linux-x86_64.sh --install
For the second option, I know that these steps are properly documented in the Torque install instructions. However, I have a simple setup where I just have a single node (compute node and server are same machine). I thought that 'make install' should do everything that the package installs do for that single node, but maybe I was mistaken.

Resources