Why is my GPU slower than CPU when training LSTM/RNN models? - machine-learning

My machine has the following spec:
CPU: Xeon E5-1620 v4
GPU: Titan X (Pascal)
Ubuntu 16.04
Nvidia driver 375.26
CUDA tookit 8.0
cuDNN 5.1
I've benchmarked on the following Keras examples with Tensorflow as the backed reference:
SCRIPT NAME GPU CPU
stated_lstm.py 5sec 5sec
babi_rnn.py 10sec 12sec
imdb_bidirectional_lstm.py 240sec 116sec
imbd_lstm.py 113sec 106sec
My gpu is clearly out performing my cpu in non-lstm models.
SCRIPT NAME GPU CPU
cifar10_cnn.py 12sec 123sec
imdb_cnn.py 5sec 119sec
mnist_cnn.py 3sec 47sec
Has anyone else experienced this?

If you use Keras, use CuDNNLSTM in place of LSTM or CuDNNGRU in place of GRU. In my case (2 Tesla M60), I am seeing 10x boost of performance. By the way I am using batch size 128 as suggested by #Alexey Golyshev.

Too small batch size. Try to increase.
Results for my GTX1050Ti:
imdb_bidirectional_lstm.py
batch_size time
32 (default) 252
64 131
96 87
128 66
imdb_lstm.py
batch_size time
32 (default) 108
64 50
96 34
128 25

It's just a tip.
Using GPU is powerful when
1. your neural network model is big.
2. batch size is big.
It's what I found from googling.

I have got similar issues here:
Test 1
CPU: Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz
Ubuntu 14.04
imdb_bidirectional_lstm.py: 155s
Test 2
GPU: GTX 860m
Nvidia Driver: 369.30
CUDA Toolkit: v8.0
cuDNN: v6.0
imdb_bidirectional_lstm.py:450s
Analyse
When I observe the GPU load curve, I found one interesting thing:
for lstm, GPU load jumps quickly between ~80% and ~10%
GPU load
This is mainly due to the sequential computation in LSTM layer. Remember that LSTM requires sequential input to calculate hidden layer weights iteratively, in other words, you must wait for hidden state at time t-1 to calculate hidden state at time t.
That's not a good idea for GPU cores, since they are many small cores who like doing computations in parallel, sequential compuatation can't fully utilize their computing powers. That's why we are seeing GPU load around 10% - 20% most of the time.
But in the phase of backpropagation, GPU could run derivative computation in parallel, so we can see GPU load peak around 80%.

Related

Does number of samples affect the GPU memory?

I am trying to train a CNN network for video frame prediction. My images are large (10 * 480 * 1440 * 3). I want to know if the number of samples that I am using for training is going to affect the GPU memory use, or only the batch size (and also network parameters) need to fit into the GPU memory?
The problem is when I load 100 samples for training with batch_size = 1, I can train the model. However, when I increase the number of samples to 200 I run out of GPU memory.
My machine configuration is:
GPU: A100 NVIDIA 40 GB memory
System memory: 1008 GB
I would appreciate any suggestion to solve this issue.

How to check bus utilization / bus load for GPU during ML inference?

I am running an ML inference for image recognition on the GPU using onnxruntime and I am seeing an upper limit for how much performance improvement batching of images is giving me - there is reduction in inference time upto around batch_size of 8, beyond that the time remains constant. I assume this must be because of some max utilization of the GPU resources, as I dont see any such limitation mentioned in the onnx documentation.
I tried using the package pynmvl.smi to get nvidia_smi and printed some utilization factors during inference as such -
utilization_percent = nvidia_smi.getInstance().DeviceQuery()['gpu'][0]['utilization']
gpu_util.append(utilization_percent ['gpu_util'])
mem_util.append(utilization_percent ['memory_util'])
What I do see is that the gpu_util and the memory_util are within 25% for the entire run of my inference, even at batch size like 32 or 64, so these are unlikely to be the cause of the bottleneck.
I assume then, that it must be the bus load limitation that might be causing this. I did not find any option within nvidia-smi to print the GPU bus load.
How can I find the bus load during the inference?

Why is my CPU doing matrix operations faster than GPU instead?

When I tried to verify that the GPU does matrix operations over the CPU, I got unexpected results.CPU performs better than GPU according to my experience result, it makes me confused.
I used cpu and gpu to do matrix multiplication respectively.Programming environment is MXNet and cuda-10.1.
with gpu:
import mxnet as mx
from mxnet import nd
x = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
y = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
%timeit nd.dot(x,y)
50.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
with cpu:
x1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
y1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
%timeit nd.dot(x1,y1)
33.4 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Why CPU faster? My CPU model is i5-6300HQ and GPU model is Nividia GTX 950M.
TLDR: Your matrix multiplication is actually not running :)
MXNet is an asynchronous framework that piles work requests in a queue treated asynchronously on a need-to-run basis by its execution engine. So what you're measuring is only the time it took to send the request, not to execute it. That's why it is so small (microseconds on a 100k*100k matrix would be surprisingly fast) and roughly equal for both CPU and GPU. To force execution, you need to add a call that forces production of a result, for example a print or a nd.dot(x, y).wait_to_read(). See here a code very similar to your benchmark https://github.com/ThomasDelteil/MXNetParisWorkshop/blob/master/FromNDArrayToTrainedModel.ipynb
Extra comments:
The gain of using GPU vs CPU comes with the size of the
parallelism opportunity. On simple tasks, that gain can be small to
non existent. CPU core frequencies are actually 2 to 3 times bigger
than GPU frequencies (your i5-6300HQ does 2.3GHz with 3.2GHz boost
ability while your GTX 950M does 0.9GHz with 1.1GHz boost ability).
MXNet ndarray is very fast at matrix algebra on CPU, because (1) its asynchronous paradigm optimizes the order of computation (2) its C++ backend runs things in parallel and (3) I believe the default MXNet build comes with Intel MKL, which significantly boosts algebra capacities of Intel CPUs
(https://medium.com/apache-mxnet/mxnet-boosts-cpu-performance-with-mkl-dnn-b4b7c8400f98). Its ability to run compute on GPU within the same API is also a big strength over Numpy for example.
I don't think your test will run on GPU: instantiating such a big matrix on
an NVIDIA Tesla V100 (16GB men, 4x more than a GTX 950M) runs in a
"large tensor size error"
I don't know the module you're using but your CPU can access your memory way quicker and also saves a lot of stuff in cache. Your GPU has longer times to load the data into GPU memory and also takes longer to be called from your CPU.
Thats always the downside of GPU computation. When you can load a bunch of data into GPU memory, there's a good chance of being faster.
Btw, thats why deep learning frameworks work in batches. When you can't work with batches, I'd always use the CPU. You also got some potential for performance improvements with multiprocessing.

iOS Metal compute pipeline slower than CPU implementation for search task

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
METAL IMPLEMENTATION DETAILS:
Host code gist:
https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d
Kernel (shader) code:
https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126
GPU frame capture profiling results:
The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.
I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.

Why my OpenCV CUDA is running slower than CPU for simple thresholding?

My CPU is Intel Core2 Duo T5550, GPU is GeForce 8400M G. CUDA version 5.5.22, OpenCV version 2.4.8.
The test code is as follows:
double t = (double)getTickCount();
gpu::threshold(src, dst, thres, binMax, THRESH_BINARY);
t = ((double)getTickCount() - t)/getTickFrequency();
cout << "Times passed in seconds: " << t << endl;
For a 3648*2736 image, the result is
CPU: Times passed in seconds: 0.0136336
GPU: Times passed in seconds: 0.0217714
Thanks!
Perhaps this is not suprising.
You GeForce 8400M G is a old mobile card having only 8 cores, see the GeForce 8M series specifications, so you cannot extract much parallelism out of it.
Brutally speaking, GPUs are advantageous over multicore CPUs when you are capable of massively extracting parallelism by a large number of cores. In other words, to fastly build up an Egyptian pyramid by slow slaves (GPU cores) you need a large number of slaves. If you have only very few slow slaves (8 in your case), then perhaps it is better to have even fewer (2 CPU cores, for example), but much faster, slaves.
EDIT
I remembered just now to have bumped into this post
Finding minimum in GPU slower than CPU
which may help convince you that bad implementations (as underlined by Abid Rahman and Mailerdaimon) may lead to GPU codes that are slower than CPU ones. The situation is even worse if, as pointed out in the answer to the post above, you are hosting also the X display on your already limited GeForce 8400M G card.
Additionally to what #JackOLantern said:
Every Copy operation involving the GPU takes Time! A lot of time compared to just computing with the CPU. This is why #Abid Rahman K comment is a good Idea, he suggested to test again with more complex Code. The advantage of the GPU is in fast parallel processing, on off it disadvantages is the relatively slow transfer rate while copying data to and from the GPU.

Resources