iOS Metal compute pipeline slower than CPU implementation for search task - ios

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
Host code gist:
Kernel (shader) code:
GPU frame capture profiling results:

The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.

I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.


How to check bus utilization / bus load for GPU during ML inference?

I am running an ML inference for image recognition on the GPU using onnxruntime and I am seeing an upper limit for how much performance improvement batching of images is giving me - there is reduction in inference time upto around batch_size of 8, beyond that the time remains constant. I assume this must be because of some max utilization of the GPU resources, as I dont see any such limitation mentioned in the onnx documentation.
I tried using the package pynmvl.smi to get nvidia_smi and printed some utilization factors during inference as such -
utilization_percent = nvidia_smi.getInstance().DeviceQuery()['gpu'][0]['utilization']
gpu_util.append(utilization_percent ['gpu_util'])
mem_util.append(utilization_percent ['memory_util'])
What I do see is that the gpu_util and the memory_util are within 25% for the entire run of my inference, even at batch size like 32 or 64, so these are unlikely to be the cause of the bottleneck.
I assume then, that it must be the bus load limitation that might be causing this. I did not find any option within nvidia-smi to print the GPU bus load.
How can I find the bus load during the inference?

Why is my CPU doing matrix operations faster than GPU instead?

When I tried to verify that the GPU does matrix operations over the CPU, I got unexpected results.CPU performs better than GPU according to my experience result, it makes me confused.
I used cpu and gpu to do matrix multiplication respectively.Programming environment is MXNet and cuda-10.1.
with gpu:
import mxnet as mx
from mxnet import nd
x = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
y = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
50.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
with cpu:
x1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
y1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
33.4 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Why CPU faster? My CPU model is i5-6300HQ and GPU model is Nividia GTX 950M.
TLDR: Your matrix multiplication is actually not running :)
MXNet is an asynchronous framework that piles work requests in a queue treated asynchronously on a need-to-run basis by its execution engine. So what you're measuring is only the time it took to send the request, not to execute it. That's why it is so small (microseconds on a 100k*100k matrix would be surprisingly fast) and roughly equal for both CPU and GPU. To force execution, you need to add a call that forces production of a result, for example a print or a, y).wait_to_read(). See here a code very similar to your benchmark
Extra comments:
The gain of using GPU vs CPU comes with the size of the
parallelism opportunity. On simple tasks, that gain can be small to
non existent. CPU core frequencies are actually 2 to 3 times bigger
than GPU frequencies (your i5-6300HQ does 2.3GHz with 3.2GHz boost
ability while your GTX 950M does 0.9GHz with 1.1GHz boost ability).
MXNet ndarray is very fast at matrix algebra on CPU, because (1) its asynchronous paradigm optimizes the order of computation (2) its C++ backend runs things in parallel and (3) I believe the default MXNet build comes with Intel MKL, which significantly boosts algebra capacities of Intel CPUs
( Its ability to run compute on GPU within the same API is also a big strength over Numpy for example.
I don't think your test will run on GPU: instantiating such a big matrix on
an NVIDIA Tesla V100 (16GB men, 4x more than a GTX 950M) runs in a
"large tensor size error"
I don't know the module you're using but your CPU can access your memory way quicker and also saves a lot of stuff in cache. Your GPU has longer times to load the data into GPU memory and also takes longer to be called from your CPU.
Thats always the downside of GPU computation. When you can load a bunch of data into GPU memory, there's a good chance of being faster.
Btw, thats why deep learning frameworks work in batches. When you can't work with batches, I'd always use the CPU. You also got some potential for performance improvements with multiprocessing.

MPSImageGaussianBlur in a rendering pipeline

I've got a MPSImageGaussianBlur object doing work on each frame of a compute pass (Blurring the contents of an intermediate texture).
While the app is still running at 60fps no problem, I see an increase of ~15% in CPU usage when enabling the blur pass. I'm wondering if this is normal?
I'm just curious as to what could be going on under the hood of MPSImageGaussianBlur's encodeToCommandBuffer: operation that would see so much CPU utilization. In my (albeit naive) understanding, I'd imagine there would just be some simple encoding along the lines of:
MPSImageGaussianBlur.encodeToCommandBuffer: pseudo-method :
func encodeToCommandBuffer(commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture) {
let encoder = commandBuffer.computeCommandEncoder()
encoder.setTexture(sourceTexture, atIndex: 0)
encoder.setTexture(destinationTexture, atIndex: 1)
// kernel weights would be built at initialization and
// present here as a `kernelWeights` property
encoder.setTexture(self.kernelWeights, atIndex: 2)
let threadgroupsPerGrid = ...
let threadsPerThreadgroup = ...
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
Most of the 'performance magic' would be implemented on the algorithms running in the compute kernel function. I can appreciate that bit because performance (on the GPU) is pretty fantastic independent of the blurRadius I initialize the MPSImageGaussianBlur with.
Some probably irrelevant details about my specific setup:
MPSImageGaussianBlur initialized with blur radius 8 pixels.
The texture I'm blurring is 128 by 128 pixels.
Performing all rendering in an MTKViewDelegate's drawInMTKView: method.
I hope this question is somewhat clear in it's intent.
MPSGaussianBlur is internally a complex multipass algorithm. It is spending some time allocating textures out of its internal texture cache to hold the intermediate data. There is the overhead of multiple kernel launches to be managed. Also some resources like Gaussian blur kernel weights need to be set up. When you commit the command buffer, all these textures need to be wired down (iOS) and some other work needs to be done. So, it is not quite as simple as you imagine.
The texture you are using is small enough that the relatively fixed CPU overhead can start to become an appreciable part of the time.
Filing a radar on the CPU cost of MPSGassianBlur would cause Apple to spend an hour or two looking if something can be improved, and will be worth your time.
I honestly would not be surprised if under the hood the gpu was being less accessed than you would think for the kernel. In my first experiences with metal compute I found performance underwhelming and fell back again on neon. It was counter intuitive. I really wouldn't be surprised if the cpu hit was neon. I saw the same using mps Gaussian. It would be nice to get this confirmed. Neon has a lot of memory and instruction features that are friendlier to this use case.
Also, an indicator that this might be the case is that these filters don't run on OS X Metal. If it were just compute shaders I'm sure they could run. But Neon code can't run on the simulator.

Why my OpenCV CUDA is running slower than CPU for simple thresholding?

My CPU is Intel Core2 Duo T5550, GPU is GeForce 8400M G. CUDA version 5.5.22, OpenCV version 2.4.8.
The test code is as follows:
double t = (double)getTickCount();
gpu::threshold(src, dst, thres, binMax, THRESH_BINARY);
t = ((double)getTickCount() - t)/getTickFrequency();
cout << "Times passed in seconds: " << t << endl;
For a 3648*2736 image, the result is
CPU: Times passed in seconds: 0.0136336
GPU: Times passed in seconds: 0.0217714
Perhaps this is not suprising.
You GeForce 8400M G is a old mobile card having only 8 cores, see the GeForce 8M series specifications, so you cannot extract much parallelism out of it.
Brutally speaking, GPUs are advantageous over multicore CPUs when you are capable of massively extracting parallelism by a large number of cores. In other words, to fastly build up an Egyptian pyramid by slow slaves (GPU cores) you need a large number of slaves. If you have only very few slow slaves (8 in your case), then perhaps it is better to have even fewer (2 CPU cores, for example), but much faster, slaves.
I remembered just now to have bumped into this post
Finding minimum in GPU slower than CPU
which may help convince you that bad implementations (as underlined by Abid Rahman and Mailerdaimon) may lead to GPU codes that are slower than CPU ones. The situation is even worse if, as pointed out in the answer to the post above, you are hosting also the X display on your already limited GeForce 8400M G card.
Additionally to what #JackOLantern said:
Every Copy operation involving the GPU takes Time! A lot of time compared to just computing with the CPU. This is why #Abid Rahman K comment is a good Idea, he suggested to test again with more complex Code. The advantage of the GPU is in fast parallel processing, on off it disadvantages is the relatively slow transfer rate while copying data to and from the GPU.

Is a CUDA-programmed GPU suitable for implementation of OpenCV adaptive threshold?

On my system, for a 5 MP image with a large window size (75px) it takes a whopping 140 ms (roughly 20 times as much as linear operations) to complete and I am looking to optimize it. I have noticed that the OpenCV gpu module does not implement a gpu version of the adaptiveThreshold so I have been thinking of implementing that algorithm for the GPU myself.
Can I hope for any speedup if I implement an adaptive threshold algorithm in CUDA, based on a large window size (50px+) and a large image (5 MP+), ignoring the overhead for loading memory into the GPU?
adaptiveThreshold documentation on
Building on Eric's answer:
The Npp CUDA library does not implement adaptiveThreshold but it seems beneficial to getting an adaptive threshold in a VERY straightforward way (just tested it and anecdotally works):
Run a box filter on src (i.e. compute mean window value for every pixel),
store in an intermediate image tmp.
Subtract a number K from each pixel in tmp
Run a compare function between src and
tmp into dst. The end.
The code may look like this (here K=0, 2nd step omitted):
nppiFilterBox_8u_C1R(, oDeviceSrc.pitch(),, oDeviceDst.pitch(),
oSizeROI, oAdapThreshWindowSize,oAnchor);
Also, wikipedia claims that applying a box filter 3 times in a row approximates a Gaussian filter to 97% accuracy.
Yes, this algorithm can be optimized on the GPU. I would expect to see an excellent speedup.
For ADAPTIVE_THRESH_MEAN_C, you could use a standard parallel reduction to calculate the arithmetic mean. For ADAPTIVE_THRESH_GAUSSIAN_C, you might use a kernel that performs per-pixel gaussian attenuation combined with a standard parallel reduction for the sum.
Implementation by CUDA should give you a satisfied performance gain.
Since your window size is large, this operation should be compute-bounded. The theoretical peak performance of a 5 MP image with 75px window on a Tesla K20X GPU should be about
5e6 * 75 * 75 / 3.95 Tflops = 7ms
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
Nvidia cuNPP library also provides a function nppiFilterBox(), which can be used to implement ADAPTIVE_THRESH_MEAN_C directly.
For ADAPTIVE_THRESH_GAUSSIAN_C, the function nppiFilter() with a proper mask could be used.
NPP doc pp.1009
