MPSImageGaussianBlur in a rendering pipeline - metal

I've got a MPSImageGaussianBlur object doing work on each frame of a compute pass (Blurring the contents of an intermediate texture).
While the app is still running at 60fps no problem, I see an increase of ~15% in CPU usage when enabling the blur pass. I'm wondering if this is normal?
I'm just curious as to what could be going on under the hood of MPSImageGaussianBlur's encodeToCommandBuffer: operation that would see so much CPU utilization. In my (albeit naive) understanding, I'd imagine there would just be some simple encoding along the lines of:
MPSImageGaussianBlur.encodeToCommandBuffer: pseudo-method :
func encodeToCommandBuffer(commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture) {
let encoder = commandBuffer.computeCommandEncoder()
encoder.setComputePipelineState(...)
encoder.setTexture(sourceTexture, atIndex: 0)
encoder.setTexture(destinationTexture, atIndex: 1)
// kernel weights would be built at initialization and
// present here as a `kernelWeights` property
encoder.setTexture(self.kernelWeights, atIndex: 2)
let threadgroupsPerGrid = ...
let threadsPerThreadgroup = ...
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
}
Most of the 'performance magic' would be implemented on the algorithms running in the compute kernel function. I can appreciate that bit because performance (on the GPU) is pretty fantastic independent of the blurRadius I initialize the MPSImageGaussianBlur with.
Some probably irrelevant details about my specific setup:
MPSImageGaussianBlur initialized with blur radius 8 pixels.
The texture I'm blurring is 128 by 128 pixels.
Performing all rendering in an MTKViewDelegate's drawInMTKView: method.
I hope this question is somewhat clear in it's intent.

MPSGaussianBlur is internally a complex multipass algorithm. It is spending some time allocating textures out of its internal texture cache to hold the intermediate data. There is the overhead of multiple kernel launches to be managed. Also some resources like Gaussian blur kernel weights need to be set up. When you commit the command buffer, all these textures need to be wired down (iOS) and some other work needs to be done. So, it is not quite as simple as you imagine.
The texture you are using is small enough that the relatively fixed CPU overhead can start to become an appreciable part of the time.
Filing a radar on the CPU cost of MPSGassianBlur would cause Apple to spend an hour or two looking if something can be improved, and will be worth your time.

I honestly would not be surprised if under the hood the gpu was being less accessed than you would think for the kernel. In my first experiences with metal compute I found performance underwhelming and fell back again on neon. It was counter intuitive. I really wouldn't be surprised if the cpu hit was neon. I saw the same using mps Gaussian. It would be nice to get this confirmed. Neon has a lot of memory and instruction features that are friendlier to this use case.
Also, an indicator that this might be the case is that these filters don't run on OS X Metal. If it were just compute shaders I'm sure they could run. But Neon code can't run on the simulator.

Related

Calculating matrices on CPU takes up most of the frame time

I am writing a simple engine for a simple game, so far I enjoy my little hobby project but the game grew and it now has roughly 800 of game objects at a time in a scene.
Every object, just like in Unity, has a transform component that calculates transformation matrix when the component is initialized. I started to notice that with 800 objects it takes 5.4 milliseconds just to update each matrix (for example if every object has moved) without any additional components or anything else.
I use GLKit math library, which for some reason faster than using native simd types. using simd types triples the time of calculation
here is a pice of code that runs it
let Translation : GLKMatrix4 = GLKMatrix4MakeTranslation(position.x, position.y, position.z)
let Scale : GLKMatrix4 = GLKMatrix4MakeScale(scale.x, scale.y, scale.z)
let Rotation : GLKMatrix4 = GLKMatrix4MakeRotationFromEulerVector(rotation)
//Produce model matrix
let SRT = GLKMatrix4Multiply(Translation, GLKMatrix4Multiply(Rotation, Scale))
Question: I am looking for a way to optimize this so I can use more game objects. and utilize more components on my objects
There could be multiple bottlenecks in your program.
Optimise your frame dependencies to avoid stalls as much as possible, e.g. by precomputing frame data on CPU. This is a good resource to learn about this technique.
Make sure that all matrices are stored in one MTLBuffer which is indexed from your vertex stage
On Apple silicon and iOS use MTLResourceStorageModeShared
If you really want to scale to tens of thousands of objects, then compute your matrices in a compute shader to store them in an MTLBuffer. Then, use indirect rendering to issue your draw calls.
In general, learn about AZDO.
Learn about compute shaders: https://developer.apple.com/documentation/metal/basic_tasks_and_concepts/performing_calculations_on_a_gpu
Learn about indirect rendering:
https://developer.apple.com/documentation/metal/indirect_command_buffers

iOS Metal compute pipeline slower than CPU implementation for search task

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
METAL IMPLEMENTATION DETAILS:
Host code gist:
https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d
Kernel (shader) code:
https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126
GPU frame capture profiling results:
The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.
I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.

Is a CUDA-programmed GPU suitable for implementation of OpenCV adaptive threshold?

On my system, for a 5 MP image with a large window size (75px) it takes a whopping 140 ms (roughly 20 times as much as linear operations) to complete and I am looking to optimize it. I have noticed that the OpenCV gpu module does not implement a gpu version of the adaptiveThreshold so I have been thinking of implementing that algorithm for the GPU myself.
Can I hope for any speedup if I implement an adaptive threshold algorithm in CUDA, based on a large window size (50px+) and a large image (5 MP+), ignoring the overhead for loading memory into the GPU?
adaptiveThreshold documentation on opencv.org:
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold
Building on Eric's answer:
The Npp CUDA library does not implement adaptiveThreshold but it seems beneficial to getting an adaptive threshold in a VERY straightforward way (just tested it and anecdotally works):
Run a box filter on src (i.e. compute mean window value for every pixel),
store in an intermediate image tmp.
Subtract a number K from each pixel in tmp
Run a compare function between src and
tmp into dst. The end.
The code may look like this (here K=0, 2nd step omitted):
nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceIntermediate.data(), oDeviceDst.pitch(),
oSizeROI, oAdapThreshWindowSize,oAnchor);
nppiCompare_8u_C1R(oDeviceSrc.data(),oDeviceSrc.pitch(),
oDeviceDst.data(),oDeviceDst.pitch(),
oDeviceResult.data(),oDeviceResult.pitch(),
oSizeROI,NPP_CMP_LESS);
Also, wikipedia claims that applying a box filter 3 times in a row approximates a Gaussian filter to 97% accuracy.
Yes, this algorithm can be optimized on the GPU. I would expect to see an excellent speedup.
For ADAPTIVE_THRESH_MEAN_C, you could use a standard parallel reduction to calculate the arithmetic mean. For ADAPTIVE_THRESH_GAUSSIAN_C, you might use a kernel that performs per-pixel gaussian attenuation combined with a standard parallel reduction for the sum.
Implementation by CUDA should give you a satisfied performance gain.
Since your window size is large, this operation should be compute-bounded. The theoretical peak performance of a 5 MP image with 75px window on a Tesla K20X GPU should be about
5e6 * 75 * 75 / 3.95 Tflops = 7ms
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP library also provides a function nppiFilterBox(), which can be used to implement ADAPTIVE_THRESH_MEAN_C directly.
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
For ADAPTIVE_THRESH_GAUSSIAN_C, the function nppiFilter() with a proper mask could be used.
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

OpenGL ES best practices for conditionals

Apple says in their Best Practices For Shaders to avoid branching if possible, and especially branching on values calculated within the shader. So I replaced some if statements with the built-in clamp() function. My question is, are clamp(), min(), and max() likely to be more efficient, or are they merely convenience (i.e. macro) functions that simply expand to if blocks?
I realize the answer may be implementation dependent. In any case, the functions are obviously cleaner and make plain the intent, which the compiler could do something with.
Historically speaking GPUs have supported per-fragment instructions such as MIN and MAX for much longer than they have supported arbitrary conditional branching. One example of this in desktop OpenGL is the GL_ARB_fragment_program extension (now superseded by GLSL) which explicitly states that it doesn't support branching, but it does provide instructions for MIN and MAX as well as some other conditional instructions.
I'd be pretty confident that all GPUs will still have dedicated hardware for these operations given how common min(), max() and clamp() are in shaders. This isn't guaranteed by the specification because an implementation can optimize code however it sees fit, but in the real world you should use GLSL's built-in functions rather than rolling your own.
The only exception would be if your conditional was being used to avoid a large amount of additional fragment processing. At some point the cost of a branch will be less than the cost of running all the code in the branch, but the balance here will be very hardware dependent and you'd have to benchmark to see if it actually helps in your application on its target hardware. Here's the kind of thing I mean:
void main() {
vec3 N = ...;
vec3 L = ...;
float NDotL = dot(N, L);
if (NDotL > 0.0)
{
// Lots of very intensive code for an awesome shadowing algorithm that we
// want to avoid wasting time on if the fragment is facing away from the light
}
}
Just clamping NDotL to 0-1 and then always processing the shadow code on every fragment only to multiply through your final shadow term by NDotL is a lot of wasted effort if NDotL was originally <= 0, and we can theoretically avoid this overhead with a branch. The reason this kind of thing is not always a performance win is that it is very dependent on how the hardware implements shader branching.

Advantages of cv::Matx

I noticed that a new data structure cv::Matx was added to the new OpenCV version, intended for small matrices of known size at compilation time, for example
cv::Matx31f // matrix 3x1 of float type
Checking the documentation I saw that most of matrix operations are available, but still I don't see the advantages of using this new type instead of the old cv::Mat.
When should I use Matx instead of Mat?
Short answer: cv::Mat uses the heap to store its data, while cv::Matx uses the stack.
A cv::Mat uses dynamic memory allocation (on the heap). This is appropriate for big matrices (like images) and lets you do things like shallow copies of a matrix, which is the default behavior of cv::Mat.
However, for the small matrices that cv::Matx is designed for, heap allocation would be very expensive compared to doing the same thing on the stack. I have seen a block of math reduce processing time by over 75% by switching to using stack-allocated types (e.g. cv::Point and cv::Matx) instead of cv::Mat.
It's about memory management and not wasting (in some cases important) memory or just reservation of memory for an object you'll use later.
That's how I understand it – may be someone else can give a better explanation.
This is a late late answer, but it is still an interesting question!
dom's answer is quite accurate, and the heap/stack reference in user1460044's is also interesting.
From a practical point of view, I wouldn't use Matx (or Vec), except if it were completely necessary. The major advantages of Matx are
Using the stack (efficient![1])
Initialization.
The problem is, at the end you will have to move your Matx data to a Mat to do most of stuff, and so, you will be back at the heap again.
On the other hand, the "cool initialization" of a Matx can be done in a normal Mat:
// Matx initialization:
Matx31f A(1.f,2.f,3.f);
// Mat initialization:
Mat B = (Mat_<float>(3,1) << 1.f, 2.f, 3.f);
Also, there is a difference in initialization (beyond the heap/stack) stuff. If you try to put 5 values into the Matx31, it will crash (runtime exception), while calling the Mat_::operator<< with 5 values will only store the first three.
[1] Efficient if your program has to create lots of matrices of less than ~10 elements. In that case use Matx matrices.
There are 2 other reasons why I prefer Matx to Mat:
Readability: people reading the code can immediately see the size of the matrices, for example:
cv::Matx34d transform = ...;
It's clear that this is a 3x4 matrix, so it contains a 3D transformation of type (R,t), where R is a rotation matrix (as opposed to say, axis-angle).
Similarly, accessing an element is more natural with transform(i,j) vs transform.at<double>(i,j).
Easy debugging. Since the elements for Matx are allocated on the stack in an array of known length, IDEs or debuggers can display the entire contents nicely when stepping through the code.

Resources