Calculating matrices on CPU takes up most of the frame time - ios

I am writing a simple engine for a simple game, so far I enjoy my little hobby project but the game grew and it now has roughly 800 of game objects at a time in a scene.
Every object, just like in Unity, has a transform component that calculates transformation matrix when the component is initialized. I started to notice that with 800 objects it takes 5.4 milliseconds just to update each matrix (for example if every object has moved) without any additional components or anything else.
I use GLKit math library, which for some reason faster than using native simd types. using simd types triples the time of calculation
here is a pice of code that runs it
let Translation : GLKMatrix4 = GLKMatrix4MakeTranslation(position.x, position.y, position.z)
let Scale : GLKMatrix4 = GLKMatrix4MakeScale(scale.x, scale.y, scale.z)
let Rotation : GLKMatrix4 = GLKMatrix4MakeRotationFromEulerVector(rotation)
//Produce model matrix
let SRT = GLKMatrix4Multiply(Translation, GLKMatrix4Multiply(Rotation, Scale))
Question: I am looking for a way to optimize this so I can use more game objects. and utilize more components on my objects

There could be multiple bottlenecks in your program.
Optimise your frame dependencies to avoid stalls as much as possible, e.g. by precomputing frame data on CPU. This is a good resource to learn about this technique.
Make sure that all matrices are stored in one MTLBuffer which is indexed from your vertex stage
On Apple silicon and iOS use MTLResourceStorageModeShared
If you really want to scale to tens of thousands of objects, then compute your matrices in a compute shader to store them in an MTLBuffer. Then, use indirect rendering to issue your draw calls.
In general, learn about AZDO.
Learn about compute shaders: https://developer.apple.com/documentation/metal/basic_tasks_and_concepts/performing_calculations_on_a_gpu
Learn about indirect rendering:
https://developer.apple.com/documentation/metal/indirect_command_buffers

Related

PETSc vectorize operations with neighboring vector values

I'm implementing finite difference algorithm from uFDTD book. Many FDM equations involve operations on adjoined vector elements.
For example, an update equation for electric field
ez[m] = ez[m] + (hy[m] - hy[m-1]) * imp0
uses adjoined vector values hy[m] and hy[m-1].
How can I implement these operations in PETSc efficiently? Is there something beyond local vector loops and scatterers?
If my goal was efficiency, I would call a stencil engine. There are many many many papers, and sometimes even open source code, for example, Devito. The idea is that PETSc manages the data structure and parallelism. Then you can feed the local data brick to your favorite stencil computer.

How to use apple's Accelerate framework in swift in order to compute the FFT of a real signal?

How am I supposed to use the Accelerate framework to compute the FFT of a real signal in Swift on iOS?
Available example on the web
Apple’s Accelerate framework seems to provide functions to compute the FFT of a signal efficiently.
Unfortunately, most of the examples available on the Internet, like Swift-FFT-Example and TempiFFT, crash if tested extensively and call the Objective C API.
The Apple documentation answers many questions, but also leads to some others (Is this piece mandatory? Why do I need this call to convert?).
Threads on Stack Overflow
There are few threads addressing various aspects of the FFT with concrete examples. Notably FFT Using Accelerate In Swift, DFT result in Swift is different than that of MATLAB and FFT Calculating incorrectly - Swift.
None of them address directly the question “What is the proper way to do it, starting from 0”?
It took me one day to figure out how to properly do it, so I hope this thread can give a clear explanation of how you are supposed to use Apple's FFT, show what are the pitfalls to avoid, and help developers save precious hours of their time.
TL ; DR : If you need a working implementation to copy past here is a gist.
What is FFT?
The Fast Fourier transform is an algorithm that take a signal in the time domain -- a collection of measurements took at a regular, usual small, interval of time -- and turn it into a signal expressed into the phase domain (a collection of frequency).
The ability to express the signal along time lost by the transformation (the transformation is invertible, which means no information is lost by computing the FFT and you can apply a IFFT to get the original signal back), but we get the ability to distinguish between frequencies that the signal contained. This is typically used to display the spectrograms of the music you are listening to on various hardware and youtube videos.
The FFT works with complexe numbers. If you don't know what they are, lets just pretend it is a combination of a radius and an angle. There is one complex number per point on a 2D plane. Real numbers (your usual floats) can be saw as a position on a line (negative on the left, positive on the right).
Nb: FFT(FFT(FFT(FFT(X))) = X (up to a constant depending on your FFT implementation).
How to compute the FFT of a real signal.
Usual you want to compute the FFT of a small window of an audio signal. For sake of the example, we will take a small 1024 samples window. You would also prefer to use a power of two, otherwise things gets a little bit more difficult.
var signal: [Float] // Array of length 1024
First, you need to initialize some constants for the computation.
// The length of the input
length = vDSP_Length(signal.count)
// The power of two of two times the length of the input.
// Do not forget this factor 2.
log2n = vDSP_Length(ceil(log2(Float(length * 2))))
// Create the instance of the FFT class which allow computing FFT of complex vector with length
// up to `length`.
fftSetup = vDSP.FFT(log2n: log2n, radix: .radix2, ofType: DSPSplitComplex.self)!
Following apple's documentation, we first need to create a complex array that will be our input.
Dont get mislead by the tutorial. What you usual want is to copy your signal as the real part of the input, and keep the complex part null.
// Input / Output arrays
var forwardInputReal = [Float](signal) // Copy the signal here
var forwardInputImag = [Float](repeating: 0, count: Int(length))
var forwardOutputReal = [Float](repeating: 0, count: Int(length))
var forwardOutputImag = [Float](repeating: 0, count: Int(length))
Be careful, the FFT function do not allow to use the same splitComplex as input and output at the same time. If you experience crashs, this may be the cause. This is why we define both an input and an output.
Now, we have to be careful and "lock" the pointer to this four arrays, as showed in the documentation example. If you simply use &forwardInputReal as argument of your DSPSplitComplex, the pointer may become invalidated at the following line and you will likely experience sporadic crash of your app.
forwardInputReal.withUnsafeMutableBufferPointer { forwardInputRealPtr in
forwardInputImag.withUnsafeMutableBufferPointer { forwardInputImagPtr in
forwardOutputReal.withUnsafeMutableBufferPointer { forwardOutputRealPtr in
fforwardOutputImag.withUnsafeMutableBufferPointer { forwardOutputImagPtr in
// Input
let forwardInput = DSPSplitComplex(realp: forwardInputRealPtr.baseAddress!, imagp: forwardInputImagPtr.baseAddress!)
// Output
var forwardOutput = DSPSplitComplex(realp: forwardOutputRealPtr.baseAddress!, imagp: forwardOutputImagPtr.baseAddress!)
// FFT call goes here
}
}
}
}
Now, the finale line: the call to your fft:
fftSetup.forward(input: forwardInput, output: &forwardOutput)
The result of your FFT is now available in forwardOutputReal and forwardOutputImag.
If you want only the amplitude of each frequency, and you don't care about the real and imaginary part, you can declare alongside the input and output an additional array:
var magnitudes = [Float](repeating: 0, count: Int(length))
add right after your fft compute the amplitude of each "bin" with:
vDSP.absolute(forwardOutput, result: &magnitudes)

MPSImageGaussianBlur in a rendering pipeline

I've got a MPSImageGaussianBlur object doing work on each frame of a compute pass (Blurring the contents of an intermediate texture).
While the app is still running at 60fps no problem, I see an increase of ~15% in CPU usage when enabling the blur pass. I'm wondering if this is normal?
I'm just curious as to what could be going on under the hood of MPSImageGaussianBlur's encodeToCommandBuffer: operation that would see so much CPU utilization. In my (albeit naive) understanding, I'd imagine there would just be some simple encoding along the lines of:
MPSImageGaussianBlur.encodeToCommandBuffer: pseudo-method :
func encodeToCommandBuffer(commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture) {
let encoder = commandBuffer.computeCommandEncoder()
encoder.setComputePipelineState(...)
encoder.setTexture(sourceTexture, atIndex: 0)
encoder.setTexture(destinationTexture, atIndex: 1)
// kernel weights would be built at initialization and
// present here as a `kernelWeights` property
encoder.setTexture(self.kernelWeights, atIndex: 2)
let threadgroupsPerGrid = ...
let threadsPerThreadgroup = ...
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
}
Most of the 'performance magic' would be implemented on the algorithms running in the compute kernel function. I can appreciate that bit because performance (on the GPU) is pretty fantastic independent of the blurRadius I initialize the MPSImageGaussianBlur with.
Some probably irrelevant details about my specific setup:
MPSImageGaussianBlur initialized with blur radius 8 pixels.
The texture I'm blurring is 128 by 128 pixels.
Performing all rendering in an MTKViewDelegate's drawInMTKView: method.
I hope this question is somewhat clear in it's intent.
MPSGaussianBlur is internally a complex multipass algorithm. It is spending some time allocating textures out of its internal texture cache to hold the intermediate data. There is the overhead of multiple kernel launches to be managed. Also some resources like Gaussian blur kernel weights need to be set up. When you commit the command buffer, all these textures need to be wired down (iOS) and some other work needs to be done. So, it is not quite as simple as you imagine.
The texture you are using is small enough that the relatively fixed CPU overhead can start to become an appreciable part of the time.
Filing a radar on the CPU cost of MPSGassianBlur would cause Apple to spend an hour or two looking if something can be improved, and will be worth your time.
I honestly would not be surprised if under the hood the gpu was being less accessed than you would think for the kernel. In my first experiences with metal compute I found performance underwhelming and fell back again on neon. It was counter intuitive. I really wouldn't be surprised if the cpu hit was neon. I saw the same using mps Gaussian. It would be nice to get this confirmed. Neon has a lot of memory and instruction features that are friendlier to this use case.
Also, an indicator that this might be the case is that these filters don't run on OS X Metal. If it were just compute shaders I'm sure they could run. But Neon code can't run on the simulator.

Advantages of cv::Matx

I noticed that a new data structure cv::Matx was added to the new OpenCV version, intended for small matrices of known size at compilation time, for example
cv::Matx31f // matrix 3x1 of float type
Checking the documentation I saw that most of matrix operations are available, but still I don't see the advantages of using this new type instead of the old cv::Mat.
When should I use Matx instead of Mat?
Short answer: cv::Mat uses the heap to store its data, while cv::Matx uses the stack.
A cv::Mat uses dynamic memory allocation (on the heap). This is appropriate for big matrices (like images) and lets you do things like shallow copies of a matrix, which is the default behavior of cv::Mat.
However, for the small matrices that cv::Matx is designed for, heap allocation would be very expensive compared to doing the same thing on the stack. I have seen a block of math reduce processing time by over 75% by switching to using stack-allocated types (e.g. cv::Point and cv::Matx) instead of cv::Mat.
It's about memory management and not wasting (in some cases important) memory or just reservation of memory for an object you'll use later.
That's how I understand it – may be someone else can give a better explanation.
This is a late late answer, but it is still an interesting question!
dom's answer is quite accurate, and the heap/stack reference in user1460044's is also interesting.
From a practical point of view, I wouldn't use Matx (or Vec), except if it were completely necessary. The major advantages of Matx are
Using the stack (efficient![1])
Initialization.
The problem is, at the end you will have to move your Matx data to a Mat to do most of stuff, and so, you will be back at the heap again.
On the other hand, the "cool initialization" of a Matx can be done in a normal Mat:
// Matx initialization:
Matx31f A(1.f,2.f,3.f);
// Mat initialization:
Mat B = (Mat_<float>(3,1) << 1.f, 2.f, 3.f);
Also, there is a difference in initialization (beyond the heap/stack) stuff. If you try to put 5 values into the Matx31, it will crash (runtime exception), while calling the Mat_::operator<< with 5 values will only store the first three.
[1] Efficient if your program has to create lots of matrices of less than ~10 elements. In that case use Matx matrices.
There are 2 other reasons why I prefer Matx to Mat:
Readability: people reading the code can immediately see the size of the matrices, for example:
cv::Matx34d transform = ...;
It's clear that this is a 3x4 matrix, so it contains a 3D transformation of type (R,t), where R is a rotation matrix (as opposed to say, axis-angle).
Similarly, accessing an element is more natural with transform(i,j) vs transform.at<double>(i,j).
Easy debugging. Since the elements for Matx are allocated on the stack in an array of known length, IDEs or debuggers can display the entire contents nicely when stepping through the code.

vDSP: Do the FFT functions include windowing?

I am working on implementing an algorithm using vDSP.
1) take FFT
2) take log of square of absolute value (can be done with lookup table)
3) take another FFT
4) take absolute value
I'm not sure if it is up to me to throw the incoming data through a windowing function before I run the FFT on it.
vDSP_fft_zrip(setupReal, &A, stride, log2n, direction);
that is my FFT function
Do I need to throw the data through vDSP_hamm_window(...) first?
The iOS Accelerate library function vDSP_fft_zrip() does not include applying a window function (unless you count the implied rectangular window due to the finite length parameter).
So you need to apply your chosen window function (there are many different ones) first.
It sounds like you're doing cepstral analysis and yes, you do need a window function prior to the first FFT. I would suggest a simple Hann or Hamming window.
I don't have any experience with your particular library, but in every other FFT library I know of it's up to you to window the data first. If nothing else, the library can't know what window you wish to use, and sometimes you don't want to use a window (if you're using the FFT for overlap-add filtering, or if you know the signal is exactly periodic in the transform block).
Also, just offhand, it seems like if you're doing 2 FFTs, the overhead of calling a logarithm function is relatively minor.

Resources