Objective C float vs int, CGPoint vs custom int based struct performance - ios

Based on the arguments in this post: Performance of Built-in types, can I conclude that my custom implementation of a int based point structure is faster or more efficient than the float-based CGPoint? I have reviewed many posts concerning the type performance differences but have not found one that includes scenarios further wrapped by a structure.
Thanks.
// Coord
typedef struct {
int x;
int y;
} Coord;
CG_INLINE Coord CoordMake(int x, int y){
Coord coord; coord.x = x; coord.y = y; return coord;
}
CG_INLINE bool CoordEqualToCoord(Coord coord, Coord anotherCoord) {
return coord.x == anotherCoord.x && coord.y == anotherCoord.y;
}
CG_INLINE CGPoint CGPointForCoord(Coord coord) {
return CGPointMake(coord.x, coord.y);
}
EDIT: I have done purely arithmetical tests and the results are really negligible until millions of iterations, which my application will not come close to doing. I will continue to use the Coord typedef but will remove the struct for a few of the reasons #meaning-matters suggests. For the record the tests did show that the int based structure was about 30% faster, but 30% of 0.0001 seconds is not really something anyone should care about. I am still interested in the points and counter-points on which implementation is better.

It depends on what you are doing with it. For ordinary arithmetic, throughput can be similar. Integer latency is usually a bit less. On some processors, the latency to L1 is better for GPRs than FPR. So, for many tests, the results will come out the same or give a small edge for integer computation. The balance will flip the other way for double vs int64_t computation on 32-bit machines. (If you are writing CPU vector code and can get away with 16-bit computation then it would be much faster to use integer.)
However, in the case of calculating coordinates/addresses for purposes of loading or storing data into/from a register, integer is clearly better on a CPU. The reason is that the load or store instruction can take an integer operand as an index into an array, but not a floating point one. To use floating point coordinates, you at minimum have to convert to integer first, then load or store, so it should be always slower. Typically, there will also have to be some rounding mode set as well (e.g. a floor() operation) and maybe some non-trivial operation to account for edging modes, such as a CL_ADDRESS_REPEAT addressing mode. Contrast that to a simple AND operation, which may be all that is necessary to achieve the same thing on integer and it should be clear that integer is a much cheaper format.
On GPUs, which emphasize floating-point computation a bit more and may not invest much in integer computation (even though it is easier), the story is quite different. There you can expect texture unit hardware to use the floating point value directly to find the required data. The floating point arithmetic to find the right coordinate is built in to the hardware and therefore "free" (if we ignore energy consumption considerations) and graphics APIs like GL or CL are built around it.
Generally speaking, though ubiquitous in graphics APIs, floating-point itself is a numerically inferior choice for a coordinate system for sampled data. It lumps too much precision in one corner of the image and may cause quantization errors / inconsistencies at the far corners of the image, leading to reduced precision for linear sampling and unexpected rounding effects. For large enough images, some pixels in the image may become unaddressable by the coordinate system, because no floating-point number exists which references that position. It is probably the case that the default rounding mode, round to nearest ties to even is undesirable for coordinate systems because linear filtering will often place the coordinate half way between two integer values, resulting in a round up for even pixels and round down for odd. This causes pixel duplication rather than the expected result in the worst case where they are all hell ways cases and the stride is 1. It is nice in that it is somewhat easier to use.
A fixed-point coordinate system allows for consistent coordinate precision and rounding across the entire surface and will avoid these problems. Modulo overflow feeds nicely into some common edging modes. Precision is predictable.

Confirmed by a quick search 32-bit int and float operations seem equally fast on ARM processors (and take 1 CPU cycle each). Please look for yourself and do a simple test as Zev Eisenberg correctly suggests.
Then it's not a good idea to start writing your own CGPoint stuff using ints for the following reasons (to name a few):
Incorrect results: Rounding or truncating coordinates to integers will give all kinds of weird/horrible/side effects.
Incompatibility with the multitude of iOS libraries.
A big waste of time.
Not faster.
Creating a messy code base (Knuth is right as Zaph brings in).
As always when trying to optimise: Take a step back and investigate if your current method/algorithm is the best choice (for possibly different scenario's in your application). This is the way to commonly massive improvement of hundreds of percents.

Related

How to use apple's Accelerate framework in swift in order to compute the FFT of a real signal?

How am I supposed to use the Accelerate framework to compute the FFT of a real signal in Swift on iOS?
Available example on the web
Apple’s Accelerate framework seems to provide functions to compute the FFT of a signal efficiently.
Unfortunately, most of the examples available on the Internet, like Swift-FFT-Example and TempiFFT, crash if tested extensively and call the Objective C API.
The Apple documentation answers many questions, but also leads to some others (Is this piece mandatory? Why do I need this call to convert?).
Threads on Stack Overflow
There are few threads addressing various aspects of the FFT with concrete examples. Notably FFT Using Accelerate In Swift, DFT result in Swift is different than that of MATLAB and FFT Calculating incorrectly - Swift.
None of them address directly the question “What is the proper way to do it, starting from 0”?
It took me one day to figure out how to properly do it, so I hope this thread can give a clear explanation of how you are supposed to use Apple's FFT, show what are the pitfalls to avoid, and help developers save precious hours of their time.
TL ; DR : If you need a working implementation to copy past here is a gist.
What is FFT?
The Fast Fourier transform is an algorithm that take a signal in the time domain -- a collection of measurements took at a regular, usual small, interval of time -- and turn it into a signal expressed into the phase domain (a collection of frequency).
The ability to express the signal along time lost by the transformation (the transformation is invertible, which means no information is lost by computing the FFT and you can apply a IFFT to get the original signal back), but we get the ability to distinguish between frequencies that the signal contained. This is typically used to display the spectrograms of the music you are listening to on various hardware and youtube videos.
The FFT works with complexe numbers. If you don't know what they are, lets just pretend it is a combination of a radius and an angle. There is one complex number per point on a 2D plane. Real numbers (your usual floats) can be saw as a position on a line (negative on the left, positive on the right).
Nb: FFT(FFT(FFT(FFT(X))) = X (up to a constant depending on your FFT implementation).
How to compute the FFT of a real signal.
Usual you want to compute the FFT of a small window of an audio signal. For sake of the example, we will take a small 1024 samples window. You would also prefer to use a power of two, otherwise things gets a little bit more difficult.
var signal: [Float] // Array of length 1024
First, you need to initialize some constants for the computation.
// The length of the input
length = vDSP_Length(signal.count)
// The power of two of two times the length of the input.
// Do not forget this factor 2.
log2n = vDSP_Length(ceil(log2(Float(length * 2))))
// Create the instance of the FFT class which allow computing FFT of complex vector with length
// up to `length`.
fftSetup = vDSP.FFT(log2n: log2n, radix: .radix2, ofType: DSPSplitComplex.self)!
Following apple's documentation, we first need to create a complex array that will be our input.
Dont get mislead by the tutorial. What you usual want is to copy your signal as the real part of the input, and keep the complex part null.
// Input / Output arrays
var forwardInputReal = [Float](signal) // Copy the signal here
var forwardInputImag = [Float](repeating: 0, count: Int(length))
var forwardOutputReal = [Float](repeating: 0, count: Int(length))
var forwardOutputImag = [Float](repeating: 0, count: Int(length))
Be careful, the FFT function do not allow to use the same splitComplex as input and output at the same time. If you experience crashs, this may be the cause. This is why we define both an input and an output.
Now, we have to be careful and "lock" the pointer to this four arrays, as showed in the documentation example. If you simply use &forwardInputReal as argument of your DSPSplitComplex, the pointer may become invalidated at the following line and you will likely experience sporadic crash of your app.
forwardInputReal.withUnsafeMutableBufferPointer { forwardInputRealPtr in
forwardInputImag.withUnsafeMutableBufferPointer { forwardInputImagPtr in
forwardOutputReal.withUnsafeMutableBufferPointer { forwardOutputRealPtr in
fforwardOutputImag.withUnsafeMutableBufferPointer { forwardOutputImagPtr in
// Input
let forwardInput = DSPSplitComplex(realp: forwardInputRealPtr.baseAddress!, imagp: forwardInputImagPtr.baseAddress!)
// Output
var forwardOutput = DSPSplitComplex(realp: forwardOutputRealPtr.baseAddress!, imagp: forwardOutputImagPtr.baseAddress!)
// FFT call goes here
}
}
}
}
Now, the finale line: the call to your fft:
fftSetup.forward(input: forwardInput, output: &forwardOutput)
The result of your FFT is now available in forwardOutputReal and forwardOutputImag.
If you want only the amplitude of each frequency, and you don't care about the real and imaginary part, you can declare alongside the input and output an additional array:
var magnitudes = [Float](repeating: 0, count: Int(length))
add right after your fft compute the amplitude of each "bin" with:
vDSP.absolute(forwardOutput, result: &magnitudes)

Input of a fixed point DSP

i'm new to working with dsps and fixed point and i really need to know:
1. Is it the fixed point dsp that converts the float number to Q format or a device does that before feeding the Dsp?
2. Who specifies the Q format to be used. Does each DSP come with a specified Q_format or the programmer does that in his codes.
3. Can i have an idea of how to perform a simple say 4 by 4 fixed point matrix multiplication in c++?
Thanks in anticipation
The format is usually fixed for a given DSP, e.g. Motorola DSP 56k family uses a 24 bit signed fractional format (Q23).
Fixed point is really just the same as an ordinary integer but there's an implicit scale factor. For most operations this makes no difference, e.g. load/store/add/subtract all work the same way regardless of whether the data is integer or fixed point.
When it comes to multiplication or division however the implicit scaling factor needs to be taken into account - typically there will be a shift after the operation to correct for this. DSP instructions take care of this automatically, whereas normal CPUs have to do this explicitly.
When you're doing e.g. a 4x4 matrix multiply you just use the DSP's native fixed point arithmetic instructions and the scaling is all taken care of automatically.

OpenGL ES best practices for conditionals

Apple says in their Best Practices For Shaders to avoid branching if possible, and especially branching on values calculated within the shader. So I replaced some if statements with the built-in clamp() function. My question is, are clamp(), min(), and max() likely to be more efficient, or are they merely convenience (i.e. macro) functions that simply expand to if blocks?
I realize the answer may be implementation dependent. In any case, the functions are obviously cleaner and make plain the intent, which the compiler could do something with.
Historically speaking GPUs have supported per-fragment instructions such as MIN and MAX for much longer than they have supported arbitrary conditional branching. One example of this in desktop OpenGL is the GL_ARB_fragment_program extension (now superseded by GLSL) which explicitly states that it doesn't support branching, but it does provide instructions for MIN and MAX as well as some other conditional instructions.
I'd be pretty confident that all GPUs will still have dedicated hardware for these operations given how common min(), max() and clamp() are in shaders. This isn't guaranteed by the specification because an implementation can optimize code however it sees fit, but in the real world you should use GLSL's built-in functions rather than rolling your own.
The only exception would be if your conditional was being used to avoid a large amount of additional fragment processing. At some point the cost of a branch will be less than the cost of running all the code in the branch, but the balance here will be very hardware dependent and you'd have to benchmark to see if it actually helps in your application on its target hardware. Here's the kind of thing I mean:
void main() {
vec3 N = ...;
vec3 L = ...;
float NDotL = dot(N, L);
if (NDotL > 0.0)
{
// Lots of very intensive code for an awesome shadowing algorithm that we
// want to avoid wasting time on if the fragment is facing away from the light
}
}
Just clamping NDotL to 0-1 and then always processing the shadow code on every fragment only to multiply through your final shadow term by NDotL is a lot of wasted effort if NDotL was originally <= 0, and we can theoretically avoid this overhead with a branch. The reason this kind of thing is not always a performance win is that it is very dependent on how the hardware implements shader branching.

Advantages of cv::Matx

I noticed that a new data structure cv::Matx was added to the new OpenCV version, intended for small matrices of known size at compilation time, for example
cv::Matx31f // matrix 3x1 of float type
Checking the documentation I saw that most of matrix operations are available, but still I don't see the advantages of using this new type instead of the old cv::Mat.
When should I use Matx instead of Mat?
Short answer: cv::Mat uses the heap to store its data, while cv::Matx uses the stack.
A cv::Mat uses dynamic memory allocation (on the heap). This is appropriate for big matrices (like images) and lets you do things like shallow copies of a matrix, which is the default behavior of cv::Mat.
However, for the small matrices that cv::Matx is designed for, heap allocation would be very expensive compared to doing the same thing on the stack. I have seen a block of math reduce processing time by over 75% by switching to using stack-allocated types (e.g. cv::Point and cv::Matx) instead of cv::Mat.
It's about memory management and not wasting (in some cases important) memory or just reservation of memory for an object you'll use later.
That's how I understand it – may be someone else can give a better explanation.
This is a late late answer, but it is still an interesting question!
dom's answer is quite accurate, and the heap/stack reference in user1460044's is also interesting.
From a practical point of view, I wouldn't use Matx (or Vec), except if it were completely necessary. The major advantages of Matx are
Using the stack (efficient![1])
Initialization.
The problem is, at the end you will have to move your Matx data to a Mat to do most of stuff, and so, you will be back at the heap again.
On the other hand, the "cool initialization" of a Matx can be done in a normal Mat:
// Matx initialization:
Matx31f A(1.f,2.f,3.f);
// Mat initialization:
Mat B = (Mat_<float>(3,1) << 1.f, 2.f, 3.f);
Also, there is a difference in initialization (beyond the heap/stack) stuff. If you try to put 5 values into the Matx31, it will crash (runtime exception), while calling the Mat_::operator<< with 5 values will only store the first three.
[1] Efficient if your program has to create lots of matrices of less than ~10 elements. In that case use Matx matrices.
There are 2 other reasons why I prefer Matx to Mat:
Readability: people reading the code can immediately see the size of the matrices, for example:
cv::Matx34d transform = ...;
It's clear that this is a 3x4 matrix, so it contains a 3D transformation of type (R,t), where R is a rotation matrix (as opposed to say, axis-angle).
Similarly, accessing an element is more natural with transform(i,j) vs transform.at<double>(i,j).
Easy debugging. Since the elements for Matx are allocated on the stack in an array of known length, IDEs or debuggers can display the entire contents nicely when stepping through the code.

What is "vectorization"?

Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorization, and what does it mean for example, that "a loop is vectorized" ?
Many CPUs have "vector" or "SIMD" instruction sets which apply the same operation simultaneously to two, four, or more pieces of data. Modern x86 chips have the SSE instructions, many PPC chips have the "Altivec" instructions, and even some ARM chips have a vector instruction set, called NEON.
"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.
I chose 4 because it's what modern hardware is most likely to directly support for 32-bit floats or ints.
The difference between vectorization and loop unrolling:
Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.
for (int i=0; i<16; ++i)
C[i] = A[i] + B[i];
Unrolling this loop would transform it into something like this:
for (int i=0; i<16; i+=4) {
C[i] = A[i] + B[i];
C[i+1] = A[i+1] + B[i+1];
C[i+2] = A[i+2] + B[i+2];
C[i+3] = A[i+3] + B[i+3];
}
Vectorizing it, on the other hand, produces something like this:
for (int i=0; i<16; i+=4)
addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);
Where "addFourThingsAtOnceAndStoreResult" is a placeholder for whatever intrinsic(s) your compiler uses to specify vector instructions.
Terminology:
Note that most modern ahead-of-time compilers are able to auto vectorize very simple loops like this, which can often be enabled via a compile option (on by default with full optimization in modern C and C++ compilers, like gcc -O3 -march=native). OpenMP #pragma omp simd is sometimes helpful to hint the compiler, especially for "reduction" loops like summing an FP array where vectorization requires pretending that FP math is associative.
More complex algorithms still require help from the programmer to generate good vector code; we call this manual vectorization, often with intrinsics like x86 _mm_add_ps that map to a single machine instruction as in SIMD prefix sum on Intel cpu or How to count character occurrences using SIMD. Or even use SIMD for short non-looping problems like Most insanely fastest way to convert 9 char digits into an int or unsigned int or How to convert a binary integer number to a hex string?
The term "vectorization" is also used to describe a higher level software transformation where you might just abstract away the loop altogether and just describe operating on arrays instead of the elements that comprise them. e.g. writing C = A + B in some language that allows that when those are arrays or matrices, unlike C or C++. In lower-level languages like that, you could describe calling BLAS or Eigen library functions instead of manually writing loops as a vectorized programming style. Some other answers on this question focus on that meaning of vectorization, and higher-level languages.
Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.
From wikipedia:
Scalar approach:
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
Vectorized approach:
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}
Vectorization is used greatly in scientific computing where huge chunks of data needs to be processed efficiently.
In real programming application , i know it's used in NUMPY(not sure of other else).
Numpy (package for scientific computing in python) , uses vectorization for speedy manipulation of n-dimensional array ,which generally is slower if done with in-built python options for handling arrays.
although tons of explanation are out there , HERE'S WHAT VECTORIZATION IS DEFINED AS IN NUMPY DOCUMENTATION PAGE
Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:
vectorized code is more concise and easier to read
fewer lines of code generally means fewer bugs
the code more closely resembles standard mathematical notation
(making it easier, typically, to correctly code mathematical
constructs)
vectorization results in more “Pythonic” code. Without
vectorization, our code would be littered with inefficient and
difficult to read for loops.
Vectorization, in simple words, means optimizing the algorithm so that it can utilize SIMD instructions in the processors.
AVX, AVX2 and AVX512 are the instruction sets (intel) that perform same operation on multiple data in one instruction. for eg. AVX512 means you can operate on 16 integer values(4 bytes) at a time. What that means is that if you have vector of 16 integers and you want to double that value in each integers and then add 10 to it. You can either load values on to general register [a,b,c] 16 times and perform same operation or you can perform same operation by loading all 16 values on to SIMD registers [xmm,ymm] and perform the operation once. This lets speed up the computation of vector data.
In vectorization we use this to our advantage, by remodelling our data so that we can perform SIMD operations on it and speed up the program.
Only problem with vectorization is handling conditions. Because conditions branch the flow of execution. This can be handled by masking. By modelling the condition into an arithmetic operation. eg. if we want to add 10 to value if it is greater then 100. we can either.
if(x[i] > 100) x[i] += 10; // this will branch execution flow.
or we can model the condition into arithmetic operation creating a condition vector c,
c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask
this is very trivial example though... thus, c is our masking vector which we use to perform binary operation based on its value. This avoid branching of execution flow and enables vectorization.
Vectorization is as important as Parallelization. Thus, we should make use of it as much possible. All modern days processors have SIMD instructions for heavy compute workloads. We can optimize our code to use these SIMD instructions using vectorization, this is similar to parrallelizing our code to run on multiple cores available on modern processors.
I would like to leave with the mention of OpenMP, which lets yo vectorize the code using pragmas. I consider it as a good starting point. Same can be said for OpenACC.
It refers to a the ability to do single mathematical operation on a list -- or "vector" -- of numbers in a single step. You see it often with Fortran because that's associated with scientific computing, which is associated with supercomputing, where vectorized arithmetic first appeared. Nowadays almost all desktop CPUs offer some form of vectorized arithmetic, through technologies like Intel's SSE. GPUs also offer a form of vectorized arithmetic.
By Intel people I think is easy to grasp.
Vectorization is the process of converting an algorithm from operating
on a single value at a time to operating on a set of values at one
time. Modern CPUs provide direct support for vector operations where a
single instruction is applied to multiple data (SIMD).
For example, a CPU with a 512 bit register could hold 16 32- bit
single precision doubles and do a single calculation.
16 times faster than executing a single instruction at a time. Combine
this with threading and multi-core CPUs leads to orders of magnitude
performance gains.
Link https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus
In Java there is a option to this be included in JDK 15 of 2020 or late at JDK 16 at 2021. See this official issue.
hope you are well!
vectorization refers to all the techniques that convert scaler implementation, in which a single operation processes a single entity at a time to vector implementation in which a single operation processes multiple entities at the same time.
Vectorization refers to a technique with the help of which we optimize the code to work with huge chunks of data efficiently. application of vectorization seen in scientific applications like NumPy, pandas also you can use this technique while working with Matlab, image processing, NLP, and much more. Overall it optimizes the runtime and memory allocation of the program.
Hope you may get your answer!
Thank you. 🙂
I would define vectorisation a feature of a given language where the responsibility on how to iterate over the elements of a certain collection can be delegated from the programmer (e.g. explicit loop of the elements) to some method provided by the language (e.g. implicit loop).
Now, why do we ever want to do that ?
Code readeability. For some (but not all!) cases operating over the entire collection at once rather than to its elements is easier to read and quicker to code;
Some interpreted languages (R, Python, Matlab.. but not Julia for example) are really slow in processing explicit loops. In these cases vectorisation uses under the hood compiled instructions for these "element order processing" and can be several orders of magnitude faster than processing each programmer-specified loop operation;
Most modern CPUs (and, nowadays, GPUs) have build-in parallelization that is exploitable when we use the vectorisation method provided by the language rather than our self-implemented order of operations of the elements;
In a similar way our programming language of choice will likely use for some vectorisation operations (e.g. matrix operations) software libraries (e.g. BLAS/LAPACK) that exploit multi-threading capabilities of the CPU, another form of parallel computation.
Note that for points 3 and 4 some languages (Julia notably) allow these hardware parallelizations to be exploited also using programmer-defined order processing (e.g. for loops), but this happens automatically and under the hood when using the vectorisation method provided by the language.
Now, while vectorisation has many advantages, sometimes an algorithm is more intuitively expressed using an explicit loop than vectorisation (where perhaps we need to resort to complex linear algebra operations, identity and diagonal matrices... all to retain our "vectorised" approach), and if using an explicit ordering form has no computational disadvantages, this one should be preferred.
See the two answers above. I just wanted to add that the reason for wanting to do vectorization is that these operations can easily be performed in paraell by supercomputers and multi-processors, yielding a big performance gain. On single processor computers there will be no performance gain.

Resources