clang optimisation flags for matrix vector calculations - clang

I looked at disassembled code generated by clang from glm (a matrix vector library for 3d calculations) operations.
I noticed clang doing some 'vectorization' for double precision operations, eg. coercing two multiplications in one SIMD instruction.
However, for single precision calculations, the code seems vary bad to me. The instructions used are from the SSE instruction sets, and the registers MMX ones, but every mulitplication is done for one single float at a time, and even groups of assignments (eg. matrix assignment) are carried out by a large bunch of movss statements. Those bad assignments even hold for double precision code.
Why is that, are there any command line arguments that would motivate clang doing better? I know compilers do no magic, but a linear list of 16 memory-adjacent assignments should be optimizable in many ways I guess?

In your assembly read xmm register is not a proof of vectorization, as the every double operation are now performed in SIMD register (even single).
Vectorization is not trivial for compiler, clang furnishes option like
clang -fslp-vectorize-aggressive file.c
it may help, else you may look for alternative, it exists a lot of library for matrix multiplication, MKL, boost-numeric, plasma, etc ... In my souvenirs GLM is old, good alternative exists.

Related

Performance cost of float ↔︎ half conversion in Metal

I have a Metal-based Core Image convolution kernel that was using half precision variables for keeping track of sums and weights. However, I now figured that the range of 16-bit half is not enough in some cases, which means I need 32-bit float for some variables.
Now I'm wondering what's more performant:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
or change all samplers and local vars to float type so that no conversion is necessary.
The former would mean that all arithmetic is performed in 32-bit precision, though it would only be needed for some operations.
Is there any documentation or benchmark I can run to find the cost of float ↔︎ half conversion in Metal?
I believe you should go with option A:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
based on the discussion in the WWDC 2016 talk entitled "Advanced Metal Shader Optimization" linked here.
Between around 17:17-18:58 is the relevant section for this topic. The speaker Fiona mentions a couple of things of importance:
A8 and later GPUs have 16-bit registers, which means that 32-bit floating-point formats (like float) use twice as many registers, which means twice as much bandwidth, energy, etc. So using half saves registers (which is always good) and energy
On A8 and later GPUs, "data type conversions are typically free, even between float and half [emphasis added]." Fiona even poses questions you might be asking yourself covering what I believe you are concerned about with all of the conversions and says that it's still probably fast because the conversions are free. Furthermore, according to the Metal Shading Language Specification Version 2.3 (pg. 218)
For textures that have half-precision floating-point pixel color values, the conversions from half to float are lossless
so that you don't have to worry about precision being lost as well.
There are some other relevant points that are worth looking into as well in that section, but I believe this is enough to justify going with option A

Precision of Q31 and SP for FFT ARM Cortex-M7

I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.
Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.

Eigen not vectorizing matrix multiplication in iOS?

I'm using the Eigen library to do some computation on an iPad 2. (ie. cortex-a9). It seems that some operations are vectorized using NEON instructions, while others aren't.
Operations that I've tried that get vectorized: dot products, vector and matrix additions and subtractions.
Operations that don't get vectorized: matrix multiplication.
I'm using these operations inside the same project and same file, so the compiler options are the same. I'm using -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp.
All matrices that I'm using have Dynamic sizes. Is there anything I'm doing wrong, or is this the expected behaviour?
Thanks.
When you use -mfpu=neon gcc/clang will vectorize integer operations, but not floating-point because NEON is not 100% IEEE-complaint (it doesn't support denormal numbers). You have to specify -ffast-math to make gcc/clang vectorize floating-point code with NEON. However, you must be careful as -ffast-math can affect the numerical results.

Why are the elements of the matrix and vector types in the F# Powerpack mutable?

F# is often promoted as a functional language where data is immutable by default, however the elements of the matrix and vector types in the F# Powerpack are mutable. Why is this?
Furthermore, for which reason are sparse matrices implemented as immutable as opposed to normal matrices?
The standard array type ('T[]) in F# is also mutable. You're mostly correct -- F# is a functional language where data immutability is encouraged, but not required. Basically, F# allows you to do write both mutable/imperative code and immutable/functional code; it's up to you to decide the best way to implement the code for your specific application.
Another reason for having mutable arrays and matrices is performance -- it is possible to implement very fast algorithms with immutable types, but users writing scientific computations usually only care about one thing: achieving maximum performance. That being that case, it follows that the arrays and matrices should be mutable.
For truly high performance, mutability is required, in one specific case : Provided that your code is perfectly optimized and that you master everything it is doing down to the cache (L1, L2) pattern of access of your program, then nothing beats a low level, to the metal approach.
This happens mostly only when you have one well specified problem that stays constant for 20 years, aka mostly in scientific tasks.
As soon as you depart from this specific case, in 99.99% the bottlenecks arise from having a too low level representation (induced by a low level langage) in which you can't express the final, real-world optimization trade-off of your problem at hand.
Bottom line, for performance, the following approach is the only way (i think) :
High level / algorithmic optimization first
Once every high level ways has been explored, low level optimization
You can see how as a consequence of that :
You should never optimize anything without FIRST measuring the impact : improvements should only be made if they yield enormous performance gains and/or do not degrade your domain logic.
You eventually will reach, if your problem is stable and well defined, the point where you will have no choice but to go to the low level, and play with memory/mutability

Logarithm with SSE, or switch to FPU?

I'm doing some statistics calculations. I need them to be fast, so I rewrote most of it to use SSE. I'm pretty much new to it, so I was wondering what the right approach here is:
To my knowledge, there is no log2 or ln function in SSE, at least not up to 4.1, which is the latest version supported by the hardware I use.
Is it better to:
extract 4 floats, and do FPU calculations on them to determine enthropy - I won't need to load any of those values back into SSE registers, just sum them up to another float
find a function for SSE that does log2
There seem to be a few SSE log2 implementations around, e.g. this one.
There is also the Intel Approximate Maths Library which has a log2 function among others - it's old (2000) but it's SSE2 and it should still work reasonably well.
See also:
sse_mathfun - SSE vector math library
avx_mathfun - AVX vector math library
libmvec - vector math library added in glibc 2.22
There is no SSE instruction that implements a logarithm function. However, there's also no single x86 instruction that performs a generic logarithm either. If you're thinking about using a logarithm function like log or log10 from the C standard library, it's worth taking a look at the implementation that is used in an open-source library like libc. You can easily roll your own logarithm approximation that operates across all elements in an SSE register.
Such a function is often implemented using a polynomial approximation that is valid within some accuracy specification over a certain region of input arguments, such as a Taylor series. You can then take advantage of logarithm properties to wrap a generic input argument into the acceptable input range for your logarithm routine. In addition, you can parameterize the base of the logarithm by taking advantage of the property:
log_y(x) = log_a(x) / log_a(y)
Where a is the base of the logarithm routine that you created.

Resources