Logarithm with SSE, or switch to FPU? - sse

I'm doing some statistics calculations. I need them to be fast, so I rewrote most of it to use SSE. I'm pretty much new to it, so I was wondering what the right approach here is:
To my knowledge, there is no log2 or ln function in SSE, at least not up to 4.1, which is the latest version supported by the hardware I use.
Is it better to:
extract 4 floats, and do FPU calculations on them to determine enthropy - I won't need to load any of those values back into SSE registers, just sum them up to another float
find a function for SSE that does log2

There seem to be a few SSE log2 implementations around, e.g. this one.
There is also the Intel Approximate Maths Library which has a log2 function among others - it's old (2000) but it's SSE2 and it should still work reasonably well.
See also:
sse_mathfun - SSE vector math library
avx_mathfun - AVX vector math library
libmvec - vector math library added in glibc 2.22

There is no SSE instruction that implements a logarithm function. However, there's also no single x86 instruction that performs a generic logarithm either. If you're thinking about using a logarithm function like log or log10 from the C standard library, it's worth taking a look at the implementation that is used in an open-source library like libc. You can easily roll your own logarithm approximation that operates across all elements in an SSE register.
Such a function is often implemented using a polynomial approximation that is valid within some accuracy specification over a certain region of input arguments, such as a Taylor series. You can then take advantage of logarithm properties to wrap a generic input argument into the acceptable input range for your logarithm routine. In addition, you can parameterize the base of the logarithm by taking advantage of the property:
log_y(x) = log_a(x) / log_a(y)
Where a is the base of the logarithm routine that you created.

Related

Performance cost of float ↔︎ half conversion in Metal

I have a Metal-based Core Image convolution kernel that was using half precision variables for keeping track of sums and weights. However, I now figured that the range of 16-bit half is not enough in some cases, which means I need 32-bit float for some variables.
Now I'm wondering what's more performant:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
or change all samplers and local vars to float type so that no conversion is necessary.
The former would mean that all arithmetic is performed in 32-bit precision, though it would only be needed for some operations.
Is there any documentation or benchmark I can run to find the cost of float ↔︎ half conversion in Metal?
I believe you should go with option A:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
based on the discussion in the WWDC 2016 talk entitled "Advanced Metal Shader Optimization" linked here.
Between around 17:17-18:58 is the relevant section for this topic. The speaker Fiona mentions a couple of things of importance:
A8 and later GPUs have 16-bit registers, which means that 32-bit floating-point formats (like float) use twice as many registers, which means twice as much bandwidth, energy, etc. So using half saves registers (which is always good) and energy
On A8 and later GPUs, "data type conversions are typically free, even between float and half [emphasis added]." Fiona even poses questions you might be asking yourself covering what I believe you are concerned about with all of the conversions and says that it's still probably fast because the conversions are free. Furthermore, according to the Metal Shading Language Specification Version 2.3 (pg. 218)
For textures that have half-precision floating-point pixel color values, the conversions from half to float are lossless
so that you don't have to worry about precision being lost as well.
There are some other relevant points that are worth looking into as well in that section, but I believe this is enough to justify going with option A

Precision of Q31 and SP for FFT ARM Cortex-M7

I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.
Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.

Objective-C Quadratic/Polynomial regression (Linest function in excel)

The objective-c math library seems pretty basic.
I'm looking for some statistics analysis functions like the Excel function "linest" to retrieve the quadratic or polynomial regressions of a data set with a given order.
Is there any function similar to the "linest" function for objective-c? Or a known statistics library/framework?
I have a hard time to believe I'm the first person to stumble upon this problem in iOS.
I spend several days getting through the math and getting it in code because I couldn't find a math library for iOS with the function I needed. I wouldn't recommend anyone do to that again, it wasn't a walk in the park, so I published my solution on my github. You can find it here:
https://github.com/KingIsulgard/iOS-Polynomial-Regression
It's easy to use, just give the x values and y values of the data and the order of polynomial you want to get and voila, you got it.
Hope this might help some people. Feel free to improve if you can. I'm just happy it finally worked.
The standard math library in general only gives you an interface to the elementary mathematical operations that are implemented in the FPU part of a CPU.
For linear regression you need either your own algorithm, it is not that complicated to implement in a handful of loops, or a dedicated (most likely) statistics library.
Writing your own algorithm for higher order or general regression is simple if a QR decomposition algorithm is available, for instance via bindings for LAPACK or similar. Then to solve
minimize sum (b[0]*f[0](x[k])+...+b[n]*f[n](x[k])-y[k])^2
one has just to construct the matrix [X|Y] where X[k,j]=f[j](x[k]) is the matrix of the values of the ansatz functions and Y[k]=y[k] is the column vector of the values to approximate. Apply the QR algorithm to [X|Y], identify or extract the R factor from its result and solve for b in
R*[b|1]'=0
via back-substitution.

clang optimisation flags for matrix vector calculations

I looked at disassembled code generated by clang from glm (a matrix vector library for 3d calculations) operations.
I noticed clang doing some 'vectorization' for double precision operations, eg. coercing two multiplications in one SIMD instruction.
However, for single precision calculations, the code seems vary bad to me. The instructions used are from the SSE instruction sets, and the registers MMX ones, but every mulitplication is done for one single float at a time, and even groups of assignments (eg. matrix assignment) are carried out by a large bunch of movss statements. Those bad assignments even hold for double precision code.
Why is that, are there any command line arguments that would motivate clang doing better? I know compilers do no magic, but a linear list of 16 memory-adjacent assignments should be optimizable in many ways I guess?
In your assembly read xmm register is not a proof of vectorization, as the every double operation are now performed in SIMD register (even single).
Vectorization is not trivial for compiler, clang furnishes option like
clang -fslp-vectorize-aggressive file.c
it may help, else you may look for alternative, it exists a lot of library for matrix multiplication, MKL, boost-numeric, plasma, etc ... In my souvenirs GLM is old, good alternative exists.

can smt/z3 be used for optimazation

Can SMT solver efficiently find a solution (or an assignment) for the pseudo-Boolean problem as described as follows:
\sum {i..m} f_i x1 x2.. xn *w_i
where f_i x1 x2 .. xn is a Boolean function, and w_i is a weight of Int type.
For your convenience, I highlight the contents in page 1 and 3, which is enough for specifying
the pseudo-Boolean problem.
SMT solvers typically address the question: given a logical formula, optionally using functions and predicates from underlying theories (such as the theory of arithmetic, the theory of bit-vectors, arrays), is the formula satisfiable or not.
They typically don't expose a way for you specify objective functions
and typically don't have built-in optimization procedures.
Some special cases are formulas that only use Booleans or a combination of Booleans and either bit-vectors or integers. Pseudo Boolean constraints can be formulated with either integers or encoded (with some care taking overflow semantics into account) using bit-vectors, or they can be encoded directly into SAT. For some formulas using bounded integers that fall in the class of psuedo-boolean problems, Z3 will try automatic reductions into bit-vectors. This applies only to benchmkars in the SMT-LIB2 format tagged as QF_LIA or applies if you explicitly invoke a tactic that performs this reduction (the "qflia" tactic should apply).
While Z3 does not directly expose objective functions, the question of augmenting
SMT solvers with objective functions is actively pursued in the research community.
One approach suggested by Nieuwenhuis and Oliveras in SAT 2006 was to build in
solving for the "weighted max SMT" problem as a custom theory. Yices comes with built-in
features for weighted max SMT, Z3 does not, but it is possible to write a custom
theory that performs the backtracking search of a weighted max SMT solver, but nothing
out of the box.
Sometimes people try to specify objective functions using quantified formulas.
In theory one could hope that quantifier elimination procedures then can solve
for the objective.
This is generally pretty bad when it comes to performance. Quantifier elimination
is an overfit and the routines (that we have) will not be efficient.
For your problem, if you want to find an optimized (maximum or minimum) result from the sum, yes Z3 has this ability. You can use the Optimize class of Z3 library instead of Solver class. The class provides two methods for 'maximization' and 'minimization' respectively. You can pass the SMT variable that is needed to be optimized and Optimization class model will give the solution for you. It actually worked with C# API using Microsoft.Z3 library. For your inconvenience, I am attaching a snippet:
Optimize opt; // initializing object
opt.MkMaximize(*your variable*);
opt.MkMinimize(*your variable*);
opt.Assert(*anything you need to do*);

Resources