I'm using the Eigen library to do some computation on an iPad 2. (ie. cortex-a9). It seems that some operations are vectorized using NEON instructions, while others aren't.
Operations that I've tried that get vectorized: dot products, vector and matrix additions and subtractions.
Operations that don't get vectorized: matrix multiplication.
I'm using these operations inside the same project and same file, so the compiler options are the same. I'm using -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp.
All matrices that I'm using have Dynamic sizes. Is there anything I'm doing wrong, or is this the expected behaviour?
Thanks.
When you use -mfpu=neon gcc/clang will vectorize integer operations, but not floating-point because NEON is not 100% IEEE-complaint (it doesn't support denormal numbers). You have to specify -ffast-math to make gcc/clang vectorize floating-point code with NEON. However, you must be careful as -ffast-math can affect the numerical results.
Related
I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.
Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.
In an iOS app, I need to solve a linear equation Ax = B, where A is a sparse matrix with 40K rows and 10K columns.
Accelerate has a Sparse Solver package, but it is still in beta:
https://developer.apple.com/documentation/accelerate/sparse_solvers
I wonder if I can use BLAS to solve the linear equation? BLAS contains functions to define a sparse matrix, but I don't see any solver functions.
NLopt is a solver for optimization, which implements different optimization algorithms and is implemented in different languages.
In order to use the LD_LBFGS algorithm in Julia, does the variable have to be a vector as opposed to a matrix?
If yes, once we need to optimize an objective which is a univariate function of a matrix variable, do we have to vectorize the matrix to be able to use this package?
Yes, NLopt only understands vectors of decision variables. If your code is more naturally expressed in terms of matrices, then you should convert the vector into a matrix in the function and derivative evaluation callbacks using reinterpret.
I looked at disassembled code generated by clang from glm (a matrix vector library for 3d calculations) operations.
I noticed clang doing some 'vectorization' for double precision operations, eg. coercing two multiplications in one SIMD instruction.
However, for single precision calculations, the code seems vary bad to me. The instructions used are from the SSE instruction sets, and the registers MMX ones, but every mulitplication is done for one single float at a time, and even groups of assignments (eg. matrix assignment) are carried out by a large bunch of movss statements. Those bad assignments even hold for double precision code.
Why is that, are there any command line arguments that would motivate clang doing better? I know compilers do no magic, but a linear list of 16 memory-adjacent assignments should be optimizable in many ways I guess?
In your assembly read xmm register is not a proof of vectorization, as the every double operation are now performed in SIMD register (even single).
Vectorization is not trivial for compiler, clang furnishes option like
clang -fslp-vectorize-aggressive file.c
it may help, else you may look for alternative, it exists a lot of library for matrix multiplication, MKL, boost-numeric, plasma, etc ... In my souvenirs GLM is old, good alternative exists.
I'm doing some statistics calculations. I need them to be fast, so I rewrote most of it to use SSE. I'm pretty much new to it, so I was wondering what the right approach here is:
To my knowledge, there is no log2 or ln function in SSE, at least not up to 4.1, which is the latest version supported by the hardware I use.
Is it better to:
extract 4 floats, and do FPU calculations on them to determine enthropy - I won't need to load any of those values back into SSE registers, just sum them up to another float
find a function for SSE that does log2
There seem to be a few SSE log2 implementations around, e.g. this one.
There is also the Intel Approximate Maths Library which has a log2 function among others - it's old (2000) but it's SSE2 and it should still work reasonably well.
See also:
sse_mathfun - SSE vector math library
avx_mathfun - AVX vector math library
libmvec - vector math library added in glibc 2.22
There is no SSE instruction that implements a logarithm function. However, there's also no single x86 instruction that performs a generic logarithm either. If you're thinking about using a logarithm function like log or log10 from the C standard library, it's worth taking a look at the implementation that is used in an open-source library like libc. You can easily roll your own logarithm approximation that operates across all elements in an SSE register.
Such a function is often implemented using a polynomial approximation that is valid within some accuracy specification over a certain region of input arguments, such as a Taylor series. You can then take advantage of logarithm properties to wrap a generic input argument into the acceptable input range for your logarithm routine. In addition, you can parameterize the base of the logarithm by taking advantage of the property:
log_y(x) = log_a(x) / log_a(y)
Where a is the base of the logarithm routine that you created.