Precision of Q31 and SP for FFT ARM Cortex-M7 - signal-processing

I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.

Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.

Related

Performance cost of float ↔︎ half conversion in Metal

I have a Metal-based Core Image convolution kernel that was using half precision variables for keeping track of sums and weights. However, I now figured that the range of 16-bit half is not enough in some cases, which means I need 32-bit float for some variables.
Now I'm wondering what's more performant:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
or change all samplers and local vars to float type so that no conversion is necessary.
The former would mean that all arithmetic is performed in 32-bit precision, though it would only be needed for some operations.
Is there any documentation or benchmark I can run to find the cost of float ↔︎ half conversion in Metal?
I believe you should go with option A:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
based on the discussion in the WWDC 2016 talk entitled "Advanced Metal Shader Optimization" linked here.
Between around 17:17-18:58 is the relevant section for this topic. The speaker Fiona mentions a couple of things of importance:
A8 and later GPUs have 16-bit registers, which means that 32-bit floating-point formats (like float) use twice as many registers, which means twice as much bandwidth, energy, etc. So using half saves registers (which is always good) and energy
On A8 and later GPUs, "data type conversions are typically free, even between float and half [emphasis added]." Fiona even poses questions you might be asking yourself covering what I believe you are concerned about with all of the conversions and says that it's still probably fast because the conversions are free. Furthermore, according to the Metal Shading Language Specification Version 2.3 (pg. 218)
For textures that have half-precision floating-point pixel color values, the conversions from half to float are lossless
so that you don't have to worry about precision being lost as well.
There are some other relevant points that are worth looking into as well in that section, but I believe this is enough to justify going with option A

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Complex interpolation on an FPGA

I have a problem in that I need to implement an algorithm on an FPGA that requires a large array of data that is too large to fit into block or distributed memory. The array contains complex fixed-point values, and it turns out that I can do a good job by reducing the total number of stored values through decimation and then linearly interpolating the interim values on demand.
Though I have DSP blocks (and so fixed-point hardware multipliers) which could be used trivially for real and imaginary part interpolation, I actually want to do the interpolation on the amplitude and angle (of the polar form of the complex number) and then convert the result to real-imaginary form. The data can be stored in polar form if it improves things.
I think my question boils down to this: How should I quickly convert between polar complex numbers and real-imaginary complex numbers (and back again) on an FPGA (noting availability of DSP hardware)? The solution need not be exact, just close, but be speed optimised. Alternatively, better strategies are gladly received!
edit: I know about cordic techniques, so this would be how I would do it in the absence of a better idea. Are there refinements specific to this problem I could invoke?
Another edit: Following from #mbschenkel's question, and some more thinking on my part, I wanted to know if there were any known tricks specific to the problem of polar interpolation.
In my case, the dominant variation between samples is a phase rotation, with a slowly varying amplitude. Since the sampling grid is known ahead of time and is regular, one trick could be to precompute some complex interpolation factors. So, for two complex values a and b, if we wish to find (N-1) intermediate equally spaced values, we can precompute the factor
scale = (abs(b)/abs(a))**(1/N)*exp(1j*(angle(b)-angle(a)))/N)
and then find each intermediate value iteratively as val[n] = scale * val[n-1] where val[0] = a.
This works well for me as I need the samples in order and I compute them all. For small variations in amplitude (i.e. abs(b)/abs(a) ~= 1) and 0 < n < N, (abs(b)/abs(a))**(n/N) is approximately linear (though linear is not necessarily better).
The above is all very good, but still results in a complex multiplication. Are there other options for approximating this? I'm interested in resource and speed constraints, not accuracy. I know I can do the rotation with CORDIC, but still need a pair of multiplications for the scaling, so I'm adding lots of complexity and resource usage for potentially limited results. I don't really have a feel for the convergence of CORDIC, so perhaps I just truncate early, or use lots of resources to converge quickly.

Does variables with large integer values affect the performance of SMT?

I am wondering whether large integer values have impact on the performance of SMT. Sometimes I need to work with large values. Mostly I do arithmetic operations (mainly addition and multiplication) on them (i.e., different integer terms) and need to compare the resultant value with constraints (i.e., some other integer term).
Large integers and/or rationals in the input problem is not a definitive indicator of hardness.
Z3 may generate large numbers internally even when the input contains only small numbers.
I have observed many examples where Z3 spends a lot of time processing large rational numbers.
A lot of time is spent computing the GCD of the numerator and denominator.
Each GCD computation takes a relatively small amount of time, but on hard problems Z3 will perform millions of them.
Note that, Z3 uses rational numbers for solving pure integer problems, because it uses a Simplex-based algorithm for solving linear arithmetic.
If you post your example, I can give you a more precise answer.

Lua: subtracting decimal numbers doesn't return correct precision

I am using Lua 5.1
print(10.08 - 10.07)
Rather than printing 0.01, above prints 0.0099999999999998.
Any idea how to get 0.01 form this subtraction?
You got 0.01 from the subtraction. It is just in the form of a repeating decimal with a tiny amount of lost precision.
Lua uses the C type double to represent numbers. This is, on nearly every platform you will use, a 64-bit binary floating point value with approximately 23 decimal digits of precision. However, no amount of precision is sufficient to represent 0.01 exactly in binary. The situation is similar to attempting to write 1/3 in decimal.
Furthermore, you are subtracting two values that are very similar in magnitude. That all by itself causes an additional loss of precision.
The solution depends on what your context is. If you are doing accounting, then I would strongly recommend that you not use floating point values to represent account values because these small errors will accumulate and eventually whole cents (or worse) will appear or disappear. It is much better to store accounts in integer cents, and divide by 100 for display.
In general, the answer is to be aware of the issues that are inherent to floating point, and one of them is this sort of small loss of precision. It is easily handled by rounding answers to an appropriate number of digits for display, and never comparing results of calculations for equality.
Some resources for background:
The semi-official explanation at the Lua Users Wiki
This great page of IEEE floating point calculators where you can enter values in decimal and see how they are represented, and vice-versa.
Wiki on IEEE floating point.
Wiki on floating point numbers in general.
What Every Computer Scientist Should Know About Floating-Point Arithmetic is the most complete discussion of the fine details.
Edit: Added the WECSSKAFPA document after all. It really is worth the study, although it will likely seem a bit overwhelming on the first pass. In the same vein, Knuth Volume 2 has extensive coverage of arithmetic in general and a large section on floating point. And, since lhf reminded me of its existence, I inserted the Lua wiki explanation of why floating point is ok to use as the sole numeric type as the first item on the list.
Use Lua's string.format function:
print(string.format('%.02f', 10.08 - 10.07))
Use an arbitrary precision library.
Use a Decimal Number Library for Lua.

Resources