How to implement decay towards zero in signed fixed point math, in sse? - vectorization

There are many decay-like physical events (for example body friction or charge leak), that are usually modelled in iterators like x' = x * 0.99, which is usually very easy to write in floating point arithmetics.
However, i have a demand to do this in 16-bit "8.8" signed fixed point manner, in sse. For efficient implementation on typical ALU mentioned formula can be rewritten as x = x - x/128; or x = x - (x>>7) where >> is "arithmetic", sign-extending right shift.
And i stuck here, because _mm_sra_epi16() produces totally counterintuitive behaviour, which is easily verifiable by following example:
#include <cstdint>
#include <iostream>
#include <emmintrin.h>
using namespace std;
int main(int argc, char** argv) {
cout << "required: ";
for (int i = -1; i < 7; ++i) {
cout << hex << (0x7fff >> i) << ", ";
}
cout << endl;
cout << "produced: ";
__m128i a = _mm_set1_epi16(0x7fff);
__m128i b = _mm_set_epi16(-1, 0, 1, 2, 3, 4, 5, 6);
auto c = _mm_sra_epi16(a, b);
for (auto i = 0; i < 8; ++i) {
cout << hex << c.m128i_i16[i] << ", ";
}
cout << endl;
return 0;
}
Output would be as follows:
required: 0, 7fff, 3fff, 1fff, fff, 7ff, 3ff, 1ff,
produced: 0, 0, 0, 0, 0, 0, 0, 0,
It only applies first shift to all, like it is actually _mm_sra1_epi16 function, accidentely named sra and given __m128i second argument bu a funny clause for no reason. So this cannot be used in SSE.
On other hand, i heard that division algorithm is enormously complex, thus _mm_div_epi16 is absent in SSE and also cannot be used.
What to do and how to implement/vectorize that popular "decay" technique?

x -= x>>7 is trivial to implement with SSE2, using a constant shift count for efficiency. This compiles to 2 instructions if AVX is available, otherwise a movdqa is needed to copy v before a destructive right-shift.
__m128i downscale(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
return _mm_sub_epi16(v, dec);
}
GCC even auto-vectorizes it (Godbolt).
void foo(short *__restrict a) {
for (int i=0 ; i<10240 ; i++) {
a[i] -= a[i]>>7; // inner loop uses the same psraw / psubw
}
}
Unlike float, fixed-point has constant absolute precision over the full range, not constant relative precision. So for small positive numbers, v>>7 will be zero and your decrement will stall. (Negative inputs underflow to -1, because arithmetic right shift rounds towards -infinity.)
If small inputs where the shift can underflow to 0, you might want to OR with _mm_set1_epi16(1) to make sure the decrement is non-zero. Negligible effect on large-ish inputs. However, that will eventually make a downscale chain go from 0 to -1. (And then back up to 0, because -1 | 1 == -1 in 2's complement).
__m128i downscale_nonzero(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
dec = _mm_or_si128(dec, _mm_set1_epi16(1));
return _mm_sub_epi16(v, dec);
}
If starting negative, the sequence would be -large, logarithmic until -128, linear until -4, -3, -2, -1, 0, -1, 0, -1, ...
Your code got all-zeros because _mm_sra_epi16 uses the low 64 bits of the 2nd source vector as a 64-bit shift count that applies to all elements. Read the manual. So you shifted all the bits out of each 16-bit element.
It's not idiotic, but per-element shift counts require AVX2 (for 32/64-bit elements) or AVX512BW for _mm_srav_epi16 or 64-bit arithmetic right shifts, which would make sense for the way you're trying to use it. (But the shift count is unsigned, so -1 also going to shift out all the bits).
Indeed, that instruction should be named _mm_sra1_epi16()
Yup, that would make sense. But remember that when these were named, AVX2 _mm_srav_* didn't exist yet. Also, that specific name would not be ideal because 1 and i are not the most visually distinct. (i for immediate, for the psraw xmm1, imm16 form instead of the psraw xmm1, xmm2/m128 form of the asm instruction: http://felixcloutier.com/x86/PSRAW:PSRAD:PSRAQ.html).
The other way it makes sense is that the MMX/SSE2 asm instruction has two forms: immediate (with the same count for all elements of course), and vector. Instead of forcing you to broadcast the count to all element, the vector version takes the scalar count in the bottom of a vector register. I think the intended use-case is after a movd xmm0, eax or something.
If you need per-element-variable shift counts without AVX512, see various Q&As about emulating it, e.g. Shifting 4 integers right by different values SIMD.
Some of the workarounds use multiplies by powers of 2 for variable left-shift, and then a right shift to put the data where needed. (But you need to somehow get the 1<<n SIMD vector prepared, so this works if the same set of counts is reused for many vectors, or especially if it's a compile-time constant).
With 16-bit elements, you can use just one _mm_mulhi_epi16 to do runtime-variable right shift counts with no precision loss or range limits. mulhi(x*y) is exactly like (x*(int)y) >> 16, so you can use y=1<<14 to right shift by 16-14 = 2 in that element.

Related

Vector Matrix multiplication via ARM NEON

I have a task - to multiply big row vector (10 000 elements) via big column-major matrix (10 000 rows, 400 columns). I decided to go with ARM NEON since I'm curious about this technology and would like to learn more about it.
Here's a working example of vector matrix multiplication I wrote:
//float* vec_ptr - a pointer to vector
//float* mat_ptr - a pointer to matrix
//float* out_ptr - a pointer to output vector
//int matCols - matrix columns
//int vecRows - vector rows, the same as matrix
for (int i = 0, max_i = matCols; i < max_i; i++) {
for (int j = 0, max_j = vecRows - 3; j < max_j; j+=4, mat_ptr+=4, vec_ptr+=4) {
float32x4_t mat_val = vld1q_f32(mat_ptr); //get 4 elements from matrix
float32x4_t vec_val = vld1q_f32(vec_ptr); //get 4 elements from vector
float32x4_t out_val = vmulq_f32(mat_val, vec_val); //multiply vectors
float32_t total_sum = vaddvq_f32(out_val); //sum elements of vector together
out_ptr[i] += total_sum;
}
vec_ptr = &myVec[0]; //switch ptr back again to zero element
}
The problem is that it's taking very long time to compute - 30 ms on iPhone 7+ when my goal is 1 ms or even less if it's possible. Current execution time is understandable since I launch multiplication iteration 400 * (10000 / 4) = 1 000 000 times.
Also, I tried to process 8 elements instead of 4. It seems to help, but numbers still very far from my goal.
I understand that I might make some horrible mistakes since I'm newbie with ARM NEON. And I would be happy if someone can give me some tip how I can optimize my code.
Also - is it worth doing big vector-matrix multiplication via ARM NEON? Does this technology fit well for such purpose?
Your code is completely flawed: it iterates 16 times assuming both matCols and vecRows are 4. What's the point of SIMD then?
And the major performance problem lies in float32_t total_sum = vaddvq_f32(out_val);:
You should never convert a vector to a scalar inside a loop since it causes a pipeline hazard that costs around 15 cycles everytime.
The solution:
float32x4x4_t myMat;
float32x2_t myVecLow, myVecHigh;
myVecLow = vld1_f32(&pVec[0]);
myVecHigh = vld1_f32(&pVec[2]);
myMat = vld4q_f32(pMat);
myMat.val[0] = vmulq_lane_f32(myMat.val[0], myVecLow, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[1], myVecLow, 1);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[2], myVecHigh, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[3], myVecHigh, 1);
vst1q_f32(pDst, myMat.val[0]);
Compute all the four rows in a single pass
Do a matrix transpose (rotation) on-the-fly by vld4
Do vector-scalar multiply-accumulate instead of vector-vector multiply and horizontal add that causes the pipeline hazards.
You were asking if SIMD is suitable for matrix operations? A simple "yes" would be a monumental understatement. You don't even need a loop for this.

OpenCV 2.4.7 Mat::convertTo() 32-bit to 16-bit truncate

I've noticed that using convertTo to convert a matrix from 32-bit to 16-bit "rounds" number to the upper boud. So, values bigger than 0x0000FFFF in the source matrix will be set as 0xFFFF in the destination matrix.
What I want for my application is instead to mask the values, setting in the destination just the 2 LSB of the values.
Here is an example:
Mat mat32;
Mat mat16;
mat32 = Mat(2,2,CV_32SC1);
for(int y = 0; y < 2; y++)
for(int x = 0; x < 2; x++)
mat32.at<unsigned int>(cv::Point(x,y)) = 0x0000FFFE + (y*2+x);
mat32.convertTo(mat16, CV_16UC1);
The matrixes have these values:
32 bits matrix:
0000FFFE 0000FFFF
00010000 00010001
16 bits matrix:
0000FFFE 0000FFFF
0000FFFF 0000FFFF
In the second row of 16-bit matrix I want to have
00000000 00000001
I can do this by scanning the source matrix value-by-value and masking the values, but the performances are low.
Is there an OpenCV function that does this?
Thanks to everyone!
MIX
This can be done, but this requires a somewhat dirty trick, so it is up to you to use this approach or not. So this is how it can be done:
For this example lets create 1000x1000 32-bit matrix and set all its values to 65541 (=256*256+5). So after the conversion we expect to have a matrix filled with fives.
Mat M1(1000, 1000, CV_32S, Scalar(65541));
And here is the trick:
Mat M2(1000, 1000, CV_16SC2, M1.data);
We created matrix M2 over the same memory buffer as M1, but M2 'think' that this is a buffer of 2-channel 16-bit image. Now the last thing to do is to copy the channel you need to the place you need. This can be done by split() or mixChannels() functions. For example:
Mat M3(1000, 1000, CV_16S);
int fromto[] = {0,0};
Mat inpu[] = {M2}, outpu[] = {M3};
mixChannels(inpu, 1, outpu, 1, fromto, 1);
cout << M3.at<short>(10,10) << endl;
Ye I know that the format of mixChannels looks weird and makes the code even less readable, but it works... If you prefer split() function:
vector<Mat> v;
split(M2,v);
cout << v[0].at<short>(10,10) << " " << v[1].at<short>(10,10) << endl;
There is no OpenCV function (that I know of) which does the conversion like you want, so either you code it yourself or like you said you go through a masking step first to remove the 16 high bits.
The mask can be applied using the bitwise_and in C++ or cvAndS in C. See here.
You could also have made your hand-written code more efficient. In general, you should avoid OpenCV pixel accessors in loops because they have bad performance. I don't have an OpenCV install at hand so this could be slighlty off -- the idea is to use the data field directly, and step which is the number of bytes per row:
for(int y = 0; y < mat32.height; ++) {
int* row = (int*)( (char*)mat32.data + y * mat32.step);
for(int x = 0; x < mat32.step/ 4)
row[x] &= 0xffff;
Then, once the mask is applied, all values fit in 16 bits, and convertTo will just truncate the 16 upper bits.
The other solution is to code the conversion by hand:
mat16.resize( mat32.size() );
for(int y = 0; y < mat32.height; ++) {
const int* row32 = (const int*)( (char*)mat32.data + y * mat32.step);
short* row16 = (short*) ( (char*)mat16.data + y * mat16.step);
for(int x = 0; x < mat32.step/ 4)
row16[x] = short(row32[x]);

Calculating constants for CRC32 using PCLMULQDQ

I'm reading through the following paper on how to implement CRC32 efficiently using the PCLMULQDQ instruction introduced in Intel Westmere and AMD Bulldozer:
V. Gopal et al. "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction." 2009. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
I understand the algorithm, but one thing I'm not sure about is how to calculate the constants $k_i$. For example, they provide the constant values for the IEEE 802.3 polynomial:
k1 = x^(4*128+64) mod P(x) = 0x8833794C
k4 = x^128 mod P(x) = 0xE8A45605
mu = x^64 div P(x) = 0x104D101DF
and so on. I can just use these constants as I only need to support the one polynomial, but I'm interested: how did they calculate those numbers? I can't just use a typical bignum implementation (e.g. the one provided by Python) because the arithmetic must happen in GF(2).
It's just like regular division, except you exclusive-or instead of subtract. So start with the most significant 1 in the dividend. Exclusive-or the dividend by the polynomial, lining up the most significant 1 of the polynomial with that 1 in the dividend to turn it into a zero. Repeat until you have eliminated all of the 1's above the low n bits, where n is the order of the polynomial. The result is the remainder.
Make sure that your polynomial has the high term in the n+1th bit. I.e., use 0x104C11DB7, not 0x4C11DB7.
If you want the quotient (which you wrote as "div"), then keep track of the positions of the 1's you eliminated. That set, shifted down by n, is the quotient.
Here is how:
/* Placed in the public domain by Mark Adler, Jan 18, 2014. */
#include <stdio.h>
#include <inttypes.h>
/* Polynomial type -- must be an unsigned integer type. */
typedef uintmax_t poly_t;
#define PPOLY PRIxMAX
/* Return x^n mod p(x) over GF(2). x^deg is the highest power of x in p(x).
The positions of the bits set in poly represent the remaining powers of x in
p(x). In addition, returned in *div are as many of the least significant
quotient bits as will fit in a poly_t. */
static poly_t xnmodp(unsigned n, poly_t poly, unsigned deg, poly_t *div)
{
poly_t mod, mask, high;
if (n < deg) {
*div = 0;
return poly;
}
mask = ((poly_t)1 << deg) - 1;
poly &= mask;
mod = poly;
*div = 1;
deg--;
while (--n > deg) {
high = (mod >> deg) & 1;
*div = (*div << 1) | high; /* quotient bits may be lost off the top */
mod <<= 1;
if (high)
mod ^= poly;
}
return mod & mask;
}
/* Compute and show x^n modulo the IEEE 802.3 CRC-32 polynomial. If d is true,
also show the low bits of the quotient. */
static void show(unsigned n, int showdiv)
{
poly_t div;
printf("x^%u mod p(x) = %#" PPOLY "\n", n, xnmodp(n, 0x4C11DB7, 32, &div));
if (showdiv)
printf("x^%u div p(x) = %#" PPOLY "\n", n, div);
}
/* Compute the constants required to use PCLMULQDQ to compute the IEEE 802.3
32-bit CRC. These results appear on page 16 of the Intel paper "Fast CRC
Computation Using PCLMULQDQ Instruction". */
int main(void)
{
show(4*128+64, 0);
show(4*128, 0);
show(128+64, 0);
show(128, 0);
show(96, 0);
show(64, 1);
return 0;
}

iOS Accelerate low-pass FFT filter mirroring result

I am trying to port an existing FFT based low-pass filter to iOS using the Accelerate vDSP framework.
It seems like the FFT works as expected for about the first 1/4 of the sample. But then after that the results seem wrong, and even more odd are mirrored (with the last half of the signal mirroring most of the first half).
You can see the results from a test application below. First is plotted the original sampled data, then an example of the expected filtered results (filtering out signal higher than 15Hz), then finally the results of my current FFT code (note that the desired results and example FFT result are at a different scale than the original data):
The actual code for my low-pass filter is as follows:
double *lowpassFilterVector(double *accell, uint32_t sampleCount, double lowPassFreq, double sampleRate )
{
double stride = 1;
int ln = log2f(sampleCount);
int n = 1 << ln;
// So that we get an FFT of the whole data set, we pad out the array to the next highest power of 2.
int fullPadN = n * 2;
double *padAccell = malloc(sizeof(double) * fullPadN);
memset(padAccell, 0, sizeof(double) * fullPadN);
memcpy(padAccell, accell, sizeof(double) * sampleCount);
ln = log2f(fullPadN);
n = 1 << ln;
int nOver2 = n/2;
DSPDoubleSplitComplex A;
A.realp = (double *)malloc(sizeof(double) * nOver2);
A.imagp = (double *)malloc(sizeof(double) * nOver2);
// This can be reused, just including it here for simplicity.
FFTSetupD setupReal = vDSP_create_fftsetupD(ln, FFT_RADIX2);
vDSP_ctozD((DSPDoubleComplex*)padAccell,2,&A,1,nOver2);
// Use the FFT to get frequency counts
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_FORWARD);
const double factor = 0.5f;
vDSP_vsmulD(A.realp, 1, &factor, A.realp, 1, nOver2);
vDSP_vsmulD(A.imagp, 1, &factor, A.imagp, 1, nOver2);
A.realp[nOver2] = A.imagp[0];
A.imagp[0] = 0.0f;
A.imagp[nOver2] = 0.0f;
// Set frequencies above target to 0.
// This tells us which bin the frequencies over the minimum desired correspond to
NSInteger binLocation = (lowPassFreq * n) / sampleRate;
// We add 2 because bin 0 holds special FFT meta data, so bins really start at "1" - and we want to filter out anything OVER the target frequency
for ( NSInteger i = binLocation+2; i < nOver2; i++ )
{
A.realp[i] = 0;
}
// Clear out all imaginary parts
bzero(A.imagp, (nOver2) * sizeof(double));
//A.imagp[0] = A.realp[nOver2];
// Now shift back all of the values
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_INVERSE);
double *filteredAccell = (double *)malloc(sizeof(double) * fullPadN);
// Converts complex vector back into 2D array
vDSP_ztocD(&A, stride, (DSPDoubleComplex*)filteredAccell, 2, nOver2);
// Have to scale results to account for Apple's FFT library algorithm, see:
// http://developer.apple.com/library/ios/#documentation/Performance/Conceptual/vDSP_Programming_Guide/UsingFourierTransforms/UsingFourierTransforms.html#//apple_ref/doc/uid/TP40005147-CH202-15952
double scale = (float)1.0f / fullPadN;//(2.0f * (float)n);
vDSP_vsmulD(filteredAccell, 1, &scale, filteredAccell, 1, fullPadN);
// Tracks results of conversion
printf("\nInput & output:\n");
for (int k = 0; k < sampleCount; k++)
{
printf("%3d\t%6.2f\t%6.2f\t%6.2f\n", k, accell[k], padAccell[k], filteredAccell[k]);
}
// Acceleration data will be replaced in-place.
return filteredAccell;
}
In the original code the library was handling non power-of-two sizes of input data; in my Accelerate code I am padding out the input to the nearest power of two. In the case of the sample test below the original sample data is 1000 samples so it's padded to 1024. I don't think that would affect results but I include that for the sake of possible differences.
If you want to experiment with a solution, you can download the sample project that generates the graphs here (in the FFTTest folder):
FFT Example Project code
Thanks for any insight, I've not worked with FFT's before so I feel like I am missing something critical.
If you want a strictly real (not complex) result, then the data before the IFFT must be conjugate symmetric. If you don't want the result to be mirror symmetric, then don't zero the imaginary component before the IFFT. Merely zeroing bins before the IFFT creates a filter with a huge amount of ripple in the passband.
The Accelerate framework also supports more FFT lengths than just powers of 2.

kissfft scaling

I am looking to compute a fast correlation using FFTs and the kissfft library, and scaling needs to be precise. What scaling is necessary (forward and backwards) and what value do I use to scale my data?
The 3 most common FFT scaling factors are:
1.0 forward FFT, 1.0/N inverse FFT
1.0/N forward FFT, 1.0 inverse FFT
1.0/sqrt(N) in both directions, FFT & IFFT
Given any possible ambiguity in the documentation, and for whatever scaling the user expects to be "correct" for their purposes, best to just feed a pure sine wave of known (1.0 float or 255 integer) amplitude and exactly periodic in the FFT length to the FFT (and/or IFFT) in question, and see if the scaling matches one of the above, is maybe different from one of the above by 2X or sqrt(2), or the desired scaling is something completely different.
e.g. Write a unit test for kissfft in your environment for your data types.
multiply each frequency response by 1/sqrt(N), for an overall scaling of 1/N
In pseudocode:
ifft( fft(x)*conj( fft(y) )/N ) == circular_correlation(x,y)
At least this is true for kisfft with floating point types.
The output of the following c++ example code should be something like
the circular correlation of [1, 3i, 0 0 ....] with itself = (10,0),(1.19796e-10,3),(-4.91499e-08,1.11519e-15),(1.77301e-08,-1.19588e-08) ...
#include <complex>
#include <iostream>
#include "kiss_fft.h"
using namespace std;
int main()
{
const int nfft=256;
kiss_fft_cfg fwd = kiss_fft_alloc(nfft,0,NULL,NULL);
kiss_fft_cfg inv = kiss_fft_alloc(nfft,1,NULL,NULL);
std::complex<float> x[nfft];
std::complex<float> fx[nfft];
memset(x,0,sizeof(x));
x[0] = 1;
x[1] = std::complex<float>(0,3);
kiss_fft(fwd,(kiss_fft_cpx*)x,(kiss_fft_cpx*)fx);
for (int k=0;k<nfft;++k) {
fx[k] = fx[k] * conj(fx[k]);
fx[k] *= 1./nfft;
}
kiss_fft(inv,(kiss_fft_cpx*)fx,(kiss_fft_cpx*)x);
cout << "the circular correlation of [1, 3i, 0 0 ....] with itself = ";
cout
<< x[0] << ","
<< x[1] << ","
<< x[2] << ","
<< x[3] << " ... " << endl;
kiss_fft_free(fwd);
kiss_fft_free(inv);
return 0;
}

Resources