Substitute a byte with another one - sse

I am finding difficulties in creating a code for this seemingly easy problem.
Given a packed 8 bits integer, substitute one byte with another if present.
For instance, I want to substitute 0x06 with 0x01, so I can do the following with res as the input to find 0x06:
// Bytes to be manipulated
res = _mm_set_epi8(0x00, 0x03, 0x02, 0x06, 0x0F, 0x02, 0x02, 0x06, 0x0A, 0x03, 0x02, 0x06, 0x00, 0x00, 0x02, 0x06);
// Target value and substitution
val = _mm_set1_epi8(0x06);
sub = _mm_set1_epi8(0x01);
// Find the target
sse = _mm_cmpeq_epi8(res, val);
// Isolate target
sse = _mm_and_si128(res, sse);
// Isolate remaining bytes
adj = _mm_andnot_si128(sse, res);
Now I don't know how to proceed to or those two parts, I need to remove the target and substitute it with the replaced byte.
What SIMD instruction am I missing here?
As with other questions, I am limited to AVX, I have no better processor.

What you essentially need to do is to set all bytes (of the input) which you want to substitute to zero. Then set all other bytes of the substitution to zero and OR the results. You already got a mask to do that from the _mm_cmpeq_epi8. Overall, this can be done like this:
__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_or_si128(_mm_and_si128(mask, sub), _mm_andnot_si128(mask, inp));
Since the last combination of and/andnot/or is very common, SSE4.1 introduced an instruction which (essentially) combines these into one:
__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_blendv_epi8(inp, sub, mask);
In fact, clang5.0 and later is smart enough to replace the first variant by the second, when compiled with optimization: https://godbolt.org/z/P-tcik
N.B.: If the substitution value is in fact 0x01 you can exploit the fact that the mask (the result of the comparison) is 0x00 or 0xff (which is -0x01), i.e., you can zero out the values you want to substitute and then subtract the mask:
__m128i val = _mm_set1_epi8(0x06);
__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_sub_epi8(_mm_andnot_si128(mask, inp), mask);
This can save either loading the 0x01 vector from memory or wasting a register for it. And depending on your architecture it may have a slightly better throughput.

Related

How to implement decay towards zero in signed fixed point math, in sse?

There are many decay-like physical events (for example body friction or charge leak), that are usually modelled in iterators like x' = x * 0.99, which is usually very easy to write in floating point arithmetics.
However, i have a demand to do this in 16-bit "8.8" signed fixed point manner, in sse. For efficient implementation on typical ALU mentioned formula can be rewritten as x = x - x/128; or x = x - (x>>7) where >> is "arithmetic", sign-extending right shift.
And i stuck here, because _mm_sra_epi16() produces totally counterintuitive behaviour, which is easily verifiable by following example:
#include <cstdint>
#include <iostream>
#include <emmintrin.h>
using namespace std;
int main(int argc, char** argv) {
cout << "required: ";
for (int i = -1; i < 7; ++i) {
cout << hex << (0x7fff >> i) << ", ";
}
cout << endl;
cout << "produced: ";
__m128i a = _mm_set1_epi16(0x7fff);
__m128i b = _mm_set_epi16(-1, 0, 1, 2, 3, 4, 5, 6);
auto c = _mm_sra_epi16(a, b);
for (auto i = 0; i < 8; ++i) {
cout << hex << c.m128i_i16[i] << ", ";
}
cout << endl;
return 0;
}
Output would be as follows:
required: 0, 7fff, 3fff, 1fff, fff, 7ff, 3ff, 1ff,
produced: 0, 0, 0, 0, 0, 0, 0, 0,
It only applies first shift to all, like it is actually _mm_sra1_epi16 function, accidentely named sra and given __m128i second argument bu a funny clause for no reason. So this cannot be used in SSE.
On other hand, i heard that division algorithm is enormously complex, thus _mm_div_epi16 is absent in SSE and also cannot be used.
What to do and how to implement/vectorize that popular "decay" technique?
x -= x>>7 is trivial to implement with SSE2, using a constant shift count for efficiency. This compiles to 2 instructions if AVX is available, otherwise a movdqa is needed to copy v before a destructive right-shift.
__m128i downscale(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
return _mm_sub_epi16(v, dec);
}
GCC even auto-vectorizes it (Godbolt).
void foo(short *__restrict a) {
for (int i=0 ; i<10240 ; i++) {
a[i] -= a[i]>>7; // inner loop uses the same psraw / psubw
}
}
Unlike float, fixed-point has constant absolute precision over the full range, not constant relative precision. So for small positive numbers, v>>7 will be zero and your decrement will stall. (Negative inputs underflow to -1, because arithmetic right shift rounds towards -infinity.)
If small inputs where the shift can underflow to 0, you might want to OR with _mm_set1_epi16(1) to make sure the decrement is non-zero. Negligible effect on large-ish inputs. However, that will eventually make a downscale chain go from 0 to -1. (And then back up to 0, because -1 | 1 == -1 in 2's complement).
__m128i downscale_nonzero(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
dec = _mm_or_si128(dec, _mm_set1_epi16(1));
return _mm_sub_epi16(v, dec);
}
If starting negative, the sequence would be -large, logarithmic until -128, linear until -4, -3, -2, -1, 0, -1, 0, -1, ...
Your code got all-zeros because _mm_sra_epi16 uses the low 64 bits of the 2nd source vector as a 64-bit shift count that applies to all elements. Read the manual. So you shifted all the bits out of each 16-bit element.
It's not idiotic, but per-element shift counts require AVX2 (for 32/64-bit elements) or AVX512BW for _mm_srav_epi16 or 64-bit arithmetic right shifts, which would make sense for the way you're trying to use it. (But the shift count is unsigned, so -1 also going to shift out all the bits).
Indeed, that instruction should be named _mm_sra1_epi16()
Yup, that would make sense. But remember that when these were named, AVX2 _mm_srav_* didn't exist yet. Also, that specific name would not be ideal because 1 and i are not the most visually distinct. (i for immediate, for the psraw xmm1, imm16 form instead of the psraw xmm1, xmm2/m128 form of the asm instruction: http://felixcloutier.com/x86/PSRAW:PSRAD:PSRAQ.html).
The other way it makes sense is that the MMX/SSE2 asm instruction has two forms: immediate (with the same count for all elements of course), and vector. Instead of forcing you to broadcast the count to all element, the vector version takes the scalar count in the bottom of a vector register. I think the intended use-case is after a movd xmm0, eax or something.
If you need per-element-variable shift counts without AVX512, see various Q&As about emulating it, e.g. Shifting 4 integers right by different values SIMD.
Some of the workarounds use multiplies by powers of 2 for variable left-shift, and then a right shift to put the data where needed. (But you need to somehow get the 1<<n SIMD vector prepared, so this works if the same set of counts is reused for many vectors, or especially if it's a compile-time constant).
With 16-bit elements, you can use just one _mm_mulhi_epi16 to do runtime-variable right shift counts with no precision loss or range limits. mulhi(x*y) is exactly like (x*(int)y) >> 16, so you can use y=1<<14 to right shift by 16-14 = 2 in that element.

Converting Bytes to signed short and int value to bytes

I have int values of two bytes for example 254 = 0xFE, 112 = 0x70.
I need to convert them to signed short. Now the signed short value should be -400.
And then after changing that value I have an integer for example -410 that i need to convert back to two bytes.
How could i achieve that for iOS?
If the bytes are in the native architecture endianness, then it's as simple as
uint8_t *p = someAddress;
short value = *(short *)p;
value = 410;
*(short *)p = value;
However if the bytes are in a foreign endianness you are required to convert each byte of the integer, which is slow. Here is one, of many, examples.

Calculating constants for CRC32 using PCLMULQDQ

I'm reading through the following paper on how to implement CRC32 efficiently using the PCLMULQDQ instruction introduced in Intel Westmere and AMD Bulldozer:
V. Gopal et al. "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction." 2009. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
I understand the algorithm, but one thing I'm not sure about is how to calculate the constants $k_i$. For example, they provide the constant values for the IEEE 802.3 polynomial:
k1 = x^(4*128+64) mod P(x) = 0x8833794C
k4 = x^128 mod P(x) = 0xE8A45605
mu = x^64 div P(x) = 0x104D101DF
and so on. I can just use these constants as I only need to support the one polynomial, but I'm interested: how did they calculate those numbers? I can't just use a typical bignum implementation (e.g. the one provided by Python) because the arithmetic must happen in GF(2).
It's just like regular division, except you exclusive-or instead of subtract. So start with the most significant 1 in the dividend. Exclusive-or the dividend by the polynomial, lining up the most significant 1 of the polynomial with that 1 in the dividend to turn it into a zero. Repeat until you have eliminated all of the 1's above the low n bits, where n is the order of the polynomial. The result is the remainder.
Make sure that your polynomial has the high term in the n+1th bit. I.e., use 0x104C11DB7, not 0x4C11DB7.
If you want the quotient (which you wrote as "div"), then keep track of the positions of the 1's you eliminated. That set, shifted down by n, is the quotient.
Here is how:
/* Placed in the public domain by Mark Adler, Jan 18, 2014. */
#include <stdio.h>
#include <inttypes.h>
/* Polynomial type -- must be an unsigned integer type. */
typedef uintmax_t poly_t;
#define PPOLY PRIxMAX
/* Return x^n mod p(x) over GF(2). x^deg is the highest power of x in p(x).
The positions of the bits set in poly represent the remaining powers of x in
p(x). In addition, returned in *div are as many of the least significant
quotient bits as will fit in a poly_t. */
static poly_t xnmodp(unsigned n, poly_t poly, unsigned deg, poly_t *div)
{
poly_t mod, mask, high;
if (n < deg) {
*div = 0;
return poly;
}
mask = ((poly_t)1 << deg) - 1;
poly &= mask;
mod = poly;
*div = 1;
deg--;
while (--n > deg) {
high = (mod >> deg) & 1;
*div = (*div << 1) | high; /* quotient bits may be lost off the top */
mod <<= 1;
if (high)
mod ^= poly;
}
return mod & mask;
}
/* Compute and show x^n modulo the IEEE 802.3 CRC-32 polynomial. If d is true,
also show the low bits of the quotient. */
static void show(unsigned n, int showdiv)
{
poly_t div;
printf("x^%u mod p(x) = %#" PPOLY "\n", n, xnmodp(n, 0x4C11DB7, 32, &div));
if (showdiv)
printf("x^%u div p(x) = %#" PPOLY "\n", n, div);
}
/* Compute the constants required to use PCLMULQDQ to compute the IEEE 802.3
32-bit CRC. These results appear on page 16 of the Intel paper "Fast CRC
Computation Using PCLMULQDQ Instruction". */
int main(void)
{
show(4*128+64, 0);
show(4*128, 0);
show(128+64, 0);
show(128, 0);
show(96, 0);
show(64, 1);
return 0;
}

NEON acceleration for 12-bit to 8-bit

I have a buffer of 12-bit data (stored in 16-bit data)
and need to converts into 8-bit (shift by 4)
How can the NEON accelerate this processing ?
Thank you for your help
Brahim
Took the liberty to assume a few things explained below, but this kind of code (untested, may require a few modifications) should provide a good speedup compared to naive non-NEON version:
#include <arm_neon.h>
#include <stdint.h>
void convert(const restrict *uint16_t input, // the buffer to convert
restrict *uint8_t output, // the buffer in which to store result
int sz) { // their (common) size
/* Assuming the buffer size is a multiple of 8 */
for (int i = 0; i < sz; i += 8) {
// Load a vector of 8 16-bit values:
uint16x8_t v = vld1q_u16(buf+i);
// Shift it by 4 to the right, narrowing it to 8 bit values.
uint8x8_t shifted = vshrn_n_u16(v, 4);
// Store it in output buffer
vst1_u8(output+i, shifted);
}
}
Things I assumed here:
that you're working with unsigned values. If it's not the case, it will be easy to adapt anyway (uint* -> int*, *_u8->*_s8 and *_u16->*_s16)
as the values are loaded 8 by 8, I assumed the buffer length was a multiple of 8 to avoid edge cases. If that's not the case, you should probably pad it artificially to a multiple of 8.
Finally, the 2 resource pages used from the NEON documentation:
about loads and stores of vectors.
about shifting vectors.
Hope this helps!
prototype : void dataConvert(void * pDst, void * pSrc, unsigned int count);
1:
vld1.16 {q8-q9}, [r1]!
vld1.16 {q10-q11}, [r1]!
vqrshrn.u16 d16, q8, #4
vqrshrn.u16 d17, q9, #4
vqrshrn.u16 d18, q10, #4
vqrshrn.u16 d19, q11, #4
vst1.16 {q8-q9}, [r0]!
subs r2, #32
bgt 1b
q flag : saturation
r flag : rounding
change u16 to s16 in case of signed data.

What are the upper and lower limits and types of pixel values in OpenCV?

What are the upper and lower limits of pixel values in OpenCV and how can I get them?
The only limits I could figure out are CV_8U type Mat's, where the lower limit for pixel values in a channel is 0, the upper is 255. What are these values for other Mat's?
Say CV_32F, CV_32S?
OpenCV Equivalent C/C++ data types:
CV_8U -> unsigned char (min = 0, max = 255)
CV_8S -> char (min = -128, max = 127)
CV_16U -> unsigned short (min = 0, max = 65535)
CV_16S -> short (min = -32768, max = 32767)
CV_32S -> int (min = -2147483648, max = 2147483647)
CV_32F -> float
CV_64F -> double
Check this tutorial for data type ranges.
One thing to consider is that while displaying images of type CV_32F or CV_64F with imshow or cvShowImage, OpenCV expects values to be normalized between 0.0 and 1.0. Else, it saturates the pixel values.
CV_32F means a 32 bit floating point number. CV_32S means a 32 bit signed integer. I'm sure you can guess what CV_64F stands for. The internet is full of references for the ranges that different data types can take on, here is 32S for instance.

Resources