Related
I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.
clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).
I am using Lua on Redis and want to compare two signed 64-bit numbers, which are stored in two 8-byte/character strings.
How can I compare them using the libraries available in Redis?
http://redis.io/commands/EVAL#available-libraries
I'd like to know >/< and == checks. I think this probably involves pulling two 32-bit numbers for each 64-bit int, and doing some clever math on those, but I am not sure.
I have some code to make this less abstract. a0, a1, b0, b1 are all 32 bit numbers used to represent the msb & lsb's of two 64-bit signed int 64s:
-- ...
local comp_int64s = function (a0, a1, b0, b1)
local cmpres = 0
-- TOOD: Real comparison
return cmpres
end
local l, a0, a1, b0, b1
a0, l = bit.tobit(struct.unpack("I4", ARGV[1]))
a1, l = bit.tobit(struct.unpack("I4", ARGV[1], 5))
b0, l = bit.tobit(struct.unpack("I4", blob))
b1, l = bit.tobit(struct.unpack("I4", blob, 5))
print("Cmp result", comp_int64s(a0, a1, b0, b1))
EDIT: Added code
I came up with a method that looks like it's working. It's a little ugly though.
The first step is to compare top 32 bits as 2 compliment #’s
MSB sign bit stays, so numbers keep correct relations
-1 —> -1
0 —> 0
9223372036854775807 = 0x7fff ffff ffff ffff -> 0x7ffff ffff = 2147483647
So returning the result from the MSB's works unless they are equal, then the LSB's need to get checked.
I have a few cases to establish the some patterns:
-1 = 0xffff ffff ffff ffff
-2 = 0xffff ffff ffff fffe
32 bit is:
-1 -> 0xffff ffff = -1
-2 -> 0xffff fffe = -2
-1 > -2 would be like -1 > -2 : GOOD
And
8589934591 = 0x0000 0001 ffff ffff
8589934590 = 0x0000 0001 ffff fffe
32 bit is:
8589934591 -> ffff ffff = -1
8589934590 -> ffff fffe = -2
8589934591 > 8589934590 would be -1 > -2 : GOOD
The sign bit on MSB’s doesn’t matter b/c negative numbers have the same relationship between themselves as positive numbers. e.g regardless of sign bit, lsb values of 0xff > 0xfe, always.
What about if the MSB on the lower 32 bits is different?
0xff7f ffff 7fff ffff = -36,028,799,166,447,617
0xff7f ffff ffff ffff = -36,028,797,018,963,969
32 bit is:
-..799.. -> 0x7fff ffff = 2147483647
-..797.. -> 0xffff ffff = -1
-..799.. < -..797.. would be 2147483647 < -1 : BAD!
So we need to ignore the sign bit on the lower 32 bits. And since the relationships are the same for the LSBs regardless of sign, just using
the lowest 32 bits unsigned works for all cases.
This means I want signed for the MSB's and unsigned for the LSBs - so chaging I4 to i4 for the LSBs. Also making big endian official and using '>' on the struct.unpack calls:
-- ...
local comp_int64s = function (as0, au1, bs0, bu1)
if as0 > bs0 then
return 1
elseif as0 < bs0 then
return -1
else
-- msb's equal comparing lsbs - these are unsigned
if au1 > bu1 then
return 1
elseif au1 < bu1 then
return -1
else
return 0
end
end
end
local l, as0, au1, bs0, bu1
as0, l = bit.tobit(struct.unpack(">i4", ARGV[1]))
au1, l = bit.tobit(struct.unpack(">I4", ARGV[1], 5))
bs0, l = bit.tobit(struct.unpack(">i4", blob))
bu1, l = bit.tobit(struct.unpack(">I4", blob, 5))
print("Cmp result", comp_int64s(as0, au1, bs0, bu1))
Comparing is a simple string compare s1 == s2.
Greater than is when not s1 == s2 and i1 < i2.
Less than is the real work. string.byte allows to get single bytes as unsigned char. In case of unsigned integer, you would just have to check bytes-downwards: b1==b2 -> check next byte; through all bytes -> false (equal); b1>b2 -> false (greater than); b1<b2 -> true. Signed requires more steps: first check the sign bit (uppermost byte >127). If sign 1 is set but not sign 2, integer 1 is negative but not integer 2 -> true. The opposite would obviously result in false. When both signs are equal, you can do the unsigned processing.
When you can pack more bytes to an integer, it's fine too, but you have to adjust the sign bit check. When you have LuaJIT, you can use the ffi library to cast your string into a byte array into an int64.
I need set some bits in ByteData at position counted in bits.
How I can do this?
Eg.
var byteData = new ByteData(1024);
var bitData = new BitData(byteData);
// Offset in bits: 387
// Number of bits: 5
// Value: 3
bitData.setBits(387, 5, 3);
Yes it is quite complicated. I dont know dart, but these are the general steps you need to take. I will label each variable as a letter and also use a more complicated example to show you what happens when the bits overflow.
1. Construct the BitData object with a ByteData object (A)
2. Call setBits(offset (B), bits (C), value (D));
I will use example values of:
A: 11111111 11111111 11111111 11111111
B: 7
C: 10
D: 00000000 11111111
3. Rather than using an integer with a fixed length of bits, you could
use another ByteData object (D) containing your bits you want to write.
Also create a mask (E) containing the significant bits.
e.g.
A: 11111111 11111111 11111111 11111111
D: 00000000 11111111
E: 00000011 11111111 (2^C - 1)
4. As an extra bonus step, we can make sure the insignificant
bits are really zero by ANDing with the bitmask.
D = D & E
D 00000000 11111111
E 00000011 11111111
5. Make sure D and E contain at least one full zero byte since we want
to shift them.
D 00000000 00000000 11111111
E 00000000 00000011 11111111
6. Work out these two integer values:
F = The extra bit offset for the start byte: B mod 8 (e.g. 7)
G = The insignificant bits: size(D) - C (e.g. 14)
7. H = G-F which should not be negative here. (e.g. 14-7 = 7)
8. Shift both D and E left by H bits.
D 00000000 01111111 10000000
E 00000001 11111111 10000000
9. Work out first byte number (J) floor(B / 8) e.g. 0
10. Read the value of A at this index out and let this be K
K = 11111111 11111111 11111111
11. AND the current (K) with NOT E to set zeros for the new bits.
Then you can OR the new bits over the top.
L = (K & !E) | D
K & !E = 11111110 00000000 01111111
L = 11111110 01111111 11111111
12. Write L to the same place you read it from.
There is no BitData class, so you'll have to do some of the bit-pushing yourself.
Find the corresponding byte offset, read in some bytes, mask out the existing bits and set the new ones at the correct bit offset, then write it back.
The real complexity comes when you need to store more bits than you can read/write in a single operation.
For endianness, if you are treating the memory as a sequence of bits with arbitrary width, I'd go for little-endian. Endianness only really makes sense for full-sized (2^n-bit, n > 3) integers. A 5 bit integer as the one you are storing can't have any endianness, and a 37 bit integer also won't have any natural way of expressing an endianness.
You can try something like this code (which can definitely be optimized more):
import "dart:typed_data";
void setBitData(ByteBuffer buffer, int offset, int length, int value) {
assert(value < (1 << length));
assert(offset + length < buffer.lengthInBytes * 8);
int byteOffset = offset >> 3;
int bitOffset = offset & 7;
if (length + bitOffset <= 32) {
ByteData data = new ByteData.view(buffer);
// Can update it one read/modify/write operation.
int mask = ((1 << length) - 1) << bitOffset;
int bits = data.getUint32(byteOffset, Endianness.LITTLE_ENDIAN);
bits = (bits & ~mask) | (value << bitOffset);
data.setUint32(byteOffset, bits, Endianness.LITTLE_ENDIAN);
return;
}
// Split the value into chunks of no more than 32 bits, aligned.
do {
int bits = (length > 32 ? 32 : length) - bitOffset;
setBitData(buffer, offset, bits, value & ((1 << bits) - 1));
offset += bits;
length -= bits;
value >>= bits;
bitOffset = 0;
} while (length > 0);
}
Example use:
main() {
var b = new Uint8List(32);
setBitData(b.buffer, 3, 8, 255);
print(b.map((v)=>v.toRadixString(16)));
setBitData(b.buffer, 13, 6*4, 0xffffff);
print(b.map((v)=>v.toRadixString(16)));
setBitData(b.buffer, 47, 21*4, 0xaaaaaaaaaaaaaaaaaaaaa);
print(b.map((v)=>v.toRadixString(16)));
}
I have a specific requirement to convert a stream of bytes into a character encoding that happens to be 6-bits per character.
Here's an example:
Input: 0x50 0x11 0xa0
Character Table:
010100 T
000001 A
000110 F
100000 SPACE
Output: "TAF "
Logically I can understand how this works:
Taking 0x50 0x11 0xa0 and showing as binary:
01010000 00010001 10100000
Which is "TAF ".
What's the best way to do this programmatically (pseudo code or c++). Thank you!
Well, every 3 bytes, you end up with four characters. So for one thing, you need to work out what to do if the input isn't a multiple of three bytes. (Does it have padding of some kind, like base64?)
Then I'd probably take each 3 bytes in turn. In C#, which is close enough to pseudo-code for C :)
for (int i = 0; i < array.Length; i += 3)
{
// Top 6 bits of byte i
int value1 = array[i] >> 2;
// Bottom 2 bits of byte i, top 4 bits of byte i+1
int value2 = ((array[i] & 0x3) << 4) | (array[i + 1] >> 4);
// Bottom 4 bits of byte i+1, top 2 bits of byte i+2
int value3 = ((array[i + 1] & 0xf) << 2) | (array[i + 2] >> 6);
// Bottom 6 bits of byte i+2
int value4 = array[i + 2] & 0x3f;
// Now use value1...value4, e.g. putting them into a char array.
// You'll need to decode from the 6-bit number (0-63) to the character.
}
Just in case if someone is interested - another variant that extracts 6-bit numbers from the stream as soon as they appear there. That is, results can be obtained even if less then 3 bytes are currently read. Would be useful for unpadded streams.
The code saves the state of the accumulator a in variable n which stores the number of bits left in accumulator from the previous read.
int n = 0;
unsigned char a = 0;
unsigned char b = 0;
while (read_byte(&byte)) {
// save (6-n) most significant bits of input byte to proper position
// in accumulator
a |= (b >> (n + 2)) & (077 >> n);
store_6bit(a);
a = 0;
// save remaining least significant bits of input byte to proper
// position in accumulator
a |= (b << (4 - n)) & ((077 << (4 - n)) & 077);
if (n == 4) {
store_6bit(a);
a = 0;
}
n = (n + 2) % 6;
}
Let's say i have this bit field value: 10101001
How would i test if any other value differs in any n bits. Without considering
the positions?
Example:
10101001
10101011 --> 1 bit different
10101001
10111001 --> 1 bit different
10101001
01101001 --> 2 bits different
10101001
00101011 --> 2 bits different
I need to make a lot of this comparisons so i'm primarily looking for perfomance but any
hint is very welcome.
Take the XOR of the two fields and do a population count of the result.
if you XOR the 2 values together, you are left only with the bits that are different.
You then only need to count the bits which are still 1 and you have your answer
in c:
unsigned char val1=12;
unsigned char val2=123;
unsigned char xored = val1 ^ val2;
int i;
int numBits=0;
for(i=0; i<8; i++)
{
if(xored&1) numBits++;
xored>>=1;
}
although there are probably faster ways to count the bits in a byte
(you could for instance use a lookuptable for 256 values)
Just like everybody else said, use XOR to determine what's different and then use one of these algorithms to count.
This gets the bit difference between the values and counts the bits three at a time:
public static int BitDifference(int a, int b) {
int cnt = 0, bits = a ^ b;
while (bits != 0) {
cnt += (0xE994 >> ((bits & 7) << 1)) & 3;
bits >>= 3;
}
return cnt;
}
XOR the numbers, then the problem becomes a matter of counting the 1s in the result.
In Java:
Integer.bitCount(a ^ b)
Comparison is performed with XOR, as others already answered.
counting can be performed in several ways:
shift left and addition.
lookup in a table.
logic formulas that you can find with Karnaugh maps.