SSE instruction to sum 32 bit integers to 64 bit - sse

I'm looking for an SSE instruction which takes two arguments of four 32 bit integers in __m128i, computes sum of corresponding pairs and returns result as two 64 bit integers in __m128i.
Is there an instruction for this?

There are no SSE operations with carry. The way to do this is to first unpack the 32-bit integers (punpckldq/punpckhdq) into 4 groups of 64-bit integers by using an all-zeroes helper vector, and then use 64-bit pairwise addition.

SSE only has this for byte->word and word->dword. (pmaddubsw (SSSE3) and pmaddwd (MMX/SSE2), which vertically multiply v1 * v2, then horizontally add neighbouring pairs.)
I'm not clear on what you want the outputs to be. You have 8 input integers (two vectors of 4), and 2 output integers (one vector of two). Since there's no insn that does any kind of 32+32 -> 64b vector addition, let's just look at how to zero-extend or sign-extended the low two 32b elements of a vector to 64b. You can combine this into whatever you need, but keep in mind there's no add-horizontal-pairs phaddq, only vertical paddq.
phaddd is similar to what you want, but without the widening: low half of the result is the sum of horizontal pairs in the first operand, high half is the sum of horizontal pairs in the second operand. It's pretty much only worth using if you need all those results, and you're not going to combine them further. (i.e. it's usually faster to shuffle and vertical add instead of running phadd to horizontally sum a vector accumulator at the end of a reduction. And if you're going to sum everything down to one result, do normal vertical sums until you're down to one register.) phaddd could be implemented in hardware to be as fast as paddd (single cycle latency and throughput), but it isn't in any AMD or Intel CPU.
Like Mysticial commented, SSE4.1 pmovzxdq / pmovsxdq are exactly what you need, and can even do it on the fly as part of a load from a 64b memory location (containing two 32b integers).
SSE4.1 was introduced with Intel Penryn, 2nd gen Core2 (45nm die shrink core2), the generation before Nehalem. Falling back to a non-vector code path on CPUs older than that might be ok, depending on how much you care about not being slow on CPUs that are already old and slow.
Without SSE4.1:
Unsigned zero-extension is easy. Like pmdj answered, just use punpck* lo and hi to unpack with zero.
If your integers are signed, you'll have to do the sign-extension manually.
There is no psraq, only psrad (Packed Shift Right Arithmetic Dword) and psraw. If there was, you could unpack with itself and then arithmetic right shift by 32b.
Instead, we probably need to generate a vector where each element is turned into its sign bit. Then blend that with an unpacked vector (but pblendw is SSE4.1 too, so we'd have to use por).
Or better, unpack the original vector with a vector of sign-masks.
# input in xmm0
movdqa xmm1, xmm0
movdqa xmm2, xmm0
psrad xmm0, 31 ; xmm0 = all-ones or all-zeros depending on sign of input elements. xmm1=orig ; xmm2=orig
; xmm0 = signmask; xmm1=orig ; xmm2=orig
punpckldq xmm1, xmm0 ; xmm1 = sign-extend(lo64(orig))
punpckhdq xmm2, xmm0 ; xmm2 = sign-extend(hi64(orig))
This should run with 2 cycle latency for both results on Intel SnB or IvB. Haswell and later only have one shuffle port (so they can't do both punpck insns in parallel), so xmm2 will be delayed for another cycle there. Pre-SnB Intel CPUs usually bottleneck on the frontend (decoders, etc) with vector instructions, because they often average more than 4B per insn.
Shifting the original instead of the copy shortens the dependency chain for whatever produces xmm0, for CPUs without move elimination (handling mov instructions at the register-rename stage, so they're zero latency. Intel-only, and only on IvB and later.) With 3-operand AVX instructions, you wouldn't need the movdqa, or the 3rd register, but then you could just use vpmovsx for the low64 anyway. To sign-extend the high 64, you'd probably psrldq byte-shift the high 64 down to the low 64.
Or movhlps or punpckhqdq self,self to use a shorter-to-encode instruction. (or AVX2 vpmovsx to a 256b reg, and then vextracti128 the upper 128, to get both 128b results with only two instructions.)
Unlike GP-register shifts (e.g. sar eax, 31) , vector shifts saturate the count instead of masking. Leaving the original sign bit as the LSB (shifting by 31) instead of a copy of it (shifting by 32) works fine, too. It has the advantage of not requiring a big comment in with the code explaining this for people who would worry when they saw psrad xmm0, 32.

Related

Bitwise xor of two 256-bit integers

I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.
Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?
My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?
Thanks.
First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.
Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).
FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:
VADDPS
SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid,
Precision, Denormal.
VMOVAPS
SIMD Floating-Point Exceptions
None.
(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)
On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.
Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.
No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)
Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)
Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):
256b operations with AVX xorps
VMOVDQU ymm0, [A]
VXORPS ymm0, ymm0, [B]
VMOVDQU [result], ymm0
3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.
So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.
Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.
Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.
128b ops
VMOVDQU xmm0, [A]
VMOVDQU xmm1, [A+16]
VPXOR xmm0, xmm0, [B]
VPXOR xmm1, xmm1, [B+16]
VMOVDQU [result], xmm0
VMOVDQU [result+16], xmm1
This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.
In registers
Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)
Z Boson's hybrid idea:
VMOVDQU ymm0, [A]
VMOVDQU ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR xmm0, xmm0, xmm4
VPXOR xmm1, xmm1, xmm5
VMOVDQU [res], xmm0
VMOVDQU [res+16], xmm1
Even more fused-domain uops (8) than just doing 128b-everything.
Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.
ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).
So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.
There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256 a8 = _mm256_loadu_ps((float*)a);
__m256 b8 = _mm256_loadu_ps((float*)b);
__m256 c8 = _mm256_xor_ps(a8,b8);
int c[8]; _mm256_storeu_ps((float*)c, c8);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
If you want to stay in the integer domain you could do
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256i a8 = _mm256_loadu_si256((__m256i*)a);
__m256i b8 = _mm256_loadu_si256((__m256i*)b);
__m128i a8lo = _mm256_castsi256_si128(a8);
__m128i a8hi = _mm256_extractf128_si256(a8, 1);
__m128i b8lo = _mm256_castsi256_si128(b8);
__m128i b8hi = _mm256_extractf128_si256(b8, 1);
__m128i c8lo = _mm_xor_si128(a8lo, b8lo);
__m128i c8hi = _mm_xor_si128(a8hi, b8hi);
int c[8];
_mm_storeu_si128((__m128i*)&c[0],c8lo);
_mm_storeu_si128((__m128i*)&c[4],c8hi);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
The _mm256_castsi256_si128 intrinsics are free.
You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.

Difference between the AVX instructions vxorpd and vpxor

According to the Intel Intrinsics Guide,
vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?
Combining some comments into an answer:
Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).
On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.
Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.
On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).
I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).
On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)
In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.
On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.

Hardware implementation for integer data processing

I am currently trying to implement a data path which processes an image data expressed in gray scale between unsigned integer 0 - 255. (Just for your information, my goal is to implement a Discrete Wavelet Transform in FPGA)
During the data processing, intermediate values will have negative numbers as well. As an example process, one of the calculation is
result = 48 - floor((66+39)/2)
The floor function is used to guarantee the integer data processing. For the above case, the result is -4, which is a number out of range between 0~255.
Having mentioned above case, I have a series of basic questions.
To deal with the negative intermediate numbers, do I need to represent all the data as 'equivalent unsigned number' in 2's complement for the hardware design? e.g. -4 d = 1111 1100 b.
If I represent the data as 2's complement for the signed numbers, will I need 9 bits opposed to 8 bits? Or, how many bits will I need to process the data properly? (With 8 bits, I cannot represent any number above 128 in 2's complement.)
How does the negative number division works if I use bit wise shifting? If I want to divide the result, -4, with 4, by shifting it to right by 2 bits, the result becomes 63 in decimal, 0011 1111 in binary, instead of -1. How can I resolve this problem?
Any help would be appreciated!
If you can choose to use VHDL, then you can use the fixed point library to represent your numbers and choose your rounding mode, as well as allowing bit extensions etc.
In Verilog, well, I'd think twice. I'm not a Verilogger, but the arithmetic rules for mixing signed and unsigned datatypes seem fraught with foot-shooting opportunities.
Another option to consider might be MyHDL as that gives you a very powerful verification environment and allows you to spit out VHDL or Verilog at the back end as you choose.

Is there a way to force PMULHRSW to treat 0x8000 as 1.0 instead of -1.0?

To process 8-bit pixels, to do things like gamma correction without losing information, we normally upsample the values, work in 16 bits or whatever, and then downsample them to 8 bits.
Now, this is a somewhat new area for me, so please excuse incorrect terminology etc.
For my needs I have chosen to work in "non-standard" Q15, where I only use the upper half of the range (0.0-1.0), and 0x8000 represents 1.0 instead of -1.0. This makes it much easier to calculate things in C.
But I ran into a problem with SSSE3. It has the PMULHRSW instruction which multiplies Q15 numbers, but it uses the "standard" range of Q15 is [-1,1-2⁻¹⁵], so multplying (my) 0x8000 (1.0) by 0x4000 (0.5) gives 0xC000 (-0.5), because it thinks 0x8000 is -1. This is quite annoying.
What am I doing wrong? Should I keep my pixel values in the 0000-7FFF range? Doesn't this kind of defeat the purpose of it being a fixed-point format? Is there a way around this? Maybe some trick?
Is there some kind of definitive treatise on Q15 which discusses all this?
Personally, I'd go with the solution of limiting the max value to 0x7FFF (~0.99something).
You don't have to jump through hoops getting the processor to work the way you'd like it
You don't have to spend a long time documenting the ins and outs of your "weird" code, as operating over 0-0x7FFF will be immediately recognisable to the readers of your code - Q-format is understood (in my experience) to run from -1.0 to +1.0-one lsb. The arithmetic doesn't work out so well otherwise, as the value of 1 lsb is different on each side of the 0!
Unless you can imagine yourself successfully arguing, to a panel of argumentative code reviewers, that that extra bit is critical to the operation of the algorithm rather than just "the last 0.01% of performance", stick to code everyone can understand, and which maps to the hardware you have available.
Alternatively, re-arrange your previous operation so that the pixels all come out to be the negative of what you originally had. Or the following operations to take in the negative of what you previously sent it. Then use values from -1.0 to 0.0 in Q15 format.
If you are sure that you won’t use any number “bigger” than $8000, the only problem would be when at least one of the multipliers is $8000 (–1, though you wish it were 1).
In this case the solution is rather simple:
pmulhrsw xmm0, xmm1
psignw xmm0, xmm0
Or, absolutely equivalent in our case (Thanks, Peter Cordes!):
pmulhrsw xmm0, xmm1
pabsw xmm0, xmm0
This will revert the negative values from multiplying by –1 to their positive values.

Efficient way to create a bit mask from multiple numbers possibly using SSE/SSE2/SSE3/SSE4 instructions

Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.
SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.

Resources