Bitwise xor of two 256-bit integers - sse

I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.
Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?
My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?
Thanks.

First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.
Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).
FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:
VADDPS
SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid,
Precision, Denormal.
VMOVAPS
SIMD Floating-Point Exceptions
None.
(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)
On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.
Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.
No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)
Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)
Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):
256b operations with AVX xorps
VMOVDQU ymm0, [A]
VXORPS ymm0, ymm0, [B]
VMOVDQU [result], ymm0
3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.
So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.
Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.
Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.
128b ops
VMOVDQU xmm0, [A]
VMOVDQU xmm1, [A+16]
VPXOR xmm0, xmm0, [B]
VPXOR xmm1, xmm1, [B+16]
VMOVDQU [result], xmm0
VMOVDQU [result+16], xmm1
This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.
In registers
Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)
Z Boson's hybrid idea:
VMOVDQU ymm0, [A]
VMOVDQU ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR xmm0, xmm0, xmm4
VPXOR xmm1, xmm1, xmm5
VMOVDQU [res], xmm0
VMOVDQU [res+16], xmm1
Even more fused-domain uops (8) than just doing 128b-everything.
Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.
ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).
So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.

There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256 a8 = _mm256_loadu_ps((float*)a);
__m256 b8 = _mm256_loadu_ps((float*)b);
__m256 c8 = _mm256_xor_ps(a8,b8);
int c[8]; _mm256_storeu_ps((float*)c, c8);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
If you want to stay in the integer domain you could do
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256i a8 = _mm256_loadu_si256((__m256i*)a);
__m256i b8 = _mm256_loadu_si256((__m256i*)b);
__m128i a8lo = _mm256_castsi256_si128(a8);
__m128i a8hi = _mm256_extractf128_si256(a8, 1);
__m128i b8lo = _mm256_castsi256_si128(b8);
__m128i b8hi = _mm256_extractf128_si256(b8, 1);
__m128i c8lo = _mm_xor_si128(a8lo, b8lo);
__m128i c8hi = _mm_xor_si128(a8hi, b8hi);
int c[8];
_mm_storeu_si128((__m128i*)&c[0],c8lo);
_mm_storeu_si128((__m128i*)&c[4],c8hi);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
The _mm256_castsi256_si128 intrinsics are free.

You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.

Related

What is maximum (ideal) memory bandwidth of an OpenCL device?

My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?
Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_​DEVICE_​MAX_​CLOCK_​FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.

1-to-4 broadcast and 4-to-1 reduce in AVX-512

I need to do the following two operations:
float x[4];
float y[16];
// 1-to-4 broadcast
for ( int i = 0; i < 16; ++i )
y[i] = x[i / 4];
// 4-to-1 reduce-add
for ( int i = 0; i < 16; ++i )
x[i / 4] += y[i];
What would be an efficient AVX-512 implementation?
For the reduce-add, just do in-lane shuffles and adds (vmovshdup / vaddps / vpermilps imm8/vaddps) like in Fastest way to do horizontal float vector sum on x86 to get a horizontal sum in each 128-bit lane, and then vpermps to shuffle the desired elements to the bottom. Or vcompressps with a constant mask to do the same thing, optionally with a memory destination.
Once packed down to a single vector, you have a normal SIMD 128-bit add.
If your arrays are actually larger than 16, instead of vpermps you could vpermt2ps to take every 4th element from each of two source vectors, setting you up for doing the += part with into x[] 256-bit vectors. (Or combine again with another shuffle into 512-bit vectors, but that will probably bottleneck on shuffle throughput on SKX).
On SKX, vpermt2ps is only a single uop, with 1c throughput / 3c latency, so it's very efficient for how powerful it is. On KNL it has 2c throughput, worse than vpermps, but maybe still worth it. (KNL doesn't have AVX512VL, but for adding to x[] with 256-bit vectors you (or a compiler) can use AVX1 vaddps ymm if you want.)
See https://agner.org/optimize/ for instruction tables.
For the load:
Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register? If so, you could
do a 128->512 broadcast with VBROADCASTF32X4 (single uop for a load port).
do an in-lane shuffle with vpermilps zmm,zmm,zmm to broadcast a different element within each 128-bit lane. (Has to be separate from the broadcast-load, because a memory-source vpermilps can either have a m512 or m32bcst source. (Instructions typically have their memory broadcast granularity = their element size, unfortunately in some cases like this where it's not at all useful. And vpermilps takes the control vector as a memory operand, not the source data.)
This is slightly better than vpermps zmm,zmm,zmm because the shuffle has 1 cycle latency instead of 3 (on Skylake-avx512).
Even outside a loop, loading a shuffle-control vector might still be your best bet.

SSE instruction to sum 32 bit integers to 64 bit

I'm looking for an SSE instruction which takes two arguments of four 32 bit integers in __m128i, computes sum of corresponding pairs and returns result as two 64 bit integers in __m128i.
Is there an instruction for this?
There are no SSE operations with carry. The way to do this is to first unpack the 32-bit integers (punpckldq/punpckhdq) into 4 groups of 64-bit integers by using an all-zeroes helper vector, and then use 64-bit pairwise addition.
SSE only has this for byte->word and word->dword. (pmaddubsw (SSSE3) and pmaddwd (MMX/SSE2), which vertically multiply v1 * v2, then horizontally add neighbouring pairs.)
I'm not clear on what you want the outputs to be. You have 8 input integers (two vectors of 4), and 2 output integers (one vector of two). Since there's no insn that does any kind of 32+32 -> 64b vector addition, let's just look at how to zero-extend or sign-extended the low two 32b elements of a vector to 64b. You can combine this into whatever you need, but keep in mind there's no add-horizontal-pairs phaddq, only vertical paddq.
phaddd is similar to what you want, but without the widening: low half of the result is the sum of horizontal pairs in the first operand, high half is the sum of horizontal pairs in the second operand. It's pretty much only worth using if you need all those results, and you're not going to combine them further. (i.e. it's usually faster to shuffle and vertical add instead of running phadd to horizontally sum a vector accumulator at the end of a reduction. And if you're going to sum everything down to one result, do normal vertical sums until you're down to one register.) phaddd could be implemented in hardware to be as fast as paddd (single cycle latency and throughput), but it isn't in any AMD or Intel CPU.
Like Mysticial commented, SSE4.1 pmovzxdq / pmovsxdq are exactly what you need, and can even do it on the fly as part of a load from a 64b memory location (containing two 32b integers).
SSE4.1 was introduced with Intel Penryn, 2nd gen Core2 (45nm die shrink core2), the generation before Nehalem. Falling back to a non-vector code path on CPUs older than that might be ok, depending on how much you care about not being slow on CPUs that are already old and slow.
Without SSE4.1:
Unsigned zero-extension is easy. Like pmdj answered, just use punpck* lo and hi to unpack with zero.
If your integers are signed, you'll have to do the sign-extension manually.
There is no psraq, only psrad (Packed Shift Right Arithmetic Dword) and psraw. If there was, you could unpack with itself and then arithmetic right shift by 32b.
Instead, we probably need to generate a vector where each element is turned into its sign bit. Then blend that with an unpacked vector (but pblendw is SSE4.1 too, so we'd have to use por).
Or better, unpack the original vector with a vector of sign-masks.
# input in xmm0
movdqa xmm1, xmm0
movdqa xmm2, xmm0
psrad xmm0, 31 ; xmm0 = all-ones or all-zeros depending on sign of input elements. xmm1=orig ; xmm2=orig
; xmm0 = signmask; xmm1=orig ; xmm2=orig
punpckldq xmm1, xmm0 ; xmm1 = sign-extend(lo64(orig))
punpckhdq xmm2, xmm0 ; xmm2 = sign-extend(hi64(orig))
This should run with 2 cycle latency for both results on Intel SnB or IvB. Haswell and later only have one shuffle port (so they can't do both punpck insns in parallel), so xmm2 will be delayed for another cycle there. Pre-SnB Intel CPUs usually bottleneck on the frontend (decoders, etc) with vector instructions, because they often average more than 4B per insn.
Shifting the original instead of the copy shortens the dependency chain for whatever produces xmm0, for CPUs without move elimination (handling mov instructions at the register-rename stage, so they're zero latency. Intel-only, and only on IvB and later.) With 3-operand AVX instructions, you wouldn't need the movdqa, or the 3rd register, but then you could just use vpmovsx for the low64 anyway. To sign-extend the high 64, you'd probably psrldq byte-shift the high 64 down to the low 64.
Or movhlps or punpckhqdq self,self to use a shorter-to-encode instruction. (or AVX2 vpmovsx to a 256b reg, and then vextracti128 the upper 128, to get both 128b results with only two instructions.)
Unlike GP-register shifts (e.g. sar eax, 31) , vector shifts saturate the count instead of masking. Leaving the original sign bit as the LSB (shifting by 31) instead of a copy of it (shifting by 32) works fine, too. It has the advantage of not requiring a big comment in with the code explaining this for people who would worry when they saw psrad xmm0, 32.

Difference between the AVX instructions vxorpd and vpxor

According to the Intel Intrinsics Guide,
vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?
Combining some comments into an answer:
Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).
On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.
Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.
On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).
I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).
On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)
In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.
On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.

Why is RAM in powers of 2?

Why is the amount of RAM always a power of 2?
512, 1024, etc.
Specifically, what is the difference between using 512, 768, and 1024 RAM for an Android emulator?
Memory is closely tied to the CPU, so making their size a power of two
means that multiple modules can be packed requiring a minimum of logic
in order to switch between them; only a few bits from the end need to
be checked (since the binary representation of the size is 1000...0000
regardless of its size) instead of many more bits were it not a power
of two.
Hard drives are not tied to the CPU and not packed in the same manner,
so exactness of their size is not required.
from https://superuser.com/questions/235030/why-are-ram-size-usually-in-powers-of-2-512-mb-1-2-4-8-gb
as referenced by BrajeshKumar in the comments on the OP. Thanks Brajesh!
Because computers deal with binary values such as 0 and 1, because registers are on(1) or off(0)
So if you use powers of 2, your hardware will use 100% of the registers.
If computers used ternary values in their circuits, then we'd have memory, processors and anything else in powers of 3.
I think, it is related with the number of bits in an address bus (or bits used to select between address spaces). n bits can address 2^n bytes, so whenever the number of address bits increases to n+1, automatically the space increases by a factor of 2. The manufacturers use their maximum address capacity when including memory chips to the design.
In Android emulator, the increase in RAM may make your program more efficient, because when your application exceeds the RAM, a part of ROM (non-volatile memory) and it is slower.

Resources