I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.
Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?
My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?
Thanks.
First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.
Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).
FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:
VADDPS
SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid,
Precision, Denormal.
VMOVAPS
SIMD Floating-Point Exceptions
None.
(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)
On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.
Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.
No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)
Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)
Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):
256b operations with AVX xorps
VMOVDQU ymm0, [A]
VXORPS ymm0, ymm0, [B]
VMOVDQU [result], ymm0
3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.
So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.
Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.
Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.
128b ops
VMOVDQU xmm0, [A]
VMOVDQU xmm1, [A+16]
VPXOR xmm0, xmm0, [B]
VPXOR xmm1, xmm1, [B+16]
VMOVDQU [result], xmm0
VMOVDQU [result+16], xmm1
This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.
In registers
Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)
Z Boson's hybrid idea:
VMOVDQU ymm0, [A]
VMOVDQU ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR xmm0, xmm0, xmm4
VPXOR xmm1, xmm1, xmm5
VMOVDQU [res], xmm0
VMOVDQU [res+16], xmm1
Even more fused-domain uops (8) than just doing 128b-everything.
Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.
ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).
So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.
There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256 a8 = _mm256_loadu_ps((float*)a);
__m256 b8 = _mm256_loadu_ps((float*)b);
__m256 c8 = _mm256_xor_ps(a8,b8);
int c[8]; _mm256_storeu_ps((float*)c, c8);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
If you want to stay in the integer domain you could do
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256i a8 = _mm256_loadu_si256((__m256i*)a);
__m256i b8 = _mm256_loadu_si256((__m256i*)b);
__m128i a8lo = _mm256_castsi256_si128(a8);
__m128i a8hi = _mm256_extractf128_si256(a8, 1);
__m128i b8lo = _mm256_castsi256_si128(b8);
__m128i b8hi = _mm256_extractf128_si256(b8, 1);
__m128i c8lo = _mm_xor_si128(a8lo, b8lo);
__m128i c8hi = _mm_xor_si128(a8hi, b8hi);
int c[8];
_mm_storeu_si128((__m128i*)&c[0],c8lo);
_mm_storeu_si128((__m128i*)&c[4],c8hi);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
The _mm256_castsi256_si128 intrinsics are free.
You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.
According to the Intel Intrinsics Guide,
vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?
Combining some comments into an answer:
Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).
On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.
Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.
On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).
I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).
On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)
In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.
On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.
I am currently trying to implement a data path which processes an image data expressed in gray scale between unsigned integer 0 - 255. (Just for your information, my goal is to implement a Discrete Wavelet Transform in FPGA)
During the data processing, intermediate values will have negative numbers as well. As an example process, one of the calculation is
result = 48 - floor((66+39)/2)
The floor function is used to guarantee the integer data processing. For the above case, the result is -4, which is a number out of range between 0~255.
Having mentioned above case, I have a series of basic questions.
To deal with the negative intermediate numbers, do I need to represent all the data as 'equivalent unsigned number' in 2's complement for the hardware design? e.g. -4 d = 1111 1100 b.
If I represent the data as 2's complement for the signed numbers, will I need 9 bits opposed to 8 bits? Or, how many bits will I need to process the data properly? (With 8 bits, I cannot represent any number above 128 in 2's complement.)
How does the negative number division works if I use bit wise shifting? If I want to divide the result, -4, with 4, by shifting it to right by 2 bits, the result becomes 63 in decimal, 0011 1111 in binary, instead of -1. How can I resolve this problem?
Any help would be appreciated!
If you can choose to use VHDL, then you can use the fixed point library to represent your numbers and choose your rounding mode, as well as allowing bit extensions etc.
In Verilog, well, I'd think twice. I'm not a Verilogger, but the arithmetic rules for mixing signed and unsigned datatypes seem fraught with foot-shooting opportunities.
Another option to consider might be MyHDL as that gives you a very powerful verification environment and allows you to spit out VHDL or Verilog at the back end as you choose.
To process 8-bit pixels, to do things like gamma correction without losing information, we normally upsample the values, work in 16 bits or whatever, and then downsample them to 8 bits.
Now, this is a somewhat new area for me, so please excuse incorrect terminology etc.
For my needs I have chosen to work in "non-standard" Q15, where I only use the upper half of the range (0.0-1.0), and 0x8000 represents 1.0 instead of -1.0. This makes it much easier to calculate things in C.
But I ran into a problem with SSSE3. It has the PMULHRSW instruction which multiplies Q15 numbers, but it uses the "standard" range of Q15 is [-1,1-2⁻¹⁵], so multplying (my) 0x8000 (1.0) by 0x4000 (0.5) gives 0xC000 (-0.5), because it thinks 0x8000 is -1. This is quite annoying.
What am I doing wrong? Should I keep my pixel values in the 0000-7FFF range? Doesn't this kind of defeat the purpose of it being a fixed-point format? Is there a way around this? Maybe some trick?
Is there some kind of definitive treatise on Q15 which discusses all this?
Personally, I'd go with the solution of limiting the max value to 0x7FFF (~0.99something).
You don't have to jump through hoops getting the processor to work the way you'd like it
You don't have to spend a long time documenting the ins and outs of your "weird" code, as operating over 0-0x7FFF will be immediately recognisable to the readers of your code - Q-format is understood (in my experience) to run from -1.0 to +1.0-one lsb. The arithmetic doesn't work out so well otherwise, as the value of 1 lsb is different on each side of the 0!
Unless you can imagine yourself successfully arguing, to a panel of argumentative code reviewers, that that extra bit is critical to the operation of the algorithm rather than just "the last 0.01% of performance", stick to code everyone can understand, and which maps to the hardware you have available.
Alternatively, re-arrange your previous operation so that the pixels all come out to be the negative of what you originally had. Or the following operations to take in the negative of what you previously sent it. Then use values from -1.0 to 0.0 in Q15 format.
If you are sure that you won’t use any number “bigger” than $8000, the only problem would be when at least one of the multipliers is $8000 (–1, though you wish it were 1).
In this case the solution is rather simple:
pmulhrsw xmm0, xmm1
psignw xmm0, xmm0
Or, absolutely equivalent in our case (Thanks, Peter Cordes!):
pmulhrsw xmm0, xmm1
pabsw xmm0, xmm0
This will revert the negative values from multiplying by –1 to their positive values.
Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.
SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.