According to the Intel Intrinsics Guide,
vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?
Combining some comments into an answer:
Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).
On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.
Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.
On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).
I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).
On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)
In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.
On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.
Related
My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?
Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_DEVICE_MAX_CLOCK_FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.
This may be a duplicate and I apologies if that is so but I really want a definitive answer as that seems to change depending upon where I look.
Is it acceptable to say that a gigabyte is 1024 megabytes or should it be said that it is 1000 megabytes? I am taking computer science at GCSE and a typical exam question could be how many bytes in a kilobyte and I believe the exam board, AQA, has the answer for such a question as 1024 not 1000. How is this? Are both correct? Which one should I go with?
Thanks in advance- this has got me rather bamboozled!
The sad fact is that it depends on who you ask. But computer terminology is slowly being aligned with normal terminology, in which kilo is 103 (1,000), mega is 106 (1,000,000), and giga is 109 (1,000,000,000).
This is reflected in the International System of Quantities and the International Electrotechnical Commission, which define gigabyte as 109 and use gibibyte for the computer-specific 1024 x 1024 x 1024 value.
The reason it "depends who you ask," is that for many years, specifically in relation to "bytes" of storage, the prefixes kilo, mega, and giga meant 1024, 10242, and 10243. But that flies in the face of normal convention with regard to these prefixes. So again, computer terminology is being aligned with non-computer terminology.
The term gigabyte is commonly used to mean either 10003 bytes or 10243 bytes depending on the context. Disk manufacturers prefer the decimal term while memory manufacturers use the binary.
Decimal definition
1 GB = 1,000,000,000 bytes (= 10003 B = 109 B)
Based on powers of 10, this definition uses the prefix as defined in the International System of Units (SI). This is the recommended definition by the International Electrotechnical Commission (IEC). This definition is used in networking contexts and most storage media, particularly hard drives, flash-based storage, and DVDs, and is also consistent with the other uses of the SI prefix in computing, such as CPU clock speeds or measures of performance.
Binary definition
1 GiB = 1,073,741,824 bytes (= 10243 B = 230 B).
The binary definition uses powers of the base 2, as is the architectural principle of binary computers. This usage is widely promulgated by some operating systems, such as Microsoft Windows in reference to computer memory (e.g., RAM). This definition is synonymous with the unambiguous unit gibibyte.
The difference between units based on decimal and binary prefixes increases as a semi-logarithmic (linear-log) function—for example, the decimal kilobyte value is nearly 98% of the kibibyte, a megabyte is under 96% of a mebibyte, and a gigabyte is just over 93% of a gibibyte value. This means that a 300 GB (279 GiB) hard disk might be indicated variously as 300 GB, 279 GB or 279 GiB, depending on the operating system.
The Wikipedia article https://en.wikipedia.org/wiki/Gigabyte has a good writeup of the confusion surrounding the usage of the term
I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.
Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?
My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?
Thanks.
First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.
Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).
FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:
VADDPS
SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid,
Precision, Denormal.
VMOVAPS
SIMD Floating-Point Exceptions
None.
(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)
On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.
Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.
No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)
Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)
Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):
256b operations with AVX xorps
VMOVDQU ymm0, [A]
VXORPS ymm0, ymm0, [B]
VMOVDQU [result], ymm0
3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.
So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.
Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.
Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.
128b ops
VMOVDQU xmm0, [A]
VMOVDQU xmm1, [A+16]
VPXOR xmm0, xmm0, [B]
VPXOR xmm1, xmm1, [B+16]
VMOVDQU [result], xmm0
VMOVDQU [result+16], xmm1
This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.
In registers
Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)
Z Boson's hybrid idea:
VMOVDQU ymm0, [A]
VMOVDQU ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR xmm0, xmm0, xmm4
VPXOR xmm1, xmm1, xmm5
VMOVDQU [res], xmm0
VMOVDQU [res+16], xmm1
Even more fused-domain uops (8) than just doing 128b-everything.
Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.
ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).
So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.
There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256 a8 = _mm256_loadu_ps((float*)a);
__m256 b8 = _mm256_loadu_ps((float*)b);
__m256 c8 = _mm256_xor_ps(a8,b8);
int c[8]; _mm256_storeu_ps((float*)c, c8);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
If you want to stay in the integer domain you could do
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256i a8 = _mm256_loadu_si256((__m256i*)a);
__m256i b8 = _mm256_loadu_si256((__m256i*)b);
__m128i a8lo = _mm256_castsi256_si128(a8);
__m128i a8hi = _mm256_extractf128_si256(a8, 1);
__m128i b8lo = _mm256_castsi256_si128(b8);
__m128i b8hi = _mm256_extractf128_si256(b8, 1);
__m128i c8lo = _mm_xor_si128(a8lo, b8lo);
__m128i c8hi = _mm_xor_si128(a8hi, b8hi);
int c[8];
_mm_storeu_si128((__m128i*)&c[0],c8lo);
_mm_storeu_si128((__m128i*)&c[4],c8hi);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
The _mm256_castsi256_si128 intrinsics are free.
You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.
I'm looking for an SSE instruction which takes two arguments of four 32 bit integers in __m128i, computes sum of corresponding pairs and returns result as two 64 bit integers in __m128i.
Is there an instruction for this?
There are no SSE operations with carry. The way to do this is to first unpack the 32-bit integers (punpckldq/punpckhdq) into 4 groups of 64-bit integers by using an all-zeroes helper vector, and then use 64-bit pairwise addition.
SSE only has this for byte->word and word->dword. (pmaddubsw (SSSE3) and pmaddwd (MMX/SSE2), which vertically multiply v1 * v2, then horizontally add neighbouring pairs.)
I'm not clear on what you want the outputs to be. You have 8 input integers (two vectors of 4), and 2 output integers (one vector of two). Since there's no insn that does any kind of 32+32 -> 64b vector addition, let's just look at how to zero-extend or sign-extended the low two 32b elements of a vector to 64b. You can combine this into whatever you need, but keep in mind there's no add-horizontal-pairs phaddq, only vertical paddq.
phaddd is similar to what you want, but without the widening: low half of the result is the sum of horizontal pairs in the first operand, high half is the sum of horizontal pairs in the second operand. It's pretty much only worth using if you need all those results, and you're not going to combine them further. (i.e. it's usually faster to shuffle and vertical add instead of running phadd to horizontally sum a vector accumulator at the end of a reduction. And if you're going to sum everything down to one result, do normal vertical sums until you're down to one register.) phaddd could be implemented in hardware to be as fast as paddd (single cycle latency and throughput), but it isn't in any AMD or Intel CPU.
Like Mysticial commented, SSE4.1 pmovzxdq / pmovsxdq are exactly what you need, and can even do it on the fly as part of a load from a 64b memory location (containing two 32b integers).
SSE4.1 was introduced with Intel Penryn, 2nd gen Core2 (45nm die shrink core2), the generation before Nehalem. Falling back to a non-vector code path on CPUs older than that might be ok, depending on how much you care about not being slow on CPUs that are already old and slow.
Without SSE4.1:
Unsigned zero-extension is easy. Like pmdj answered, just use punpck* lo and hi to unpack with zero.
If your integers are signed, you'll have to do the sign-extension manually.
There is no psraq, only psrad (Packed Shift Right Arithmetic Dword) and psraw. If there was, you could unpack with itself and then arithmetic right shift by 32b.
Instead, we probably need to generate a vector where each element is turned into its sign bit. Then blend that with an unpacked vector (but pblendw is SSE4.1 too, so we'd have to use por).
Or better, unpack the original vector with a vector of sign-masks.
# input in xmm0
movdqa xmm1, xmm0
movdqa xmm2, xmm0
psrad xmm0, 31 ; xmm0 = all-ones or all-zeros depending on sign of input elements. xmm1=orig ; xmm2=orig
; xmm0 = signmask; xmm1=orig ; xmm2=orig
punpckldq xmm1, xmm0 ; xmm1 = sign-extend(lo64(orig))
punpckhdq xmm2, xmm0 ; xmm2 = sign-extend(hi64(orig))
This should run with 2 cycle latency for both results on Intel SnB or IvB. Haswell and later only have one shuffle port (so they can't do both punpck insns in parallel), so xmm2 will be delayed for another cycle there. Pre-SnB Intel CPUs usually bottleneck on the frontend (decoders, etc) with vector instructions, because they often average more than 4B per insn.
Shifting the original instead of the copy shortens the dependency chain for whatever produces xmm0, for CPUs without move elimination (handling mov instructions at the register-rename stage, so they're zero latency. Intel-only, and only on IvB and later.) With 3-operand AVX instructions, you wouldn't need the movdqa, or the 3rd register, but then you could just use vpmovsx for the low64 anyway. To sign-extend the high 64, you'd probably psrldq byte-shift the high 64 down to the low 64.
Or movhlps or punpckhqdq self,self to use a shorter-to-encode instruction. (or AVX2 vpmovsx to a 256b reg, and then vextracti128 the upper 128, to get both 128b results with only two instructions.)
Unlike GP-register shifts (e.g. sar eax, 31) , vector shifts saturate the count instead of masking. Leaving the original sign bit as the LSB (shifting by 31) instead of a copy of it (shifting by 32) works fine, too. It has the advantage of not requiring a big comment in with the code explaining this for people who would worry when they saw psrad xmm0, 32.
Paraphrasing from in "Programming Pearls" book (about c language on older machines, since book is from the late 90's):
Integer arithmetic operations (+, -, *) can take around 10 nano seconds whereas the % operator takes up to 100 nano seconds.
Why there is that much difference?
How does a modulus operator work internally?
Is it same as division (/) in terms of time?
The modulus/modulo operation is usually understood as the integer equivalent of the remainder operation - a side effect or counterpart to division.
Except for some degenerate cases (where the divisor is a power of the operating base - i.e. a power of 2 for most number formats) this is just as expensive as integer division!
So the question is really, why is integer division so expensive?
I don't have the time or expertise to analyze this mathematically, so I'm going to appeal to grade school maths:
Consider the number of lines of working out in the notebook (not including the inputs) required for:
Equality: (Boolean operations) essentially none - in computer "big O" terms this is known a O(1)
addition: two, working left to right, one line for the output and one line for the carry. This is an O(N) operation
long multiplication: n*(n+1) + 2: two lines for each of the digit products (one for total, one for carry) plus a final total and carry. So O(N^2) but with a fixed N (32 or 64), and it can be pipelined in silicon to less than that
long division: unknown, depends upon the argument size - it's a recursive descent and some instances descend faster than others (1,000,000 / 500,000 requires less lines than 1,000 / 7). Also each step is essentially a series of multiplications to isolate the closest factors. (Although multiple algorithms exist). Feels like an O(N^3) with variable N
So in simple terms, this should give you a feel for why division and hence modulo is slower: computers still have to do long division in the same stepwise fashion tha you did in grade school.
If this makes no sense to you; you may have been brought up on school math a little more modern than mine (30+ years ago).
The Order/Big O notation used above as O(something) expresses the complexity of a computation in terms of the size of its inputs, and expresses a fact about its execution time. http://en.m.wikipedia.org/wiki/Big_O_notation
O(1) executes in constant (but possibly large) time. O(N) takes as much time as the size of its data-so if the data is 32 bits it takes 32 times the O(1) time of the step to calculate one of its N steps, and O(N^2) takes N times N (N squared) the time of its N steps (or possibly N times MN for some constant M). Etc.
In the above working I have used O(N) rather than O(N^2) for addition since the 32 or 64 bits of the first input are calculated in parallel by the CPU. In a hypothetical 1 bit machine a 32 bit addition operation would be O(32^2) and change. The same order reduction applies to the other operations too.