Is there an AVX, AVX2, or AVX512 function like _mm256_mulhi_epu16, but for 8-bit? - avx

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3967,3970&text=_mm256_mulhi_epu16
Essentially, what i need is "_mm256_mulhi_epu8" (which do not exist as it seems), which would
"Multiply the packed unsigned 8-bit integers in a and b, producing intermediate 16-bit integers, and store the high 8 bits of the intermediate integers in dst."
Is there a way to do that with any of the 256-bit or 512-bit instruction sets on x86?

Related

HOW does a 8 bit processor interpret the 2 bytes of a 16 bit number to be a single piece of info?

Assume the 16 bit no. to be 256.
So,
byte 1 = Some binary no.
byte 2 = Some binary no.
But byte 1 also represents a 8 bit no.(Which could be an independent decimal number) and so does byte 2..
So how does the processor know that bytes 1,2 represent a single no. 256 and not two separate numbers
The processor would need to have another long type for that. I guess you could implement a software equivalent, but for the processor, these two bytes would still have individual values.
The processor could also have a special integer representation and machine instructions that handle these numbers. For example, most modern machines nowadays use twos-complement integers to represent negative numbers. In twos-complement, the most significant bit is used to differentiate negative numbers. So a twos-complement 8-bit integer can have a range of -128 (1000 0000) to 127 (0111 111).
You could easily have the most significant bit mean something else, so for example, when MSB is 0 we have integers from 0 (0000 0000) to 127 (0111 1111); when MSB is 1 we have integers from 256 (1000 0000) to 256 + 127 (1111 1111). Whether this is efficient or good architecture is another history.

Why pixel shader returns float4 when the back buffer format is DXGI_FORMAT_B8G8R8A8_UNORM?

Alright, so this has been bugging me for a while now, and could not find anything on MSDN that goes into the specifics that I need.
This is more of a 3 part question, so here it goes:
1-) When creating the swapchain applications specify backbuffer pixel formats, and most often is either B8G8R8A8 or R8G8B8A8. This gives 8 bit per color channel so a total of 4 bytes is used per pixel....so why does the pixel shader has to return a color as a float4 when float4 is actually 16 bytes?
2-) When binding textures to the Pixel Shader my textures are DXGI_FORMAT_B8G8R8A8_UNORM format, but why does the sampler need a float4 per pixel to work?
3-) Am I missing something here? am I overthinking this or what?
Please provide links to to support your claim. Preferably from MSDN!!!!
GPUs are designed to perform calculations on 32bit floating point data, at least if they want to support D3D11. As of D3D10 you can also perform 32bit signed and unsigned integer operations. There's no requirement or language support for types smaller than 4 bytes in HLSL, so there's no "byte/char" or "short" for 1 and 2 byte integers or lower precision floating point.
Any DXGI formats that use the "FLOAT", "UNORM" or "SNORM" suffix are non-integer formats, while "UINT" and "SINT" are unsigned and signed integer. Any reads performed by the shader on the first three types will be provided to the shader as 32 bit floating point irrespective of whether the original format was 8 bit UNORM/SNORM or 10/11/16/32 bit floating point. Data in vertices is usually stored at a lower precision than full-fat 32bit floating point to save memory, but by the time it reaches the shader it has already been converted to 32bit float.
On output (to UAVs or Render Targets) the GPU compresses the "float" or "uint" data to whatever format the target was created at. If you try outputting float4(4.4, 5.5, 6.6, 10.1) to a target that is 8-bit normalised then it'll simply be truncated to (1.0,1.0,1.0,1.0) and only consume 4 bytes per pixel.
So to answer your questions:
1) Because shaders only operate on 32 bit types, but the GPU will compress/truncate your output as necessary to be stored in the resource you currently have bound according to its type. It would be madness to have special keywords and types for every format that the GPU supported.
2) The "sampler" doesn't "need a float4 per pixel to work". I think you're mixing your terminology. The declaration that the texture is a Texture2D<float4> is really just stating that this texture has four components and is of a format that is not an integer format. "float" doesn't necessarily mean the source data is 32 bit float (or actually even floating point) but merely that the data has a fractional component to it (eg 0.54, 1.32). Equally, declaring a texture as Texture2D<uint4> doesn't mean that the source data is 32 bit unsigned necessarily, but more that it contains four components of unsigned integer data. However, the data will be returned to you and converted to 32 bit float or 32 bit integer for use inside the shader.
3) You're missing the fact that the GPU decompresses textures / vertex data on reads and compresses it again on writes. The amount of storage used for your vertices/texture data is only as much as the format that you create the resource in, and has nothing to do with the fact that the shader is operating on 32 bit floats / integers.

SSE instruction to sum 32 bit integers to 64 bit

I'm looking for an SSE instruction which takes two arguments of four 32 bit integers in __m128i, computes sum of corresponding pairs and returns result as two 64 bit integers in __m128i.
Is there an instruction for this?
There are no SSE operations with carry. The way to do this is to first unpack the 32-bit integers (punpckldq/punpckhdq) into 4 groups of 64-bit integers by using an all-zeroes helper vector, and then use 64-bit pairwise addition.
SSE only has this for byte->word and word->dword. (pmaddubsw (SSSE3) and pmaddwd (MMX/SSE2), which vertically multiply v1 * v2, then horizontally add neighbouring pairs.)
I'm not clear on what you want the outputs to be. You have 8 input integers (two vectors of 4), and 2 output integers (one vector of two). Since there's no insn that does any kind of 32+32 -> 64b vector addition, let's just look at how to zero-extend or sign-extended the low two 32b elements of a vector to 64b. You can combine this into whatever you need, but keep in mind there's no add-horizontal-pairs phaddq, only vertical paddq.
phaddd is similar to what you want, but without the widening: low half of the result is the sum of horizontal pairs in the first operand, high half is the sum of horizontal pairs in the second operand. It's pretty much only worth using if you need all those results, and you're not going to combine them further. (i.e. it's usually faster to shuffle and vertical add instead of running phadd to horizontally sum a vector accumulator at the end of a reduction. And if you're going to sum everything down to one result, do normal vertical sums until you're down to one register.) phaddd could be implemented in hardware to be as fast as paddd (single cycle latency and throughput), but it isn't in any AMD or Intel CPU.
Like Mysticial commented, SSE4.1 pmovzxdq / pmovsxdq are exactly what you need, and can even do it on the fly as part of a load from a 64b memory location (containing two 32b integers).
SSE4.1 was introduced with Intel Penryn, 2nd gen Core2 (45nm die shrink core2), the generation before Nehalem. Falling back to a non-vector code path on CPUs older than that might be ok, depending on how much you care about not being slow on CPUs that are already old and slow.
Without SSE4.1:
Unsigned zero-extension is easy. Like pmdj answered, just use punpck* lo and hi to unpack with zero.
If your integers are signed, you'll have to do the sign-extension manually.
There is no psraq, only psrad (Packed Shift Right Arithmetic Dword) and psraw. If there was, you could unpack with itself and then arithmetic right shift by 32b.
Instead, we probably need to generate a vector where each element is turned into its sign bit. Then blend that with an unpacked vector (but pblendw is SSE4.1 too, so we'd have to use por).
Or better, unpack the original vector with a vector of sign-masks.
# input in xmm0
movdqa xmm1, xmm0
movdqa xmm2, xmm0
psrad xmm0, 31 ; xmm0 = all-ones or all-zeros depending on sign of input elements. xmm1=orig ; xmm2=orig
; xmm0 = signmask; xmm1=orig ; xmm2=orig
punpckldq xmm1, xmm0 ; xmm1 = sign-extend(lo64(orig))
punpckhdq xmm2, xmm0 ; xmm2 = sign-extend(hi64(orig))
This should run with 2 cycle latency for both results on Intel SnB or IvB. Haswell and later only have one shuffle port (so they can't do both punpck insns in parallel), so xmm2 will be delayed for another cycle there. Pre-SnB Intel CPUs usually bottleneck on the frontend (decoders, etc) with vector instructions, because they often average more than 4B per insn.
Shifting the original instead of the copy shortens the dependency chain for whatever produces xmm0, for CPUs without move elimination (handling mov instructions at the register-rename stage, so they're zero latency. Intel-only, and only on IvB and later.) With 3-operand AVX instructions, you wouldn't need the movdqa, or the 3rd register, but then you could just use vpmovsx for the low64 anyway. To sign-extend the high 64, you'd probably psrldq byte-shift the high 64 down to the low 64.
Or movhlps or punpckhqdq self,self to use a shorter-to-encode instruction. (or AVX2 vpmovsx to a 256b reg, and then vextracti128 the upper 128, to get both 128b results with only two instructions.)
Unlike GP-register shifts (e.g. sar eax, 31) , vector shifts saturate the count instead of masking. Leaving the original sign bit as the LSB (shifting by 31) instead of a copy of it (shifting by 32) works fine, too. It has the advantage of not requiring a big comment in with the code explaining this for people who would worry when they saw psrad xmm0, 32.

Hardware implementation for integer data processing

I am currently trying to implement a data path which processes an image data expressed in gray scale between unsigned integer 0 - 255. (Just for your information, my goal is to implement a Discrete Wavelet Transform in FPGA)
During the data processing, intermediate values will have negative numbers as well. As an example process, one of the calculation is
result = 48 - floor((66+39)/2)
The floor function is used to guarantee the integer data processing. For the above case, the result is -4, which is a number out of range between 0~255.
Having mentioned above case, I have a series of basic questions.
To deal with the negative intermediate numbers, do I need to represent all the data as 'equivalent unsigned number' in 2's complement for the hardware design? e.g. -4 d = 1111 1100 b.
If I represent the data as 2's complement for the signed numbers, will I need 9 bits opposed to 8 bits? Or, how many bits will I need to process the data properly? (With 8 bits, I cannot represent any number above 128 in 2's complement.)
How does the negative number division works if I use bit wise shifting? If I want to divide the result, -4, with 4, by shifting it to right by 2 bits, the result becomes 63 in decimal, 0011 1111 in binary, instead of -1. How can I resolve this problem?
Any help would be appreciated!
If you can choose to use VHDL, then you can use the fixed point library to represent your numbers and choose your rounding mode, as well as allowing bit extensions etc.
In Verilog, well, I'd think twice. I'm not a Verilogger, but the arithmetic rules for mixing signed and unsigned datatypes seem fraught with foot-shooting opportunities.
Another option to consider might be MyHDL as that gives you a very powerful verification environment and allows you to spit out VHDL or Verilog at the back end as you choose.

Packing a 16-bit floating point variable into two 8-bit variables (aka half into two bytes)

I code on XNA and only has access to shader model 3, hence no bitshift operators. I need to pack two random 16-bit floating point variables (meaning NOT in range [0,1] but ANY RANDOM FLOAT VARIABLE) into two 8-bit variables. There is no way to normalize them.
I thought about doing bitshifting manually but I can't find a good article on how to convert a random decimal float (not [0,1]) into binary and back.
Thanks
This is not really a good idea - a 16-bit float already has very limited range and precision. Remember that 8-bits leaves you with just 256 possible values!
Getting an 8-bit value into a shader is trivial. As a colour is one method. You can use each channel as a normalised range, from 0 to 1.
Of course, you say you don't want to normalise your values. So I assume you want to maintain the nice floating-point property of a wide range with better precision closer to zero.
(Now would be a good time to read some background info on floating-point. Especially about half-precision floating-point and minifloats and microfloats.)
One way to do that would be to encode your values using a logarithm and an exponent (to encode and decode, respectivly). This is basically exactly what the floating-point format itself does. The exact maths will depend on the precision and the range that you desire - (which 256 values will you represent?) - so I will leave it as an exercise.

Resources