XMM register values - sse

I am finding it hard to interpret the value of xmm registers in the register window of Visual Studio. The windows displays the following :
XMM0 = 00000000000000004018000000000000 XMM1 = 00000000000000004020000000000000
XMM2 = 00000000000000000000000000000000 XMM3 = 00000000000000000000000000000000
XMM4 = 00000000000000000000000000000000 XMM5 = 00000000000000000000000000000000
XMM6 = 00000000000000000000000000000000 XMM7 = 00000000000000000000000000000000
XMM00 = +0.00000E+000 XMM01 = +2.37500E+000 XMM02 = +0.00000E+000
XMM03 = +0.00000E+000 XMM10 = +0.00000E+000 XMM11 = +2.50000E+000
XMM12 = +0.00000E+000 XMM13 = +0.00000E+000
From the code that I am running, the value of XMM0 and XMM1 should be 6 and 8 (or other way round). The register value here shown is : XMM01 = +2.37500E+000
What does this translate to ?

Yes, it looks like:
XMM0 = { 6.0, 0.0 } // 6.0 = 0x4018000000000000 (double precision)
XMM1 = { 8.0, 0.0 } // 8.0 = 0x4020000000000000 (double precision)
The reason you are having problems interpreting this is that your debugger is only displaying each 128 bit XMM register in hex and then below that as 4 x single precision floats, but you are evidently using double precision floats.
I'm not familiar with the Visual Studio debugger, but there should ideally be a way to change the representation of your XMM registers - you may have to look at the manual or online help for this.
Note that in general using double precision with SSE is rarely of any value, particularly if you have a fairly modern x86 CPU with two FPUs.

Related

Extract 4 bits vs 2Bit of Bluetooth HEX Data, why same method results in error

This is a follow on question from this SO (Extract 4 bits of Bluetooth HEX Data) which an answer has been accepted. I wanna understand more why the difference between what I was using; example below; (which works) when applied to the SO (Extract 4 bits of Bluetooth HEX Data) does not.
To decode Cycling Power Data, the first 2 bits are the flags and it's used to determine what capabilities the power meter provides.
guard let characteristicData = characteristic.value else { return -1 }
var byteArray = [UInt8](characteristicData)
// This is the output from the Sensor (In Decimal and Hex)
// DEC [35, 0, 25, 0, 96, 44, 0, 33, 229] Hex:{length = 9, bytes = 0x23001900602c0021e5} FirstByte:100011
/// First 2 Bits is Flags
let flags = byteArray[1]<<8 + byteArray[0]
This results in the flags bit being concatenate from the first 2 bits. After which I used the flags bit and masked it to get the relevant bit position.
eg: to get power balance, I do (flags & 0x01 > 0)
This method works and I'm a happy camper.
However, Why is it that when I used this same method on SO Extract 4 bits of Bluetooth HEX Data it does not work? This is decoding Bluetooth FTMS Data (different from above)
guard let characteristicData = characteristic.value else { return -1 }
let byteArray = [UInt8](characteristicData)
let nsdataStr = NSData.init(data: (characteristic.value)!)
print("pwrFTMS 2ACC Feature Array:[\(byteArray.count)]\(byteArray) Hex:\(nsdataStr)")
PwrFTMS 2ACC Feature Array:[8][2, 64, 0, 0, 8, 32, 0, 0] Hex:{length = 8, bytes = 0x0240000008200000}
Based on the specs, the returned data has 2 characteristics, each of them 4 octet long.
doing
byteArray[3]<<24 + byteArray[2]<<16 + byteArray[1]<<8 + byteArray[0]
to join the first 4bytes results in an wrong output to start the decoding.
edit: Added clarification
There is a problem with this code that you say works... but it seems to work "accidentally":
let flags = byteArray[1]<<8 + byteArray[0]
This results in a UInt8, but the flags field in the first table is 16 bits. Note that byteArray[1] << 8 always evaluates to 0, because you are shifting all of the bits of the byte out of the byte. It appeared to work because the only bit you were interested in was in byteArray[0].
So you need it convert it to 16-bit (or larger) first and then shift it:
let flags = (UInt16(byteArray[1]) << 8) + UInt16(byteArray[0])
Now flags is UInt16
Similarly when you do 4 bytes, you need them to be 32-bit values, before you shift. So
let flags = UInt32(byteArray[3]) << 24
+ UInt32(byteArray[2]) << 16
+ UInt32(byteArray[1]) << 8
+ UInt32(byteArray[0])
but since that's just reading a 32-bit value from a sequence of bytes that are in little endian byte order, and all current Apple devices (and the vast majority of all other modern computers) are little endian machines, here is an easier way:
let flags = byteArray.withUnsafeBytes {
$0.bindMemory(to: UInt32.self)[0]
}
In summary, in both cases, you had been only preserving byte 0 in your shift-add, because the other shifts all evaluated to 0 due to shifting the bits completely out of the byte. It just so happened that in the first case byte[0] contained the information you needed. In general, it's necessary to first promote the value to the size you need for the result, and then shift it.

Memory allocation and solve linear systems in Julia

I'm using Julia 1.5.0. Consider the following code:
using LinearAlgebra
using Distributions
using BenchmarkTools
function solve_b!(A, tol_iters)
b = [1.0 2.0]'
luA = lu!(A)
x = [0.0; 0.0]
for i =1:tol_iters
A[1,1] += 0.001
A[2,2] += 0.001
luA = lu!(A)
ldiv!(x, luA, b)
end
end
A = rand(2,2)
solve_b!(A, 1000)
If I run this with julia --track-allocation=user, I see that most of the memory allocation comes from b = [1.0 2.0]' and x = [0.0; 0.0]. That is, when I see the .mem file, I see the following:
96 b = [1.0 2.0]'
0 luA = lu!(A)
96 x = [0.0; 0.0]
The memory allocation increases as I increase tol_iters.
Can someone explain why? I'm using lu! and ldiv!, so I would expect the update to be in-place. Therefore there should not be any additional memory allocation associated with the number of iterations.

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.
clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

Converting decimal number to flag values

I have some constraints like so:
interesting = 0x1
choked = 0x2
remote_interested = 0x4
remote_choked = 0x8
supports_extensions = 0x10
local_connection = 0x20
handshake = 0x40
connecting = 0x80
queued = 0x100
on_parole = 0x200
seed = 0x400
optimistic_unchoke = 0x800
rc4_encrypted = 0x100000
plaintext_encrypted = 0x200000
and the documentation tells me 'The flags attribute tells you in which state the peer is in. It is set to any combination of the enums above' so basically I call the dll and it fills in the structure with a decimal number representing the flag values, a few examples:
2086227
170
2098227
106
How do I from the decimal determine the flags?
In order to determine which flags were set, you need to use the bitwise AND operation (bit32.band() in Lua 5.2). For example:
function hasFlags(int, ...)
local all = bit32.bor(...)
return bit32.band(int, all) == all
end
if hasFlags(2086227, interesting, local_connection) then
-- do something that has interesting and local_connection
end

F#/"Accelerator v2" DFT algorithm implementation probably incorrect

I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform.
I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result in this case too). Basically, I'm just generally confused. I don't have any good input data and I don't know what to do with the result or how to verify it.
This question is related to my other post here
open Microsoft.ParallelArrays
open System
// X64MulticoreTarget is faster on my machine, unexpectedly
let target = new DX9Target() // new X64MulticoreTarget()
ignore(target.ToArray1D(new FloatParallelArray([| 0.0f |]))) // Dummy operation to warm up the GPU
let stopwatch = new System.Diagnostics.Stopwatch() // For benchmarking
let Hz = 50.0f
let fStep = (2.0f * float32(Math.PI)) / Hz
let shift = 0.0f // offset, once we have to adjust for the last batch of samples of a stream
// If I knew that the periodic function is periodic
// at whole-number intervals, I think I could keep
// shift within a smaller range to support streams
// without overflowing shift - but I haven't
// figured that out
//let elements = 8192 // maximum for a 1D array - makes sense as 2^13
//let elements = 7240 // maximum on my machine for a 2D array, but why?
let elements = 7240
// need good data!!
let buffer : float32[,] = Array2D.init<float32> elements elements (fun i j -> 0.5f) //(float32(i * elements) + float32(j)))
let input = new FloatParallelArray(buffer)
let seqN : float32[,] = Array2D.init<float32> elements elements (fun i j -> (float32(i * elements) + float32(j)))
let steps = new FloatParallelArray(seqN)
let shiftedSteps = ParallelArrays.Add(shift, steps)
let increments = ParallelArrays.Multiply(fStep, steps)
let cos_i = ParallelArrays.Cos(increments) // Real component series
let sin_i = ParallelArrays.Sin(increments) // Imaginary component series
stopwatch.Start()
// From the documentation, I think ParallelArrays.Multiply does standard element by
// element multiplication, not matrix multiplication
// Then we sum each element for each complex component (I don't understand the relationship
// of this, or the importance of the generalization to complex numbers)
let real = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, cos_i))).[0]
let imag = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, sin_i))).[0]
printf "%A in " ((real * real) + (imag * imag)) // sum the squares for the presence of the frequency
stopwatch.Stop()
printfn "%A" stopwatch.ElapsedMilliseconds
ignore (System.Console.ReadKey())
I share your surprise that your answer is not closer to zero. I'd suggest writing naive code to perform your DFT in F# and seeing if you can track down the source of the discrepancy.
Here's what I think you're trying to do:
let N = 7240
let F = 1.0f/50.0f
let pi = single System.Math.PI
let signal = [| for i in 1 .. N*N -> 0.5f |]
let real =
seq { for i in 0 .. N*N-1 -> signal.[i] * (cos (2.0f * pi * F * (single i))) }
|> Seq.sum
let img =
seq { for i in 0 .. N*N-1 -> signal.[i] * (sin (2.0f * pi * F * (single i))) }
|> Seq.sum
let power = real*real + img*img
Hopefully you can use this naive code to get a better intuition for how the accelerator code ought to behave, which could guide you in your testing of the accelerator code. Keep in mind that part of the reason for the discrepancy may simply be the precision of the calculations - there are ~52 million elements in your arrays, so accumulating a total error of 79 may not actually be too bad. FWIW, I get a power of ~0.05 when running the above single precision code, but a power of ~4e-18 when using equivalent code with double precision numbers.
Two suggestions:
ensure you're not somehow confusing degrees with radians
try doing it sans-parallelism, or just with F#'s asyncs for parallelism
(In F#, if you have an array of floats
let a : float[] = ...
then you can 'add a step to all of them in parallel' to produce a new array with
let aShift = a |> (fun x -> async { return x + shift })
|> Async.Parallel |> Async.RunSynchronously
(though I expect this might be slower that just doing a synchronous loop).)

Resources