SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value - sse

I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.

I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.

clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

Related

Minimizing memory usage in Julia function

This function is a workhorse which I want to optimize. Any idea on how its memory usage can be limited would be great.
function F(len, rNo, n, ratio = 0.5)
s = zeros(len); m = copy(s); d = copy(s);
s[rNo]=1
rNo ≤ len-1 && (m[rNo + 1] = s[rNo+1] = -n[rNo])
rNo > 1 && (m[rNo - 1] = s[rowNo-1] = n[rowNo-1])
r=1
while true
for i ∈ 2:len-1
d[i] = (n[i]*m[i+1] - n[i-1]*m[i-1])/(r+1)
end
d[1] = n[1]*m[2]/(r+1);
d[len] = -n[len-1]*m[len-1]/(r+1);
for i ∈ 1:len
s[i]+=d[i]
end
sum(abs.(d))/sum(abs.(m)) < ratio && break #converged
m = copy(d); r+=1
end
return reshape(s, 1, :)
end
It calculates rows of a special matrix exponential which I stack later.
Although the full method is quite faster than built in exp thanks to the special properties, it takes up far more memory as measured by #time.
Since I am a noob in memory management and also in Julia, I am sure it can be optimized quite a bit..
Am I doing something obviously wrong?

I think most of your allocations come from sum(abs.(d))/sum(abs.(m)) < ratio && break #converged. If you replace it with sum(abs, d)/sum(abs,m) < ratio && break #converged those allocations should go away. (it also will be a speed boost).
Your other allocations can be removed by replacing m = copy(d) with m .= d which does an element-wise copy.
There are also a couple of style things where I think you could make this a nicer function to read and use. My changes would be as follows
function F(rNo, v, ratio = 0.5)
len = length(v)
s = zeros(len+1); m = copy(s); d = copy(s);
s[rNo]=1
rNo ≤ len && (m[rNo + 1] = s[rNo+1] = -v[rNo])
rNo > 1 && (m[rNo - 1] = s[rowNo-1] = v[rowNo-1])
r=1
while true
for i ∈ 2:len
d[i] = (v[i]*m[i+1] - v[i-1]*m[i-1]) / (r+1)
end
d[1] = v[1]*m[2]/(r+1);
d[end] = -v[end]*m[end]/(r+1);
s .+= d
sum(abs, d)/sum(abs, m) < ratio && break #converged
m .= d; r+=1
end
return reshape(s, 1, :)
end
The most notable change is removing len from the arguments. Including an array length argument is common in C (and probably others) where finding the length of an array is hard, but in Julia length is cheap (O(1)), and adding extra arguments is just more clutter and confusion for the people using it. I also made use of the fact that julia is able to turn s[end] into s[length(x)] to make this a little cleaner. Also, in general when using Julia you should look for ways to use dotted operations rather than writing for loops. The for loops will be fast, but why take 3 lines to do what you could in 1 shorter line? (I also renamed n to v since to me n is a number and v is a vector, but that is pure preference).
I hope this helps.

Octave fminunc doesn't converge

I'm trying to use fminunc in Octave for a logistic problem, but it doesn't work. It says that I didn't define variables, but actually I did. If I define variables directly in the costFunction,and not in the main, it doesn't give any problem, but the function doesn't work really. In fact the exitFlag is equal to -3 and it doesn't converge at all.
Here's my function:
function [jVal, gradient] = cost(theta, X, y)
X = [1,0.14,0.09,0.58,0.39,0,0.55,0.23,0.64;1,-0.57,-0.54,-0.16,0.21,0,-0.11,-0.61,-0.35;1,0.42,0.45,-0.41,-0.6,0,-0.44,0.38,-0.29];
y = [1;0;1];
theta = [0.8;0.2;0.6;0.3;0.4;0.5;0.6;0.2;0.4];
jVal = 0;
jVal = costFunction2(X, y, theta); %this is another function that gives me jVal. I'm quite sure it is
%correct because I use it also with other algorithms and it
%works perfectly
m = length(y);
xSize = size(X, 2);
gradient = zeros(xSize, 1);
sig = X * theta;
h = 1 ./(1 + exp(-sig));
for i = 1:m
for j = 1:xSize
gradient(j) = (1/m) * sum(h(i) - y(i)) .* X(i, j);
end
end
Here's my main:
theta = [0.8;0.2;0.6;0.3;0.4;0.5;0.6;0.2;0.4];
options = optimset('GradObj', 'on', 'MaxIter', 100);
[optTheta, functionVal, exitFlag] = fminunc(#cost, theta, options)
if I compile it:
optTheta =
0.80000
0.20000
0.60000
0.30000
0.40000
0.50000
0.60000
0.20000
0.40000
functionVal = 0.15967
exitFlag = -3
How can I resolve this problem?

You are not in fact using fminunc correctly. From the documentation:
-- fminunc (FCN, X0)
-- fminunc (FCN, X0, OPTIONS)
FCN should accept a vector (array) defining the unknown variables,
and return the objective function value, optionally with gradient.
'fminunc' attempts to determine a vector X such that 'FCN (X)' is a
local minimum.
What you are passing is not a handle to a function that accepts a single vector argument. Instead, what you are passing (i.e. #cost) is a handle to a function that takes three arguments.
You need to 'convert' this into a handle to a function that takes only one input, and does what you want under the hood. The easiest way to do this is by 'wrapping' your cost function into an anonymous function that only takes one argument, and calls the cost function in the appropriate way, e.g.
fminunc( #(t) cost(t, X, y), theta, options )
Note: This assumes X and y are defined in the scope where you do this 'wrapping' business

Which sigmoid function to use for increasing contrast of an image?

Is there a standard sigmoidal function used to increase the contrast in a gray level bitmap?
Currently I am using the following. This would be applied to gray levels represented at values between 0 and 1 inclusive.
static double ContrastCurve(double val, double k = 1)
{
Func<double,double> logistic_func = (double x) => 1.0 / (1.0 + Math.Exp(-k * (x - 0.5)));
var low = logistic_func(0);
var high = logistic_func(1);
var range = high - low;
var value = logistic_func(val);
return (value - low) / range;
}
This is the logistic function applied to a value between 0 and 1 with the output normalized so that the output is also in [0...1]. This function works but it is completely arbitrary, something I just made up, so the k param has no official name or meaning in image processing literature and so forth.
If there is a function that is standard I would prefer that but haven't found anything that seems definitive. Code such as this link seems as ad hoc to me.

As Mark Setchell's comment notes, ImageMagick uses the following function citing "Fundamentals of Image Processing", Hany Farid:
g(u) = 1 / [1 + exp(-α*u + β)]
scaled such that for domain [0..1] its range is [0..1].
This is essentially a two parameter version of the function defined in the code in the question above i.e. the code in the question implements the same function but makes the substitution α = k and β = -k/2 which yields a one parameter function f where f(0.5) = 0.5 when scaled such that f(0) = 0 and f(1) = 1.

How would one create a bitwise rotation function in dart?

I'm in the process of creating a cryptography package for Dart (https://pub.dev/packages/steel_crypt). Right now, most of what I've done is either exposed from PointyCastle or simple-ish algorithms where bitwise rotations are unnecessary or replaceable by >> and <<.
However, as I move toward complicated cryptography solutions, which I can do mathematically, I'm unsure of how to implement bitwise rotation in Dart with maximum efficiency. Because of the nature of cryptography, the speed part is emphasized and uncompromising, in that I need the absolute fastest implementation.
I've ported a method of bitwise rotation from Java. I'm pretty sure this is correct, but unsure of the efficiency and readability:
My tested implementation is below:
int INT_BITS = 64; //Dart ints are 64 bit
static int leftRotate(int n, int d) {
//In n<<d, last d bits are 0.
//To put first 3 bits of n at
//last, do bitwise-or of n<<d with
//n >> (INT_BITS - d)
return (n << d) | (n >> (INT_BITS - d));
}
static int rightRotate(int n, int d) {
//In n>>d, first d bits are 0.
//To put last 3 bits of n at
//first, we do bitwise-or of n>>d with
//n << (INT_BITS - d)
return (n >> d) | (n << (INT_BITS - d));
}
EDIT (for clarity): Dart has no unsigned right or left shift, meaning that >> and << are signed right shifts, which bears more significance than I might have thought. It poses a challenge that other languages don't in terms of devising an answer. The accepted answer below explains this and also shows the correct method of bitwise rotation.

As pointed out, Dart has no >>> (unsigned right shift) operator, so you have to rely on the signed shift operator.
In that case,
int rotateLeft(int n, int count) {
const bitCount = 64; // make it 32 for JavaScript compilation.
assert(count >= 0 && count < bitCount);
if (count == 0) return n;
return (n << count) |
((n >= 0) ? n >> (bitCount - count) : ~(~n >> (bitCount - count)));
}
should work.
This code only works for the native VM. When compiling to JavaScript, numbers are doubles, and bitwise operations are only done on 32-bit numbers.

Inferring latches in Verilog/SystemVerilog

The statements in procedural blocks execute seqeuntially so why aren't any of the block1, block2 or block3 inferring a latch?
module testing(
input logic a, b, c,
output logic x, y, z, v
);
logic tmp_ref, tmp1, tmp2, tmp3;
//reference
always_comb begin: ref_block
tmp_ref = a & b;
x = tmp_ref ^ c;
end
always_comb begin: block1
y = tmp1 ^ c;
tmp1 = a & b;
end
always #(*) begin: block2
tmp2 <= a & b;
z = tmp2 ^ c;
end
always #(c) begin: block3
tmp3 = a & b;
v = tmp3 ^ c;
end
endmodule: testing
In block1 y is calculated using the blocking assignment before the new value of tmp1 is available.
In block2 tmp2 is calculated using a non-blocking assignment, which should postpone the assignment for when the always block finishes. Meanwhile, z is calculated using the blocking assignment and the new value of tmp2 is not yet available.
In block3 there is an incomplete sensitivity list and still no latch.
Here is the synthesis result from Quartus II 14.1:
Only when I add this block a latch is inferred:
//infers a latch
always #(*) begin: block4
if (c == 1'b1) begin
tmp4 = a & b;
w = tmp4 ^ c;
end
end
Can someone please explain why incomplete sensitivity list or using a variable before the value is updated does not infer a latch in a combinatorial block?

The type of assignment used in a combinatorial block will not effect synthesis. The use of non-blocking (<=) may result in RTL (pre-synthesis) to gates (post-synthesis) simulator mismatches.
The same is true for sensitivity lists, synthesis will give the behaviour of auto generated or complete list.
In a clocked process (#(posedge clk)) use non-blocking (<=) to get the simulation behaviour of a flip-flop. It is possible to use blocking (=) as well to have combinatorial code inside the clocked process but mixing styles is considered a bad coding practice. The combinatorial part code just be moved to a separate combinatorial block (always #*).
A latch is a basic memory element, if the circuit does not need memory then it will not be inferred.
For example:
always #* begin:
v = (a & b) ^ c;
end
v is completely defined by inputs, there is no memory involved. In comparison to :
always #* begin
if (c == 1'b1) begin
w = (a & b) ^ c;
end
end
When c is 0 w must hold its value, therefore a latch is inferred.
It is worth noting that while latches are not bad, care must be taken with the timing of when the open and close to ensure they capture the correct data. Therefore inferred latch are typically seen as bad and are from poor coding.
SystemVerilog has the following syntax for semantically implying design intent:
always_latch begin
if (c == 1'b1) begin
w = (a & b) ^ c;
end
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value - sse

Related

Minimizing memory usage in Julia function

Octave fminunc doesn't converge

Which sigmoid function to use for increasing contrast of an image?

How would one create a bitwise rotation function in dart?

Inferring latches in Verilog/SystemVerilog

Categories

Resources