Multiplying a vector component with an array in ArrayFire - arrayfire

I'm getting an error while trying to multiply a vector component with an array (element-wise multiplication or broadcast). The docs show that this overloaded case for * should be fine:
AFAPI array operator* (const float &lhs, const array &rhs)
Multiplies two arrays or an array and a value. (const array&, const
array&)
But according to the error message below, perhaps vect(0) needs to be further flattened or reduced so that the sizes are consistent?
The error statement is clear:
Invalid dimension for argument 1 Expected: ldims == rides
Below is the code:
#include <arrayfire.h>
int main(int argc, char *argv[])
{
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
int n = 3;
int N = 5;
// Create the arrays:
af::array matrix = af::constant(0,n,n,f32); // 3 x 3 float array of zeros
af::array vect = af::seq(1,N); // A col vector of floats: {1.0, ... ,5.0}
// Show the arrays:
af_print(matrix);
af_print(vect);
// Print a single component of the vector:
af_print(vect(0));
// This line produces the error (see below):
af_print(vect(0) * matrix); // Why doesn't this work?
// But somthing like this is fine:
af_print(1.0 * matrix);
return 0;
}
Producing the output:
ArrayFire v3.3.2
ATI Radeon HD 6750M
matrix [3 3 1 1]
0.0000 0.0000 0.0000
0.0000 0.0000 0.0000
0.0000 0.0000 0.0000
vect [5 1 1 1]
1.0000
2.0000
3.0000
4.0000
5.0000
vect(0) [1 1 1 1]
1.0000
The dims() output of af_print() for the matrix = [3 3 1 1], and vect(0) = [1 1 1 1], make me suspicious, but I'm not sure how to flatten further. One would think this example to be a common way of using the ArrayFire API.
The error exception that is thrown is:
libc++abi.dylib: terminating with uncaught exception of type
af::exception: ArrayFire Exception (Invalid input size:203): In
function getOutDims In file src/backend/ArrayInfo.cpp:173
Invalid dimension for argument 1 Expected: ldims == rides
In function af::array af::operator*(const af::array &, const af::array
&)
Adding a use-case to clarify:
In practice I am constructing a final array by summation of coeff(k) * (a 2-d slice of a 3-d array Z):
for (int j = 0; j<indx.dims(0); ++j)
final += coeff(indx(j)) * Z(af::span,af::span,indx(j));
I'll look into using a gfor but initially just wanted to get the correct numerical output. Note also that the vector: index is predefined, e.g., say index = {1, 2, 4, 7, ...} and the elements are not necessarily in sequence; this allows the selection of specific terms.

ArrayFire does not implicitly do vector array-scalar array element-wise operation (the case you say is failing). Only vector array-value ones are supported implicitly.
To do what you are doing, you will need to use the tile() function as shown below.
af_print(tile(vect(0), matrix.dims()) * matrix);
Since the dimension being tiled is 1, tile will be used as a JIT function. There is no extra memory used here. The entire computation is done in a single kernel. Hence no performance hit either.

Since OP added a usecase since the last answer, this is how you write a fully vectorized version in arrayfire.
array coeffs = moddims(coeff(indx), 1, 1, coeff.elements());
array final = sum(Z(span, span, indx) * tile(coeffs, Z.dims(0), Z.dims(1)), 2);

Related

diffrent return value for same arguement of ceil() function

#include<bits/stdc++.h>
using namespace std;
int main() {
int ans = ceil(1.5);
printf("%d\n", ans);
ans = ceil(3 / 2);
printf("%d", ans);
}
Output:
2
1
Why this code print different answers in my editor (vs code)?
Actually your are sending different arguments to function ceil
3 / 2 will be first calculated to integer 1, for 3 and 2 are all integers so the operator / will return an integer.
So you are actually calling ceil(1) for the second time
When u sent 3/2 as an argument, you are actually sent 1. The program calculates 3/2 as an int therefore the result is 1, and then the second ans calculation is actually by ceil(1.0)
Instead ceil(3 / 2), you need to do ceil(3.0 / 2.0). In this situation, the program calculates this as a double and the result will be 1.5, meaning the second ans calculation is by ceil(1.5).

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.
clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

SystemVerilog constraint for mapping between two 2D arrays

There are two MxN 2D arrays:
rand bit [M-1:0] src [N-1:0];
rand bit [M-1:0] dst [N-1:0];
Both of them will be randomized separately so that they both have P number of 1'b1 in them and rest are 1'b0.
A third MxN array of integers named 'map' establishes a one to one mapping between the two arrays 'src' and 'dst'.
rand int [M-1:0] map [N-1:0];
Need a constraint for 'map' such that after randomization, for each element of src[i][j] where src[i][j] == 1'b1, map[i][j] == M*k+l when dst[k][l] == 1. The k and l must be unique for each non-zero element of map.
To give an example:
Let M = 3 and N = 2.
Let src be
[1 0 1
0 1 0]
Let dst be
[0 1 1
1 0 0]
Then one possible randomization of 'map' will be:
[3 0 1
0 2 0]
In the above map:
3 indicates pointing from src[0,0] to dst[1,0] (3 = 1*M+0)
1 indicates pointing from src[0,2] to dst[0,1] (1 = 0*M+1)
2 indicates pointing from src[1,1] to dst[0,2] (2 = 0*M+2)
This is very difficult to express as a SystemVerilog constraint because
there is no way to conditionally select elements of an array to be unique
You cannot have random variables as part of index expression to an array element.
Since you are randomizing src and dst separately, it might be easier to compute the pointers and then randomly choose the pointers to fill in the map.
module top;
parameter M=3,N=4,P=4;
bit [M-1:0] src [N];
bit [M-1:0] dst [N];
int map [N][M];
int pointers[$];
initial begin
assert( randomize(src) with {src.sum() with ($countones(item)) == P;} );
assert( randomize(dst) with {dst.sum() with ($countones(item)) == P;} );
foreach(dst[K,L]) if (dst[K][L]) pointers.push_back(K*M+L);
pointers.shuffle();
foreach(map[I,J]) map[I][J] = pointers.pop_back();
$displayb("%p\n%p",src,dst);
$display("%p",map);
end
endmodule

Simple Matrix multiplication in opencv fails

I do the following
Mat xOld,xNew;
for(uint i=0;i<inliers.size();i++){
if(inliers[i]){
double xOld_arr[3]={kpOld[i].pt.x,kpOld[i].pt.y,1};
double xNew_arr[3]={kpNew[i].pt.x,kpNew[i].pt.y,1};
Mat xo(1,3,CV_64FC1,xOld_arr),xn(1,3,CV_64FC1,xNew_arr);
xNew.push_back(xn);
xOld.push_back(xo);
}
}
xNew=xNew.t();
cout<<F.size()<<" "<<xNew.size();
Mat t=xNew*F;
Output is
[3 x 3] [24 x 3]OpenCV Error: Assertion failed (a_size.width == len) in gemm, file /home/flex/test/opencv/modules/core/src/matmul.cpp, line 1537
terminate called after throwing an instance of 'cv::Exception'
what(): /home/flex/test/opencv/modules/core/src/matmul.cpp:1537: error: (-215) a_size.width == len in function gemm
What am I missing? when I multiply matrix shouldn't it be correct. Cause xNew has same colums and F same Rows?
what type is F?
so F is 3 rows, 3 cols. xNew (after transpose) is 3 rows, 24 cols. Now you try to multiply (matrix notation: rows x columns) 3x24 * 3x3 which is not defined. Matrix multiplication is size: N x M * M x O => NxO matrix. So you should be able to multiply both matrices if you don't transpose, but I can't tell you whether that is the multiplication you want.
Maybe the confusion is in this line: xn(1,3,CV_64FC1,xNew_arr) here you create a matrix with 1 row and 3 columns and later add this row to xNew.

Scaling a number between two values

If I am given a floating point number but do not know beforehand what range the number will be in, is it possible to scale that number in some meaningful way to be in another range? I am thinking of checking to see if the number is in the range 0<=x<=1 and if not scale it to that range and then scale it to my final range. This previous post provides some good information, but it assumes the range of the original number is known beforehand.
You can't scale a number in a range if you don't know the range.
Maybe what you're looking for is the modulo operator. Modulo is basically the remainder of division, the operator in most languages is is %.
0 % 5 == 0
1 % 5 == 1
2 % 5 == 2
3 % 5 == 3
4 % 5 == 4
5 % 5 == 0
6 % 5 == 1
7 % 5 == 2
...
Sure it is not possible. You can define range and ignore all extrinsic values. Or, you can collect statistics to find range in run time (i.e. via histogram analysis).
Is it really about image processing? There are lots of related problems in image segmentation field.
You want to scale a single random floating point number to be between 0 and 1, but you don't know the range of the number?
What should 99.001 be scaled to? If the range of the random number was [99, 100], then our scaled-number should be pretty close to 0. If the range of the random number was [0, 100], then our scaled-number should be pretty close to 1.
In the real world, you always have some sort of information about the range (either the range itself, or how wide it is). Without further info, the answer is "No, it can't be done."
I think the best you can do is something like this:
int scale(x) {
if (x < -1) return 1 / x - 2;
if (x > 1) return 2 - 1 / x;
return x;
}
This function is monotonic, and has a range of -2 to 2, but it's not strictly a scaling.
I am assuming that you have the result of some 2-dimensional measurements and want to display them in color or grayscale. For that, I would first want to find the maximum and minimum and then scale between these two values.
static double[][] scale(double[][] in, double outMin, double outMax) {
double inMin = Double.POSITIVE_INFINITY;
double inMax = Double.NEGATIVE_INFINITY;
for (double[] inRow : in) {
for (double d : inRow) {
if (d < inMin)
inMin = d;
if (d > inMax)
inMax = d;
}
}
double inRange = inMax - inMin;
double outRange = outMax - outMin;
double[][] out = new double[in.length][in[0].length];
for (double[] inRow : in) {
double[] outRow = new double[inRow.length];
for (int j = 0; j < inRow.length; j++) {
double normalized = (inRow[j] - inMin) / inRange; // 0 .. 1
outRow[j] = outMin + normalized * outRange;
}
}
return out;
}
This code is untested and just shows the general idea. It further assumes that all your input data is in a "reasonable" range, away from infinity and NaN.

Resources