How does _mm_mul_ps() add two __m128? - sse

I´m doing a program that takes two matrix 4x4 and multiply them using Intrinsics.
What I understand until now:
MMX/SSE instructions set allow you to accelerate computing. In particular it uses a 4 bytes elements vector.
__m128 represents a 16 bytes vector (4 elements of 4 bytes). Furthermore, __m128 data needs to be aligned in order to work.
Where I get lost is here:
Function _mm_mul_ps(_m128, _m128) that (as I have read) takes two vectors of 16 bytes of 4 flotats of 4 bytes. It multiply "one to one" the two vectors and returns a _m128. But, what does that _m128 vector contains exactly (the result of what)?
Function _mm_hadd_ps(_m128, _m128) adds two 16 bytes vectors (each one of 4 bytes floats). It "adds horizontaly" this way: vectorA(a1, a2, a3,a4) + vectorB(b1, b2, b3, b4) = vectorResult(a1 + a2, a3 + a4, b1 + b2, b3 + b4)
What I´m trying to do:
// Stores the result of multiply on row of A by one column of B
_declspec (align(16)) __m128 aux;
// Horizontal add
for(int i = 0; i < 4; i++){
for (int j = 0; j < 4; j++){
aux= _mm_mul_ps(vectorA[i], vectorB[j]);
// Add results
aux = _mm_hadd_ps(aux, aux);
aux = _mm_hadd_ps(aux,aux);
}
}
I can´t see how the functions work (I don´t have a "mental image").

Related

Calculate sum of row but its initial row number and row count

Let's say I have a column of numbers:
1
2
3
4
5
6
7
8
Is there a formula that can calculate sum of numbers starting from n-th row and adding to the sum k numbers, for example start from 4th row and add 3 numbers down the row, i.e. PartialSum(4, 3) would be 4 + 5 + 6 = 15
BTW I can't use App Script as now it has some type of error Error code RESOURCE_EXHAUSTED. and in general I have had issue of stabile work with App Script before too.
As Tanaike mentioned, the error code when using Google Apps Script was just a temporary bug that seems to be solved at this moment.
Now, I can think of 2 possible solutions for this using custom functions:
Solution 1
If your data follows a specific numeric order one by one just like the example provided in the post, you may want to consider using the following code:
function PartialSum(n, k) {
let sum = n;
for(let i=1; i<k; i++)
{
sum = sum + n + i;
}
return sum;
}
Solution 2
If your data does not follow any particular order and you just want to sum a specific number of rows that follow the row you select, then you can use:
function PartialSum(n, k) {
let ss = SpreadsheetApp.getActiveSheet();
let r = ss.getRange(n, 1); // Set column 1 as default (change it as needed)
let sum = n;
for(let i=1; i<k; i++)
{
let val = ss.getRange(n + i, 1).getValue();
sum = sum + val;
}
return sum;
}
Result:
References:
Custom Functions in Google Sheets
Formula:
= SUM( OFFSEET( initialCellName, 0, 0, numberOfElementsInColumn, 0) )
Example add 7 elements starting from A5 cell:
= SUM( OFFSEET( A5, 0, 0, 7, 0) )

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.
clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

How to export/convert line projection to excel table and order the Y coornidate

I wrote a code that can get line projection (intensity profile) of an image, and I would like to convert/export this line projection (intensity profile) to excel table, and then order all the Y coordinate. For example, except the maximum and minimum values of all the Y coordinate, I would like to know largest 5 coordinate value and smallest coordinate value.
Is there any code can reach this function? Thanks,
image line_projection
Realimage imgexmp
imgexmp := GetFrontImage()
number samples = 256, xscale, yscale, xsize, ysize
GetSize( imgexmp, xsize, ysize )
line_projection := CreateFloatImage( "line projection", Xsize, 1 )
line_projection = 0
line_projection[icol,0] += imgexmp
line_projection /= samples
ShowImage( line_projection )
Finding a 'sorted' list of values
If you need to sort though large lists of values (i.e. large images) the following might not be very sufficient. However, if your aim is to get the "x highest" values with a relatively small number of X, then the following code is just fine:
number nFind = 10
image test := GetFrontImage().ImageClone()
Result( "\n\n" + nFind + " highest values:\n" )
number x,y,v
For( number i=0; i<nFind; i++ )
{
v = max(test,x,y)
Result( "\t" + v + " at " + x + "\n" )
test[x,y] = - Infinity()
}
Working with a copy and subsequently "removing" the maximum value by changing that pixel value. The max command is fast - even for large images -, but the for-loop iteration and setting of individual pixels is slow. Hence this script is too slow for a complete 'sorting' of the data if it is big, but it can quickly get you the n 'highest' values.
This is a non-coding answer:
If you havea LinePlot display in DigitalMicrograph, you can simply copy-paste that into Excel to get the numbers.
i.e. with the LinePlot image front most, preses CTRL + C to copy
(make sure there are no ROIs on it).
Switch to Excel and press CTRL + V. Done.
==>

How do you convert 8-bit bytes to 6-bit characters?

I have a specific requirement to convert a stream of bytes into a character encoding that happens to be 6-bits per character.
Here's an example:
Input: 0x50 0x11 0xa0
Character Table:
010100 T
000001 A
000110 F
100000 SPACE
Output: "TAF "
Logically I can understand how this works:
Taking 0x50 0x11 0xa0 and showing as binary:
01010000 00010001 10100000
Which is "TAF ".
What's the best way to do this programmatically (pseudo code or c++). Thank you!
Well, every 3 bytes, you end up with four characters. So for one thing, you need to work out what to do if the input isn't a multiple of three bytes. (Does it have padding of some kind, like base64?)
Then I'd probably take each 3 bytes in turn. In C#, which is close enough to pseudo-code for C :)
for (int i = 0; i < array.Length; i += 3)
{
// Top 6 bits of byte i
int value1 = array[i] >> 2;
// Bottom 2 bits of byte i, top 4 bits of byte i+1
int value2 = ((array[i] & 0x3) << 4) | (array[i + 1] >> 4);
// Bottom 4 bits of byte i+1, top 2 bits of byte i+2
int value3 = ((array[i + 1] & 0xf) << 2) | (array[i + 2] >> 6);
// Bottom 6 bits of byte i+2
int value4 = array[i + 2] & 0x3f;
// Now use value1...value4, e.g. putting them into a char array.
// You'll need to decode from the 6-bit number (0-63) to the character.
}
Just in case if someone is interested - another variant that extracts 6-bit numbers from the stream as soon as they appear there. That is, results can be obtained even if less then 3 bytes are currently read. Would be useful for unpadded streams.
The code saves the state of the accumulator a in variable n which stores the number of bits left in accumulator from the previous read.
int n = 0;
unsigned char a = 0;
unsigned char b = 0;
while (read_byte(&byte)) {
// save (6-n) most significant bits of input byte to proper position
// in accumulator
a |= (b >> (n + 2)) & (077 >> n);
store_6bit(a);
a = 0;
// save remaining least significant bits of input byte to proper
// position in accumulator
a |= (b << (4 - n)) & ((077 << (4 - n)) & 077);
if (n == 4) {
store_6bit(a);
a = 0;
}
n = (n + 2) % 6;
}

How could I test if two bit patterns differs in any N bits (position doesn't matter)

Let's say i have this bit field value: 10101001
How would i test if any other value differs in any n bits. Without considering
the positions?
Example:
10101001
10101011 --> 1 bit different
10101001
10111001 --> 1 bit different
10101001
01101001 --> 2 bits different
10101001
00101011 --> 2 bits different
I need to make a lot of this comparisons so i'm primarily looking for perfomance but any
hint is very welcome.
Take the XOR of the two fields and do a population count of the result.
if you XOR the 2 values together, you are left only with the bits that are different.
You then only need to count the bits which are still 1 and you have your answer
in c:
unsigned char val1=12;
unsigned char val2=123;
unsigned char xored = val1 ^ val2;
int i;
int numBits=0;
for(i=0; i<8; i++)
{
if(xored&1) numBits++;
xored>>=1;
}
although there are probably faster ways to count the bits in a byte
(you could for instance use a lookuptable for 256 values)
Just like everybody else said, use XOR to determine what's different and then use one of these algorithms to count.
This gets the bit difference between the values and counts the bits three at a time:
public static int BitDifference(int a, int b) {
int cnt = 0, bits = a ^ b;
while (bits != 0) {
cnt += (0xE994 >> ((bits & 7) << 1)) & 3;
bits >>= 3;
}
return cnt;
}
XOR the numbers, then the problem becomes a matter of counting the 1s in the result.
In Java:
Integer.bitCount(a ^ b)
Comparison is performed with XOR, as others already answered.
counting can be performed in several ways:
shift left and addition.
lookup in a table.
logic formulas that you can find with Karnaugh maps.

Resources