SIMD zero vector test

SIMD zero vector test - sse

Does there exist a quick way to check whether a SIMD vector is a zero vector (all components equal +-zero). I am currently using an algorithm, using shifts, that runs in log2(N) time, where N is the dimension of the vector. Does there exist anything faster? Note that my question is broader (tags), than the proposed answer and it refers to vectors of all types (integer, float, double, ...).

How about this straightforward avx code? I think it's O(N) and don't know how you could possibly do better without making assumptions about the input data - you have to actually read every value to know if its 0 so it's about doing as much of that as possible per cycle.
You should be able to massage the code to your needs. Should treat both +0 and -0 as zero. Will work for unaligned memory addresses but aligning to 32 byte addresses will make the loads faster. You may need to add something to deal with remaining bytes if size isn't a multiple of 8.
uint64_t num_non_zero_floats(float *mem_address, int size) {
uint64_t num_non_zero = 0;
__m256 zeros _mm256_setzero_ps ();
for(i = 0; i != size; i+=8) {
__m256 vec _mm256_loadu_ps (mem_addr + i);
__m256 comparison_out _mm256_cmp_ps (zeros, vec, _CMP_EQ_OQ); //3 cycles latency, throughput 1
uint64_t bits_non_zero = _mm256_movemask_ps(comparison_out); //2-3 cycles latency
num_non_zero += __builtin_popcountll(bits_non_zero);
}
return num_non_zero;
}

If you want to test floats for +/- 0.0, then you can check for all the bits being zero, except the sign bit. Any set-bits anywhere except the sign bit mean the float is non-zero. (http://www.h-schmidt.net/FloatConverter/IEEE754.html)
Agner Fog's asm optimization guide points out that you can test a float or double for zero using integer instructions:
; Example 17.4b
mov eax, [rsi]
add eax, eax ; shift out the sign bit
jz IsZero
For vectors, though, using ptest with a sign-bit mask is better than using paddd to get rid of the sign bit. Actually, test [rsi], $0x7fffffff may be more efficient than Agner Fog's load/add sequence, but a 32bit immediate probably stops the load from micro-fusing on Intel, and maybe have a larger code-size.
x86 PTEST (SSE4.1) does a bitwise AND and sets flags based on the result.
movdqa xmm0, [mask]
.loop:
ptest xmm0, [rsi+rcx]
jnz nonzero
add rcx, 16 # count up towards zero
jl .loop # with rsi pointing to past the end of the array
...
nonzero:
Or cmov could be useful to consume the flags set by ptest.
IDK if it'd be possible to use a loop-counter instruction that didn't set the zero flag, so you could do both tests with one jump instruction or something. Probably not. And the extra uop to merge the flags (or the partial-flags stall on earlier CPUs) would cancel out the benefit.
#Iwillnotexist Idonotexist: re one of your comments on the OP: you can't just movemask without doing a pcmpeq first, or a cmpps. The non-zero bit might not be in the high bit! You probably knew that, but one of your comments seemed to leave it out.
I do like the idea of ORing together multiple values before actually testing. You're right that sign-bits would OR with other sign-bits, and then you ignore them the same way you would if you were testing one at a time. A loop that PORs 4 or 8 vectors before each PTEST would probably be faster. (PTEST is 2 uops, and can't macro-fuse with a jcc.)

Related

Delphi Warning W1073 Combining signed type and unsigned 64-bit type - treated as an unsigned type

I get the subject warning on the following line of code;
SelectedFilesSize := SelectedFilesSize +
UInt64(IdList.GetPropertyValue(TShellColumns.Size)) *
ifthen(Selected, 1, -1);
Specifically, the IDE highlights the third line.
SelectedFilesSize is declared as UInt64.
The code appears to work when I run it; if I select an item, its file size is added to the total, if I deselect a file its size is subtracted.
I know I can suppress this warning with {$WARN COMBINING_SIGNED_UNSIGNED64 OFF}.
Can someone explain? Will there be an unforeseen impact if SelectedFilesSize gets huge? Or an impact on a specific target platform?
Delphi 10.3, Win32 and Win64 targets

This will work here, but the warning is right.
If you multiply a UInt64 with -1, you are actually multiplying it with $FFFFFFFFFFFFFFFF. The final result will be a 128 bit value, but the lower 64 bits will be the same as for a signed multiplication (that is also why the code generator often produces an imul opcode, even for unsiged multiplication: the lower bits will be correct, just the — unused — higher bits won't be). The upper 64 bits won't be used anyway, so they don't matter.
If you add that (actually negative) value to another UInt64 (e.g. SelectedFilesSize), the 64 bit result will be correct again. The CPU does not discriminate between positive or negative values when adding. The resulting CPU flags (carry, overflow) will indicate overflow, but if you ignore that by not using range or overflow checks, your code will be fine.
Your code will likely produce a runtime error if range or overflow checks are on, though.
In other words, this works because any excess upper bit — the 64th bit and above — can be ignored. Otherwise, the values would be wrong. See example.
Example
Say your IdList.GetPropertyValue(TShellColumns.Size) is 420. Then you are performing:
$00000000000001A4 * $FFFFFFFFFFFFFFFF = $00000000000001A3FFFFFFFFFFFFFF5C
This is a huge but positive number, but fortunately the lower 64 bits ($FFFFFFFFFFFFFF5C) can be interpreted as -420 (a really negative value in 128 bit would be $FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5C or -420).
Now say your SelectedFileSize is 100000 (or hex $00000000000186A0). Then you get:
$00000000000186A0 + $FFFFFFFFFFFFFF5C = $00000000000184FC
(or actually $100000000000184FC, but the top bit -- the carry -- is ignored).
$00000000000184FC is 99580 in decimal, so exactly the value you wanted.

32 bit multiplication on 24 bit ALU

I want to port a 32 by 32 bit unsigned multiplication on a 24-bit dsp (it's a Linear Congruential Generator, so I'm not allowed to truncate, also I don't want to replace yet the current LCG with a 24 bit one). The available data types are 24 and 48 bit ints.
Only the last 32 LSB are needed. Do you know any hacks to implement this in fewer multiplies, masks and shifts than the usual way?
The line looks like this:
//val is an int(32 bit)
val = (1664525 * val) + 1013904223;

An outline would be (in my current compiler style):
static uint48_t val = SEED;
...
val = 0xFFFFFFFFUL & ((1664525UL * val) + 1013904223UL);
and hopefully the compiler will recognise:
it can use a multiply and accumulate command
it only needs a reduced multiply algorithim due to the "high word" of the constant being zero
the AND could be effected by resetting the upper bits or multiplying a constant and restoring
...other stuff depends on your {mystery dsp} target
Note
if you scale up the coefficients by 2^16, you can get truncation for free, but due to lack of info
you will have to explore/decide if it is better overall.

(This is more an elaboration why two multiplications 24×24→n, 31<n are enough for 32×32→min(n, 40).)
The question discloses amazingly little about the capabilities to build a method
32×21→32 in fewer [24×24] multiplies, masks and shifts than the usual way on:
24 and 48 bit ints & DSP (I read high throughput, non-high latency 24×24→48).
As far as there indeed is a 24×24→48 multiply (or even 24×24+56→56 MAC) and one factor is less than 24 bits, the question is pointless, a second multiply being the compelling solution.
The usual composition of a 24<n<48×24<m<48→24<p multiply from 24×24→48 uses three of the latter; a compiler should know as well as a coder that "the fourth multiply" would yield bits with a significance/position exceeding the combined lengths of the lower parts of the factors.
So, is it possible to generate "the long product" using just a second 24×24→48?
Let the (bytes of the) factors be w_xyz and W_XYZ, respectively; the underscores suggesting "the Ws" being the lower significance bits in the higher significance words/ints if interpreted as 24bit ints. The first 24×24→48 gives the sum of
  zX
 yXzY
xXyYzZ
 xYyZ
  xZ, what is needed (fat) is
 wZ +
 zW.
This can be computed using one combined multiplication of
((w<<16)|(z & 0xff)) × ((W<<16)|(Z & 0xff)). (Never mind the 17th bit of wZ+zW "running" into wW.)
(In the first revision of this answer, I foolishly produced wZ and zW separately - their sum is wanted in the end, anyway.)
(Annoyingly, this is about all you can do for 24×24→24 as a base operation too - beyond this "combining multiplication", you need four instead of one.)
Another angle to explore is choosing a different PRNG.
It may have to be >24 bits (tell!).
On a 24 bit machine, XorShift* (or even XorShift+) 48/32 seems worth a look.

Summing 3 lanes in a NEON float32x4_t

I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_ts. My computation finishes with the need to sum three of the four floats in this vector.
I can drop back to C floats at this point and vst1q_f32 to get the four values out and add up the three I need. But I figure it may be more effective if there's a way to do it directly with the vector in an instruction or two, and then just grab a single lane result, but I couldn't figure out any clear path to doing this.
I'm new to NEON programming, and the existing "documentation" is pretty horrific. Any ideas? Thanks!

You should be able to use VFP unit for such task. NEON and VFP shares the same register bank, meaning you don't need to shuffle around registers to get advantage of one unit and they can also have different views of the same register bits.
Your float32x4_t is 128 bit so it must sit on a Quad (Q) register. If you are solely using arm intrinsic you wouldn't know which one you are using. Problem there is if it is sitting above 4, VFP can't see it as a single precision (for the curious reader: I kept this simple since there are differences between VFP versions and this is the bare minimum requirement.). So it would be best to move your float32x4_t to a fixed register like Q0. After this you can just sum registers like S0, S1, S2 with vadd.f32 and move the result back to an ARM register.
Some warnings... VFP and NEON are theoretically different execution units sharing same register bank and pipeline. I am not sure if this approach is any better than others, I don't need to say but again, you should do benchmark. Also this approach isn't streamlined with neon intrinsic so you probably would need to craft your code with inline assembly.
I did a simple snippet to see how this can look like and I've come up with this:
#include "arm_neon.h"
float32_t sum3() {
register float32x4_t v asm ("q0");
float32_t ret;
asm volatile(
"vadd.f32 s0, s1\n"
"vadd.f32 s0, s2\n"
"vmov %[ret], s0\n"
: [ret] "=r" (ret)
:
:);
return ret;
}
objdump of it looks like (compiled with gcc -O3 -mfpu=neon -mfloat-abi=softfp)
00000000 <sum3>:
0: ee30 0a20 vadd.f32 s0, s0, s1
4: ee30 0a01 vadd.f32 s0, s0, s2
8: ee10 3a10 vmov r0, s0
c: 4770 bx lr
e: bf00 nop
I really would like to hear your impressions if you give this a go!

Can you zero-out the fourth element? Perhaps just by copying it and using vset_lane_f32?
If so, you can use the answers from Sum all elements in a quadword vector in ARM assembly with NEON like:
float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input));
return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements
Though this actually does a bit more work than you need, so it might be faster to just extract the three floats with vget_lane_f32 and add them.

It sounds like you want to use (some version of) VLD1 to load zero into your extra lane (unless you can arrange for it to be zero already), followed by two VPADDL instructions to pairwise-sum four lanes into two and then two lanes into one.

Efficient conversion of an array of singles to an array of doubles in Delphi 2010

I need to implement a wrapper layer between a high level application and a low level sub-system using slightly different typing:
The application produces an array of single vectors:
unit unApplication
type
TVector = record
x, y, z : single;
end;
TvectorArray = array of Tvector;
procedure someFunc(): tvectorArray;
[...]
while the subsystem expects an array of double vectors. I also implemented typecasting from tvector to Tvectord:
unit unSubSystem
type
TVectorD = record
x, y, z : double;
class operator Implicit(value : t3dVector):t3dvectorD;inline;
end;
TvectorDArray = array of TvectorD;
procedure otherFunc(points: tvectorDArray);
implementation
class operator T3dVecTorD.Implicit(value : t3dVector):t3dvectorD;
begin
result.x := value.x;
result.y := value.y;
result.z := value.z;
end;
What I am currently doing is like this:
uses unApplication, unsubsystem,...
procedure ConvertValues
var
singleVecArr : TvectorArray;
doubleveArr : TvectorDArray;
begin
singleVecArr := somefunc;
setlength(doubleVecArray, lenght(singlevecArr));
for i := 0 to length(singlevecArr) -1 do
doubleVecArray[i] := singleVecArr[i];
end;
Is there a more efficient way to perform these kinds of conversion?

First of all I would say that you should not attempt any optimisation without first timing. In this case I don't mean timing alternative algorithms, I mean timing the code in question and assessing what proportion of the total time is spent there.
My instincts tell me that the code you show will run for a tiny proportion of the overall time and so optimising it will yield no discernible benefits. I think if you do anything meaningful with each element of this array then that must be true since the cost of converting from single to double will be small compared to floating point operations.
Finally, if perchance this code is a bottleneck, you should consider not converting it at all. My assumption is that you are using standard Delphi floating point operations which map to the 8087 FPU. All such floating point operations happen inside the 8087 floating point stack. Values are converted on entry to either 64 or more normally 80 bit precision. I don't think it would be any slower to load a single than to load a double – in fact it may even be faster due to memory read performance.

Assuming that the conversion indeed is the bottleneck, then one way of speeding up the conversion may be to use SSE# instead of the FPU, provided the necessary instruction sets can be assumed to be present on the computers on which this code will run.
For instance, the following would convert one single Vector into one double Vector:
procedure SingleToDoubleVector (var S: TVector; var D: TVectorD);
// #S in EAX
// #D in EDX
asm
movups xmm0, [eax] ;// Load S in xmm0
movhlps xmm1, xmm0 ;// Copy High 2 singles of xmm0 into xmm1
cvtps2pd xmm2, xmm0 ;// Convert Low two singles of xmm0 into doubles in xmm2
cvtss2sd xmm3, xmm1 ;// Convert Lowes single in xmm1 into double in xmm1
movupd [edx], xmm2 ;// Move two doubles in xmm2 into D (.X and .Y)
movsd [edx+16],xmm3 ;// Move one double from xmm3 into D.Z
end;
I am not saying that this bit of code is the most efficient way to do it and there are many caveats with using assembly code in general and this code in particular. Note that this code makes assumptions about the alignment of the fields in your records. (It does not make assumptions regarding the alignment of the record as a whole.)
Also, for best results, you would control the alignment of your array/record elements in memory and write the entire conversion loop in assembly, to reduce overheads. Whether this is what you want/can do is another question.

If modifying the source to produce doubles rather than singles is not possible you can try threading out the process. Try dividing the TArray into two or four equal sized chunks (depending on processor count) and have each thread do the conversion. Doing this will realize almost double or quadruple speed.
Also, is the 'length' call calculated each loop? Maybe place that into a variable to avoid the calculation.

How could I guess a checksum algorithm?

Let's assume that I have some packets with a 16-bit checksum at the end. I would like to guess which checksum algorithm is used.
For a start, from dump data I can see that one byte change in the packet's payload totally changes the checksum, so I can assume that it isn't some kind of simple XOR or sum.
Then I tried several variations of CRC16, but without much luck.
This question might be more biased towards cryptography, but I'm really interested in any easy to understand statistical tools to find out which CRC this might be. I might even turn to drawing different CRC algorithms if everything else fails.
Backgroud story: I have serial RFID protocol with some kind of checksum. I can replay messages without problem, and interpret results (without checksum check), but I can't send modified packets because device drops them on the floor.
Using existing software, I can change payload of RFID chip. However, unique serial number is immutable, so I don't have ability to check every possible combination. Allthough I could generate dumps of values incrementing by one, but not enough to make exhaustive search applicable to this problem.
dump files with data are available if question itself isn't enough :-)
Need reference documentation? A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS is great reference which I found after asking question here.
In the end, after very helpful hint in accepted answer than it's CCITT, I
used this CRC calculator, and xored generated checksum with known checksum to get 0xffff which led me to conclusion that final xor is 0xffff instread of CCITT's 0x0000.

There are a number of variables to consider for a CRC:
Polynomial
No of bits (16 or 32)
Normal (LSB first) or Reverse (MSB first)
Initial value
How the final value is manipulated (e.g. subtracted from 0xffff), or is a constant value
Typical CRCs:
LRC: Polynomial=0x81; 8 bits; Normal; Initial=0; Final=as calculated
CRC16: Polynomial=0xa001; 16 bits; Normal; Initial=0; Final=as calculated
CCITT: Polynomial=0x1021; 16 bits; reverse; Initial=0xffff; Final=0x1d0f
Xmodem: Polynomial=0x1021; 16 bits; reverse; Initial=0; Final=0x1d0f
CRC32: Polynomial=0xebd88320; 32 bits; Normal; Initial=0xffffffff; Final=inverted value
ZIP32: Polynomial=0x04c11db7; 32 bits; Normal; Initial=0xffffffff; Final=as calculated
The first thing to do is to get some samples by changing say the last byte. This will assist you to figure out the number of bytes in the CRC.
Is this a "homemade" algorithm. In this case it may take some time. Otherwise try the standard algorithms.
Try changing either the msb or the lsb of the last byte, and see how this changes the CRC. This will give an indication of the direction.
To make it more difficult, there are implementations that manipulate the CRC so that it will not affect the communications medium (protocol).
From your comment about RFID, it implies that the CRC is communications related. Usually CRC16 is used for communications, though CCITT is also used on some systems.
On the other hand, if this is UHF RFID tagging, then there are a few CRC schemes - a 5 bit one and some 16 bit ones. These are documented in the ISO standards and the IPX data sheets.
IPX: Polynomial=0x8005; 16 bits; Reverse; Initial=0xffff; Final=as calculated
ISO 18000-6B: Polynomial=0x1021; 16 bits; Reverse; Initial=0xffff; Final=as calculated
ISO 18000-6C: Polynomial=0x1021; 16 bits; Reverse; Initial=0xffff; Final=as calculated
Data must be padded with zeroes to make a multiple of 8 bits
ISO CRC5: Polynomial=custom; 5 bits; Reverse; Initial=0x9; Final=shifted left by 3 bits
Data must be padded with zeroes to make a multiple of 8 bits
EPC class 1: Polynomial=custom 0x1021; 16 bits; Reverse; Initial=0xffff; Final=post processing of 16 zero bits
Here is your answer!!!!
Having worked through your logs, the CRC is the CCITT one. The first byte 0xd6 is excluded from the CRC.

It might not be a CRC, it might be an error correcting code like Reed-Solomon.
ECC codes are often a substantial fraction of the size of the original data they protect, depending on the error rate they want to handle. If the size of the messages is more than about 16 bytes, 2 bytes of ECC wouldn't be enough to be useful. So if the message is large, you're most likely correct that its some sort of CRC.

I'm trying to crack a similar problem here and I found a pretty neat website that will take your file and run checksums on it with 47 different algorithms and show the results. If the algorithm used to calculate your checksum is any of these algorithms, you would simply find it among the list of checksums produced with a simple text search.
The website is https://defuse.ca/checksums.htm

You would have to try every possible checksum algorithm and see which one generates the same result. However, there is no guarantee to what content was included in the checksum. For example, some algorithms skip white spaces, which lead to different results.
I really don't see why would somebody want to know that though.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart