Check all bytes of a __m128i for a match of a single byte using SSE/AVX/AVX2 - sse

I'm looking for efficient ways to compute the following function:
Input: __m128i data, uint8_t in;
Output: boolean indicating if any byte in data is in.
I'm essentially using them to implement a space/time efficient stack for bytes with capacity 8. My most efficient solution is to first compute a __m128i tmp with all bytes as in. Then check if any bytes in tmp\xor data is a zero byte.

Yes, AVX2 has efficient byte-broadcast. SSSE3 pshufb with an all-zero mask is just as cheap, but you have to create the shuffle-control vector. AVX512BW/F even has single-instruction vpbroadcastb/w/d/q x/y/zmm, r32. (With optional masking, so you can zero some or merge with an existing vector if you want, e.g. insert at a position using a single-bit mask.)
Fortunately, compilers know how to do this when implementing _mm_set1_epi8 so we can leave it to the compiler.
Then it just boils down to the usual pcmpeqb/pmovmskb to get an integer which will have a 1 bit for matching elements, which you can branch on.
// 0 for not found, non-zero for found. (Bit position tells you where).
unsigned contains(__m128i data, uint8_t needle) {
__m128i k = _mm_set1_epi8(needle);
__m128i cmp = _mm_cmpeq_epi8(data, k); // vector mask
return _mm_movemask_epi8(cmp); // integer bitmask
}
As you'd expect, all compilers make this use this asm (Godbolt)
contains(long long __vector(2), unsigned char):
vmovd xmm1, edi
vpbroadcastb xmm1, xmm1
vpcmpeqb xmm0, xmm0, xmm1
vpmovmskb eax, xmm0
ret
Except MSVC, which wastes an instruction on movsx eax, dl first. (Windows x64 passes the 2nd arg in RDX, vs. x86-64 System V passing the first integer arg in RDI.)
Without AVX2, you'll get something like this with SSSE3 or above
# gcc8.3 -O3 -march=nehalem
contains(long long __vector(2), unsigned char):
movd xmm1, edi
pxor xmm2, xmm2
pshufb xmm1, xmm2 # _mm_shuffle_epi8(needle, _mm_setzero_si128())
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
ret
Or with just SSE2 (baseline for x86-64):
contains(long long __vector(2), unsigned char):
mov DWORD PTR [rsp-12], edi
movd xmm1, DWORD PTR [rsp-12] # gcc's tune=generic strategy is still store/reload /facepalm
punpcklbw xmm1, xmm1 # duplicate to low 2 bytes
punpcklwd xmm1, xmm1 # duplciate to low 4 bytes
pshufd xmm1, xmm1, 0 # broadcast
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
ret
Related:
How to compare two vectors using SIMD and get a single boolean result? and many duplicates
How can I count the occurrence of a byte in array using SIMD?
SIMD/SSE: How to check that all vector elements are non-zero (pxor+ptest+jcc = 4 uops vs. pcmpeqb+pmovmskb + macro-fused test/jcc = 3 uops.)
The indices of non-zero bytes of an SSE/AVX register (finding the match positions)
How to count character occurrences using SIMD (like memchr but counting matches instead of finding the first, using AVX2. With efficient accumulation of the counts, and efficient horizontal sums.)

Related

No memory source operand for crc32 in Delphi 11? [duplicate]

I am trying to assemble the code below using yasm. I have put 'here' comments where yasm reports the error "error: invalid size for operand 2". Why is this error happening ?
segment .data
a db 25
b dw 0xffff
c dd 3456
d dq -14
segment .bss
res resq 1
segment .text
global _start
_start:
movsx rax, [a] ; here
movsx rbx, [b] ; here
movsxd rcx, [c] ; here
mov rdx, [d]
add rcx, rdx
add rbx, rcx
add rax, rbx
mov [res], rax
ret
For most instructions, the width of the register operand implies the width of the memory operand, because both operands have to be the same size. e.g. mov rdx, [d] implies mov rdx, qword [d] because you used a 64-bit register.
But the same movsx / movzx mnemonics are used for the byte-source and word-source opcodes, so it's ambiguous unless the source is a register (like movzx eax, cl). Another example is crc32 r32, r/m8 vs. r/m16 vs. r/m32. (Unlike movsx/zx, its source size can be as wide as the operand-size.)
movsx / movzx with a memory source always need the width of the memory operand specified explicitly.
The movsxd mnemonic is supposed to imply a 32-bit source size. movsxd rcx, [c] assembles with NASM, but apparently not with YASM. YASM requires you to write dword, even though it doesn't accept byte, word, or qword there, and it doesn't accept movsx rcx, dword [c] either (i.e. it requires the movsxd mnemonic for 32-bit source operands).
In NASM, movsx rcx, dword [c] assembles to movsxd, but movsxd rcx, word [c] is still rejected. i.e. in NASM, plain movsx is fully flexible, but movsxd is still rigid. I'd still recommend using dword to make the width of the load explicit, for the benefit of humans.
movsx rax, byte [a]
movsx rbx, word [b]
movsxd rcx, dword [c]
Note that the "operand size" of the instruction (as determined by the operand-size prefix to make it 16-bit, or REX.W=1 to make it 64-bit) is the destination width for movsx / movzx. Different source sizes use different opcodes.
In case it's not obvious, there's no movzxd because 32-bit mov already zero-extends to 64-bit implicitly. movsxd eax, ecx is encodeable, but not recommended (use mov instead).
In AT&T syntax, you need to explicitly specify both the source and destination width in the mnemonic, like movsbq (%rsi), %rax. GAS won't let you write movsb (%rsi), %eax to infer a destination width (operand-size) because movsb/movsw/etc are the mnemonics for string-move instructions with implicit (%rsi), (%rdi) operands.
Fun fact: GAS and clang do allow it for things like movzb (%rsi), %eax as movzbl, but GAS only has extra logic to allow disambiguation (not just inferring size) based on operands when it's necessary, like movsd (%rsi), %xmm0 vs. movsd. (Clang12.0.1 actually does accept movsb (%rcx), %eax as movsbl, but GAS 2.36.1 doesn't, so for portability it's best to be explicit with sign-extension, and not a bad idea for zero-extension too.)
Other stuff about your source code:
NASM/YASM allow you to use the segment keyword instead of section, but really you're giving ELF section names, not executable segment names. Also, you can put read-only data in section .rodata (which is linked as part of the text segment). What's the difference of section and segment in ELF file format.
You can't ret from _start. It's not a function, it's your ELF entry point. The first thing on the stack is argc, not a valid return address. Use this to exit cleanly:
xor edi,edi
mov eax, 231
syscall ; sys_exit_group(0)
See the x86 tag wiki for links to more useful guides (and debugging tips at the bottom).

Inline asm (32) emulation of move (copy memory) command

I have two two-dimensional arrays with dynamic sizes (guess that's the proper wording). I copy the content of first one into the other using:
dest:=copy(src,0,4*x*y);
// src,dest:array of array of longint; x,y:longint;
// setlength(both arrays,x,y); //x and y are max 15 bit positive!
It works. However I'm unable to reproduce this in asm. I tried the following variations to no avail... Could someone enlighten me...
MOV ESI,src; MOV EDI,dest; MOV EBX,y; MOV EAX,x; MUL EBX;
PUSH DS; POP ES; MOV ECX,EAX; CLD; REP MOVSD;
Also tried with LEA (didn't expect that to work since it should fetch the pointer address not the array address), no workie, and tried with:
p1:=#src[0,0]; p2:=#dest[0,0]; //being no-type pointers
MOV ESI,p1; MOV EDI,p2... (the same asm)
Hints pls? Btw it's delphi 6. The error is, of course, access violation.
This is really a two-fold three-fold question.
What's the structure of a dynamic array.
Which instructions in asm will copy the array.
I'm throwing random assembler at the CPU, why doesn't it work?
Structure of a dynamic array
See: http://docwiki.embarcadero.com/RADStudio/Seattle/en/Internal_Data_Formats
To quote:
Dynamic Array Types
On the 32-bit platform, a dynamic-array variable occupies 4 bytes of memory (and 8 bytes on 64-bit) that contain a pointer to the dynamically allocated array. When the variable is empty (uninitialized) or holds a zero-length array, the pointer is nil and no dynamic memory is associated with the variable. For a nonempty array, the variable points to a dynamically allocated block of memory that contains the array in addition to a 32-bit (64-bit on Win64) length indicator and a 32-bit reference count. The table below shows the layout of a dynamic-array memory block.
Dynamic array memory layout (32-bit and 64-bit)
Offset 32-bit -8 -4 0
Offset 64-bit -12 -8 0
contents refcount count start of data
So the dynamic array variable is a pointer to the middle of the above structure.
How do I access this in asm
Let's assume the array holds records of type TMyRec
you'll need to run this code for every inner array in the outer array to do the deep copy. I leave this as an exercise for the reader. (you can do the other part in pascal).
type
TDynArr: array of TMyRec;
procedure SlowButBasicMove(const Source: TDynArr; var dest);
asm
//insert register pushes, see below.
mov esi,Source //esi = pointer to source data
mov edi,Dest //edi = pointer to dest
sub esi,8
mov ebx,[esi] //ebx = refcount (just in case)
mov ecx,[esi+4] //ecx = element count
mov edx,SizeOf(TMyRec) //anywhere from 1 to zillions
mul ecx,edx //==ecx=number of bytes in array.
//// now we can start moving
xor ebx,ebx //ebx =0
add eax,8 //eax = #data
#loop:
mov eax,[esi+ebx] //Get data from source
mov [edi+ebx],esi //copy it to dest
add ebx,4 //4 bytes at a time
cmp ebx,ecx //is ebx> number of bytes?
jle loop
//Done copying.
//insert register pops, see below
end;
That's the copy done, however in order for the system not to crash, you need to save and restore the non volatile registers (all but EAX, ECX, EDX), see: http://docwiki.embarcadero.com/RADStudio/Seattle/en/Program_Control
push ebx
push esi
push edi
--- insert code shown above
//restore non-volatile registers
pop edi
pop esi
pop ebx //note the restoring must happen in the reverse order of the push.
See the Jeff Dunteman's book assembly step by step if you're completely new to asm.
You will get access violations if:
you try to read from a wrong address.
you try to write to a wrong adress.
you read past the end of the array.
you write to memory you haven't claimed before using GetMem or whatever means.
if you write past the end of your buffer.
if you do not restore all non-volatile registers
Remember you're directly dealing with the CPU. Delphi will not assist you in any way.
Really fast code will use some form of SSE to move 16bytes per instruction in an unrolled loop, see the above mentioned fastcode for examples of optimized assembler.
Random assembler
In assembler you need to know exactly what you're what to do, how and what the CPU does.
Set a breakpoint and run your code. Press ctrl + alt + C and behold the CPU-debug window.
This will allow you to see the code Delphi generates.
You can single step through the code to see what the CPU does.
see: http://www.plantation-productions.com/Webster/index.html
For more reading.
Dynamic Arrays differ from Static Arrays, especially when it comes to multi-dimensionality.
Refer to this reference for internal formats.
The point is that an Array Of Array Of LongInt of dimensions X and Y (in this order!) is a pointer to an array of X pointers that point to an array of Y LongInts.
Since it seems, from your comments, that you have already allocated the space for all elements in Dest, I assume you want to do a Deep Copy.
Here a sample program, where the assembly as been made as simple as possible for the sake of clarity.
Program Test;
Var Src, Dest: Array Of Array Of LongInt;
X, Y, I, J: Integer;
Begin
X := 4;
Y := 2;
setLength(Src, X, Y);
setLength(Dest, X, Y);
For I := 0 To X-1 Do
For J := 0 To Y-1 Do
Src[I,J] := I*Y + J;
{$ASMMODE intel}
Asm
cld ;Be sure movsd increments the registers
mov esi, DWORD PTR [Src] ;Src pointer
mov edi, DWORD PTR [Dest] ;Dest pointer
mov ecx, DWORD PTR [X] ;Repeat for X times
;The number of elements in Src
#_copy:
push esi ;Save these for later
push edi
push ecx
mov ecx, DWORD PTR [Y] ;Repeat for Y times
;The number of element in a Src[i] array
mov edi, DWORD PTR [edi] ;Get the pointer to the Dest[i] array
mov esi, DWORD PTR [esi] ;Get the pointer to the Src[i] array
rep movsd ;Copy sub array
pop ecx ;Restore
pop edi
pop esi
add esi, 04h ;Go from Src[i] to Src[i+1]
add edi, 04h ;Go from Dest[i] to Dest[i+1]
loop #_copy
End;
For I := 0 To X-1 Do
Begin
WriteLn();
For J := 0 To Y-1 Do
Begin
Write(' ');
Write(Dest[I,J]);
End;
End;
End.
Note 1 This source code is intended to be compile with freepascal.
Donation of Spare Time(TM) for downloading and installing Delphi are welcome!
Note 2 This source code is for illustration purpose only, it is pretty obvious, it has already been stated above, but somehow not everybody got it.
If the OP wanted a fast way to copy the array, they should have stated so.
Note 3 I don't save the clobbered registers, this is bad practice, my bad; I forgot, as there are no subroutines, no optimizations and no reason for the compiler to pass data in the registers between the two fors.
This is left as an exercise to the reader.

Why is 64 bit Delphi app calculating different results than 32 bit build?

We recently started the process of creating 64 bit builds of our applications. During comparison testing we found that the 64 bit build is calculating differently. I have a code sample that demonstrates the difference between the two builds.
var
currPercent, currGross, currCalcValue : Currency;
begin
currGross := 1182.42;
currPercent := 1.45;
currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
ShowMessage(CurrToStr(currCalcValue));
end;
If you step through this in the 32 bit version, currCalcValue is calculated with 17.1451 while the 64 bit version comes back with 17.145.
Why isn't the 64 bit build calculating out the extra decimal place? All variables are defined as 4 decimal currency values.
Here's my SSCCE based on your code. Note the use of a console application. Makes life much simpler.
{$APPTYPE CONSOLE}
uses
SysUtils;
var
currPercent, currGross, currCalcValue : Currency;
begin
currGross := 1182.42;
currPercent := 1.45;
currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
Writeln(CurrToStr(currCalcValue));
Readln;
end.
Now look at the code that is generated. First 32 bit:
Project3.dpr.13: currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
0041C409 8D45EC lea eax,[ebp-$14]
0041C40C BADCC44100 mov edx,$0041c4dc
0041C411 E8A6A2FEFF call #UStrLAsg
0041C416 8B1504E74100 mov edx,[$0041e704]
0041C41C 8B45EC mov eax,[ebp-$14]
0041C41F E870AFFFFF call StrToCurr
0041C424 DF7DE0 fistp qword ptr [ebp-$20]
0041C427 9B wait
0041C428 DF2DD83E4200 fild qword ptr [$00423ed8]
0041C42E DF6DE0 fild qword ptr [ebp-$20]
0041C431 DEC9 fmulp st(1)
0041C433 DF2DE03E4200 fild qword ptr [$00423ee0]
0041C439 DEC9 fmulp st(1)
0041C43B D835E4C44100 fdiv dword ptr [$0041c4e4]
0041C441 DF3DE83E4200 fistp qword ptr [$00423ee8]
0041C447 9B wait
And the 64 bit:
Project3.dpr.13: currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
0000000000428A0E 488D4D38 lea rcx,[rbp+$38]
0000000000428A12 488D1513010000 lea rdx,[rel $00000113]
0000000000428A19 E84213FEFF call #UStrLAsg
0000000000428A1E 488B4D38 mov rcx,[rbp+$38]
0000000000428A22 488B155F480000 mov rdx,[rel $0000485f]
0000000000428A29 E83280FFFF call StrToCurr
0000000000428A2E 4889C1 mov rcx,rax
0000000000428A31 488B0510E80000 mov rax,[rel $0000e810]
0000000000428A38 48F7E9 imul rcx
0000000000428A3B C7C110270000 mov ecx,$00002710
0000000000428A41 48F7F9 idiv rcx
0000000000428A44 488BC8 mov rcx,rax
0000000000428A47 488B0502E80000 mov rax,[rel $0000e802]
0000000000428A4E 48F7E9 imul rcx
0000000000428A51 C7C110270000 mov ecx,$00002710
0000000000428A57 48F7F9 idiv rcx
0000000000428A5A 488905F7E70000 mov [rel $0000e7f7],rax
Note that the 32 bit code performs the arithmetic on the FPU, but the 64 bit code performs it using integer arithmetic. That's the key difference.
In the 32 bit code, the following calculation is performed:
Convert '0.01' to currency, which is 100, allowing for the fixed point shift of 10,000.
Load 14,500 into the FPU.
Multiply by 100 giving 1,450,000.
Multiply by 11,824,200 giving 17,145,090,000,000.
Divide by 10,000^2 giving 171,450.9.
Round to the nearest integer giving 171,451.
Store that in your currency variable. Hence the result is 17.1451.
Now, in the 64 bit code, it's a little different. Because we use 64 bit integers all the way. It looks like this:
Convert '0.01' to currency, which is 100.
Multiply by 14,500 which is 1,450,000.
Divide by 10,000 which is 145.
Multiply by 11,824,200 giving 1,714,509,000.
Divide by 10,000 which is 171,450. Uh-oh, loss of precision here.
Store that in your currency variable. Hence the result is 17.145.
So the issue is that the 64 bit compiler divides by 10,000 at each intermediate step. Presumably to avoid overflow, much more likely in a 64 bit integer than a floating point register.
Were it to do the calculation like this:
100 * 14,500 * 11,824,200 / 10,000 / 10,000
it would get the right answer.
This has been fixed as of XE5u2 and as of current writing XE6u1.

FLD instruction x64 bit

I have a little problem with FLD instruction in x64 bit ...
want to load Double value to the stack pointer FPU in st0 register, but it seem to be impossible.
In Delphi x32, I can use this code :
function DoSomething(X:Double):Double;
asm
FLD X
// Do Something ..
FST Result
end;
Unfortunately, in x64, the same code does not work.
Delphi inherite Microsoft x64 Calling Convention.
So if arguments of function/procedure are float/double, they are passed in XMM0L, XMM1L, XMM2L, and XMM3L registers.
But you can use var before parameter as workaround like:
function DoSomething(var X:Double):Double;
asm
FLD qword ptr [X]
// Do Something ..
FST Result
end;
In x64 mode floating point parameters are passed in xmm-registers. So when Delphi tries to compile FLD X, it becomes FLD xmm0 but there is no such instruction. You first need to move it to memory.
The same goes with the result, it should be passed back in xmm0.
Try this (not tested):
function DoSomething(X:Double):Double;
var
Temp : double;
asm
MOVQ qword ptr Temp,X
FLD Temp
//do something
FST Temp
MOVQ xmm0,qword ptr Temp
end;
You don't need to use legacy x87 stack registers in x86-64 code, because SSE2 is baseline, a required part of the x86-64 ISA. You can and should do your scalar FP math using addsd, mulsd, sqrtsd and so on, on XMM registers. (Or addss for float)
The Windows x64 calling convention passes float/double FP args in XMM0..3, if they're one of the first four args to the function. (i.e. the 3rd total arg goes in xmm2 if it's FP, rather than the 3rd FP arg going in xmm2.) It returns FP values in XMM0.
Only use x87 if you actually need 80-bit precision inside your function. (Instructions like fsin and fyl2x are not fast, and can usually be done just as well by normal math libraries using SSE/SSE2 instructions.
function times2(X:Double):Double;
asm
addsd xmm0, xmm0 // upper 8 bytes of XMM0 are ignored
ret
end
Storing to memory and reloading into an x87 register costs you about 10 cycles of latency for no benefit. SSE/SSE2 scalar instructions are just as fast, or faster, than their x87 equivalents, and easier to program for and optimize because you never need fxch; it's a flat register design instead of stack-based. (https://agner.org/optimize/). Also, you have 15 XMM registers.
Of course, you usually don't need inline asm at all. It could be useful for manually-vectorizing if the compiler doesn't do that for you.

Printing character from register

I'm using the NASM assembler.
The value returned to the eax register is supposed to be a character, when I attempt to print the integer representation its a value that looks like a memory address. I was expecting the decimal representation of the letter. For example, if character 'a' was moved to eax I should see 97 being printed (the decimal representation of 'a'). But this is not the case.
section .data
int_format db "%d", 0
;-----------------------------------------------------------
mov eax, dword[ebx + edx]
push eax
push dword int_format
call _printf ;prints a strange number
add esp, 8
xor eax, eax
mov eax, dword[ebx + edx]
push eax
call _putchar ;prints the correct character!
add esp, 4
So what gives here? ultimately I want to compare the character so it is important that eax gets the correct decimal representation of the character.
mov eax, dword[ebx + edx]
You are loading a dword (32 bits) from the address pointed to ebx+edx. If you want a single character, you need to load a byte. For that, you can use movzx instruction:
movzx eax, byte[ebx + edx]
This will load a single byte to the low byte of eax (i.e. al) and zero out the rest of the register.
Another option would be to mask out the extra bytes after loading the dword, e.g.:
and eax, 0ffh
or
movxz eax, al
As for putchar, it works because it interprets the passed value as char, i.e. it ignores the high three bytes present in the register and considers only the low byte.

Resources