I want to have universal way of detecting specific CPU features. For this task I've created this function which takes EAX leaf number,Register name and bit number and returns true or false. It works fine for MMX/SSEx/AVX (EAX=1) but it does not detect AVX2 (EAX=7).
CPU: i5-4670k
OS: Windows 7
DetectCPUFeature('1','EDX',23) //DETECTS MMX CORRECTLY
DetectCPUFeature('1','EDX',25) //DETECTS SSE CORRECTLY
DetectCPUFeature('1','EDX',26) //DETECTS SSE2 CORRECTLY
DetectCPUFeature('1','ECX',0) //DETECTS SSE3 CORRECTLY
DetectCPUFeature('1','ECX',9) //DETECTS SSSE3 CORRECTLY
DetectCPUFeature('1','ECX',19) //DETECTS SSE4.1 CORRECTLY
DetectCPUFeature('1','ECX',20) //DETECTS SSE4.2 CORRECTLY
DetectCPUFeature('1','ECX',28) //DETECTS AVX CORRECTLY
DetectCPUFeature('7','EBX',5) //DOES NOT DETECT AVX2!
.
function DetectCPUFeature(EAX_Leaf_HEX,Register_Name:string;Bit:byte):boolean;
var _eax,_ebx,_ecx,_edx,EAX_Leaf,_Result: Longword;
x:integer;
Binary_mask:string;
Decimal_mask:int64;
begin
EAX_Leaf:=HexToInt(EAX_Leaf_HEX);
Binary_mask:='1';
for x:=1 to Bit do Binary_mask:=Binary_mask+'0';
Decimal_mask:=BinToInt(Binary_mask);
if AnsiUpperCase(Register_Name)='EDX' then
begin
asm
mov eax,EAX_Leaf // https://en.wikipedia.org/wiki/CPUID
db $0F,$A2 // db $0F,$A2 = CPUID instruction
mov _Result,edx
end;
end;
if AnsiUpperCase(Register_Name)='ECX' then
begin
asm
mov eax,EAX_Leaf
db $0F,$A2
mov _Result,ecx
end;
end;
if AnsiUpperCase(Register_Name)='EBX' then
begin
asm
mov eax,EAX_Leaf
db $0F,$A2
mov _Result,ebx
end;
end;
if (_Result and Decimal_mask) = Decimal_mask then DetectCPUFeature:=true
else DetectCPUFeature:=false;
end;
This sort of code is very dubious, mixing asm with Pascal code. Your code, in the asm blocks modifies registers and fails to restore them. That could easily be conflicting with the compiler's register usage. My strong advice to you is that you should never mix asm and Pascal in this way. Always use pure Pascal or pure asm.
What you need is a function that will perform the CPUID instruction and return you all the registers in a structure. You can then pick out what you want from that using Pascal code.
In addition, as #J... points out, you need to specify the sub-leaf value in the ECX register before invoking the CPUID instruction. That is a requirement for a number of the more recently added CPUID arguments.
This is the function you need:
type
TCPUID = record
EAX: Cardinal;
EBX: Cardinal;
ECX: Cardinal;
EDX: Cardinal;
end;
function GetCPUID(Leaf, Subleaf: Cardinal): TCPUID;
asm
push ebx
push edi
mov edi, ecx
mov ecx, edx
cpuid
mov [edi+$0], eax
mov [edi+$4], ebx
mov [edi+$8], ecx
mov [edi+$c], edx
pop edi
pop ebx
end;
I've written this for 32 bit code, but if you need to support 64 bit code also that support is easy enough to add.
function GetCPUID(Leaf, Subleaf: Integer): TCPUID;
asm
{$IF Defined(CPUX86)}
push ebx
push edi
mov edi, ecx
mov ecx, edx
cpuid
mov [edi+$0], eax
mov [edi+$4], ebx
mov [edi+$8], ecx
mov [edi+$c], edx
pop edi
pop ebx
{$ELSEIF Defined(CPUX64)}
mov r9,rcx
mov ecx,r8d
mov r8,rbx
mov eax,edx
cpuid
mov [r9+$0], eax
mov [r9+$4], ebx
mov [r9+$8], ecx
mov [r9+$c], edx
mov rbx, r8
{$ELSE}
{$Message Fatal 'GetCPUID has not been implemented for this architecture.'}
{$IFEND}
end;
With this at hand you can call CPUID passing any value as input, and retrieve all 4 registers of output, with which you can then do whatever you please.
Your code to create a bitmask is extremely inefficient and very far from idiomatic. Use 1 shl N to create a value with a single bit set, in position N.
Code like this:
if (_Result and Decimal_mask) = Decimal_mask then DetectCPUFeature:=true
else DetectCPUFeature:=false;
is also some way from idiomatic. That would normally be written like this:
DetectCPUFeature := value and mask <> 0;
You might end up with a wrapper function that looks like this:
type
TCPUIDRegister = (regEAX, regEBX, regECX, regEDX);
function GetCPUIDRegister(CPUID: TCPUID; Reg: TCPUIDRegister): Cardinal;
begin
case Reg of
regEAX:
Result := CPUID.EAX;
regEBX:
Result := CPUID.EBX;
regECX:
Result := CPUID.ECX;
regEDX:
Result := CPUID.EDX;
end;
end;
function CPUFeatureEnabled(Leaf, Subleaf: Cardinal; Reg: TCPUIDRegister; Bit: Integer): Boolean;
var
value: Cardinal;
begin
value := GetCPUIDRegister(GetCPUID(Leaf, Subleaf), Reg);
Result := value and (1 shl Bit) <> 0;
end;
While David's answer is excellent, the reason the function fails is that the ECX register is not set to zero (required for fetching extended info in the CPUID call).
See : How to detect New Instruction support in the 4th generation Intel® Core™ processor family
where AVX2 is found by (emphasis mine)
CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1
The following correctly returns the extended information and identifies AVX2 support.
if AnsiUpperCase(Register_Name)='EBX' then
begin
asm
push ecx { push ecx to stack}
mov ecx, 0 { set ecx to zero}
mov eax,EAX_Leaf
db $0F,$A2
mov _Result,ebx
pop ecx { restore ecx}
end;
The other asm functions have the same error as ECX is required to be zero for those calls also.
taken from Synopse Informatique:
type
/// the potential features, retrieved from an Intel CPU
// - see https://en.wikipedia.org/wiki/CPUID#EAX.3D1:_Processor_Info_and_Feature_Bits
TALIntelCpuFeature =
( { in EDX }
cfFPU, cfVME, cfDE, cfPSE, cfTSC, cfMSR, cfPAE, cfMCE,
cfCX8, cfAPIC, cf_d10, cfSEP, cfMTRR, cfPGE, cfMCA, cfCMOV,
cfPAT, cfPSE36, cfPSN, cfCLFSH, cf_d20, cfDS, cfACPI, cfMMX,
cfFXSR, cfSSE, cfSSE2, cfSS, cfHTT, cfTM, cfIA64, cfPBE,
{ in ECX }
cfSSE3, cfCLMUL, cfDS64, cfMON, cfDSCPL, cfVMX, cfSMX, cfEST,
cfTM2, cfSSSE3, cfCID, cfSDBG, cfFMA, cfCX16, cfXTPR, cfPDCM,
cf_c16, cfPCID, cfDCA, cfSSE41, cfSSE42, cfX2A, cfMOVBE, cfPOPCNT,
cfTSC2, cfAESNI, cfXS, cfOSXS, cfAVX, cfF16C, cfRAND, cfHYP,
{ extended features in EBX, ECX }
cfFSGS, cf_b01, cfSGX, cfBMI1, cfHLE, cfAVX2, cf_b06, cfSMEP, cfBMI2,
cfERMS, cfINVPCID, cfRTM, cfPQM, cf_b13, cfMPX, cfPQE, cfAVX512F,
cfAVX512DQ, cfRDSEED, cfADX, cfSMAP, cfAVX512IFMA, cfPCOMMIT,
cfCLFLUSH, cfCLWB, cfIPT, cfAVX512PF, cfAVX512ER, cfAVX512CD,
cfSHA, cfAVX512BW, cfAVX512VL, cfPREFW1, cfAVX512VBMI);
/// all features, as retrieved from an Intel CPU
TALIntelCpuFeatures = set of TALIntelCpuFeature;
var
/// the available CPU features, as recognized at program startup
ALCpuFeatures: TALIntelCpuFeatures;
{**}
type
_TRegisters = record
eax,ebx,ecx,edx: cardinal;
end;
{***************************************************************}
procedure _GetCPUID(Param: Cardinal; var Registers: _TRegisters);
{$IF defined(CPU64BITS)}
asm // ecx=param, rdx=Registers (Linux: edi,rsi)
.NOFRAME
mov eax, ecx
mov r9, rdx
mov r10, rbx // preserve rbx
xor ebx, ebx
xor ecx, ecx
xor edx, edx
cpuid
mov _TRegisters(r9).&eax, eax
mov _TRegisters(r9).&ebx, ebx
mov _TRegisters(r9).&ecx, ecx
mov _TRegisters(r9).&edx, edx
mov rbx, r10
end;
{$else}
asm
push esi
push edi
mov esi, edx
mov edi, eax
pushfd
pop eax
mov edx, eax
xor eax, $200000
push eax
popfd
pushfd
pop eax
xor eax, edx
jz #nocpuid
push ebx
mov eax, edi
xor ecx, ecx
cpuid
mov _TRegisters(esi).&eax, eax
mov _TRegisters(esi).&ebx, ebx
mov _TRegisters(esi).&ecx, ecx
mov _TRegisters(esi).&edx, edx
pop ebx
#nocpuid:
pop edi
pop esi
end;
{$ifend}
{******************************}
procedure _TestIntelCpuFeatures;
var regs: _TRegisters;
begin
regs.edx := 0;
regs.ecx := 0;
_GetCPUID(1,regs);
PIntegerArray(#ALCpuFeatures)^[0] := regs.edx;
PIntegerArray(#ALCpuFeatures)^[1] := regs.ecx;
_GetCPUID(7,regs);
PIntegerArray(#ALCpuFeatures)^[2] := regs.ebx;
PByteArray(#ALCpuFeatures)^[12] := regs.ecx;
end;
initialization
_TestIntelCpuFeatures;
I need to process a file that is coming from the old Mac era (old Motorola CPU). The bytes are big endian so I have a function that swaps and Int64 to Intel little endian. The function is ASM and works on 32 bit CPU but not on 64. For 64 bit I have a different function that is not ASM. I want to combine the functions using IFDEF. Can I do this? Will it be a problem?
interface
function SwapInt64(Value: Int64): Int64; assembler;
implementation
{$IFDEF CPUx86}
function SwapInt64(Value: Int64): Int64; assembler; { Does not work on 64 bit } {
asm
MOV EDX,[DWORD PTR EBP + 12]
MOV EAX,[DWORD PTR EBP + 8]
BSWAP EAX
XCHG EAX,EDX
BSWAP EAX
end;
{$else}
function SwapInt64 (Value: Int64): Int64;
var P: PInteger;
begin
Result: = (Value shl 32) or (Value shr 32);
P: = #Result;
P ^: = (Swap (P ^) shl 16) or (Swap (P ^ shr 16));
Inc (P);
P ^: = (Swap (P ^) shl 16) or (Swap (P ^ shr 16));
end;
{$ENDIF}
I think the compiler will correctly compile/call the appropriate function no matter if one is ASM and the other is Pascal.
What you are proposing is perfectly fine. It is a quite reasonable approach.
If you want a 64 bit swap in asm, for x64, that's quite simple:
function SwapInt64(Value: Int64): Int64;
asm
MOV RAX,RCX
BSWAP RAX
end;
Combine this with the 32 bit version using conditional, as you have done in the question.
function SwapInt64(Value: Int64): Int64;
{$IF Defined(CPUX86)}
asm
MOV EDX,[DWORD PTR EBP + 12]
MOV EAX,[DWORD PTR EBP + 8]
BSWAP EAX
XCHG EAX,EDX
BSWAP EAX
end;
{$ELSEIF Defined(CPUX64)}
asm
MOV RAX,RCX
BSWAP RAX
end;
{$ELSE}
{$Message Fatal 'Unsupported architecture'}
{$ENDIF}
Or include a Pascal implementation in the {$ELSE} block.
The approach of swapping the bytes in a separate routine that cannot be inlined is a bit silly if performance is what you're after.
A better way to a assume you've got a block of data and all dword/qwords in it need to have their endianness changed.
This would look something like this.
For dwords
function SwapDWords(var Data; size: cardinal): boolean;
{ifdef CPUX64}
asm
//Data in RCX, Size in EDX
xor EAX,EAX //failure
test EDX,3
jz #MultipleOf4
#error:
ret
#MultipleOf4
neg EDX //Count up instead of down
jz #done
ADD RCX,RDX
#loop
mov R8d, [RCX+RDX]
bswap R8d
mov [RCX+RDX],R8d
add RDX,4 //add is faster than inc on modern processors
jnz #loop
#done:
inc EAX //success
ret
end;
For qwords
function SwapQWords(var Data; size: cardinal): boolean;
{ifdef CPUX64}
asm
//Data in RCX, Size in EDX
xor EAX,EAX //failure
test EDX,7
jz #MultipleOf8
#error:
ret
#MultipleOf8
neg EDX //Count up instead of down
jz #done
ADD RCX,RDX
#loop
mov R8, [RCX+RDX]
bswap R8
mov [RCX+RDX],R8
add RDX,8 //add is faster than inc on modern processors
jnz #loop
#done:
inc EAX //success
ret
end;
If you're already on 64 bit, then you have SSE2, and can use the 128-bit SSE registers.
Now you can process 4 dwords at a time, effectively unrolling the loop 4 times.
See: http://www.asmcommunity.net/forums/topic/?id=29743
movntpd xmm5,[RCX+RDX] //non-temporal move to avoid polluting the cache
movdqu xmm0, xmm5
movdqu xmm1, xmm5
pxor xmm5, xmm5
punpckhbw xmm0, xmm5 ; interleave '0' with bytes of original
punpcklbw xmm1, xmm5 ; so they become words
pshuflw xmm0, xmm0, 27 ; swap the words by shuffling
pshufhw xmm0, xmm0, 27 ;//27 = B00_01_10_11
pshuflw xmm1, xmm1, 27
pshufhw xmm1, xmm1, 27
packuswb xmm1, xmm0 ; make the words back into bytes.
movntpd [RCX+RDX], xmm1 //non-temporal move to keep the cache clean.
Simply use either LEToN() or BEtoN()
Use the LE variant if the data is little endian (e.g. 32 or 64-bits x86 mac, modern arm), use the BE if the source data (e.g. file from disk) is in big endian format.
Depending on the used architecture a swap or "nothing" will be inlined, usually a fairly optimal one for single conversions. For block oriented solutions see the posted SSE code (or Agner Fog's)
People said that Delphi produces quite-nice optimized code on integer operations. I try the following example in Delphi 2007, and see its assembly code produced by the compiler.
program p1000;
{$APPTYPE CONSOLE}
procedure test;
var
arr: array of integer;
i: integer;
begin
SetLength(arr, 100);
for i := 0 to High(arr) do
begin
if (i = High(arr)) then
begin
arr[i] := -9;
end;
end;
end;
begin
test;
readln;
end.
When building configuration is set to DEBUG, I can set a breakpoint and use shortkey Ctrl+Alt+D to see its assembly code, like this:
Project3.dpr.11: for i := 0 to High(arr) do
004045A1 8B45FC mov eax,[ebp-$04]
004045A4 E8F7FAFFFF call #DynArrayHigh
004045A9 8BF0 mov esi,eax
004045AB 85F6 test esi,esi
004045AD 7C1D jl $004045cc
004045AF 46 inc esi
004045B0 33DB xor ebx,ebx
Project3.dpr.13: if (i = High(arr)) then
004045B2 8B45FC mov eax,[ebp-$04]
004045B5 E8E6FAFFFF **call #DynArrayHigh**
004045BA 3BD8 cmp ebx,eax
004045BC 750A jnz $004045c8
Project3.dpr.15: arr[i] := -9;
004045BE 8B45FC mov eax,[ebp-$04]
004045C1 C70498F7FFFFFF mov [eax+ebx*4],$fffffff7
Project3.dpr.17: end;
004045C8 43 inc ebx
Project3.dpr.11: for i := 0 to High(arr) do
004045C9 4E dec esi
004045CA 75E6 jnz $004045b2
As far as I can understand, it calls High() function again and again in the loop:
Project3.dpr.13: if (i = High(arr)) then
004045B2 8B45FC mov eax,[ebp-$04]
004045B5 E8E6FAFFFF **call #DynArrayHigh**
004045BA 3BD8 cmp ebx,eax
When building configuration is set to RELEASE, the breakpoint isn't available, so I press F8/F7 to step into the loop.
00404589 6A64 push $64
0040458B 8D45FC lea eax,[ebp-$04]
0040458E B901000000 mov ecx,$00000001
00404593 8B1554454000 mov edx,[$00404554]
00404599 E8B6FCFFFF call #DynArraySetLength
0040459E 83C404 add esp,$04
004045A1 8B45FC mov eax,[ebp-$04]
004045A4 E8F7FAFFFF call #DynArrayHigh
004045A9 8BF0 mov esi,eax
004045AB 85F6 test esi,esi
004045AD 7C1D jl $004045cc
004045AF 46 inc esi
004045B0 33DB xor ebx,ebx
004045B2 8B45FC mov eax,[ebp-$04]
004045B5 E8E6FAFFFF call #DynArrayHigh
004045BA 3BD8 cmp ebx,eax
004045BC 750A jnz $004045c8
004045BE 8B45FC mov eax,[ebp-$04]
004045C1 C70498F7FFFFFF mov [eax+ebx*4],$fffffff7
004045C8 43 inc ebx
004045C9 4E dec esi
004045CA 75E6 jnz $004045b2
004045CC 33C0 xor eax,eax
004045BC 750A jnz $004045c8
Again, the same call #DynArrayHigh is produced...
So my question is, why the compiler can't optimize this? simply save the High() value in a local register/variable because the array size is not changed.
This is not a answer but rather a (self destructing) comment :)
In my view, The compiler must not attempt to optimize this.
Why should the compiler attempt to optimize a (non-deterministic) High function as opposed to other? (such as Length)
The dynamic array length might be changed inside the loop either by SetLenth, or by other means. the array might be re-initilzed at run-time and your code might depend on that:
for i := 0 to High(arr) do
begin
if (i = High(arr)) then
arr[i] := -9
else
if foo() then
arr := nil; // or SetLength(arr, 0);
if High(arr) = -1 then Exit; // arr is nil
end;
How do you suggest this should be optimized? Should the compiler even attempt to optimized this?
I don't see anything special about High functionm even if the compiler translates it to #DynArrayHigh.
If you want your code to be optimized , optimize it yourself .e.g:
var
arrHigh: Integer;
arrHigh := High(arr);
for i := 0 to arrHigh do
if i = arrHigh then...
I am trying to get DUnit2 working under 64 bits, but I am stumped to what this method does, let alone how to convert it to 64 bits. Pure Pascal would better, but since it refers to the stack (ebp), it might not be possible.
function CallerAddr: Pointer; assembler;
const
CallerIP = $4;
asm
mov eax, ebp
call IsBadPointer
test eax,eax
jne ##Error
mov eax, [ebp].CallerIP
sub eax, 5 // 5 bytes for call
push eax
call IsBadPointer
test eax,eax
pop eax
je ##Finish
##Error:
xor eax, eax
##Finish:
end;
function RtlCaptureStackBackTrace(FramesToSkip: ULONG; FramesToCapture: ULONG;
out BackTrace: Pointer; BackTraceHash: PULONG): USHORT; stdcall;
external 'kernel32.dll' name 'RtlCaptureStackBackTrace' delayed;
function CallerAddr: Pointer;
begin
// Skip 2 Frames, one for the return of CallerAddr and one for the
// return of RtlCaptureStackBackTrace
if RtlCaptureStackBackTrace(2, 1, Result, nil) > 0 then
begin
if not IsBadPointer(Result) then
Result := Pointer(NativeInt(Result) - 5)
else
Result := nil;
end
else
begin
Result := nil;
end;
end;
function CallerAddr: Pointer; assembler;
const
CallerIP = $4;
asm
mov rax, rcx ;For int.. XMM0 for float
call IsBadPointer
test rax,rax
jne ##Error
mov rax, [rcx].CallerIP
sub rax, 5 // 5 bytes for call
push rax
call IsBadPointer
test rax,rax
pop rax
je ##Finish
##Error:
xor rax, rax
##Finish:
end;
I written an asm function in Delphi 7 but it transforms my code to something else:
function f(x: Cardinal): Cardinal; register;
label err;
asm
not eax
mov edx,eax
shr edx, 1
and eax, edx
bsf ecx, eax
jz err
mov eax, 1
shl eax, cl
mov edx, eax
add edx, edx
or eax, edx
ret
err:
xor eax, eax
end;
// compiled version
f:
push ebx // !!!
not eax
mov edx,eax
shr edx, 1
and eax, edx
bsf ecx, eax
jz +$0e
mov eax, 1
shl eax, cl
mov edx, eax
add edx, edx
or eax, edx
ret
err:
xor eax, eax
mov eax, ebx // !!!
pop ebx // !!!
ret
// the almost equivalent without asm
function f(x: Cardinal): Cardinal;
var
c: Cardinal;
begin
x := not x;
x := x and x shr 1;
if x <> 0 then
begin
c := bsf(x); // bitscanforward
x := 1 shl c;
Result := x or (x shl 1)
end
else
Result := 0;
end;
Why does it generate push ebx and pop ebx? And why does it do mov eax, ebx?
It seems that it generates the partial stack frame because of the mov eax, ebx.
This simple test generates mov eax, edx but doesn't generate that stack frame:
function asmtest(x: Cardinal): Cardinal; register;
label err;
asm
not eax
and eax, 1
jz err
ret
err:
xor eax, eax
end;
// compiled
asmtest:
not eax
and eax, $01
jz +$01
ret
xor eax, eax
mov eax, edx // !!!
ret
It seems that it has something to do with the label err. If I remove that I don't get the mov eax, * part.
Why does this happen?
Made a bug report on Quality Central.
The practical advice is: do not use label keyword in asm code, use ##-prefixed labels:
function f(x: Cardinal): Cardinal; register;
asm
not eax
mov edx,eax
shr edx, 1
and eax, edx
bsf ecx, eax
jz ##err
mov eax, 1
shl eax, cl
mov edx, eax
add edx, edx
or eax, edx
ret
##err:
xor eax, eax
end;
Updated:
I have not found the bug report in Basm area. It looks like a bug, but I have used BASM for many years and never thought about using label keyword such a way. In fact I never used label keyword in Delphi at all. :)
Well ... back then, in the Delphi-Manual, it used to say something about Compiler-Optimization and thealike-crazyness:
The Compiler generates Stackframes only for nested Routines, for Routines having local Variables and for Routines with Stack-Parameters
The auto-generated Initialization- and Finalizationcode for Routines includes:
PUSH EBP ; If Locals <> 0 or Params <> 0
MOV EBP,ESP ; If Locals <> 0 or Params <> 0
SUB ESP,Locals ; If Locals <> 0
...
MOV ESP,EBP ; If Locals <> 0
POP EBP ; If Locals <> 0 or Params <> 0
RET Params ; Always
If local Variables contain Variants, long Strings or Interfaces they are initialized with Null but aren't finalized afterwards.
Locals is the Size of local Variables, Params the Size of Parameters. If both Locals as well as Params are Null no Init-Code will be generated and the Finalizationcode only contains a RET-Intruction.
Maybe that has got something to do with it all...