FPU instructions that check precision

FPU instructions that check precision - delphi

Using the fldcw instruction it's possible to change the precision of the FPU unit to 24 or more bits. However after doing some testing I'm starting to think that very few x87 operations are in fact using that setting.
I haven't tested all operations but on this test machine so far, it looks like only fdiv and fsqrt stop computing at the selected precision, and that all other operations (fadd fsub fmul...) always compute full extended precision.
If that was the case I would expect it to be because those 2 instructions (fdiv and fsqrt) are significantly slower then most other x87 FPU instructions so when lower precision is sufficient it's possible to speed them up, but really, I'm just wondering if this has always been the case or if it's a quirk of the very recent processor used in my test machine.
edit: here's delphi code to show it
program Project1;
uses
windows,dialogs,sysutils;
{$R *.res}
const
test_mul:single=1234567890.0987654321;
var
i:longint;
s:single absolute i;
s1,s2,s3:single;
procedure test_24;
asm
mov word([esp-2]),$103f // 24bit precision, trunc
fldcw word([esp-2])
fld [s]
fmul [test_mul]
fstp [s1]
end;
procedure test_53;
asm
mov word([esp-2]),$123f // 53bit precision, trunc
fldcw word([esp-2])
fld [s]
fmul [test_mul]
fstp [s2]
end;
procedure test_64;
asm
mov word([esp-2]),$133f // 64bit precision, trunc
fldcw word([esp-2])
fld [s]
fmul [test_mul]
fstp [s3]
end;
begin
i:=0;
repeat
test_24;
test_53;
test_64;
if (s1<>s2) or (s2<>s3) then begin
showmessage('Error at step:'+inttostr(i));
break;
end;
inc(i);
until i=0;
showmessage('No difference found between precisions');
end.
edit2: false alarm, I was mistaken, I was storing as single instead of extended so couldn't catch the difference, here's a fixed test, thanks to hans passant for catching my mistake:
program Project1;
uses
windows,dialogs,sysutils;
{$R *.res}
const
test_mul:single=1234567890.0987654321;
var
i:longint;
errors:cardinal;
s:single absolute i;
s1,s2,s3:extended;
procedure test_24;
asm
mov word([esp-2]),$103f // 24bit precision, trunc
fldcw word([esp-2])
fld [s]
fmul [test_mul]
fstp [s1]
end;
procedure test_53;
asm
mov word([esp-2]),$123f // 53bit precision, trunc
fldcw word([esp-2])
fld [s]
fmul [test_mul]
fstp [s2]
end;
procedure test_64;
asm
mov word([esp-2]),$133f // 64bit precision, trunc
fldcw word([esp-2])
fld [s]
fmul [test_mul]
fstp [s3]
end;
begin
errors:=0;
i:=0;
repeat
test_24;
test_53;
test_64;
if (s1<>s2) or (s2<>s3) then begin
inc(errors);
end;
inc(i);
until i=0;
showmessage('Number of differences between precisions: '+inttostr(errors));
end.

Related

converting ASM instruction RDRand to Win64

I have this function (RDRand - written by David Heffernan) that seam to work ok in 32 bit, but failed in 64 bit :
function TryRdRand(out Value: Cardinal): Boolean;
{$IF defined(CPU64BITS)}
asm .noframe
{$else}
asm
{$ifend}
db $0f
db $c7
db $f1
jc #success
xor eax,eax
ret
#success:
mov [eax],ecx
mov eax,1
end;
doc of the function is here: https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
Especially it's written :
Essentially, developers invoke this instruction with a single operand:
the destination register where the random value will be stored. Note
that this register must be a general purpose register, and the size of
the register (16, 32, or 64 bits) will determine the size of the
random value returned.
After invoking the RDRAND instruction, the caller must examine the
carry flag (CF) to determine whether a random value was available at
the time the RDRAND instruction was executed. As Table 3 shows, a
value of 1 indicates that a random value was available and placed in
the destination register provided in the invocation. A value of 0
indicates that a random value was not available. In current
architectures the destination register will also be zeroed as a side
effect of this condition.
My knowledge of ASM is quite low, what did I miss ?
Also I do not quite understand this instruction :
...
xor eax,eax
ret
...
What it's does exactly ?

If you want a function that performs exactly the same then I think that looks like this:
function TryRdRand(out Value: Cardinal): Boolean;
asm
{$if defined(WIN64)}
.noframe
// rdrand eax
db $0f
db $c7
db $f0
jnc #fail
mov [rcx],eax
{$elseif defined(WIN32)}
// rdrand ecx
db $0f
db $c7
db $f1
jnc #fail
mov [eax],ecx
{$else}
{$Message Fatal 'TryRdRand not implemented for this platform'}
{$endif}
mov eax,1
ret
#fail:
xor eax,eax
end;
The suggestion made by Peter Cordes of implementing a retry loop in the asm looks sensible to me. I will not attempt to implement that here, since I think it is somewhat outside the scope of your question.
Also, Peter points out that in x64 you can read a 64 bit random value with the REX.W=1 prefix. That would look like this:
function TryRdRand(out Value: NativeUInt): Boolean;
asm
{$if defined(WIN64)}
.noframe
// rdrand rax
db $48 // REX.W = 1
db $0f
db $c7
db $f0
jnc #fail
mov [rcx],rax
{$elseif defined(WIN32)}
// rdrand ecx
db $0f
db $c7
db $f1
jnc #fail
mov [eax],ecx
{$else}
{$Message Fatal 'TryRdRand not implemented for this platform'}
{$endif}
mov eax,1
ret
#fail:
xor eax,eax
end;

How to return var parameter in Delphi ASM

I'm far from having knowledge about ASM, so forgive me if this is a stupid question.
I have this function:
function SwapDWord(const AValue: DWORD): DWORD;
asm
BSWAP EAX
end;
How would I convert it to a procedure in ASM:
procedure SwapDWordVar(var AValue: DWORD);
asm
// ???
end;
I do not want to use AValue := SwapDWord(AValue); which I could. I want to do this in ASM.
I tried many silly things by looking at system.pas and tried to understand which register(s) to use. but nothing worked. It always return back the original AValue.

Possible variant:
procedure SwapDWordVar(var AValue: DWORD);
asm
mov edx, [eax]
bswap edx
mov [eax], edx
end;
You might find useful Guido Gybels articles

Just try
procedure SwapDWordVar(var AValue: DWORD);
asm
mov edx, dword ptr [AValue]
bswap edx
mov dword ptr [AValue], edx
end;
Note that this version will compile and work for both Win32 and Win64.
But note that it won't be faster than AValue := SWapDWord(AValue) since most of the time will be spent calling the function, not accessing the memory.

What is the Delphi equivalent to the C __builtin_clz()?

Quoted from https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html,
— Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
What is the Delphi equivalent to the C __builtin_clz() ? If there isn't, how to implement it efficiently in Delphi?
Actually, I want to use it to calculate the base-2 logarithm of an integer.

If you only care about 32 bit code then it goes like this:
function __builtin_clz(x: Cardinal): Cardinal;
asm
BSR EAX,EAX
NEG EAX
ADD EAX,32
end;
Or if you want to support 64 bit code as well then it would be:
function __builtin_clz(x: Cardinal): Cardinal;
{$IF Defined(CPUX64)}
asm
BSR ECX,ECX
NEG ECX
ADD ECX,31
MOV EAX,ECX
{$ENDIF}
{$IF Defined(CPUX86)}
asm
BSR EAX,EAX
NEG EAX
ADD EAX,31
{$ENDIF}
end;
It's likely that an asm guru could trim this down a little, but BSR (bit scan reverse) is the key instruction.
For the mobile compilers, I don't know how to do this efficiently.

Why is 64 bit Delphi app calculating different results than 32 bit build?

We recently started the process of creating 64 bit builds of our applications. During comparison testing we found that the 64 bit build is calculating differently. I have a code sample that demonstrates the difference between the two builds.
var
currPercent, currGross, currCalcValue : Currency;
begin
currGross := 1182.42;
currPercent := 1.45;
currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
ShowMessage(CurrToStr(currCalcValue));
end;
If you step through this in the 32 bit version, currCalcValue is calculated with 17.1451 while the 64 bit version comes back with 17.145.
Why isn't the 64 bit build calculating out the extra decimal place? All variables are defined as 4 decimal currency values.

Here's my SSCCE based on your code. Note the use of a console application. Makes life much simpler.
{$APPTYPE CONSOLE}
uses
SysUtils;
var
currPercent, currGross, currCalcValue : Currency;
begin
currGross := 1182.42;
currPercent := 1.45;
currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
Writeln(CurrToStr(currCalcValue));
Readln;
end.
Now look at the code that is generated. First 32 bit:
Project3.dpr.13: currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
0041C409 8D45EC lea eax,[ebp-$14]
0041C40C BADCC44100 mov edx,$0041c4dc
0041C411 E8A6A2FEFF call #UStrLAsg
0041C416 8B1504E74100 mov edx,[$0041e704]
0041C41C 8B45EC mov eax,[ebp-$14]
0041C41F E870AFFFFF call StrToCurr
0041C424 DF7DE0 fistp qword ptr [ebp-$20]
0041C427 9B wait
0041C428 DF2DD83E4200 fild qword ptr [$00423ed8]
0041C42E DF6DE0 fild qword ptr [ebp-$20]
0041C431 DEC9 fmulp st(1)
0041C433 DF2DE03E4200 fild qword ptr [$00423ee0]
0041C439 DEC9 fmulp st(1)
0041C43B D835E4C44100 fdiv dword ptr [$0041c4e4]
0041C441 DF3DE83E4200 fistp qword ptr [$00423ee8]
0041C447 9B wait
And the 64 bit:
Project3.dpr.13: currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
0000000000428A0E 488D4D38 lea rcx,[rbp+$38]
0000000000428A12 488D1513010000 lea rdx,[rel $00000113]
0000000000428A19 E84213FEFF call #UStrLAsg
0000000000428A1E 488B4D38 mov rcx,[rbp+$38]
0000000000428A22 488B155F480000 mov rdx,[rel $0000485f]
0000000000428A29 E83280FFFF call StrToCurr
0000000000428A2E 4889C1 mov rcx,rax
0000000000428A31 488B0510E80000 mov rax,[rel $0000e810]
0000000000428A38 48F7E9 imul rcx
0000000000428A3B C7C110270000 mov ecx,$00002710
0000000000428A41 48F7F9 idiv rcx
0000000000428A44 488BC8 mov rcx,rax
0000000000428A47 488B0502E80000 mov rax,[rel $0000e802]
0000000000428A4E 48F7E9 imul rcx
0000000000428A51 C7C110270000 mov ecx,$00002710
0000000000428A57 48F7F9 idiv rcx
0000000000428A5A 488905F7E70000 mov [rel $0000e7f7],rax
Note that the 32 bit code performs the arithmetic on the FPU, but the 64 bit code performs it using integer arithmetic. That's the key difference.
In the 32 bit code, the following calculation is performed:
Convert '0.01' to currency, which is 100, allowing for the fixed point shift of 10,000.
Load 14,500 into the FPU.
Multiply by 100 giving 1,450,000.
Multiply by 11,824,200 giving 17,145,090,000,000.
Divide by 10,000^2 giving 171,450.9.
Round to the nearest integer giving 171,451.
Store that in your currency variable. Hence the result is 17.1451.
Now, in the 64 bit code, it's a little different. Because we use 64 bit integers all the way. It looks like this:
Convert '0.01' to currency, which is 100.
Multiply by 14,500 which is 1,450,000.
Divide by 10,000 which is 145.
Multiply by 11,824,200 giving 1,714,509,000.
Divide by 10,000 which is 171,450. Uh-oh, loss of precision here.
Store that in your currency variable. Hence the result is 17.145.
So the issue is that the 64 bit compiler divides by 10,000 at each intermediate step. Presumably to avoid overflow, much more likely in a 64 bit integer than a floating point register.
Were it to do the calculation like this:
100 * 14,500 * 11,824,200 / 10,000 / 10,000
it would get the right answer.

This has been fixed as of XE5u2 and as of current writing XE6u1.

How can I bit-reflect a byte in Delphi?

Is there an easy way to bit-reflect a byte variable in Delphi so that the most significant bit (MSB) gets the least significant bit (LSB) and vice versa?

In code you can do it like this:
function ReverseBits(b: Byte): Byte;
var
i: Integer;
begin
Result := 0;
for i := 1 to 8 do
begin
Result := (Result shl 1) or (b and 1);
b := b shr 1;
end;
end;
But a lookup table would be much more efficient, and only consume 256 bytes of memory.
function ReverseBits(b: Byte): Byte; inline;
const
Table: array [Byte] of Byte = (
0,128,64,192,32,160,96,224,16,144,80,208,48,176,112,240,
8,136,72,200,40,168,104,232,24,152,88,216,56,184,120,248,
4,132,68,196,36,164,100,228,20,148,84,212,52,180,116,244,
12,140,76,204,44,172,108,236,28,156,92,220,60,188,124,252,
2,130,66,194,34,162,98,226,18,146,82,210,50,178,114,242,
10,138,74,202,42,170,106,234,26,154,90,218,58,186,122,250,
6,134,70,198,38,166,102,230,22,150,86,214,54,182,118,246,
14,142,78,206,46,174,110,238,30,158,94,222,62,190,126,254,
1,129,65,193,33,161,97,225,17,145,81,209,49,177,113,241,
9,137,73,201,41,169,105,233,25,153,89,217,57,185,121,249,
5,133,69,197,37,165,101,229,21,149,85,213,53,181,117,245,
13,141,77,205,45,173,109,237,29,157,93,221,61,189,125,253,
3,131,67,195,35,163,99,227,19,147,83,211,51,179,115,243,
11,139,75,203,43,171,107,235,27,155,91,219,59,187,123,251,
7,135,71,199,39,167,103,231,23,151,87,215,55,183,119,247,
15,143,79,207,47,175,111,239,31,159,95,223,63,191,127,255
);
begin
Result := Table[b];
end;
This is more than 10 times faster than the version of the code that operates on individual bits.
Finally, I don't normally like to comment too negatively on accepted answers when I have a competing answer. In this case there are very serious problems with the answer that you accepted that I would like to state clearly for you and also for any future readers.
You accepted #Arioch's answer at the time when it contained the same Pascal code as can be seen in this answer, together with two assembler versions. It turns out that those assembler versions are much slower than the Pascal version. They are twice as slow as the Pascal code.
It is a common fallacy that converting high level code to assembler results in faster code. If you do it badly then you can easily produce code that runs more slowly than the code emitted by the compiler. There are times when it is worth writing code in assembler but you must not ever do so without proper benchmarking.
What is particularly egregious about the use of assembler here is that it is so obvious that the table based solution will be exceedingly fast. It's hard to imagine how that could be significantly improved upon.

function BitFlip(B: Byte): Byte;
const
N: array[0..15] of Byte = (0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
begin
Result := N[B div 16] or N[B mod 16] shl 4;
end;

function ByteReverseLoop(b: byte): byte;
var i: integer;
begin
Result := 0; // actually not needed, just to make compiler happy
for i := 1 to 8 do
begin
Result := Result shl 1;
if Odd(b) then Result := Result or 1;
b := b shr 1;
end;
end;
If speed is important, then you can use lookup table. You feel it once on program start and then you just take a value from table. Since you're only needing to map byte to byte, that would take 256x1=256 bytes of memory. And given recent Delphi versions support inline functions, that would provide for both speed, readability and reliability (incapsulating array lookup in the function you may be sure you would not change the values due to some typo)
Var ByteReverseLUT: array[byte] of byte;
function ByteReverse(b: byte): byte; inline;
begin Result := ByteReverseLUT[b] end;
{Unit/program initialization}
var b: byte;
for b := Low(ByteReverseLUT) to High(ByteReverseLUT)
do ByteReverseLUT[b] := ByteReverseLoop(b);
Speed comparison of several implementations that were mentioned on this forum.
AMD Phenom2 x710 / Win7 x64 / Delphi XE2 32-bit {$O+}
Pascal AND original: 12494
Pascal AND reversed: 33459
Pascal IF original: 46829
Pascal IF reversed: 45585
Asm SHIFT 1: 15802
Asm SHIFT 2: 15490
Asm SHIFT 3: 16212
Asm AND 1: 19408
Asm AND 2: 19601
Asm AND 3: 19802
Pascal AND unrolled: 10052
Asm Shift unrolled: 4573
LUT, called: 3192
Pascal math, called: 4614
http://pastebin.ca/2304708
Note: LUT (lookup table) timings are probably rather optimistic here. Due to running in tight loop the whole table was sucked into L1 CPU cache. In real computations this function most probably would be called much less frequently and L1 cache would not keep the table entirely.
Pascal inlined function calls result are bogus - Delphi did not called them, detecting they had no side-effects. But funny - the timings were different.
Asm Shift unrolled: 4040
LUT, called: 3011
LUT, inlined: 977
Pascal unrolled: 10052
Pas. unrolled, inlined: 849
Pascal math, called: 4614
Pascal math, inlined: 6517
And below the explanation:
Project1.dpr.427: d := BitFlipLUT(i)
0044AC45 8BC3 mov eax,ebx
0044AC47 E89CCAFFFF call BitFlipLUT
Project1.dpr.435: d := BitFlipLUTi(i)
Project1.dpr.444: d := MirrorByte(i);
0044ACF8 8BC3 mov eax,ebx
0044ACFA E881C8FFFF call MirrorByte
Project1.dpr.453: d := MirrorByteI(i);
0044AD55 8BC3 mov eax,ebx
Project1.dpr.460: d := MirrorByte7Op(i);
0044ADA3 8BC3 mov eax,ebx
0044ADA5 E8AEC7FFFF call MirrorByte7Op
Project1.dpr.462: d := MirrorByte7OpI(i);
0044ADF1 0FB6C3 movzx eax,bl
All calls to inlined functions were eliminated.
Yet about passing the parameters Delphi made three different decisions:
For the 1st call it eliminated parameter passing together with function call
For the 2nd call it kept parameter passing, despite function was not called
For the 3rd call it kept changed parameter passing, which proved longer then function call itself! Weird! :-)

Using brute force can be simple and effective.
This routine is NOT on par with David's LUT solution.
Update
Added array of byte as input and result assigned to array of byte as well.
This shows better performance for the LUT solution.
function MirrorByte(b : Byte) : Byte; inline;
begin
Result :=
((b and $01) shl 7) or
((b and $02) shl 5) or
((b and $04) shl 3) or
((b and $08) shl 1) or
((b and $10) shr 1) or
((b and $20) shr 3) or
((b and $40) shr 5) or
((b and $80) shr 7);
end;
Update 2
Googling a little, found BitReverseObvious.
function MirrorByte7Op(b : Byte) : Byte; inline;
begin
Result :=
{$IFDEF WIN64} // This is slightly better in x64 than the code in x32
(((b * UInt64($80200802)) and UInt64($0884422110)) * UInt64($0101010101)) shr 32;
{$ENDIF}
{$IFDEF WIN32}
((b * $0802 and $22110) or (b * $8020 and $88440)) * $10101 shr 16;
{$ENDIF}
end;
This one is closer to the LUT solution, even faster in one test.
To sum up, MirrorByte7Op() is 5-30% slower than LUT in 3 of the tests, 5% faster in one test.
Code to benchmark:
uses
System.Diagnostics;
const
cBit : Byte = $AA;
cLoopMax = 1000000000;
var
sw : TStopWatch;
arrB : array of byte;
i : Integer;
begin
SetLength(arrB,cLoopMax);
for i := 0 TO Length(arrB) - 1 do
arrB[i]:= System.Random(256);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := b;
end;
sw.Stop;
WriteLn('Loop ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := ReflectBits(arrB[i]);
end;
sw.Stop;
WriteLn('RB array in: ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := MirrorByte(arrB[i]);
end;
sw.Stop;
WriteLn('MB array in: ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := MirrorByte7Op(arrB[i]);
end;
sw.Stop;
WriteLn('MB7Op array in : ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i] := ReflectBits(arrB[i]);
end;
sw.Stop;
WriteLn('RB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i]:= MirrorByte(arrB[i]);
end;
sw.Stop;
WriteLn('MB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i]:= MirrorByte7Op(arrB[i]);
end;
sw.Stop;
WriteLn('MB7Op array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
ReadLn;
end.
Result of benchmark (XE3, i7 CPU 870):
32 bit 64 bit
--------------------------------------------------
Byte assignment (= empty loop) 599 ms 2117 ms
MirrorByte to byte, array in 6991 ms 8746 ms
MirrorByte7Op to byte, array in 1384 ms 2510 ms
ReverseBits to byte, array in 945 ms 2119 ms
--------------------------------------------------
ReverseBits array in/out 1944 ms 3721 ms
MirrorByte7Op array in/out 1790 ms 3856 ms
BitFlipNibble array in/out 1995 ms 6730 ms
MirrorByte array in/out 7157 ms 8894 ms
ByteReverse array in/out 38246 ms 42303 ms
I added some of the other proposals in the last part of the table (all inlined). It is probably most fair to test in a loop with an array in and an array as result. ReverseBits (LUT) and MirrorByte7Op are comparable in speed followed by BitFlipNibble (LUT) which underperforms a bit in x64.
Note: I added a new algorithm for the x64 bit part of MirrorByte7Op. It makes better use of the 64 bit registers and has fewer instructions.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

FPU instructions that check precision - delphi

Related

converting ASM instruction RDRand to Win64

How to return var parameter in Delphi ASM

What is the Delphi equivalent to the C __builtin_clz()?

Why is 64 bit Delphi app calculating different results than 32 bit build?

How can I bit-reflect a byte in Delphi?

Categories

Resources