We recently started the process of creating 64 bit builds of our applications. During comparison testing we found that the 64 bit build is calculating differently. I have a code sample that demonstrates the difference between the two builds.
var
currPercent, currGross, currCalcValue : Currency;
begin
currGross := 1182.42;
currPercent := 1.45;
currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
ShowMessage(CurrToStr(currCalcValue));
end;
If you step through this in the 32 bit version, currCalcValue is calculated with 17.1451 while the 64 bit version comes back with 17.145.
Why isn't the 64 bit build calculating out the extra decimal place? All variables are defined as 4 decimal currency values.
Here's my SSCCE based on your code. Note the use of a console application. Makes life much simpler.
{$APPTYPE CONSOLE}
uses
SysUtils;
var
currPercent, currGross, currCalcValue : Currency;
begin
currGross := 1182.42;
currPercent := 1.45;
currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
Writeln(CurrToStr(currCalcValue));
Readln;
end.
Now look at the code that is generated. First 32 bit:
Project3.dpr.13: currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
0041C409 8D45EC lea eax,[ebp-$14]
0041C40C BADCC44100 mov edx,$0041c4dc
0041C411 E8A6A2FEFF call #UStrLAsg
0041C416 8B1504E74100 mov edx,[$0041e704]
0041C41C 8B45EC mov eax,[ebp-$14]
0041C41F E870AFFFFF call StrToCurr
0041C424 DF7DE0 fistp qword ptr [ebp-$20]
0041C427 9B wait
0041C428 DF2DD83E4200 fild qword ptr [$00423ed8]
0041C42E DF6DE0 fild qword ptr [ebp-$20]
0041C431 DEC9 fmulp st(1)
0041C433 DF2DE03E4200 fild qword ptr [$00423ee0]
0041C439 DEC9 fmulp st(1)
0041C43B D835E4C44100 fdiv dword ptr [$0041c4e4]
0041C441 DF3DE83E4200 fistp qword ptr [$00423ee8]
0041C447 9B wait
And the 64 bit:
Project3.dpr.13: currCalcValue := (currGross * (currPercent * StrToCurr('.01')));
0000000000428A0E 488D4D38 lea rcx,[rbp+$38]
0000000000428A12 488D1513010000 lea rdx,[rel $00000113]
0000000000428A19 E84213FEFF call #UStrLAsg
0000000000428A1E 488B4D38 mov rcx,[rbp+$38]
0000000000428A22 488B155F480000 mov rdx,[rel $0000485f]
0000000000428A29 E83280FFFF call StrToCurr
0000000000428A2E 4889C1 mov rcx,rax
0000000000428A31 488B0510E80000 mov rax,[rel $0000e810]
0000000000428A38 48F7E9 imul rcx
0000000000428A3B C7C110270000 mov ecx,$00002710
0000000000428A41 48F7F9 idiv rcx
0000000000428A44 488BC8 mov rcx,rax
0000000000428A47 488B0502E80000 mov rax,[rel $0000e802]
0000000000428A4E 48F7E9 imul rcx
0000000000428A51 C7C110270000 mov ecx,$00002710
0000000000428A57 48F7F9 idiv rcx
0000000000428A5A 488905F7E70000 mov [rel $0000e7f7],rax
Note that the 32 bit code performs the arithmetic on the FPU, but the 64 bit code performs it using integer arithmetic. That's the key difference.
In the 32 bit code, the following calculation is performed:
Convert '0.01' to currency, which is 100, allowing for the fixed point shift of 10,000.
Load 14,500 into the FPU.
Multiply by 100 giving 1,450,000.
Multiply by 11,824,200 giving 17,145,090,000,000.
Divide by 10,000^2 giving 171,450.9.
Round to the nearest integer giving 171,451.
Store that in your currency variable. Hence the result is 17.1451.
Now, in the 64 bit code, it's a little different. Because we use 64 bit integers all the way. It looks like this:
Convert '0.01' to currency, which is 100.
Multiply by 14,500 which is 1,450,000.
Divide by 10,000 which is 145.
Multiply by 11,824,200 giving 1,714,509,000.
Divide by 10,000 which is 171,450. Uh-oh, loss of precision here.
Store that in your currency variable. Hence the result is 17.145.
So the issue is that the 64 bit compiler divides by 10,000 at each intermediate step. Presumably to avoid overflow, much more likely in a 64 bit integer than a floating point register.
Were it to do the calculation like this:
100 * 14,500 * 11,824,200 / 10,000 / 10,000
it would get the right answer.
This has been fixed as of XE5u2 and as of current writing XE6u1.
Is there an easy way to bit-reflect a byte variable in Delphi so that the most significant bit (MSB) gets the least significant bit (LSB) and vice versa?
In code you can do it like this:
function ReverseBits(b: Byte): Byte;
var
i: Integer;
begin
Result := 0;
for i := 1 to 8 do
begin
Result := (Result shl 1) or (b and 1);
b := b shr 1;
end;
end;
But a lookup table would be much more efficient, and only consume 256 bytes of memory.
function ReverseBits(b: Byte): Byte; inline;
const
Table: array [Byte] of Byte = (
0,128,64,192,32,160,96,224,16,144,80,208,48,176,112,240,
8,136,72,200,40,168,104,232,24,152,88,216,56,184,120,248,
4,132,68,196,36,164,100,228,20,148,84,212,52,180,116,244,
12,140,76,204,44,172,108,236,28,156,92,220,60,188,124,252,
2,130,66,194,34,162,98,226,18,146,82,210,50,178,114,242,
10,138,74,202,42,170,106,234,26,154,90,218,58,186,122,250,
6,134,70,198,38,166,102,230,22,150,86,214,54,182,118,246,
14,142,78,206,46,174,110,238,30,158,94,222,62,190,126,254,
1,129,65,193,33,161,97,225,17,145,81,209,49,177,113,241,
9,137,73,201,41,169,105,233,25,153,89,217,57,185,121,249,
5,133,69,197,37,165,101,229,21,149,85,213,53,181,117,245,
13,141,77,205,45,173,109,237,29,157,93,221,61,189,125,253,
3,131,67,195,35,163,99,227,19,147,83,211,51,179,115,243,
11,139,75,203,43,171,107,235,27,155,91,219,59,187,123,251,
7,135,71,199,39,167,103,231,23,151,87,215,55,183,119,247,
15,143,79,207,47,175,111,239,31,159,95,223,63,191,127,255
);
begin
Result := Table[b];
end;
This is more than 10 times faster than the version of the code that operates on individual bits.
Finally, I don't normally like to comment too negatively on accepted answers when I have a competing answer. In this case there are very serious problems with the answer that you accepted that I would like to state clearly for you and also for any future readers.
You accepted #Arioch's answer at the time when it contained the same Pascal code as can be seen in this answer, together with two assembler versions. It turns out that those assembler versions are much slower than the Pascal version. They are twice as slow as the Pascal code.
It is a common fallacy that converting high level code to assembler results in faster code. If you do it badly then you can easily produce code that runs more slowly than the code emitted by the compiler. There are times when it is worth writing code in assembler but you must not ever do so without proper benchmarking.
What is particularly egregious about the use of assembler here is that it is so obvious that the table based solution will be exceedingly fast. It's hard to imagine how that could be significantly improved upon.
function BitFlip(B: Byte): Byte;
const
N: array[0..15] of Byte = (0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
begin
Result := N[B div 16] or N[B mod 16] shl 4;
end;
function ByteReverseLoop(b: byte): byte;
var i: integer;
begin
Result := 0; // actually not needed, just to make compiler happy
for i := 1 to 8 do
begin
Result := Result shl 1;
if Odd(b) then Result := Result or 1;
b := b shr 1;
end;
end;
If speed is important, then you can use lookup table. You feel it once on program start and then you just take a value from table. Since you're only needing to map byte to byte, that would take 256x1=256 bytes of memory. And given recent Delphi versions support inline functions, that would provide for both speed, readability and reliability (incapsulating array lookup in the function you may be sure you would not change the values due to some typo)
Var ByteReverseLUT: array[byte] of byte;
function ByteReverse(b: byte): byte; inline;
begin Result := ByteReverseLUT[b] end;
{Unit/program initialization}
var b: byte;
for b := Low(ByteReverseLUT) to High(ByteReverseLUT)
do ByteReverseLUT[b] := ByteReverseLoop(b);
Speed comparison of several implementations that were mentioned on this forum.
AMD Phenom2 x710 / Win7 x64 / Delphi XE2 32-bit {$O+}
Pascal AND original: 12494
Pascal AND reversed: 33459
Pascal IF original: 46829
Pascal IF reversed: 45585
Asm SHIFT 1: 15802
Asm SHIFT 2: 15490
Asm SHIFT 3: 16212
Asm AND 1: 19408
Asm AND 2: 19601
Asm AND 3: 19802
Pascal AND unrolled: 10052
Asm Shift unrolled: 4573
LUT, called: 3192
Pascal math, called: 4614
http://pastebin.ca/2304708
Note: LUT (lookup table) timings are probably rather optimistic here. Due to running in tight loop the whole table was sucked into L1 CPU cache. In real computations this function most probably would be called much less frequently and L1 cache would not keep the table entirely.
Pascal inlined function calls result are bogus - Delphi did not called them, detecting they had no side-effects. But funny - the timings were different.
Asm Shift unrolled: 4040
LUT, called: 3011
LUT, inlined: 977
Pascal unrolled: 10052
Pas. unrolled, inlined: 849
Pascal math, called: 4614
Pascal math, inlined: 6517
And below the explanation:
Project1.dpr.427: d := BitFlipLUT(i)
0044AC45 8BC3 mov eax,ebx
0044AC47 E89CCAFFFF call BitFlipLUT
Project1.dpr.435: d := BitFlipLUTi(i)
Project1.dpr.444: d := MirrorByte(i);
0044ACF8 8BC3 mov eax,ebx
0044ACFA E881C8FFFF call MirrorByte
Project1.dpr.453: d := MirrorByteI(i);
0044AD55 8BC3 mov eax,ebx
Project1.dpr.460: d := MirrorByte7Op(i);
0044ADA3 8BC3 mov eax,ebx
0044ADA5 E8AEC7FFFF call MirrorByte7Op
Project1.dpr.462: d := MirrorByte7OpI(i);
0044ADF1 0FB6C3 movzx eax,bl
All calls to inlined functions were eliminated.
Yet about passing the parameters Delphi made three different decisions:
For the 1st call it eliminated parameter passing together with function call
For the 2nd call it kept parameter passing, despite function was not called
For the 3rd call it kept changed parameter passing, which proved longer then function call itself! Weird! :-)
Using brute force can be simple and effective.
This routine is NOT on par with David's LUT solution.
Update
Added array of byte as input and result assigned to array of byte as well.
This shows better performance for the LUT solution.
function MirrorByte(b : Byte) : Byte; inline;
begin
Result :=
((b and $01) shl 7) or
((b and $02) shl 5) or
((b and $04) shl 3) or
((b and $08) shl 1) or
((b and $10) shr 1) or
((b and $20) shr 3) or
((b and $40) shr 5) or
((b and $80) shr 7);
end;
Update 2
Googling a little, found BitReverseObvious.
function MirrorByte7Op(b : Byte) : Byte; inline;
begin
Result :=
{$IFDEF WIN64} // This is slightly better in x64 than the code in x32
(((b * UInt64($80200802)) and UInt64($0884422110)) * UInt64($0101010101)) shr 32;
{$ENDIF}
{$IFDEF WIN32}
((b * $0802 and $22110) or (b * $8020 and $88440)) * $10101 shr 16;
{$ENDIF}
end;
This one is closer to the LUT solution, even faster in one test.
To sum up, MirrorByte7Op() is 5-30% slower than LUT in 3 of the tests, 5% faster in one test.
Code to benchmark:
uses
System.Diagnostics;
const
cBit : Byte = $AA;
cLoopMax = 1000000000;
var
sw : TStopWatch;
arrB : array of byte;
i : Integer;
begin
SetLength(arrB,cLoopMax);
for i := 0 TO Length(arrB) - 1 do
arrB[i]:= System.Random(256);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := b;
end;
sw.Stop;
WriteLn('Loop ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := ReflectBits(arrB[i]);
end;
sw.Stop;
WriteLn('RB array in: ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := MirrorByte(arrB[i]);
end;
sw.Stop;
WriteLn('MB array in: ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := MirrorByte7Op(arrB[i]);
end;
sw.Stop;
WriteLn('MB7Op array in : ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i] := ReflectBits(arrB[i]);
end;
sw.Stop;
WriteLn('RB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i]:= MirrorByte(arrB[i]);
end;
sw.Stop;
WriteLn('MB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i]:= MirrorByte7Op(arrB[i]);
end;
sw.Stop;
WriteLn('MB7Op array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
ReadLn;
end.
Result of benchmark (XE3, i7 CPU 870):
32 bit 64 bit
--------------------------------------------------
Byte assignment (= empty loop) 599 ms 2117 ms
MirrorByte to byte, array in 6991 ms 8746 ms
MirrorByte7Op to byte, array in 1384 ms 2510 ms
ReverseBits to byte, array in 945 ms 2119 ms
--------------------------------------------------
ReverseBits array in/out 1944 ms 3721 ms
MirrorByte7Op array in/out 1790 ms 3856 ms
BitFlipNibble array in/out 1995 ms 6730 ms
MirrorByte array in/out 7157 ms 8894 ms
ByteReverse array in/out 38246 ms 42303 ms
I added some of the other proposals in the last part of the table (all inlined). It is probably most fair to test in a loop with an array in and an array as result. ReverseBits (LUT) and MirrorByte7Op are comparable in speed followed by BitFlipNibble (LUT) which underperforms a bit in x64.
Note: I added a new algorithm for the x64 bit part of MirrorByte7Op. It makes better use of the 64 bit registers and has fewer instructions.