Different optimizations in Math.Sum in Win32/64 - delphi

I have the following code
const
NumIterations = 10000000;
var
i, j : Integer;
x : array[1..100] of Double;
Start : Cardinal;
S : Double;
begin
for i := Low(x) to High(x) do x[i] := i;
Start := GetTickCount;
for i := 1 to NumIterations do S := System.Math.Sum(x);
ShowMessage('Math.Sum: ' + IntToStr(GetTickCount - Start));
Start := GetTickCount;
for i := 1 to NumIterations do begin
S := 0;
for j := Low(x) to High(x) do S := S + x[j];
end;
ShowMessage('Simple Sum: ' + IntToStr(GetTickCount - Start));
end;
When compiled for Win32 Math.Sum is considerably faster than the simple loop, as Math.Sum is written in Assembler and uses four-fold loop unrolling.
But when compiled for Win64, Math.Sum is considerably slower than the simple loop, because in 64-bit Math.Sum uses Kahan summation. This is an optimization for accuracy minimizing pile-up of errors during the summation process, but is considerably slower than even the simple loop.
I.e. when compiling for Win32 I get code optimized for speed, when compiling the same code for Win64 I get code optimized for accuracy. This is not exactly what I naively would expect.
Is there any sensible reason for this difference between Win32/64? Double is always 8 byte, so the accuracy should be identical in Win32/64.
Is Math.Sum still implemented identically (Assembler and loop unrolling in Win32, Kahan summation in Win64) in current versions of Delphi? I use Delphi-XE5.

Is Math.Sum still implemented identically (Assembler and loop unrolling in Win32, Kahan summation in Win64) in current versions of Delphi? I use Delphi-XE5.
Yes (Delphi 10.3.2).
Is there any sensible reason for this difference between Win32/64? Double is always 8 byte, so the accuracy should be identical in Win32/64.
32-bit Delphi for Win32 uses the old FPU, while the 64-bit compiler uses SSE instructions. When the 64-bit compiler was introduced in XE2, many of the old assembly routines was not ported to 64-bit. Instead, some routines were ported with similar functionality as other modern compilers.
You can enhance the 64-bit implementation a bit by introducing a Kahan summation function:
program TestKahanSum;
{$APPTYPE CONSOLE}
uses
System.SysUtils,Math,Diagnostics;
function KahanSum(const input : TArray<Double>): Double;
var
sum,c,y,t : Double;
i : Integer;
begin
sum := 0.0;
c := 0.0;
for i := Low(input) to High(input) do begin
y := input[i] - c;
t := sum + y;
c := (t - sum) - y;
sum := t;
end;
Result := sum;
end;
var
dArr : TArray<Double>;
res : Double;
i : Integer;
sw : TStopWatch;
begin
SetLength(dArr,100000000);
for i := 0 to High(dArr) do dArr[i] := Pi;
sw := TStopWatch.StartNew;
res := Math.Sum(dArr);
WriteLn('Math.Sum:',res,' [ms]:',sw.ElapsedMilliseconds);
sw := TStopWatch.StartNew;
res := KahanSum(dArr);
WriteLn('KahanSum:',res,' [ms]:',sw.ElapsedMilliseconds);
sw := TStopWatch.StartNew;
res := 0;
for i := 0 to High(dArr) do res := res + dArr[i];
WriteLn('NaiveSum:',res,' [ms]:',sw.ElapsedMilliseconds);
ReadLn;
end.
64-bit:
Math.Sum: 3.14159265358979E+0008 [ms]:492
KahanSum: 3.14159265358979E+0008 [ms]:359
NaiveSum: 3.14159265624272E+0008 [ms]:246
32-bit:
Math.Sum: 3.14159265358957E+0008 [ms]:67
KahanSum: 3.14159265358979E+0008 [ms]:958
NaiveSum: 3.14159265624272E+0008 [ms]:277
Pi with 15 digits is 3.14159265358979
The 32-bit math assembly routine is accurate to 13 digits in this example, while the 64-bit math routine is accurate to 15 digits.
Conclusion:
The 64 bit implementation is slower (by a factor of two compared to a naive summation), but more accurate than the 32-bit math routine.
Introducing an enhanced Kahan summation routine improves performance by 35%.

Having the very same RTL function not behave the same when switching a compilation target is an awful bug. It should not change the behavior. Even worse, Win64/pascal Sum() over single or double does not behave the same! sum(single) is naive summing, whereas sum(double) uses Kahan... :(
You would better either use plain + operator, or create your own Kahan sum function.
I can confirm that the bug is still there in Delphi 10.3.

Related

Delphi fastest FileSize for sizes > 10gb

Wanted to check with you experts if there are any drawbacks in this funtion. Will it work properly on the various Windows OS ? I am using Delphi Seattle (32 and 64 bit exe's). I am using this instead of Findfirst for its speed.
function GetFileDetailsFromAttr(pFileName:WideString):int64;
var
wfad: TWin32FileAttributeData;
wSize:LARGE_INTEGER ;
begin
Result:=0 ;
if not GetFileAttributesEx(pwidechar(pFileName), GetFileExInfoStandard,#wfad) then
exit;
wSize.HighPart:=wfad.nFileSizeHigh ;
wSize.LowPart:=wfad.nFileSizeLow ;
result:=wsize.QuadPart ;
end;
The typical googled samples shown with this command does not work for filesize > 9GB
function GetFileAttributesEx():Int64 using
begin
...
result:=((&wfad.nFileSizeHigh) or (&wfad.nFileSizeLow))
Code with variant record is correct.
But this code
result:=((&wfad.nFileSizeHigh) or (&wfad.nFileSizeLow))
is just wrong, result cannot overcome 32-bit border
Code from link in comment
result := Int64(info.nFileSizeLow) or Int64(info.nFileSizeHigh shl 32);
is wrong because it does not account how compiler works with 32 and 64-bit values. Look at the next example showing how to treat this situation properly (for value d, e):
var
a, b: DWord;
c, d, e: Int64;
wSize:LARGE_INTEGER ;
begin
a := 1;
b := 1;
c := Int64(a) or Int64(b shl 32);
d := Int64(a) or Int64(b) shl 32;
wSize.LowPart := a;
wSize.HighPart := b;
e := wsize.QuadPart;
Caption := Format('$%x $%x $%x', [c, d, e]);
Note that in the expression for c 32-bit value is shifted by 32 bits left and looses set bit, then zero transforms to 64-bit.
Unbound to how you get the filesize: it would even be faster if you'd use a type (manual) that exists for ~25 years already to assign the filesize directly to the function's result instead of using an intermediate variable:
Int64Rec(result).Hi:= wfad.nFileSizeHigh;
Int64Rec(result).Lo:= wfad.nFileSizeLow;
end;
In case this isn't obvious to anyone here's what the compilation looks like:
Above: the intermediate variable w: LARGE_INTEGER first gets assigned the two 32bit parts and then is assigned itself to the function's result. Cost: 10 instructions.
Above: the record Int64Rec is used to cast the function's result and assign both 32bit parts directly, without the need of any other variable. Cost: 6 instructions.
Environment used: Delphi 7.0 (Build 8.1), compiler version 15.0, Win32 executable, code optimization: on.

Delphi Tokyo 64-bit flushes denormal numbers to zero?

During a short look at the source code of system.math, I discovered that
the 64-bit version Delphi Tokyo 10.2.3 flushes denormal IEEE-Doubles to zero, as can be seen from then following program;
{$apptype console}
uses
system.sysutils, system.math;
var
x: double;
const
twopm1030 : UInt64 = $0000100000000000; {2^(-1030)}
begin
x := PDouble(#twopm1030)^;
writeln(x);
x := ldexp(1,-515);
writeln(x*x);
x := ldexp(1,-1030);
writeln(x);
end.
For 32-bit the output is as expected
8.69169475979376E-0311
8.69169475979376E-0311
8.69169475979376E-0311
but with 64-bit I get
8.69169475979375E-0311
0.00000000000000E+0000
0.00000000000000E+0000
So basically Tokyo can handle denormal numbers in 64-bit mode, the constant is written correctly, but from arithmetic operations or even with ldexp a denormal result is flushed to zero.
Can this observation be confirmed on other systems? If yes, where it is documented? (The only info I could find about zero-flushing is,
that Denormals become zero when stored in a Real48).
Update: I know that for both 32- and 64-bit the single overload is used. For 32-bit the x87 FPU is used and the ASM code is virtually identical for all precisions (single, double, extended). The FPU always returns a 80-bit extended which is stored in a double without premature truncation. The 64-bit code does precision adjustment before storing.
Meanwhile I filed an issue report (https://quality.embarcadero.com/browse/RSP-20925), with the focus on the inconsistent results for 32- or 64-bit.
Update:
There is only a difference in how the compiler treats the overloaded selection.
As #Graymatter found out, the LdExp overload called is the Single type for both the 32-bit and the 64-bit compiler. The only difference is the codebase, where the 32-bit compiler is using asm code, while the 64-bit compiler has a purepascal implementation.
To fix the code to use the correct overload, explicitly define the type for the LdExp() first argument like this it works (64-bit):
program Project116;
{$APPTYPE CONSOLE}
uses
system.sysutils, system.math;
var
x: double;
const
twopm1030 : UInt64 = $0000100000000000; {2^(-1030)}
begin
x := PDouble(#twopm1030)^;
writeln(x);
x := ldexp(Double(1),-515);
writeln(x*x);
x := ldexp(Double(1),-1030);
writeln(x);
ReadLn;
end.
Outputs:
8.69169475979375E-0311
8.69169475979375E-0311
8.69169475979375E-0311
I would say that this behaviour should be reported as a RTL bug, since the overloaded function selected in your case is the Single type. The resulting type is a Double and the compiler should definitely adapt accordingly.
since the 32-bit and the 64-bit compiler should produce the same result.
Note, the Double(1) typecast for floating point types, was introduced in Delphi 10.2 Tokyo. For solutions in prevoius versions, see What is first version of Delphi which allows typecasts like double(10).
The problem here is that Ldexp(single) is returning different results depending on whether the ASM code is being called or whether the pascal code is called. In both cases, the compiler is calling the Single version of the overload because the type isn't specified in the call.
Your pascal code which is executed in the Win64 scenario tries to deal with the exponent less than -126 but the method is still not able to correctly calculate the result because single numbers are limited to an 8 bit exponent. The assembler seems to get around this but I didn't look into it in much detail as to why that's the case.
function Ldexp(const X: Single; const P: Integer): Single;
{ Result := X * (2^P) }
{$IFNDEF X86ASM}
var
T: Single;
I: Integer;
const
MaxExp = 127;
MinExp = -126;
FractionOfOne = $00800000;
begin
T := X;
Result := X;
case T.SpecialType of
fsDenormal,
fsNDenormal,
fsPositive,
fsNegative:
begin
FClearExcept;
I := P;
if I > MaxExp then
begin
T.BuildUp(False, FractionOfOne, MaxExp);
Result := Result * T;
I := I - MaxExp;
if I > MaxExp then I := MaxExp;
end
else if I < MinExp then
begin
T.BuildUp(False, FractionOfOne, MinExp);
Result := Result * T;
I := I - MinExp;
if I < MinExp then I := MinExp;
end;
if I <> 0 then
begin
T.BuildUp(False, FractionOfOne, I);
Result := Result * T;
end;
FCheckExcept;
end;
// fsZero,
// fsNZero,
// fsInf,
// fsNInf,
// fsNaN:
else
;
end;
end;
{$ELSE X86ASM}
{$IF defined(CPUX86) and defined(IOS)} // iOS/Simulator
...
{$ELSE}
asm // StackAlignSafe
PUSH EAX
FILD dword ptr [ESP]
FLD X
FSCALE
POP EAX
FSTP ST(1)
FWAIT
end;
{$ENDIF}
{$ENDIF X86ASM}
As LU RD suggested, you can get around the problem by forcing the methods to call the Double overload. There is a bug but that bug is that the ASM code doesn't match the pascal code in Ldexp(const X: Single; const P: Integer), not that a different overload is being called.

How can I bit-reflect a byte in Delphi?

Is there an easy way to bit-reflect a byte variable in Delphi so that the most significant bit (MSB) gets the least significant bit (LSB) and vice versa?
In code you can do it like this:
function ReverseBits(b: Byte): Byte;
var
i: Integer;
begin
Result := 0;
for i := 1 to 8 do
begin
Result := (Result shl 1) or (b and 1);
b := b shr 1;
end;
end;
But a lookup table would be much more efficient, and only consume 256 bytes of memory.
function ReverseBits(b: Byte): Byte; inline;
const
Table: array [Byte] of Byte = (
0,128,64,192,32,160,96,224,16,144,80,208,48,176,112,240,
8,136,72,200,40,168,104,232,24,152,88,216,56,184,120,248,
4,132,68,196,36,164,100,228,20,148,84,212,52,180,116,244,
12,140,76,204,44,172,108,236,28,156,92,220,60,188,124,252,
2,130,66,194,34,162,98,226,18,146,82,210,50,178,114,242,
10,138,74,202,42,170,106,234,26,154,90,218,58,186,122,250,
6,134,70,198,38,166,102,230,22,150,86,214,54,182,118,246,
14,142,78,206,46,174,110,238,30,158,94,222,62,190,126,254,
1,129,65,193,33,161,97,225,17,145,81,209,49,177,113,241,
9,137,73,201,41,169,105,233,25,153,89,217,57,185,121,249,
5,133,69,197,37,165,101,229,21,149,85,213,53,181,117,245,
13,141,77,205,45,173,109,237,29,157,93,221,61,189,125,253,
3,131,67,195,35,163,99,227,19,147,83,211,51,179,115,243,
11,139,75,203,43,171,107,235,27,155,91,219,59,187,123,251,
7,135,71,199,39,167,103,231,23,151,87,215,55,183,119,247,
15,143,79,207,47,175,111,239,31,159,95,223,63,191,127,255
);
begin
Result := Table[b];
end;
This is more than 10 times faster than the version of the code that operates on individual bits.
Finally, I don't normally like to comment too negatively on accepted answers when I have a competing answer. In this case there are very serious problems with the answer that you accepted that I would like to state clearly for you and also for any future readers.
You accepted #Arioch's answer at the time when it contained the same Pascal code as can be seen in this answer, together with two assembler versions. It turns out that those assembler versions are much slower than the Pascal version. They are twice as slow as the Pascal code.
It is a common fallacy that converting high level code to assembler results in faster code. If you do it badly then you can easily produce code that runs more slowly than the code emitted by the compiler. There are times when it is worth writing code in assembler but you must not ever do so without proper benchmarking.
What is particularly egregious about the use of assembler here is that it is so obvious that the table based solution will be exceedingly fast. It's hard to imagine how that could be significantly improved upon.
function BitFlip(B: Byte): Byte;
const
N: array[0..15] of Byte = (0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
begin
Result := N[B div 16] or N[B mod 16] shl 4;
end;
function ByteReverseLoop(b: byte): byte;
var i: integer;
begin
Result := 0; // actually not needed, just to make compiler happy
for i := 1 to 8 do
begin
Result := Result shl 1;
if Odd(b) then Result := Result or 1;
b := b shr 1;
end;
end;
If speed is important, then you can use lookup table. You feel it once on program start and then you just take a value from table. Since you're only needing to map byte to byte, that would take 256x1=256 bytes of memory. And given recent Delphi versions support inline functions, that would provide for both speed, readability and reliability (incapsulating array lookup in the function you may be sure you would not change the values due to some typo)
Var ByteReverseLUT: array[byte] of byte;
function ByteReverse(b: byte): byte; inline;
begin Result := ByteReverseLUT[b] end;
{Unit/program initialization}
var b: byte;
for b := Low(ByteReverseLUT) to High(ByteReverseLUT)
do ByteReverseLUT[b] := ByteReverseLoop(b);
Speed comparison of several implementations that were mentioned on this forum.
AMD Phenom2 x710 / Win7 x64 / Delphi XE2 32-bit {$O+}
Pascal AND original: 12494
Pascal AND reversed: 33459
Pascal IF original: 46829
Pascal IF reversed: 45585
Asm SHIFT 1: 15802
Asm SHIFT 2: 15490
Asm SHIFT 3: 16212
Asm AND 1: 19408
Asm AND 2: 19601
Asm AND 3: 19802
Pascal AND unrolled: 10052
Asm Shift unrolled: 4573
LUT, called: 3192
Pascal math, called: 4614
http://pastebin.ca/2304708
Note: LUT (lookup table) timings are probably rather optimistic here. Due to running in tight loop the whole table was sucked into L1 CPU cache. In real computations this function most probably would be called much less frequently and L1 cache would not keep the table entirely.
Pascal inlined function calls result are bogus - Delphi did not called them, detecting they had no side-effects. But funny - the timings were different.
Asm Shift unrolled: 4040
LUT, called: 3011
LUT, inlined: 977
Pascal unrolled: 10052
Pas. unrolled, inlined: 849
Pascal math, called: 4614
Pascal math, inlined: 6517
And below the explanation:
Project1.dpr.427: d := BitFlipLUT(i)
0044AC45 8BC3 mov eax,ebx
0044AC47 E89CCAFFFF call BitFlipLUT
Project1.dpr.435: d := BitFlipLUTi(i)
Project1.dpr.444: d := MirrorByte(i);
0044ACF8 8BC3 mov eax,ebx
0044ACFA E881C8FFFF call MirrorByte
Project1.dpr.453: d := MirrorByteI(i);
0044AD55 8BC3 mov eax,ebx
Project1.dpr.460: d := MirrorByte7Op(i);
0044ADA3 8BC3 mov eax,ebx
0044ADA5 E8AEC7FFFF call MirrorByte7Op
Project1.dpr.462: d := MirrorByte7OpI(i);
0044ADF1 0FB6C3 movzx eax,bl
All calls to inlined functions were eliminated.
Yet about passing the parameters Delphi made three different decisions:
For the 1st call it eliminated parameter passing together with function call
For the 2nd call it kept parameter passing, despite function was not called
For the 3rd call it kept changed parameter passing, which proved longer then function call itself! Weird! :-)
Using brute force can be simple and effective.
This routine is NOT on par with David's LUT solution.
Update
Added array of byte as input and result assigned to array of byte as well.
This shows better performance for the LUT solution.
function MirrorByte(b : Byte) : Byte; inline;
begin
Result :=
((b and $01) shl 7) or
((b and $02) shl 5) or
((b and $04) shl 3) or
((b and $08) shl 1) or
((b and $10) shr 1) or
((b and $20) shr 3) or
((b and $40) shr 5) or
((b and $80) shr 7);
end;
Update 2
Googling a little, found BitReverseObvious.
function MirrorByte7Op(b : Byte) : Byte; inline;
begin
Result :=
{$IFDEF WIN64} // This is slightly better in x64 than the code in x32
(((b * UInt64($80200802)) and UInt64($0884422110)) * UInt64($0101010101)) shr 32;
{$ENDIF}
{$IFDEF WIN32}
((b * $0802 and $22110) or (b * $8020 and $88440)) * $10101 shr 16;
{$ENDIF}
end;
This one is closer to the LUT solution, even faster in one test.
To sum up, MirrorByte7Op() is 5-30% slower than LUT in 3 of the tests, 5% faster in one test.
Code to benchmark:
uses
System.Diagnostics;
const
cBit : Byte = $AA;
cLoopMax = 1000000000;
var
sw : TStopWatch;
arrB : array of byte;
i : Integer;
begin
SetLength(arrB,cLoopMax);
for i := 0 TO Length(arrB) - 1 do
arrB[i]:= System.Random(256);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := b;
end;
sw.Stop;
WriteLn('Loop ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := ReflectBits(arrB[i]);
end;
sw.Stop;
WriteLn('RB array in: ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := MirrorByte(arrB[i]);
end;
sw.Stop;
WriteLn('MB array in: ',b:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
b := MirrorByte7Op(arrB[i]);
end;
sw.Stop;
WriteLn('MB7Op array in : ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i] := ReflectBits(arrB[i]);
end;
sw.Stop;
WriteLn('RB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i]:= MirrorByte(arrB[i]);
end;
sw.Stop;
WriteLn('MB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
sw := TStopWatch.StartNew;
for i := 0 to Pred(cLoopMax) do
begin
arrB[i]:= MirrorByte7Op(arrB[i]);
end;
sw.Stop;
WriteLn('MB7Op array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);
ReadLn;
end.
Result of benchmark (XE3, i7 CPU 870):
32 bit 64 bit
--------------------------------------------------
Byte assignment (= empty loop) 599 ms 2117 ms
MirrorByte to byte, array in 6991 ms 8746 ms
MirrorByte7Op to byte, array in 1384 ms 2510 ms
ReverseBits to byte, array in 945 ms 2119 ms
--------------------------------------------------
ReverseBits array in/out 1944 ms 3721 ms
MirrorByte7Op array in/out 1790 ms 3856 ms
BitFlipNibble array in/out 1995 ms 6730 ms
MirrorByte array in/out 7157 ms 8894 ms
ByteReverse array in/out 38246 ms 42303 ms
I added some of the other proposals in the last part of the table (all inlined). It is probably most fair to test in a loop with an array in and an array as result. ReverseBits (LUT) and MirrorByte7Op are comparable in speed followed by BitFlipNibble (LUT) which underperforms a bit in x64.
Note: I added a new algorithm for the x64 bit part of MirrorByte7Op. It makes better use of the 64 bit registers and has fewer instructions.

Does Delphi have isqrt?

I'm doing some heavy work on large integer numbers in UInt64 values, and was wondering if Delphi has an integer square root function.
Fow now I'm using Trunc(Sqrt(x*1.0)) but I guess there must be a more performant way, perhaps with a snippet of inline assembler? (Sqrt(x)with x:UInt64 throws an invalid type compiler error in D7, hence the *1.0 bit.)
I am very far from an expert on assembly, so this answer is just me fooling around.
However, this seems to work:
function isqrt(const X: Extended): integer;
asm
fld X
fsqrt
fistp #Result
fwait
end;
as long as you set the FPU control word's rounding setting to "truncate" prior to calling isqrt. The easiest way might be to define the helper function
function SetupRoundModeForSqrti: word;
begin
result := Get8087CW;
Set8087CW(result or $600);
end;
and then you can do
procedure TForm1.FormCreate(Sender: TObject);
var
oldCW: word;
begin
oldCW := SetupRoundModeForSqrti; // setup CW
// Compute a few million integer square roots using isqrt here
Set8087CW(oldCW); // restore CW
end;
Test
Does this really improve performance? Well, I tested
procedure TForm1.FormCreate(Sender: TObject);
var
oldCW: word;
p1, p2: Int64;
i: Integer;
s1, s2: string;
const
N = 10000000;
begin
oldCW := SetupRoundModeForSqrti;
QueryPerformanceCounter(p1);
for i := 0 to N do
Tag := isqrt(i);
QueryPerformanceCounter(p2);
s1 := inttostr(p2-p1);
QueryPerformanceCounter(p1);
for i := 0 to N do
Tag := trunc(Sqrt(i));
QueryPerformanceCounter(p2);
s2 := inttostr(p2-p1);
Set8087CW(oldCW);
ShowMessage(s1 + #13#10 + s2);
end;
and got the result
371802
371774.
Hence, it is simply not worth it. The naive approach trunc(sqrt(x)) is far easier to read and maintain, has superior future and backward compatibility, and is less prone to errors.
I believe that the answer is no it does not have an integer square root function and that your solution is reasonable.
I'm a bit surprised at the need to multiple by 1.0 to convert to a floating point value. I think that must be a Delphi bug and more recent versions certainly behave as you would wish.
This is the code I end up using, based on one of the algorhythms listed on wikipedia
type
baseint=UInt64;//or cardinal for the 32-bit version
function isqrt(x:baseint):baseint;
var
p,q:baseint;
begin
//get highest power of four
p:=0;
q:=4;
while (q<>0) and (q<=x) do
begin
p:=q;
q:=q shl 2;
end;
//
q:=0;
while p<>0 do
begin
if x>=p+q then
begin
dec(x,p);
dec(x,q);
q:=(q shr 1)+p;
end
else
q:=q shr 1;
p:=p shr 2;
end;
Result:=q;
end;

Is There A Fast GetToken Routine For Delphi?

In my program, I process millions of strings that have a special character, e.g. "|" to separate tokens within each string. I have a function to return the n'th token, and this is it:
function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
var
I, P, P2: integer;
begin
P2 := Pos(Delim, Line);
if TokenNum = 1 then begin
if P2 = 0 then
Result := Line
else
Result := copy(Line, 1, P2-1);
end
else begin
P := 0; { To prevent warnings }
for I := 2 to TokenNum do begin
P := P2;
if P = 0 then break;
P2 := PosEx(Delim, Line, P+1);
end;
if P = 0 then
Result := ''
else if P2 = 0 then
Result := copy(Line, P+1, MaxInt)
else
Result := copy(Line, P+1, P2-P-1);
end;
end; { GetTok }
I developed this function back when I was using Delphi 4. It calls the very efficient PosEx routine that was originally developed by Fastcode and is now included in the StrUtils library of Delphi.
I recently upgraded to Delphi 2009 and my strings are all Unicode. This GetTok function still works and still works well.
I have gone through the new libraries in Delphi 2009 and there are many new functions and additions to it.
But I have not seen a GetToken function like I need in any of the new Delphi libraries, in the various fastcode projects, and I can't find anything with a Google search other than Zarko Gajic's: Delphi Split / Tokenizer Functions, which is not as optimized as what I already have.
Any improvement, even 10% would be noticeable in my program. I know an alternative is StringLists and to always keep the tokens separate, but this has a big overhead memory-wise and I'm not sure if I did all that work to convert whether it would be any faster.
Whew. So after all this long winded talk, my question really is:
Do you know of any very fast implementations of a GetToken routine? An assembler optimized version would be ideal?
If not, are there any optimizations that you can see to my code above that might make an improvement?
Followup: Barry Kelly mentioned a question I asked a year ago about optimizing the parsing of the lines in a file. At that time I hadn't even thought of my GetTok routine which was not used for the that read or parsing. It is only now that I saw the overhead of my GetTok routine which led me to ask this question. Until Carl Smotricz and Barry's answers, I had never thought of connecting the two. So obvious, but it just didn't register. Thanks for pointing that out.
Yes, my Delim is a single character, so obviously I have some major optimization I can do. My use of Pos and PosEx in the GetTok routine (above) blinded me to the idea that I can do it faster with a character by character search instead, with bits of code like:
while (cp^ > #0) and (cp^ <= Delim) do
Inc(cp);
I'm going to go through everyone's answers and try the various suggestions and compare them. Then I'll post the results.
Confusion: Okay, now I'm really perplexed.
I took Carl and Barry's recommendation to go with PChars, and here is my implementation:
function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
{ LK Nov 7, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; https://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
I: integer;
PLine, PStart: PChar;
begin
PLine := PChar(Line);
PStart := PLine;
inc(PLine);
for I := 1 to TokenNum do begin
while (PLine^ <> #0) and (PLine^ <> Delim) do
inc(PLine);
if I = TokenNum then begin
SetString(Result, PStart, PLine - PStart);
break;
end;
if PLine^ = #0 then begin
Result := '';
break;
end;
inc(PLine);
PStart := PLine;
end;
end; { GetTok }
On paper, I don't think you can do much better than this.
So I put both routines to the task and used AQTime to see what's happening. The run I had included 1,108,514 calls to GetTok.
AQTime timed the original routine at 0.40 seconds. The million calls to Pos took 0.10 seconds. A half a million of the TokenNum = 1 copies took 0.10 seconds. The 600,000 PosEx calls only took 0.03 seconds.
Then I timed my new routine with AQTime for the same run and exactly the same calls. AQTime reports that my new "fast" routine took 3.65 seconds, which is 9 times as long. The culprit according to AQTime was the first loop:
while (PLine^ <> #0) and (PLine^ <> Delim) do
inc(PLine);
The while line, which was executed 18 million times, was reported at 2.66 seconds. The inc line, executed 16 million times, was said to take 0.47 seconds.
Now I thought I knew what was happening here. I had a similar problem with AQTime in a question I posed last year: Why is CharInSet faster than Case statement?
Again it was Barry Kelly who clued me in. Basically, an instrumenting profiler like AQTime does not necessarily do the job for microoptimization. It adds an overhead to each line which may swamp the results which is shown clearly in these numbers. The 34 million lines executed in my new "optimized code" overwhelm the several million lines of my original code, with apparently little or no overhead from the Pos and PosEx routines.
Barry gave me a sample of code using QueryPerformanceCounter to check that he was correct, and in that case he was.
Okay, so let's do the same now with QueryPerformanceCounter to prove that my new routine is faster and not 9 times slower as AQTime says it is. So here I go:
function TimeIt(const Title: string): double;
var i: Integer;
start, finish, freq: Int64;
Seconds: double;
begin
QueryPerformanceCounter(start);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 1);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 2);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 3);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 4);
QueryPerformanceCounter(finish);
QueryPerformanceFrequency(freq);
Seconds := (finish - start) / freq;
Result := Seconds;
end;
So this will test 1,000,000 calls to GetTok.
My old procedure with the Pos and PosEx calls took 0.29 seconds.
The new one with PChars took 2.07 seconds.
Now I am completely befuddled! Can anyone tell me why the PChar procedure is not only slower, but is 8 to 9 times slower!?
Mystery solved! Andreas said in his answer to change the Delim parameter from a string to a Char. I'll always be using just a Char, so at least for my implementation this is very possible. I was amazed at what happened.
The time for the 1 million calls went down from 1.88 seconds to .22 seconds.
And surprisingly, the time for my original Pos/PosEx routine went UP from .29 to .44 seconds when I changed it's Delim parameter to a Char.
Frankly, I'm disappointed by Delphi's optimizer. That Delim is a constant parameter. The optimizer should have noticed that the same conversion is happening within the loop and should have moved it out so that it would only be done once.
Double checking my Code generation parameters, yes I do have Optimization True and String format checking Off.
Bottom line is that the new PChar routine with Andrea's fix is about 25% faster than my original (.22 versus .29).
I still want to follow up on the other comments here and test them out.
Turning off optimization and turning on String format checking only increases the time from .22 to .30. It adds about the same to the original.
The advantage to using assembler code, or calling routines written in assembler like Pos or PosEx is that they are NOT subject to what code generation options you have set. They will always run the same way, a pre-optimized and non-bloated way.
I have reaffirmed in the last couple of days, that the best way to compare code for microoptimization is to look at and compare the Assembler code in the CPU window. It would be nice if Embarcadero could make that window a bit more convenient, and allow us to copy portions to the clipboard or to print sections of it.
Also, I unfairly slammed AQTime earlier in this post, thinking that the extra time added for my new routine was solely because of the instrumentation it added. Now that I go back and check with the Char parameter instead of String, the while loop is down to .30 seconds (from 2.66) and the inc line is down to .14 seconds (from .47). Strange that the inc line would go down as well. But I'm getting worn out from all this testing already.
I took Carl's idea of looping by characters, and rewrote that code with that idea. It makes another improvement, down to .19 seconds from .22. So here is now the best so far:
function GetTok(const Line: string; const Delim: Char; const TokenNum: Byte): string;
{ LK Nov 8, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; https://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
I, CurToken: Integer;
PLine, PStart: PChar;
begin
CurToken := 1;
PLine := PChar(Line);
PStart := PLine;
for I := 1 to length(Line) do begin
if PLine^ = Delim then begin
if CurToken = TokenNum then
break
else begin
CurToken := CurToken + 1;
inc(PLine);
PStart := PLine;
end;
end
else
inc(PLine);
end;
if CurToken = TokenNum then
SetString(Result, PStart, PLine - PStart)
else
Result := '';
end;
There still may be some minor optimizations to this, such as the CurToken = Tokennum comparison, which should be the same type, Integer or Byte, whichever is faster.
But let's say, I'm happy now.
Thanks again to the StackOverflow Delphi community.
It makes a big difference what "Delim" is expected to be. If it's expected to be a single character, you're far better off stepping through the string character by character, ideally through a PChar, and testing specifically.
If it's a long string, Boyer-Moore and similar searches have a set-up phase for skip tables, and the best way would be to build the tables once, and reuse them for each subsequent find. That means you need state between calls, and this function would be better off as a method on an object instead.
You might be interested in this answer I gave to a question some time before, about the fastest way to parse a line in Delphi. (But I see that it is you that asked the question! Nevertheless, in solving your problem, I would hew to how I described parsing, not using PosEx like you are using, depending on what Delim normally looks like.)
UPDATE: OK, I spent about 40 minutes looking at this. If you know the delimiter is going to be a character, you're pretty much always better off with the second version (i.e. PChar scanning), but you have to pass Delim as a character. At the time of writing, you're converting the PLine^ expression - of type Char - to a string for comparison with Delim. That will be very slow; even indexing into the string, with Delim[1] will also be somewhat slow.
However, depending on how large your lines are, and how many delimited pieces you want to pull out, you may be better off with a resumable approach, rather than skipping unwanted delimited pieces inside the tokenizing routine. If you call GetTok with successively increasing indexes, like you are currently doing in your mini benchmark, you'll end up with O(n*n) performance, where n is the number of delimited sections. That can be turned into O(n) if you save the state of the scan and restore it for the next iteration, or pack all extracted items into an array.
Here's a version that does all tokenization once, and returns an array. It needs to tokenize twice though, in order to know how large to make the array. On the other hand, only the second tokenization needs to extract the strings:
// Do all tokenization up front.
function GetTok4(const Line: string; const Delim: Char): TArray<string>;
var
cp, start: PChar;
count: Integer;
begin
// Count sections
count := 1;
cp := PChar(Line);
start := cp;
while True do
begin
if cp^ <> #0 then
begin
if cp^ <> Delim then
Inc(cp)
else
begin
Inc(cp);
Inc(count);
end;
end
else
begin
Inc(count);
Break;
end;
end;
SetLength(Result, count);
cp := start;
count := 0;
while True do
begin
if cp^ <> #0 then
begin
if cp^ <> Delim then
Inc(cp)
else
begin
SetString(Result[count], start, cp - start);
Inc(cp);
Inc(count);
end;
end
else
begin
SetString(Result[count], start, cp - start);
Break;
end;
end;
end;
Here's the resumable approach. The loads and stores of the current position and delimiter character do have a cost, though:
type
TTokenizer = record
private
FSource: string;
FCurrPos: PChar;
FDelim: Char;
public
procedure Reset(const ASource: string; ADelim: Char); inline;
function GetToken(out AResult: string): Boolean; inline;
end;
procedure TTokenizer.Reset(const ASource: string; ADelim: Char);
begin
FSource := ASource; // keep reference alive
FCurrPos := PChar(FSource);
FDelim := ADelim;
end;
function TTokenizer.GetToken(out AResult: string): Boolean;
var
cp, start: PChar;
delim: Char;
begin
// copy members to locals for better optimization
cp := FCurrPos;
delim := FDelim;
if cp^ = #0 then
begin
AResult := '';
Exit(False);
end;
start := cp;
while (cp^ <> #0) and (cp^ <> Delim) do
Inc(cp);
SetString(AResult, start, cp - start);
if cp^ = Delim then
Inc(cp);
FCurrPos := cp;
Result := True;
end;
Here's the full program I used for benchmarking.
Here are the results:
*** count=3, Length(src)=200
GetTok1: 595 ms
GetTok2: 547 ms
GetTok3: 2366 ms
GetTok4: 407 ms
GetTokBK: 226 ms
*** count=6, Length(src)=350
GetTok1: 1587 ms
GetTok2: 1502 ms
GetTok3: 6890 ms
GetTok4: 679 ms
GetTokBK: 334 ms
*** count=9, Length(src)=500
GetTok1: 3055 ms
GetTok2: 2912 ms
GetTok3: 13766 ms
GetTok4: 947 ms
GetTokBK: 446 ms
*** count=12, Length(src)=650
GetTok1: 4997 ms
GetTok2: 4803 ms
GetTok3: 23021 ms
GetTok4: 1213 ms
GetTokBK: 543 ms
*** count=15, Length(src)=800
GetTok1: 7417 ms
GetTok2: 7173 ms
GetTok3: 34644 ms
GetTok4: 1480 ms
GetTokBK: 653 ms
Depending on the characteristics of your data, whether the delimiter is likely to be a character or not, and how you work with it, different approaches may be faster.
(I made a mistake in my earlier program, I wasn't measuring the same operations for each style of routine. I updated the pastebin link and benchmark results.)
Your new function (the one with PChar) should declare "Delim" as Char and not as String. In your current implementation the compiler has to convert the PLine^ char into a string to compare it with "Delim". And that happens in a tight loop resulting is an enormous performance hit.
function GetTok(const Line: string; const Delim: Char{<<==}; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
{ LK Nov 7, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; http://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
I: integer;
PLine, PStart: PChar;
begin
PLine := PChar(Line);
PStart := PLine;
inc(PLine);
for I := 1 to TokenNum do begin
while (PLine^ <> #0) and (PLine^ <> Delim) do
inc(PLine);
if I = TokenNum then begin
SetString(Result, PStart, PLine - PStart);
break;
end;
if PLine^ = #0 then begin
Result := '';
break;
end;
inc(PLine);
PStart := PLine;
end;
end; { GetTok }
Delphi compiles to VERY efficient code; in my experience, it was very difficult to do better in assembler.
I think you should just point a PChar (they still exist, don't they? I parted ways with Delphi around 4.0) at the beginning of the string and increment it while counting "|"s, until you've found n-1 of them. I suspect that will be faster than calling PosEx repeatedly.
Take note of that position, then increment the pointer some more until you hit the next pipe. Pull out your substring. Done.
I'm only guessing, but I wouldn't be surprised if this was close to the quickest this problem can be solved.
EDIT: Here's what I had in mind. This code is, alas, uncompiled and untested, but it should demonstrate what I meant.
In particular, Delim is treated as a single char, which I believe makes a world of difference if that will fulfill the requirements, and the character at PLine is tested only once. Finally, there's no more comparison against TokenNum; I believe it's faster to decrement a counter to 0 for counting delimiters.
function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
var
Del: Char;
PLine, PStart: PChar;
Nth, I, P0, P9: Integer;
begin
Del := Delim[1];
Nth := TokenNum + 1;
P0 := 1;
P9 := Line.length + 1;
PLine := PChar(line);
for I := 1 to P9 do begin
if PLine^ = Del then begin
if Nth = 0 then begin
P9 := I;
break;
end;
Dec(Nth);
if Nth = 0 then P0 := I + 1
end;
Inc(PLine);
end;
if (Nth <= 1) or (TokenNum = 1) then
Result := Copy(Line, P0, P9 - P0);
else
Result := ''
end;
Using assembler would be a micro-optimization. There are much greater gains to be had by optimizing the algorithm. Not doing work beats doing work in the fastest possible way, every time.
One example would be if you have places in your program where you need several tokens of the same line. Another procedure that returns an array of tokens which you can then index into should be faster than calling your function more than once, especially if you let the procedure not return all tokens, but only as many as you need.
But in general I agree with Carl's answer (+1), using a PChar for scanning would probably be faster than your current code.
This is a function that I've had in my personal library for quite some time that I use extensively. I believe this is the most current version of it. I've had multiple versions in the past being optimized for a variety of different reasons. This one tries to take into account Quoted strings, but if that code is removed it makes the function a slight bit faster.
I actually have a number of other routines, CountSections and ParseSectionPOS being a couple of examples.
Unfortuately this routine is ansi/pchar based only. Although I don't think it would be difficult to move it to unicode. Maybe I've already done that...I'll have to check on that.
Note: This routine is 1 based in the ParseNum indexing.
function ParseSection(ParseLine: string; ParseNum: Integer; ParseSep: Char; QuotedStrChar:char = #0) : string;
var
wStart, wEnd : integer;
wIndex : integer;
wLen : integer;
wQuotedString : boolean;
begin
result := '';
wQuotedString := false;
if not (ParseLine = '') then
begin
wIndex := 1;
wStart := 1;
wEnd := 1;
wLen := Length(ParseLine);
while wEnd <= wLen do
begin
if (QuotedStrChar <> #0) and (ParseLine[wEnd] = QuotedStrChar) then
wQuotedString := not wQuotedString;
if not wQuotedString and (ParseLine[wEnd] = ParseSep) then
begin
if wIndex=ParseNum then
break
else
begin
inc(wIndex);
wStart := wEnd+1;
end;
end;
inc(wEnd);
end;
result := copy(ParseLine, wStart, wEnd-wStart);
if (length(result) > 0) and (QuotedStrChar <> #0) and (result[1] = QuotedStrChar) then
result := AnsiDequotedStr(result, QuotedStrChar);
end;
end; { ParseSection }
In your code, I think this is the only line that can be optimized:
Result := copy(Line, P+1, MaxInt)
If you calculate the new Length there, it might get a bit faster, but not the 10% you are looking for.
Your tokenizing algorithm seems pretty OK.
For optimizing it, I would run it through a profiler (like AQTime from AutomatedQA) with a representative subset of your production data. That will point you to the weakest spot.
The only RTL function that comes close is this one in the Classes unit:
procedure TStrings.SetDelimitedText(const Value: string);
It tokenizes, but uses both QuoteChar and Delimiter, but you only use a Delimiter.
It uses the SetString function in the System unit which is a pretty fast way to set the content of a string based on a PChar/PAnsiChar/PUnicodeChar and a length.
That might get you some improvement as well; on the other hand, Copy is really fast too.
I'm not the person always blaming the algorithm, but if I look at the first piece of source,
the problem is that for string N, you do the POS/posexes for string 1..n-1 again too.
This means for N items, you do sum (n, n-1,n-2...1) POSes (=+/- 0.5*N^2) , while only N are needed.
If you simply cache the position of the last found result, e.g. in a record that is passed by VAR parameter, you can gain a lot.
type
TLastPosition = record
elementnr : integer; // last tokennumber
elementpos: integer; // character index of last match
end;
and then something
if tokennum=(lastposition.elementnr+1) then
begin
newpos:=posex(delim,line,lastposition.elementpos);
end;
Unfortunately, I don't have the time now to write it out, but I hope you get the idea

Resources