In Delphi Alexandria RTL, is ScanChar() badly written? - delphi

In the Delphi Alexandria RTL, they have this function:
function ScanChar(const S: string; var Pos: Integer; Ch: Char): Boolean;
var
C: Char;
begin
if (Ch = ' ') and ScanBlanks(S, Pos) then
Exit(True);
Result := False;
if Pos <= High(S) then
begin
C := S[Pos];
if C = Ch then
Result := True
else if (Ch >= 'a') and (Ch <= 'z') and (C >= 'a') and (C <= 'z') then
Result := Char(Word(C) xor $0020) = Char(Word(Ch) xor $0020)
else if Ch.IsLetter and C.IsLetter then
Result := ToUpper(C) = ToUpper(Ch);
if Result then
Inc(Pos);
end;
end;
I can't understand the purpose of this comparison:
else if (Ch >= 'a') and (Ch <= 'z') and (C >= 'a') and (C <= 'z') then
Result := Char(Word(C) xor $0020) = Char(Word(Ch) xor $0020)
It looks like it's the same as doing this:
else if (Ch >= 'a') and (Ch <= 'z') and (C >= 'a') and (C <= 'z') then
Result := c = Ch
Is this true?

else if (Ch >= 'a') and (Ch <= 'z') and (C >= 'a') and (C <= 'z') then
Result := Char(Word(C) xor $0020) = Char(Word(Ch) xor $0020)
Purpose of this comparison is optimization and making faster comparison if the characters are plain ASCII letters and avoiding expensive call to WinAPI via ToUpper function that can handle Unicode characters.
Or at least that is what would happen if the comparison itself would not be badly broken.
Comparison checks whether both characters are lower case and fall into range between small letter a (ASCII value 97) and small letter z (ASCII value 122). But what it should actually check is that both characters fall into range between large letter A (ASCII value 65) and small letter z, covering the whole range of ASCII letters regardless of their case. (There are few non letter characters in that range, but those are not relevant as Result assignment would never yield True for any of those characters.)
Once that is fixed, we also need to fix Result assignment expression as it will not properly compare lowercase and uppercase letters. To do that we can simply use or operator on all characters which will turn uppercase characters to lowercase, and leave lowercase as-is. As previously mentioned, at this point in code, non-letter characters in that range can be safely ignored.
Correct code for that part of the ScanChar function would be:
...
else
if (Ch >= 'A') and (Ch <= 'z') and (C >= 'A') and (C <= 'z') then
Result := Word(Ch) or $0020 = Word(C) or $0020
else
...
Note: Even through original ScanChar function contains incorrect code, the result of the function will still be correct as for same letters in different case the code will always go through ToUpper part of the if branch.

It is not exactly the same as C = Ch, but the result is the same, I suppose.
The comparison is redundant, IMHO. It is using XOR to convert lowercase ASCII letters into uppercase ASCII letters (as they differ by only 1 bit), and then comparing the uppercase letters for equality. But the following comparison using IsLetter+ToUpper does the same thing, just for any letters, not just ASCII letters.

Related

Compare multiple values at a time

I need to check if N values are equals.
var
A, B, C, D : Integer;
begin
...
if(A = B) and (B = C) and (C = D) then
ShowMessage('Same value');
end;
Is there a shorter way to compare N values?
I mean something like:
var
A, B, C, D : Integer;
begin
...
if SameValue([A, B, C, D]) then
ShowMessage('Same value');
end;
Well, the best you can achieve is basically your own suggestion.
You would implement this using an open array parameter:
function AllEqual(const AValues: array of Integer): Boolean;
var
i: Integer;
begin
for i := 1 to High(AValues) do
if AValues[i] <> AValues[0] then
Exit(False);
Result := True;
end;
The correctness of this implementation is obvious:
If the number of values in the array is 0 or 1, it returns True.
Otherwise, and in general, it returns False iff the array contains two non-equal values.
AValues[0] is only accessed if High(AValues) >= 1, in which case the 0th value exists.
A function like this one is straightforward to implement for ordinal types. For real types (floating-point values), it becomes much more subtle, at least if you want to compare the elements with epsilons (like the SameValue function does in the Delphi RTL). Indeed, then you get different behaviour depending on if you compare every element against the first element, or if you compare every element against its predecessor.
Andreas' answer is correct, I'd like to add a different approach though:
uses Math;
function AllEqual(const AValues: array of Integer): Boolean;
begin
Result := (MinIntValue(AValues) = MaxIntValue(AValues));
end;
function AllEqualF(const AValues: array of Double; Epsilon: Double): Boolean;
begin
Result := ((MaxValue(AValues)- MinValue(AValues)) <= Epsilon);
end;
There is quite simple and very fast equality comparison approach for ints without a need of additional method and stuff like this - it's Bitwise Operators
And of course, this could be put in a method with open array or so.
There are even 2 options (or maybe more), with second you also can replace "or" to "+" , OR (not both, it will ruin equality-test logic) you can replace "xor" to "-" (last case)
BUT the resulting condition length is not shorter than original (only the last case is same and all brackets/parenthesis are vital, except first xor/-), here is the testing code:
program Project1;{$APPTYPE CONSOLE}
uses Math; var a, b, c, d, x : Integer; s: string;
begin
Randomize;
repeat
x := Random(10) - 5;
a := x + Sign(Random() - 0.5);
b := x + Sign(Random() - 0.5);
c := x + Sign(Random() - 0.5);
d := x + Sign(Random() - 0.5);
Writeln(a, ' ', b, ' ', c, ' ', d);
Writeln((A = B) and (B = C) and (C = D));
Writeln(a or b or c or d = a and b and c and d);
Writeln(a xor b or (b xor c) or (c xor d) = 0);
Writeln(a - b or (b - c) or (c - d) = 0);
Readln(s);
until s <> '';
end.

Delphi: What are faster pure Pascal approachs to find the position of a character in a Unicode string?

Background
Added Later
I have made a pure Pascal function to find the position of a character in a Unicode string as follows:
function CharPosEx(const chChr: Char; const sStr: string;
const iOffset: Integer=1): Integer;
var
PStr : PChar;
PRunIdx: PChar;
PEndIdx: PChar;
iLenStr: Integer;
begin
Result := 0;
iLenStr := Length(sStr);
if (iLenStr = 0) or (iOffset <= 0) or (iOffset > iLenStr) then Exit;
PStr := Pointer(sStr);
PEndIdx := #PStr[iLenStr - 1];
PRunIdx := #PStr[iOffset - 1];
repeat
if PRunIdx^ = chChr then begin
Result := PRunIdx - PStr + 1;
Exit;
end;
Inc(PRunIdx);
until PRunIdx > PEndIdx;
end;
I decide to not use the built-in StrUtils.PosEx() because I want to create a UTF16_CharPosEx function based on an optimized pure Pascal function of CharPosEx. I'm trying to find a faster generic solution like the pure Pascal approachs of the Fastcode Project.
The Original Statements
According to the accepted answer to the question, Delphi: fast Pos with 64-bit, the fastest pure Pascal function to find the position of a substring in a string is PosEx_Sha_Pas_2() of the Fastcode Project.
For the fastest pure Pascal function to find the position of a character in a string, I noticed that the Fastcode Project has CharPos(), CharPosIEx(), and CharPosEY() for a left-to-right matching, as well as CharPosRev() for a right-to-left matching.
However, the problem is that all Fastcode functions were developed before Delphi 2009, which was the first Delphi release that supports Unicode.
I'm interested in CharPos(), and CharPosEY(). I want to re-benchmark them because there are some optimization techniques that are useless nowadays, such as loop unrolling technique that was occasionally implemented in Fastcode functions.
However, I cannot recompile the benchmark project for each of the CharPos family challenges because I have been using Delphi XE3 here, therefore I cannot conclude which one is the fastest.
Questions
Anyone here know or can conlude which one is the fastest pure Pascal implementations for each of the mentioned Fastcode challenges, especially for CharPos() and CharPosEY()?
Other approaches out of the Fastcode Project solution are welcome.
Notes
The Unicode string term I used here refers to a string whose the type is UnicodeString regardless its encoding scheme.
If encoding scheme matters, what I mean is the fixed-width 16-bit encoding scheme (UCS-2).
Many of the solutions to find a character in a string amongst the fastcode examples, uses a technique to read the string in in larger chunks into a register and then analyze the register bytes for a match. this works fine when the characters are single bytes, but are not optimal when characters are 16 bit unicode.
Some examples even use a lookup table, but that is also not optimal in a unicode string search.
I find that the fastcode purepascal PosEx_Sha_Pas_2 string search routine works very good both in 32/64 bit mode even for single character search.
You might as well use that routine.
I stripped out some parts not needed out of the PosEx_Sha_Pas_2 into CharPosEx_LU_Pas and gained some percent in execution time:
function CharPosEx_LU_Pas(c: Char; const S: string; Offset: Integer = 1): Integer;
var
len: Integer;
p, pStart, pStop: PChar;
label
Loop0, Loop4,
TestT, Test0, Test1, Test2, Test3, Test4,
AfterTestT, AfterTest0,
Ret;
begin;
p := Pointer(S);
if (p = nil) or (Offset < 1) then
begin;
Exit(0);
end;
len := PLongInt(PByte(p) - 4)^; // <- Modified to fit 32/64 bit
if (len < Offset) then
begin;
Exit(0);
end;
pStop := p + len;
pStart := p;
p := p + Offset + 3;
if p < pStop then
goto Loop4;
p := p - 4;
goto Loop0;
Loop4:
if c = p[-4] then
goto Test4;
if c = p[-3] then
goto Test3;
if c = p[-2] then
goto Test2;
if c = p[-1] then
goto Test1;
Loop0:
if c = p[0] then
goto Test0;
AfterTest0:
if c = p[1] then
goto TestT;
AfterTestT:
p := p + 6;
if p < pStop then
goto Loop4;
p := p - 4;
if p < pStop then
goto Loop0;
Exit(0);
Test3:
p := p - 2;
Test1:
p := p - 2;
TestT:
p := p + 2;
if p <= pStop then
goto Ret;
Exit(0);
Test4:
p := p - 2;
Test2:
p := p - 2;
Test0:
Inc(p);
Ret:
Result := p - pStart;
end;
I claim no originality to this snippet as it was a simple task to strip out those code parts not needed from PosEx_Sha_Pas_2.
Benchmark 32 bit (101 character string, last character matches):
50000000 repetitions.
System.Pos: 1547 ms
PosEX_Sha_Pas_2: 1292 ms
CharPosEx: 2315 ms
CharPosEx_LU_Pas: 1103 ms
SysUtils.StrScan: 2666 ms
Benchmark 64 bit (101 character string, last character matches):
50000000 repetitions.
System.Pos: 20928 ms
PosEX_Sha_Pas_2: 1783 ms
CharPosEx: 2874 ms
CharPosEx_LU_Pas: 1728 ms
SysUtils.StrScan: 3115 ms

Delphi - Loop through the String

I'm trying to find out if String is "mnemonic type"...
My mnemonic type consists of letters from 'a' to 'z' and from 'A' to 'Z', digits from '0' to '9', and additionaly '_'.
I build code like below. It should result with True if given string match my mnemonic pattern otherwise False:
TRes := True;
for I := 0 to (AString.Length - 1) do
begin
if not ((('0' <= AString[I]) and (AString[I] <= '9'))
or (('a' <= AString[I]) and (AString[I] <= 'z'))
or (('A' <= AString[I]) and (AString[I] <= 'Z'))
or (AString[I] = '_')) then
TRes := False;
end;
This code always results with False.
I'm assuming that since you tagged the question XE5, and used zero-based indexing, that your strings are zero-based. But perhaps that assumptions was mistaken.
Your logic is fine, although it is rather hard to read. The code in the question is already doing what you intend. At least the if statement does indeed perform the test that you intend.
Let's just re-write your code to make it easier to understand. I'm going to lay it our differently, and use a local loop variable to represent each character:
for C in AString do
begin
if not (
(('0' <= C) and (C <= '9')) // C is in range 0..9
or (('a' <= C) and (C <= 'z')) // C is in range a..z
or (('A' <= C) and (C <= 'Z')) // C is in range A..Z
or (C = '_') // C is _
) then
TRes := False;
end;
When written like that I'm sure that you will agree that it performs the test that you intend.
To make the code easier to understand however, I would write an IsValidIdentifierChar function:
function IsValidIdentifierChar(C: Char): Boolean;
begin
Result := ((C >= '0') and (C <= '9'))
or ((C >= 'A') and (C <= 'Z'))
or ((C >= 'a') and (C <= 'z'))
or (C = '_');
end;
As #TLama says, you can write IsValidIdentifierChar more concisely using CharInSet:
function IsValidIdentifierChar(C: Char): Boolean;
begin
Result := CharInSet(C, ['0'..'9', 'a'..'z', 'A'..'Z', '_']);
end;
Then you can build your loop on top of this function:
TRes := True;
for C in AString do
if not IsValidIdentifierChar(C) do
begin
TRes := False;
break;
end;
String type is 1-based. dynamic Arrays are 0-based. Better use for ... in so you are safe for future Delphi's.
Testing for ranges of possible character values can be done more efficiently (and more conciece) is CharInSet.
function IsMnemonic( AString: string ): Boolean;
var
Ch: Char;
begin
for Ch in AString do
if not CharInSet( Ch, [ '_', '0'..'9', 'A'..'Z', 'a'..'z' ] ) then
Exit( False );
Result := True;
end;

Pos() within utf8 string boundaries

I'd like to have a Pos() adapted to be used specifying boundaries within the Source string, rather than have it perform the search in the entire data.
Let's say I have a string which is 100 chars long, I want to perform the Pos only between the 5th and 20th character of the (unicode/utf8) string.
The code should be adapted from the ASM fastcode implementation in delphi, and obviously avoid pre-copying the portion of the string to a temporal one, as the purpose is making it faster than that.
My scenario:
I have a string which is accessed many times, and each time, a portion of it is copied to another temporal string, then a Pos is performed on it. I want to avoid the intermediary copy every time, and rather perform the Pos within the boundaries I specify.
Edit: question edited after new one was deemed a duplicate.
I would still like a solution that expands on the current XE3 FastCode assembly implementation, as that would fit my goal here.
Here is an alternative that is not based on asm.
It will also work on a 64-bit application.
function PosExUBound(const SubStr, Str: UnicodeString; Offset,EndPos: Integer): Integer; overload;
var
I, LIterCnt, L, J: NativeInt;
PSubStr, PS: PWideChar;
begin
L := Length(SubStr);
if (EndPos > Length(Str)) then
EndPos := Length(Str);
{ Calculate the number of possible iterations. Not valid if Offset < 1. }
LIterCnt := EndPos - Offset - L + 1;
{- Only continue if the number of iterations is positive or zero (there is space to check) }
if (Offset > 0) and (LIterCnt >= 0) and (L > 0) then
begin
PSubStr := PWideChar(SubStr);
PS := PWideChar(Str);
Inc(PS, Offset - 1);
Dec(L);
I := 0;
J := L;
repeat
if PS[I + J] <> PSubStr[J] then
begin
Inc(I);
J := L;
Dec(LIterCnt);
if (LIterCnt < 0)
then Exit(0);
end
else
if (J > 0) then
Dec(J)
else
Exit(I + Offset);
until false;
end;
Result := 0;
end;
I will leave it as an excercise to implement an AnsiString overloaded version.
BTW, the purepascal parts of the Pos() functions in XE3 are to put it mildly poorly written. See QC111103 Inefficient loop in Pos() for purepascal. Give it a vote if you like.

Searching for Unicode chars from a raw byte array - Free Pascal\Lazarus or Delphi

I don't want to bore people with the explanation of why and how so I 'll just jump right in.
I have an array of bytes containing raw byte data. The array is 1000 bytes. I want to go through that array of 1000 bytes and extract UTF-16 Unicode characters only that might resemble a filename but I don't know where, exactly, in that array of 1000 bytes the characters appear.
I have read
Lazarus Unicode Page and this but am still somewhat unsure with the syntactical approach to my problem. I understand that a Unicode char can be up to 4 bytes in size but is commonly two (a letter and a space).
I have used UTF8encode(WideCharLenToString(#MyArray,SomeIntValue) with success for other areas where I KNOW certain Unicode chars exist further to this thread that I asked about and is now solved. But I now need to "hunt" for them now, for a different reason, within the array. e.g. "Look at the first 16 bytes. Are they Unicode? If not, Look at the next 16. Are they Unicode? If so, convert them to a string and display them".
Can anyone help me?
Without knowing the actual layout of the bytes, or the formatting of the filename (does it have a drive letter and path, does it use UNC paths, or is it just a file name by itself?), hunting for the boundaries of the filename string is going to be difficult.
If you can assume that the filename always begins with a drive letter and path, then you can loop through the array one byte a time until you decode a six-byte UTF-16 sequence that consists of a character between 'a'-'z' or 'A'-'Z' followed by ':' and '\' characters. If you find that, keep decoding UTF-16 sequences until you encounter a decoded null character or a binary value that is not a valid UTF-16 sequence, eg:
var
Buffer: array[0..1000-1] of Byte;
I: Integer;
PCh: PWord;
Hi, Lo: Word;
Ch: Cardinal;
PStart: PWideChar;
Len: Integer;
FileName: WideString;
begin
...
I := 0;
while I <= (SizeOf(Buffer)-6) do
begin
PCh := PWord(#Buffer[I]);
if not (((PCh^ >= Ord('a')) and (PCh^ <= Ord('z'))) or ((PCh^ >= Ord('A')) and (PCh^ <= Ord('Z')))) then
begin
Inc(I);
Continue;
end;
Inc(PCh);
if PCh^ <> Ord(':') then
begin
Inc(I);
Continue;
end;
Inc(PCh);
if PCh^ <> Ord('\') then
begin
Inc(I);
Continue;
end;
PStart := PWideChar(#Buffer[I]);
Len := 0;
Inc(I, 6);
Inc(PCh);
while I <= (SizeOf(Buffer)-2) do
begin
if (PCh^ < $D800) or (PCh^ > $DFFF) then
begin
Ch := Cardinal(PCh^);
Inc(I, 2);
if Ch = 0 then Break;
Inc(Len);
end else
begin
if PCh^ > $DBFF then Break;
if (I+2) = SizeOf(Buffer) then Break;
Hi := PCh^;
Inc(PCh);
if (PCh^ < $DC00) or (PCh^ > $DFFF) then Break;
Lo := PCh^;
Ch := ((Cardinal(Hi) - $D800) * $400) + (Cardinal(Lo) - $DC00) + $10000;
if Ch > $10FFFF then Break;
Inc(I, 4);
Inc(Len, 2);
end;
end;
SetString(FileName, PStart, Len);
if Len > 0 then
begin
... use FileName as nedeed...
end;
end;
...
end;
UTF-16 codepoints are either 2 bytes or 4 bytes long. It's not a letter and a space; in isolation, most 16-bit words are valid UTF-16 characters. (Codepoints with values between D800 and DBFF need to be followed by a value in the range DC00-DFFF to make one complete Unicode character.) If you're just looking for valid UTF-16, it's unlikely you'll make much headway. You'll need to look specific patterns found in filenames, like .ext (which would be encoded in UTF-16 as either \00.\00e\00x\00t or .\00e\00x\00t\00, depending on whether it's big-endian or little-endian.)

Resources