Delphi XE AnsiStrings with escaped combining diacritical marks

Delphi XE AnsiStrings with escaped combining diacritical marks - delphi

What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?
I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.

I think you need to perform Unicode Normalization. on your string.
I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:
NormalizationKC
Unicode normalization form KC, compatibility composition. Transforms
each base plus combining characters to
the canonical precomposed equivalent
and all compatibility characters to
their equivalents. For example, the ligature ﬁ becomes f + i; similarly, A + ¨ + ﬁ + n becomes Ä + f + i + n.

Here is the complete code that solved my problem:
function Unescape(const s: AnsiString): string;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1;
j := 1;
while i <= Length(s) do begin
if s[i] = '\' then begin
if i < Length(s) then begin
// escaped backslash?
if s[i + 1] = '\' then begin
Result[j] := '\';
inc(i, 2);
end
// convert hex number to WideChar
else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s))
and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
inc(i, 6);
Result[j] := WideChar(c);
end else begin
raise Exception.CreateFmt('Invalid code at position %d', [i]);
end;
end else begin
raise Exception.Create('Unexpected end of string');
end;
end else begin
Result[j] := WideChar(s[i]);
inc(i);
end;
inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j - 1);
end;
const
NormalizationC = 1;
function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';
function Normalize(const s: string): string;
var
newLength: integer;
begin
// in NormalizationC mode the result string won't grow longer than the input string
SetLength(Result, Length(s));
newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
SetLength(Result, newLength);
end;
function UnescapeAndNormalize(const s: AnsiString): string;
begin
Result := Normalize(Unescape(s));
end;
Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)

Are they always escaped like this? Always in a number of 4 digits?
How is the \ character itself escaped?
Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:
function Unescape(s: AnsiString): WideString;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1; j := 1;
while i <= Length(s) do
begin
// If a '\' is found, typecast the following 4 digit integer to widechar
if s[i] = '\' then
begin
if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
raise Exception.CreateFmt('Invalid code at position %d', [i]);
Inc(i, 6);
Result[j] := WideChar(c);
end
else
begin
Result[j] := WideChar(s[i]);
Inc(i);
end;
Inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j-1);
end;
Use like this
MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);
This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.
[edit] Code is ok. Highlighter fails.

If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?

GolezTrol,
you forget '$'
if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then

Related

how to convert hexa to dec using delphi and hex to octal?

function HexToDec(Str: string): Integer;
var
i, M: Integer;
begin
Result:=0;
M:=1;
Str:=AnsiUpperCase(Str);
for i:=Length(Str) downto 1 do
begin
case Str[i] of
'1'..'9': Result:=Result+(Ord(Str[i])-Ord('0'))*M;
'A'..'F': Result:=Result+(Ord(Str[i])-Ord('A')+10)*M;
end;
M:=M shl 4;
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
if Edit1.Text<>'' then
Label2.Caption:=IntToStr(HexToDec(Edit1.Text));
end;
How to using it without function, because i want to call the result again in other line, and how about hexa to octal ? am i must conver from hexa to dec and then dec to octal?

Delphi can do this already, so you don't need to write a function parsing the number. It is quite simple, actually:
function HexToDec(const Str: string): Integer;
begin
if (Str <> '') and ((Str[1] = '-') or (Str[1] = '+')) then
Result := StrToInt(Str[1] + '$' + Copy(Str, 2, MaxInt))
else
Result := StrToInt('$' + Str);
end;
Note that that also handles negative hex numbers, or numbers like +$1234.
How to using it without function, because i want to call the result again in other line ?
If you want to re-use the value, assign the result of HexToDec to a variable and use that in IntToStr.
FWIW, in your function, there is no need to call AnsiUpperCase, because all hex digits fall in the ASCII range anyway. A much simpler UpperCase should work too.

My first comment would be that you are not converting hex to decimal with your function (although you are converting to decimal as an intermediate) but rather hex to integer. IntToStr then converts integer to base 10, effectively. To generalise what you want then I would create two functions - strBaseToInt and IntToStrBase where Base is meant to imply e.g. 16 for hex, 10 for dec, 8 for octal, etc., and assuming the convention adopted by hex that A=10, and so on but to (possibly) Z = 35 making the maximum base possible 36.
I don't handle + or - but that could be added easily.
In the reverse funtion, again for simplicity of illustration I have ommitted supporting negative values.
Edit
Thanks to Rudy for this improvement
Edit 2 - Overflow test added, as per comments
function StrBaseToInt(const Str: string; const Base : integer): Integer;
var
i, iVal, iTest: Longword;
begin
if (Base > 36) or (Base < 2) then raise Exception.Create('Invalid Base');
Result:=0;
iTest := 0;
for i:=1 to Length(Str) do
begin
case Str[i] of
'0'..'9': iVal := (Ord(Str[i])-Ord('0'));
'A'..'Z': iVal := (Ord(Str[i])-Ord('A')+10);
'a'..'z': iVal := (Ord(Str[i])-Ord('a')+10);
else raise Exception.Create( 'Illegal character found');
end;
if iVal < Base then
begin
Result:=Result * Base + iVal;
if Result < iTest then // overflow test!
begin
raise Exception.Create( 'Overflow occurred');
end
else
begin
iTest := Result;
end;
end
else
begin
raise Exception.Create( 'Illegal character found');
end;
end;
end;
Then, for example your HexToOct function would look like this
function HexToOct( Value : string ) : string;
begin
Result := IntToStrBase( StrBaseToInt( Value, 16), 8 );
end;
Additional
A general function would be
function BaseToBase( const Value : string; const FromBase, ToBase : integer ) : string;
begin
Result := IntToStrBase( StrBaseToInt( Value, FromBase ),ToBase );
end;

Delphi XE3 - Remove Ansi Code / Color from string

I'm struggling with dealing with Ansi code strings. I'm getting the [32m, [37m, [K etc chars.
Is there a quicker way to eliminate/strip the ansi codes from the strings I get rather than doing it with the loop through chars searching for the beginning and end points of the ansi codes?
I know the declaration is something like this: #27'['#x';'#y';'#z'm';
where x, y, z... are the ANSI codes. So I assume I should be searching for #27 until I find "m;"
Are there any already made functions to achieve what I want? My search returned nothing except this article.
Thanks

You can treat this protocol very fast with code like this (simplest finite state machine):
var
s: AnsiString;
i: integer;
InColorCode: Boolean;
begin
s := 'test'#27'['#5';'#30';'#47'm colored text';
InColorCode := False;
for i := 1 to Length(s) do
if InColorCode then
case s[i] of
#0: TextAttrib = Normal;
...
#47: TextBG := White;
'm': InColorCode := false;
else;
// I do nothing here for `;`, '[' and other chars.
// treat them if necessary
end;
else
if s[i] = #27 then
InColorCode := True
else
output char with current attributes
Clearing string from ESC-codes:
procedure StripEscCode(var s: AnsiString);
const
StartChar: AnsiChar = #27;
EndChar: AnsiChar = 'm';
var
i, cnt: integer;
InEsc: Boolean;
begin
Cnt := 0;
InEsc := False;
for i := 1 to Length(s) do
if InEsc then begin
InEsc := s[i] <> EndChar;
Inc(cnt)
end
else begin
InEsc := s[i] = StartChar;
if InEsc then
Inc(cnt)
else
s[i - cnt] :=s[i];
end;
setLength(s, Length(s) - cnt);
end;

There is an ansi version of StrToInt?

It seems there is no Ansi overload for StrToInt. Is this right? Or maybe I am missing something.
StrToInt insists to convert my ansistrings to string.

You are correct. There is no ANSI version of StrToInt. The place to find ANSI versions of standard function is the AnsiStrings unit, and there's nothing there.
Either write your own function to do the job, or accept the conversion required to use StrToInt.
It's not too hard to write your own function. It might look like this:
uses
SysConst; // for SInvalidInteger
....
{$OVERFLOWCHECKS OFF}
{$RANGECHECKS OFF}
function AnsiStrToInt(const s: AnsiString): Integer;
procedure Error;
begin
raise EConvertError.CreateResFmt(#SInvalidInteger, [s]);
end;
var
Index, Len, Digit: Integer;
Negative: Boolean;
begin
Index := 1;
Result := 0;
Negative := False;
Len := Length(s);
while (Index <= Len) and (s[Index] = ' ') do
inc(Index);
if Index > Len then
Error;
case s[Index] of
'-','+':
begin
Negative := s[Index] = '-';
inc(Index);
if Index > Len then
Error;
end;
end;
while Index <= Len do
begin
Digit := ord(s[Index]) - ord('0');
if (Digit < 0) or (Digit > 9) then
Error;
Result := Result * 10 + Digit;
if Result < 0 then
Error;
inc(Index);
end;
if Negative then
Result := -Result;
end;
This is a cut-down version of that found in StrToInt. It does not handle hexadecimal and is a bit more stringent regarding errors. Before using this code I'd want to test whether or not this really is your bottleneck.
It is quite interesting that this code, based on that in the RTL source, is incapable of returning low(Integer). It's not too hard to fix that up, but it would make the code more complex.

The code is actually very simple (hex strings aren't supported but prolly you don't need them):
function AnsiStrToInt(const S: RawByteString): Integer;
var
P: PByte;
Negative: Boolean;
Digit: Integer;
begin
P:= Pointer(S);
// skip leading spaces
while (P^ = Ord(' ')) do Inc(P);
Negative:= False;
if (P^ = Ord('-')) then begin
Negative:= True;
Inc(P);
end
else if (P^ = Ord('+')) then Inc(P);
if P^ = 0 then
raise Exception.Create('No data');
Result:= 0;
repeat
if Cardinal(Result) > Cardinal(High(Result) div 10) then
raise Exception.Create('Integer overflow');
Digit:= P^ - Ord('0');
if (Digit < 0) or (Digit > 9) then
raise Exception.Create('Invalid char');
Result:= Result * 10 + Digit;
if (Result < 0) then begin
if not Negative or (Cardinal(Result) <> Cardinal(Low(Result))) then
raise Exception.Create('Integer overflow');
end;
Inc(P);
until (P^ = 0);
if Negative then Result:= -Result;
end;

I followed this tip:
How to convert AnsiString to UnicodeString in Delphi XE4
Example:
var
a : AnsiString;
b : String;
c : Integer;
begin
a := '123';
b := String(a);
c := StrToInt(b);

What is the fastest way of stripping non alphanumeric characters from a string in Delphi7?

The characters allowed are A to Z, a to z, 0 to 9. The least amount of code or a single function would be best as the system is time critical on response to input.

If I understand you correctly you could use a function like this:
function StripNonAlphaNumeric(const AValue: string): string;
var
SrcPtr, DestPtr: PChar;
begin
SrcPtr := PChar(AValue);
SetLength(Result, Length(AValue));
DestPtr := PChar(Result);
while SrcPtr[0] <> #0 do begin
if SrcPtr[0] in ['a'..'z', 'A'..'Z', '0'..'9'] then begin
DestPtr[0] := SrcPtr[0];
Inc(DestPtr);
end;
Inc(SrcPtr);
end;
SetLength(Result, DestPtr - PChar(Result));
end;
This will use PChar for highest speed (at the cost of less readability).
Edit: Re the comment by gabr about using DestPtr[0] instead of DestPtr^: This should compile to the same bytes anyway, but there are nice applications in similar code, where you need to look ahead. Suppose you would want to replace newlines, then you could do something like
function ReplaceNewlines(const AValue: string): string;
var
SrcPtr, DestPtr: PChar;
begin
SrcPtr := PChar(AValue);
SetLength(Result, Length(AValue));
DestPtr := PChar(Result);
while SrcPtr[0] <> #0 do begin
if (SrcPtr[0] = #13) and (SrcPtr[1] = #10) then begin
DestPtr[0] := '\';
DestPtr[1] := 't';
Inc(SrcPtr);
Inc(DestPtr);
end else
DestPtr[0] := SrcPtr[0];
Inc(SrcPtr);
Inc(DestPtr);
end;
SetLength(Result, DestPtr - PChar(Result));
end;
and therefore I don't usually use the ^.

uses JclStrings;
S := StrKeepChars('mystring', ['A'..'Z', 'a'..'z', '0'..'9']);

Just to add a remark.
The solution using a set is fine in Delphi 7, but it can cause some problems in Delphi 2009 because sets can't be of char (they are converted to ansichar).
A solution you can use is:
case key of
'A'..'Z', 'a'..'z', '0'..'9' : begin end; // No action
else
Key := #0;
end;
But the most versatile way is of course:
if not ValidChar(key) then
Key := #0;
In that case you can use ValidChar in multiple locations and if it need to be changed you only have to change it once.

OnKeypress event
begin
if not (key in ['A'..'Z','a'..'z','0'..'9']) then
Key := #0;
end;

StringReplace alternatives to improve performance

I am using StringReplace to replace &gt and &lt by the char itself in a generated XML like this:
StringReplace(xml.Text,'>','>',[rfReplaceAll]) ;
StringReplace(xml.Text,'<','<',[rfReplaceAll]) ;
The thing is it takes way tooo long to replace every occurence of &gt.
Do you purpose any better idea to make it faster?

If you're using Delphi 2009, this operation is about 3 times faster with TStringBuilder than with ReplaceString. It's Unicode safe, too.
I used the text from http://www.CodeGear.com with all occurrences of "<" and ">" changed to "<" and ">" as my starting point.
Including string assignments and creating/freeing objects, these took about 25ms and 75ms respectively on my system:
function TForm1.TestStringBuilder(const aString: string): string;
var
sb: TStringBuilder;
begin
StartTimer;
sb := TStringBuilder.Create;
sb.Append(aString);
sb.Replace('>', '>');
sb.Replace('<', '<');
Result := sb.ToString();
FreeAndNil(sb);
StopTimer;
end;
function TForm1.TestStringReplace(const aString: string): string;
begin
StartTimer;
Result := StringReplace(aString,'>','>',[rfReplaceAll]) ;
Result := StringReplace(Result,'<','<',[rfReplaceAll]) ;
StopTimer;
end;

Try FastStrings.pas from Peter Morris.

You should definitely look at the Fastcode project pages: http://fastcode.sourceforge.net/
They ran a challenge for a faster StringReplace (Ansi StringReplace challenge), and the 'winner' was 14 times faster than the Delphi RTL.
Several of the fastcode functions have been included within Delphi itself in recent versions (D2007 on, I think), so the performance improvement may vary dramatically depending on which Delphi version you are using.
As mentioned before, you should really be looking at a Unicode-based solution if you're serious about processing XML.

The problem is that you are iterating the entire string size twice (one for replacing > by > and another one to replace < by <).
You should iterate with a for and simply check ahead whenever you find a & for a gt; or lt; and do the immediate replace and then skipping 3 characters ((g|l)t;). This way it can do that in proportional time to the size of the string xml.Text.
A simple C# example as I do not know Delphi but should do for you to get the general idea.
String s = "<xml>test</xml>";
char[] input = s.ToCharArray();
char[] res = new char[s.Length];
int j = 0;
for (int i = 0, count = input.Length; i < count; ++i)
{
if (input[i] == '&')
{
if (i < count - 3)
{
if (input[i + 1] == 'l' || input[i + 1] == 'g')
{
if (input[i + 2] == 't' && input[i + 3] == ';')
{
res[j++] = input[i + 1] == 'l' ? '<' : '>';
i += 3;
continue;
}
}
}
}
res[j++] = input[i];
}
Console.WriteLine(new string(res, 0, j));
This outputs:
<xml>test</xml>

When you are dealing with a multiline text files, you can get some performance by processing it line by line. This approach reduced time in about 90% to replaces on >1MB xml file.
procedure ReplaceMultilineString(xml: TStrings);
var
i: Integer;
line: String;
begin
for i:=0 to xml.Count-1 do
begin
line := xml[i];
line := StringReplace(line, '>', '>', [rfReplaceAll]);
line := StringReplace(line, '<', '<', [rfReplaceAll]);
xml[i] := line;
end;
end;

Untested conversion of the C# code written by Jorge Ferreira.
function ReplaceLtGt(const s: string): string;
var
inPtr, outPtr: integer;
begin
SetLength(Result, Length(s));
inPtr := 1;
outPtr := 1;
while inPtr <= Length(s) do begin
if (s[inPtr] = '&') and ((inPtr + 3) <= Length(s)) and
(s[inPtr+1] in ['l', 'g']) and (s[inPtr+2] = 't') and
(s[inPtr+3] = ';') then
begin
if s[inPtr+1] = 'l' then
Result[outPtr] := '<'
else
Result[outPtr] := '>';
Inc(inPtr, 3);
end
else begin
Result[outPtr] := Result[inPtr];
Inc(inPtr);
end;
Inc(outPtr);
end;
SetLength(Result, outPtr - 1);
end;

Systools (Turbopower, now open source) has a ReplaceStringAllL function that does all of them in a string.

it's work like charm so fast trust it
Function NewStringReplace(const S, OldPattern, NewPattern: string; Flags: TReplaceFlags): string;
var
OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
PatLength,NewPatLength,P,i,PatCount,PrevP: Integer;
c,d: pchar;
begin
PatLength:=Length(OldPattern);
if PatLength=0 then begin
Result:=S;
exit;
end;
if rfIgnoreCase in Flags then begin
Srch:=AnsiUpperCase(S);
OldPat:=AnsiUpperCase(OldPattern);
end else begin
Srch:=S;
OldPat:=OldPattern;
end;
PatLength:=Length(OldPat);
if Length(NewPattern)=PatLength then begin
//Result length will not change
Result:=S;
P:=1;
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
for i:=1 to PatLength do
Result[P+i-1]:=NewPattern[i];
if not (rfReplaceAll in Flags) then exit;
inc(P,PatLength);
end;
until p=0;
end else begin
//Different pattern length -> Result length will change
//To avoid creating a lot of temporary strings, we count how many
//replacements we're going to make.
P:=1; PatCount:=0;
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
inc(P,PatLength);
inc(PatCount);
if not (rfReplaceAll in Flags) then break;
end;
until p=0;
if PatCount=0 then begin
Result:=S;
exit;
end;
NewPatLength:=Length(NewPattern);
SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
P:=1; PrevP:=0;
c:=pchar(Result); d:=pchar(S);
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
for i:=PrevP+1 to P-1 do begin
c^:=d^;
inc(c); inc(d);
end;
for i:=1 to NewPatLength do begin
c^:=NewPattern[i];
inc(c);
end;
if not (rfReplaceAll in Flags) then exit;
inc(P,PatLength);
inc(d,PatLength);
PrevP:=P-1;
end else begin
for i:=PrevP+1 to Length(S) do begin
c^:=d^;
inc(c); inc(d);
end;
end;
until p=0;
end;
end;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Delphi XE AnsiStrings with escaped combining diacritical marks - delphi

If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?

GolezTrol, you forget '$' if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then

Related

how to convert hexa to dec using delphi and hex to octal?

Delphi XE3 - Remove Ansi Code / Color from string

There is an ansi version of StrToInt?

What is the fastest way of stripping non alphanumeric characters from a string in Delphi7?

StringReplace alternatives to improve performance

Categories

Resources