Decode UTF-8 encoded Cyrillic with Delphi 2007

Decode UTF-8 encoded Cyrillic with Delphi 2007 - delphi

I am working in Delphi 2007 (no Unicode support) and I am retrieving XML and JSON data from the Google Analytics API. Below is some UTF-8 encoded data that I get for a URL referral path:
ga:referralPath=/add/%D0%9F%D0%B8%D0%B6%D0%B0%D0%BC
When I decode it using this decoder it properly generates this:
ga:referralPath=/add/Пижам
Is there a function I can use in Delphi 2007 which will perform this decoding?
UPDATE
This data is corresponds to a URL. Ultimately what I want to do is to store this in a SqlServer database (out of the box - no settings modified regarding character sets). And then be able to produce/create an html pages with a working link to this page (note: I am only dealing with the url referral path in this example - obviously to make a valid url link a source would be needed).

D2007 supports Unicode, just not to the extent that D2009+ does. Unicode in D2007 is handled using WideString and the few RTL support functions that do exist.
The URL contains percent-encoded UTF-8 byte octets. Simply convert those sequences into their binary representation and then use UTF8Decode() to decode the UTF-8 data to a WideString. For example:
function HexToBits(C: Char): Byte;
begin
case C of
'0'..'9': Result := Byte(Ord(C) - Ord('0'));
'a'..'f': Result := Byte(10 + (Ord(C) - Ord('a')));
'A'..'F': Result := Byte(10 + (Ord(C) - Ord('A')));
else
raise Exception.Create('Invalid encoding detected');
end;
end;
var
sURL: String;
sWork: UTF8String;
C: Char;
B: Byte;
wDecoded: WideString;
I: Integer;
begin
sURL := 'ga:referralPath=/add/%D0%9F%D0%B8%D0%B6%D0%B0%D0%BC';
sWork := sURL;
I := 1;
while I <= Length(sWork) do
begin
if sWork[I] = '%' then
begin
if (I+2) > Length(sWork) then
raise Exception.Create('Incomplete encoding detected');
sWork[I] := Char((HexToBits(sWork[I+1]) shl 4) or HexToBits(sWork[I+2]));
Delete(sWork, I+1, 2);
end;
Inc(I);
end;
wDecoded := UTF8Decode(sWork);
...
end;

You can use the following code, which uses Windows API :
function Utf8ToStr(const Source : string) : string;
var
i, len : integer;
TmpBuf : array of byte;
begin
SetLength(Result, 0);
i := MultiByteToWideChar(CP_UTF8, 0, #Source[1], Length(Source), nil, 0);
if i = 0 then Exit;
SetLength(TmpBuf, i * SizeOf(WCHAR));
Len := MultiByteToWideChar(CP_UTF8, 0, #Source[1], Length(Source), #TmpBuf[0], i);
if Len = 0 then Exit;
i := WideCharToMultiByte(CP_ACP, 0, #TmpBuf[0], Len, nil, 0, nil, nil);
if i = 0 then Exit;
SetLength(Result, i);
i := WideCharToMultiByte(CP_ACP, 0, #TmpBuf[0], Len, #Result[1], i, nil, nil);
SetLength(Result, i);
end;

Related

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

First I get a TMemoryStream from an HTTP request, which contains the body of the response.
Then I load it in a TStringList and save the text in a widestring (also tried with ansistring).
The problem is that I need to convert the string because the users language is spanish, so vowels with accent marks are very common and I need to store the info.
lServerResponse := TStringList.Create;
lServerResponse.LoadFromStream(lResponseMemoryStream);
lStringResponse := lServerResponse.Text;
lDecodedResponse := Utf8Decode(lStringResponse );
If the response (a part of it) is "Hólá Múndó", lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³", and lDecodedResponse will be "Hólá Múndó".
But if the user adds any emoji (lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³ ðŸ˜€" if the emoji is 😀) Utf8Decode fails and returns an empty string.
Is there a way to get just the ANSI characters from a string (or MemoryStream)?, or removing whatever Utf8Decode can't convert?
Thanks for your time.

TMemoryStream is just raw bytes. There is no reason to loading that stream into a TStringList just to extract a (Wide|Ansi)String from it. You can assign the bytes directly to an AnsiString/UTF8String using SetString() instead, eg:
var
lStringResponse: UTF8String;
lDecodedResponse: WideString;
begin
SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
lDecodedResponse := UTF8Decode(lStringResponse);
end;
Just make sure the HTTP content really is encoded as UTF-8, or else this approach will not work.
That being said - UTF8Decode() (and UTF8Encode()) in Delphi 7 DO NOT support Unicode codepoints above U+FFFF, which means they DO NOT support Emojis at all. That was fixed in Delphi 2009.
To work around that issue in earlier versions, you can use the Win32 API MultiByteToWideChar() function instead, eg:
uses
..., Windows;
function My_UTF8Decode(const S: UTF8String): WideString;
var
WLen: Integer;
begin
WLen := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), nil, 0);
if WLen > 0 then
begin
SetLength(Result, WLen);
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), PWideChar(Result), WLen);
end else
Result := '';
end;
var
lStringResponse: UTF8String;
lDecodedResponse: WideString;
begin
SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
lDecodedResponse := My_UTF8Decode(lStringResponse);
end;
Alternatively:
uses
..., Windows;
function My_UTF8Decode(const S: PAnsiChar; const SLen: Integer): WideString;
var
WLen: Integer;
begin
WLen := MultiByteToWideChar(CP_UTF8, 0, S, SLen, nil, 0);
if WLen > 0 then
begin
SetLength(Result, WLen);
MultiByteToWideChar(CP_UTF8, 0, S, SLen, PWideChar(Result), WLen);
end else
Result := '';
end;
var
lDecodedResponse: WideString;
begin
lDecodedResponse := My_UTF8Decode(PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
end;
Or, use a 3rd party Unicode conversion library, like ICU or libiconv, which handle this for you.

Detect the status of a printer paper

i need to get paper status information from a printer. I have a list of esc/pos commands.
I'm trying to send these comands with escape function
http://msdn.microsoft.com/en-us/library/windows/desktop/dd162701%28v=vs.85%29.aspx
This is my code
type
TPrnBuffRec = record
bufflength: Word;
Buff_1: array[0..255] of Char;
end;
procedure TFTestStampa.SpeedButton2Click(Sender: TObject);
var
Buff: TPrnBuffRec;
BuffOut: TPrnBuffRec;
TestInt: Integer;
cmd : string;
begin
printer.BeginDoc;
try
TestInt := PassThrough;
if Escape(Printer.Handle, QUERYESCSUPPORT, SizeOf(TESTINT),
#testint, nil) > 0 then
begin
cmd := chr(10) + chr(04) + '4';
StrPCopy(Buff.Buff_1, cmd);
Buff.bufflength := StrLen(Buff.Buff_1);
Escape(Printer.Canvas.Handle, Passthrough, 0, #buff,
#buffOut);
ShowMessage( conver(strPas(buffOut.Buff_1)) );
end
finally
printer.EndDoc;
end;
function TFTestStampa.Conver(s: string): String;
var
i: Byte;
t : String;
begin
t := '';
for i := 1 to Length(s) do
t := t + IntToHex(Ord(s[i]), 2) + ' ';
Result := t;
end;
Problem is with different cmds I obtain always the same string ....
Can you give me an example of escape function with last parameter not nill ?
Alternatives to obtain paper status ?

I suppose you are using Delphi 2009 above and you used this source for your example, so your problem might be caused by Unicode parameters. In Delphi since version 2009, string type is defined as UnicodeString whilst in Delphi 2009 below as AnsiString, the same stands also for Char which is WideChar in Delphi 2009 up and AnsiChar below.
If so, then I think you have a problem at least with your buffer data length, because Char = WideChar takes 2 bytes and you were using StrLen function which returns the number of chars what cannot correspond to the data size of number of chars * 2 bytes.
I hope this will fix your problem, but I can't verify it, because I don't have your printer :)
type
TPrinterData = record
DataLength: Word;
Data: array [0..255] of AnsiChar; // let's use 1 byte long AnsiChar
end;
function Convert(const S: AnsiString): string;
var
I: Integer; // 32-bit integer is more efficient than 8-bit byte type
T: string; // here we keep the native string data type
begin
T := '';
for I := 1 to Length(S) do
T := T + IntToHex(Ord(S[I]), 2) + ' ';
Result := T;
end;
procedure TFTestStampa.SpeedButton2Click(Sender: TObject);
var
TestInt: Integer;
Command: AnsiString;
BufferIn: TPrinterData;
BufferOut: TPrinterData;
begin
Printer.BeginDoc;
try
TestInt := PASSTHROUGH;
if Escape(Printer.Handle, QUERYESCSUPPORT, SizeOf(TestInt), #TestInt, nil) > 0 then
begin
Command := Chr(10) + Chr(04) + '4';
StrPCopy(BufferIn.Data, Command);
BufferIn.DataLength := StrLen(Command);
FillChar(BufferOut.Data, Length(BufferOut.Data), #0);
BufferOut.DataLength := 0;
Escape(Printer.Canvas.Handle, PASSTHROUGH, 0, #BufferIn, #BufferOut);
ShowMessage(Convert(StrPas(BufferOut.Data)));
end
finally
Printer.EndDoc;
end;
end;

Delphi XE AnsiStrings with escaped combining diacritical marks

What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?
I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.

I think you need to perform Unicode Normalization. on your string.
I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:
NormalizationKC
Unicode normalization form KC, compatibility composition. Transforms
each base plus combining characters to
the canonical precomposed equivalent
and all compatibility characters to
their equivalents. For example, the ligature ﬁ becomes f + i; similarly, A + ¨ + ﬁ + n becomes Ä + f + i + n.

Here is the complete code that solved my problem:
function Unescape(const s: AnsiString): string;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1;
j := 1;
while i <= Length(s) do begin
if s[i] = '\' then begin
if i < Length(s) then begin
// escaped backslash?
if s[i + 1] = '\' then begin
Result[j] := '\';
inc(i, 2);
end
// convert hex number to WideChar
else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s))
and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
inc(i, 6);
Result[j] := WideChar(c);
end else begin
raise Exception.CreateFmt('Invalid code at position %d', [i]);
end;
end else begin
raise Exception.Create('Unexpected end of string');
end;
end else begin
Result[j] := WideChar(s[i]);
inc(i);
end;
inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j - 1);
end;
const
NormalizationC = 1;
function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';
function Normalize(const s: string): string;
var
newLength: integer;
begin
// in NormalizationC mode the result string won't grow longer than the input string
SetLength(Result, Length(s));
newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
SetLength(Result, newLength);
end;
function UnescapeAndNormalize(const s: AnsiString): string;
begin
Result := Normalize(Unescape(s));
end;
Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)

Are they always escaped like this? Always in a number of 4 digits?
How is the \ character itself escaped?
Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:
function Unescape(s: AnsiString): WideString;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1; j := 1;
while i <= Length(s) do
begin
// If a '\' is found, typecast the following 4 digit integer to widechar
if s[i] = '\' then
begin
if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
raise Exception.CreateFmt('Invalid code at position %d', [i]);
Inc(i, 6);
Result[j] := WideChar(c);
end
else
begin
Result[j] := WideChar(s[i]);
Inc(i);
end;
Inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j-1);
end;
Use like this
MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);
This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.
[edit] Code is ok. Highlighter fails.

If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?

GolezTrol,
you forget '$'
if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then

Delphi: Encoding Strings as Python do

I want to encode strings as Python do.
Python code is this:
def EncodeToUTF(inputstr):
uns = inputstr.decode('iso-8859-2')
utfs = uns.encode('utf-8')
return utfs
This is very simple.
But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).
I tried this test code to see the convertion:
procedure TForm1.Button1Click(Sender: TObject);
var
w : WideString;
buf : array[0..2048] of WideChar;
i : integer;
lc : Cardinal;
begin
lc := GetThreadLocale;
Caption := IntToStr(lc);
StringToWideChar(Edit1.Text, buf, SizeOF(buf));
w := buf;
lc := MakeLCID(
MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
0);
Win32Check(SetThreadLocale(lc));
Edit2.Text := WideCharToString(PWideChar(w));
Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;
The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase.
The local lc is 1038 (hun), the new lc is 1033.
But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.
What I do wrong? How to I do same thing as Python do?
Thanks for every help, link, etc:
dd

Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:
1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():
function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
ret: Integer;
uns: WideString;
begin
Result := '';
if inputstr = '' then Exit;
ret := MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), nil, 0);
if ret < 1 then Exit;
SetLength(uns, ret);
MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns));
ret := WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), nil, 0, nil, nil);
if ret < 1 then Exit;
SetLength(Result, ret);
WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), PAnsiChar(Result), Length(Result), nil, nil);
end;
2a) on D2009+, use SysUtils.TEncoding.Convert():
function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
enc: TEncoding;
buf: TBytes;
begin
Result := '';
if inputstr = '' then Exit;
enc := TEncoding.GetEncoding(28592);
try
buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
if Length(buf) > 0 then
SetString(Result, PAnsiChar(#buf[0]), Length(buf));
finally
enc.Free;
end;
end;
2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:
type
Latin2String = type AnsiString(28592);
var
inputstr: Latin2String;
outputstr: UTF8String;
begin
// put the ISO-8859-2 encoded bytes into inputstr, then...
outputstr := inputstr;
end;

If you're using Delphi 2009 or newer every input from the default VCL controls will be UTF-16, so no need to do any conversions on your input.
If you're using Delphi 2007 or older (as it seems) you are at mercy of Windows, because the VCL is ANSI and Windows has a fixed Codepage that determines which characters can be used in i.e. a TEdit.
You can change the system-wide default ANSI CP in the control panel though, but that requires a reboot each time you do.
In Delphi 2007 you have some chance to use TNTUnicode controls or some similar solution to get the Text from the UI to your code.
In Delphi 2009 and newer there are also plenty of Unicode and character set handling routines in the RTL.
The conversion between character sets can be done with SysUtils.TEncoding:
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html

The Python code in your question returns a string in UTF-8 encoding. To do this with pre-2009 Delphi versions you can use code similar to:
procedure TForm1.Button1Click(Sender: TObject);
var
Src, Dest: string;
Len: integer;
buf : array[0..2048] of WideChar;
begin
Src := Edit1.Text;
Len := MultiByteToWideChar(CP_ACP, 0, PChar(Src), Length(Src), #buf[0], 2048);
buf[Len] := #0;
SetLength(Dest, 2048);
SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, #buf[0], Len, PChar(Dest),
2048, nil, nil));
Edit2.Text := Dest;
end;
Note that this doesn't change the current thread locale, it simply passes the correct code page parameters to the API.

There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16().
My code that converts between ISO Latin2 and UTF-8 looks like:
s2 := EncodingToUTF16('ISO-8859-2', s);
s2utf8 := UTF16ToEncoding('UTF-8', s2);

Reading web pages / unicode

I have this function in Delphi 2009 /2010
It returns garbage, now if I change the char,pchar types to Ansichar,Pansichar it returns the text but all foreign unicode text is garbage. it drive me banana
I have been trying all kind of stuff for 2 days now
I thought I understoff this unicode crap but I guess I do not
Help please
thanks
Philippe Watel
function GetInetFileAsString(const fileURL: string): string;
const
C_BufferSize = 1024;
var
sAppName: string;
hSession,
hURL: HInternet;
Buffer: array[0..C_BufferSize] of Char;
BufferLen: DWORD;
strPageContent: string;
strTemp: string;
begin
Result := '';
sAppName := ExtractFileName(Application.ExeName);
hSession := InternetOpen(PChar(sAppName), INTERNET_OPEN_TYPE_PRECONFIG, nil,
nil, 0);
try
hURL := InternetOpenURL(hSession, PChar(fileURL), nil, 0, 0, 0);
try
strPageContent := '';
repeat
InternetReadFile(hURL, #Buffer, SizeOf(Buffer), BufferLen);
SetString(strTemp, PChar(#buffer), BufferLen div SizeOf(Char));
strPageContent := strPageContent + strTemp;
until BufferLen = 0;
Result := strPageContent;
finally
InternetCloseHandle(hURL)
end
finally
InternetCloseHandle(hSession)
end
end;

Starting in Delphi 2009, String is an alias for UnicodeString, which holds UTF-16 data. An HTML page, on the other hand, is typically encoded using a multi-byte Ansi encoding instead (usually UTF-8 nowadays, but not always). Your current code will only work if the HTML is encoded as UTF-16, which is very rare. You should not be reading the raw HTML bytes into a UnicodeString directly. You need to first download the entire data into a TBytes, RawByteString, TMemoryStream, or other suitable byte container of your choosing, and then perform an Ansi->Unicode conversion afterwards, based on the charset that is specified in the HTTP "Content-Type" response header. You can use the Accept-charset request header to tell the server which charset you prefer the data be sent as, and if the server is not able to use that charset then it should send a 406 Not Acceptable response (though it MIGHT still send a successful response in an unacceptable charset if it chooses to ignore your request header, so you should account for that).
Try something like this:
function GetInetFileAsString(const fileURL: string): string;
const
C_BufferSize = 1024;
var
sAppName: string;
hSession, hURL: HInternet;
Buffer: array of Byte;
BufferLen: DWORD;
strHeader: String;
strPageContent: TStringStream;
begin
Result := '';
SetLength(Buffer, C_BufferSize);
sAppName := ExtractFileName(Application.ExeName);
hSession := InternetOpen(PChar(sAppName), INTERNET_OPEN_TYPE_PRECONFIG, nil, nil, 0);
try
strHeader := 'Accept-Charset: utf-8'#13#10;
hURL := InternetOpenURL(hSession, PChar(fileURL), PChar(strHeader), Length(strHeader), 0, 0);
try
strPageContent := TStringStream.Create('', TEncoding.UTF8);
try
repeat
if not InternetReadFile(hURL, PByte(Buffer), Length(Buffer), BufferLen) then
Exit;
if BufferLen = 0 then
Break;
strPageContent.WriteBuffer(PByte(Buffer)^, BufferLen);
until False;
Result := strPageContent.DataString;
// or, use HttpQueryInfo(HTTP_QUERY_CONTENT_TYPE) to get
// the Content-Type header, parse out its "charset" attribute,
// and convert strPageContent.Memory to UTF-16 accordingly...
finally
strPageContent.Free;
end;
finally
InternetCloseHandle(hURL);
end
finally
InternetCloseHandle(hSession);
end;
end;

My first thought is to add the correct AcceptEncoding/CharSet header to the request:
e.g:
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Decode UTF-8 encoded Cyrillic with Delphi 2007 - delphi

Related

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

Detect the status of a printer paper

Delphi XE AnsiStrings with escaped combining diacritical marks

Delphi: Encoding Strings as Python do

Reading web pages / unicode

Categories

Resources