Reading web pages / unicode

Reading web pages / unicode - delphi

I have this function in Delphi 2009 /2010
It returns garbage, now if I change the char,pchar types to Ansichar,Pansichar it returns the text but all foreign unicode text is garbage. it drive me banana
I have been trying all kind of stuff for 2 days now
I thought I understoff this unicode crap but I guess I do not
Help please
thanks
Philippe Watel
function GetInetFileAsString(const fileURL: string): string;
const
C_BufferSize = 1024;
var
sAppName: string;
hSession,
hURL: HInternet;
Buffer: array[0..C_BufferSize] of Char;
BufferLen: DWORD;
strPageContent: string;
strTemp: string;
begin
Result := '';
sAppName := ExtractFileName(Application.ExeName);
hSession := InternetOpen(PChar(sAppName), INTERNET_OPEN_TYPE_PRECONFIG, nil,
nil, 0);
try
hURL := InternetOpenURL(hSession, PChar(fileURL), nil, 0, 0, 0);
try
strPageContent := '';
repeat
InternetReadFile(hURL, #Buffer, SizeOf(Buffer), BufferLen);
SetString(strTemp, PChar(#buffer), BufferLen div SizeOf(Char));
strPageContent := strPageContent + strTemp;
until BufferLen = 0;
Result := strPageContent;
finally
InternetCloseHandle(hURL)
end
finally
InternetCloseHandle(hSession)
end
end;

Starting in Delphi 2009, String is an alias for UnicodeString, which holds UTF-16 data. An HTML page, on the other hand, is typically encoded using a multi-byte Ansi encoding instead (usually UTF-8 nowadays, but not always). Your current code will only work if the HTML is encoded as UTF-16, which is very rare. You should not be reading the raw HTML bytes into a UnicodeString directly. You need to first download the entire data into a TBytes, RawByteString, TMemoryStream, or other suitable byte container of your choosing, and then perform an Ansi->Unicode conversion afterwards, based on the charset that is specified in the HTTP "Content-Type" response header. You can use the Accept-charset request header to tell the server which charset you prefer the data be sent as, and if the server is not able to use that charset then it should send a 406 Not Acceptable response (though it MIGHT still send a successful response in an unacceptable charset if it chooses to ignore your request header, so you should account for that).
Try something like this:
function GetInetFileAsString(const fileURL: string): string;
const
C_BufferSize = 1024;
var
sAppName: string;
hSession, hURL: HInternet;
Buffer: array of Byte;
BufferLen: DWORD;
strHeader: String;
strPageContent: TStringStream;
begin
Result := '';
SetLength(Buffer, C_BufferSize);
sAppName := ExtractFileName(Application.ExeName);
hSession := InternetOpen(PChar(sAppName), INTERNET_OPEN_TYPE_PRECONFIG, nil, nil, 0);
try
strHeader := 'Accept-Charset: utf-8'#13#10;
hURL := InternetOpenURL(hSession, PChar(fileURL), PChar(strHeader), Length(strHeader), 0, 0);
try
strPageContent := TStringStream.Create('', TEncoding.UTF8);
try
repeat
if not InternetReadFile(hURL, PByte(Buffer), Length(Buffer), BufferLen) then
Exit;
if BufferLen = 0 then
Break;
strPageContent.WriteBuffer(PByte(Buffer)^, BufferLen);
until False;
Result := strPageContent.DataString;
// or, use HttpQueryInfo(HTTP_QUERY_CONTENT_TYPE) to get
// the Content-Type header, parse out its "charset" attribute,
// and convert strPageContent.Memory to UTF-16 accordingly...
finally
strPageContent.Free;
end;
finally
InternetCloseHandle(hURL);
end
finally
InternetCloseHandle(hSession);
end;
end;

My first thought is to add the correct AcceptEncoding/CharSet header to the request:
e.g:
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Related

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

First I get a TMemoryStream from an HTTP request, which contains the body of the response.
Then I load it in a TStringList and save the text in a widestring (also tried with ansistring).
The problem is that I need to convert the string because the users language is spanish, so vowels with accent marks are very common and I need to store the info.
lServerResponse := TStringList.Create;
lServerResponse.LoadFromStream(lResponseMemoryStream);
lStringResponse := lServerResponse.Text;
lDecodedResponse := Utf8Decode(lStringResponse );
If the response (a part of it) is "Hólá Múndó", lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³", and lDecodedResponse will be "Hólá Múndó".
But if the user adds any emoji (lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³ ðŸ˜€" if the emoji is 😀) Utf8Decode fails and returns an empty string.
Is there a way to get just the ANSI characters from a string (or MemoryStream)?, or removing whatever Utf8Decode can't convert?
Thanks for your time.

TMemoryStream is just raw bytes. There is no reason to loading that stream into a TStringList just to extract a (Wide|Ansi)String from it. You can assign the bytes directly to an AnsiString/UTF8String using SetString() instead, eg:
var
lStringResponse: UTF8String;
lDecodedResponse: WideString;
begin
SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
lDecodedResponse := UTF8Decode(lStringResponse);
end;
Just make sure the HTTP content really is encoded as UTF-8, or else this approach will not work.
That being said - UTF8Decode() (and UTF8Encode()) in Delphi 7 DO NOT support Unicode codepoints above U+FFFF, which means they DO NOT support Emojis at all. That was fixed in Delphi 2009.
To work around that issue in earlier versions, you can use the Win32 API MultiByteToWideChar() function instead, eg:
uses
..., Windows;
function My_UTF8Decode(const S: UTF8String): WideString;
var
WLen: Integer;
begin
WLen := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), nil, 0);
if WLen > 0 then
begin
SetLength(Result, WLen);
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), PWideChar(Result), WLen);
end else
Result := '';
end;
var
lStringResponse: UTF8String;
lDecodedResponse: WideString;
begin
SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
lDecodedResponse := My_UTF8Decode(lStringResponse);
end;
Alternatively:
uses
..., Windows;
function My_UTF8Decode(const S: PAnsiChar; const SLen: Integer): WideString;
var
WLen: Integer;
begin
WLen := MultiByteToWideChar(CP_UTF8, 0, S, SLen, nil, 0);
if WLen > 0 then
begin
SetLength(Result, WLen);
MultiByteToWideChar(CP_UTF8, 0, S, SLen, PWideChar(Result), WLen);
end else
Result := '';
end;
var
lDecodedResponse: WideString;
begin
lDecodedResponse := My_UTF8Decode(PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
end;
Or, use a 3rd party Unicode conversion library, like ICU or libiconv, which handle this for you.

How can I get a regular Delphi string from a stream after retrieving an object from Amazon S3?

I am putting a JSON string into Amazon S3 using the TAmazonStorageService class UploadObject method. When I retrieve the object it is placed in a stream (I am using a TStringStream), which appears to be coded in UTF-16 LE. If I then attempt to load that JSON into a memo, a TStringList, or any other similar object I get just the first character, the open curly brace of the JSON. On the other hand, if I write it to a file I get the entire JSON (UTF-16 LE encoded). I am assuming that because UTF-16 LE encodes each character with two bytes, and the second byte is always 0, Delphi is assuming that the 0 is the end of file marker.
How can I get a regular Delphi string (WideString), or even an ANSIString from the TStringStream, or is there another stream that I should use that I can use to get a WideString or ANSIString.
Here is pseudo code that represents the upload:
procedure StorePayload( AmazonConnectionInfo: TAmazonConnectionInfo; JSONString: String;
PayloadMemTable: TFDAdaptedDataSet;
PayloadType: String; PayloadVersion: Integer);
var
AmazonStorageService: TAmazonStorageService;
ab: TBytes;
ResponseInfo: TCloudResponseInfo;
ss: TStringStream;
Guid: TGuid;
begin
Guid := TGuid.NewGuid;
AmazonStorageService := TAmazonStorageService.Create( AmazonConnectionInfo );
try
// Write payload to S3
ResponseInfo := TCloudResponseInfo.Create;
try
ss := TStringStream.Create( JSONString );
try
ab := StringToBytes( ss.DataString );
if AmazonStorageService.UploadObject( BucketName, Guid.ToString, ab, false, nil, nil, amzbaPrivate, ResponseInfo ) then
PayloadMemTable.AppendRecord( [Guid.ToString, PayloadType, PayloadVersion, now() ] );
finally
ss.Free;
end;
finally
ResponseInfo.Free;
end;
finally
AmazonStorageService.Free;
end;
end;
And here is pseudo code that represents the retrieval of the JSON:
function RetrievePayload( AmazonConnectionInfo: TAmazonConnectionInfo ): String;
var
AmazonStorageService: TAmazonStorageService;
ObjectName: string;
ResponseInfo: TCloudResponseInfo;
ss: TStringStream;
OptParams: TAmazonGetObjectOptionals;
begin
// I tried with and without the TAmazonGetObjectOptionals
OptParams := TAmazonGetObjectOptionals.Create;
OptParams.ResponseContentEncoding := 'ANSI';
OptParams.ResponseContentType := 'text/plain';
AmazonStorageService := TAmazonStorageService.Create( AmazonConnectionInfo );
try
ss := TStringStream.Create( );
try
ResponseInfo := TCloudResponseInfo.Create;
try
if not AmazonStorageService.GetObject( BucketName, PayloadID, OptParams,
ss, ResponseInfo, amzrNotSpecified ) then
raise Exception.Create('Error retrieving item ' + ObjectName);
Result := ss.DataString;
// The memo will contain only {
Form1.Memo1.Lines.Text := ss.DataString;
finally
ResponseInfo.Free;
end;
finally
ss.Free;
end;
finally
AmazonStorageService.Free;
end;
end;

In Delphi 2009 and later, String is a UTF-16 UnicodeString, however TStringStream operates on 8-bit ANSI by default (for backwards compatibility with pre-Unicode Delphi versions).
There is no need for StorePayload() to use TStringStream at all. You are storing a String into the stream just to read a String back out from it. So just use the original String as-is.
Using StringToBytes() is unnecessary, too. You can, and should, use TEncoding.UTF8 instead, as UTF-8 is the preferred encoding for JSON data, eg:
procedure StorePayload( AmazonConnectionInfo: TAmazonConnectionInfo; JSONString: String;
PayloadMemTable: TFDAdaptedDataSet;
PayloadType: String; PayloadVersion: Integer);
var
AmazonStorageService: TAmazonStorageService;
ab: TBytes;
ResponseInfo: TCloudResponseInfo;
Guid: TGuid;
begin
Guid := TGuid.NewGuid;
AmazonStorageService := TAmazonStorageService.Create( AmazonConnectionInfo );
try
// Write payload to S3
ResponseInfo := TCloudResponseInfo.Create;
try
ab := TEncoding.UTF8.GetBytes( JSONString );
if AmazonStorageService.UploadObject( BucketName, Guid.ToString, ab, false, nil, nil, amzbaPrivate, ResponseInfo ) then
PayloadMemTable.AppendRecord( [Guid.ToString, PayloadType, PayloadVersion, Now() ] );
finally
ResponseInfo.Free;
end;
finally
AmazonStorageService.Free;
end;
end;
Conversely, when RetrievePayload() calls GetObject() later, you can use TEncoding.UTF8 with TStringStream to decode the String, eg:
function RetrievePayload( AmazonConnectionInfo: TAmazonConnectionInfo ): String;
var
AmazonStorageService: TAmazonStorageService;
ResponseInfo: TCloudResponseInfo;
ss: TStringStream;
begin
AmazonStorageService := TAmazonStorageService.Create( AmazonConnectionInfo );
try
ss := TStringStream.Create( '', TEncoding.UTF8 );
try
ResponseInfo := TCloudResponseInfo.Create;
try
if not AmazonStorageService.GetObject( BucketName, PayloadID, ss, ResponseInfo, amzrNotSpecified ) then
raise Exception.Create('Error retrieving item ' + ObjectName);
Result := ss.DataString;
Form1.Memo1.Text := Result;
finally
ResponseInfo.Free;
end;
finally
ss.Free;
end;
finally
AmazonStorageService.Free;
end;
end;
If you need to retrieve any pre-existing bucket objects that have already been uploaded as UTF-16, RetrievePayload() could use TEncoding.Unicode instead:
ss := TStringStream.Create( '', TEncoding.Unicode );
However, that won't work for newer objects uploaded with UTF-8. So, a more flexible solution would be to retrieve the raw bytes using a TMemoryStream or TBytesStream, then analyze the bytes to determine whether UTF8 or UTF-16 were used, and then use TEncoding.UTF8.GetString() or TEncoding.Unicode.GetString() to decode the bytes to a String.

Delphi Clientdataset conversion?

I have been hounded this problem for a few days, I have two cliendatasets with data in them and I want to convert the olevariant data to string using two functions I found here in Stack Overflow.
The purpose of conversion to string is to be able to transfer the string to another location and convert it back again to olevariant and assign it to another clientdataset.
To simulate it, I created a sample app with the following partial code(see block below).
The code executes properly but my problem is when I convert the windows locale to japanese(which is the requirement), I encounter a datapacket mismatch in the data assignment on the second dataset. but if I do this in the japanese locale:
clientdataset2.data := clientdataset1.data
it works fine. English locale, the code works just fine.
Is there a problem in the string conversion? or is there anything I can do? I really would appreciate help with this.
//to simulate the conversion
TempData := ClientDataSet1.Data;
TempString := OleVariantToString(ClientDataset1.Data);
TempData2 := StringToOleVariant(TempString);
ClientDataSet2.Data := TempData2; //mismatch in data packet happens here in japanese locale
//conversion functions
function TForm1.OleVariantToString(const Value: OleVariant): string;
var
ss: TStringStream;
Size: integer;
Data: PByteArray;
begin
Result := '';
if Length(Value) = 0 then
Exit;
ss := TStringStream.Create;
try
Size := VarArrayHighBound(Value, 1) - VarArrayLowBound(Value, 1) + 1;
Data := VarArrayLock(Value);
try
ss.Position := 0;
ss.WriteBuffer(Data^, Size);
ss.Position := 0;
Result := ss.DataString;
finally
VarArrayUnlock(Value);
end;
finally
ss.Free;
end;
end;
function TForm1.StringToOleVariant(const Value: string): OleVariant;
var
ss: TStringStream;
MyBuffer: Pointer;
begin
Result := null;
if Value = '' then
Exit;
ss := TStringStream.Create(Value);
try
Result := VarArrayCreate([0, ss.Size - 1], varByte);
MyBuffer := VarArrayLock(Result);
try
ss.Position := 0;
ss.ReadBuffer(MyBuffer^, ss.Size);
finally
VarArrayUnlock(Result);
end;
finally
ss.Free;
end;
end;

Streaming to string is already implemented, you can use
Writing: TClientDataSet.SaveToFile or TClientDataSet.SaveToStream
Reading: TClientDataSet.LoadFromFile or TClientDataSet.LoadFromStream
procedure SaveToStream(Stream: TStream; Format: TDataPacketFormat = dfBinary);
procedure SaveToFile(const FileName: string = ''; Format: TDataPacketFormat = fBinary);
procedure LoadFromStream(Stream: TStream);
procedure LoadFromFile(const FileName: string = '');
the TDataPacketFormat options are:
dfBinary: Information is encoded in binary format.
dfXML:Information is encoded in XML, with extended characters encoded using an escape sequence.
dfXMLUTF8:Information is encoded in XML, with extended characters represented using UTF8.
Using dfXMLUTF8 you should have no problems with non/ansi characters sets.

Why doesn't TStringStream remove the BOM when converting to a string?

We have a library function that goes like this:
class function TFileUtils.ReadTextStream(const AStream: TStream): string;
var
StringStream: TStringStream;
begin
StringStream := TStringStream.Create('', TEncoding.Unicode);
try
// This is WRONG since CopyFrom might rewind the stream (see Remys comment)
StringStream.CopyFrom(AStream, AStream.Size - AStream.Position);
Result := StringStream.DataString;
finally
StringStream.Free;
end;
end;
When I check the string that is returned by the function the first Char is the (little-endian) BOM.
Why doesn't TStringStream ignore the BOM?
Is there a better way to do this? I don't need backwards compatibility with older Delphi versions, a working solution for XE2 would be fine.

The BOM has to be coming from the source TStream, as TStringStream does not write a BOM. If you want to ignore the BOM if it is present in the source, you have to do it manually before then copying the data, eg:
class function TFileUtils.ReadTextStream(const AStream: TStream): string;
var
StreamPos, StreamSize: Int64;
Buf: TBytes;
NumBytes: Integer;
Encoding: TEncoding;
begin
Result := '';
StreamPos := AStream.Position;
StreamSize := AStream.Size - StreamPos;
// Anything available to read?
if StreamSize < 1 then Exit;
// Read the first few bytes from the stream...
SetLength(Buf, 4);
NumBytes := AStream.Read(Buf[0], Length(Buf));
if NumBytes < 1 then Exit;
Inc(StreamPos, NumBytes);
Dec(StreamSize, NumBytes);
// Detect the BOM. If you know for a fact what the TStream data is encoded as,
// you can assign the Encoding variable to the appropriate TEncoding object and
// GetBufferEncoding() will check for that encoding's BOM only...
SetLength(Buf, NumBytes);
Encoding := nil;
Dec(NumBytes, TEncoding.GetBufferEncoding(Buf, Encoding));
// If any non-BOM bytes were read than rewind the stream back to that position...
if NumBytes > 0 then
begin
AStream.Seek(-NumBytes, soCurrent);
Dec(StreamPos, NumBytes);
Inc(StreamSize, NumBytes);
end else
begin
// Anything left to read after the BOM?
if StreamSize < 1 then Exit;
end;
// Now read and decode whatever is left in the stream...
StringStream := TStringStream.Create('', Encoding);
try
StringStream.CopyFrom(AStream, StreamSize);
Result := StringStream.DataString;
finally
StringStream.Free;
end;
end;

Apparently TStreamReader doesn't suffer from the same problem:
var
StreamReader: TStreamReader;
begin
StreamReader := TStreamReader.Create(AStream);
try
Result := StreamReader.ReadToEnd;
finally
StreamReader.Free;
end;
end;
TStringList also works (thanks whosrdaddy):
var
Strings: TStringList;
begin
Strings := TStringList.Create;
try
Strings.LoadFromStream(AStream);
Result := Strings.Text;
finally
Strings.Free;
end;
end;
I also measured both methods and TStreamReader seems to be about twice as fast.

Decode UTF-8 encoded Cyrillic with Delphi 2007

I am working in Delphi 2007 (no Unicode support) and I am retrieving XML and JSON data from the Google Analytics API. Below is some UTF-8 encoded data that I get for a URL referral path:
ga:referralPath=/add/%D0%9F%D0%B8%D0%B6%D0%B0%D0%BC
When I decode it using this decoder it properly generates this:
ga:referralPath=/add/Пижам
Is there a function I can use in Delphi 2007 which will perform this decoding?
UPDATE
This data is corresponds to a URL. Ultimately what I want to do is to store this in a SqlServer database (out of the box - no settings modified regarding character sets). And then be able to produce/create an html pages with a working link to this page (note: I am only dealing with the url referral path in this example - obviously to make a valid url link a source would be needed).

D2007 supports Unicode, just not to the extent that D2009+ does. Unicode in D2007 is handled using WideString and the few RTL support functions that do exist.
The URL contains percent-encoded UTF-8 byte octets. Simply convert those sequences into their binary representation and then use UTF8Decode() to decode the UTF-8 data to a WideString. For example:
function HexToBits(C: Char): Byte;
begin
case C of
'0'..'9': Result := Byte(Ord(C) - Ord('0'));
'a'..'f': Result := Byte(10 + (Ord(C) - Ord('a')));
'A'..'F': Result := Byte(10 + (Ord(C) - Ord('A')));
else
raise Exception.Create('Invalid encoding detected');
end;
end;
var
sURL: String;
sWork: UTF8String;
C: Char;
B: Byte;
wDecoded: WideString;
I: Integer;
begin
sURL := 'ga:referralPath=/add/%D0%9F%D0%B8%D0%B6%D0%B0%D0%BC';
sWork := sURL;
I := 1;
while I <= Length(sWork) do
begin
if sWork[I] = '%' then
begin
if (I+2) > Length(sWork) then
raise Exception.Create('Incomplete encoding detected');
sWork[I] := Char((HexToBits(sWork[I+1]) shl 4) or HexToBits(sWork[I+2]));
Delete(sWork, I+1, 2);
end;
Inc(I);
end;
wDecoded := UTF8Decode(sWork);
...
end;

You can use the following code, which uses Windows API :
function Utf8ToStr(const Source : string) : string;
var
i, len : integer;
TmpBuf : array of byte;
begin
SetLength(Result, 0);
i := MultiByteToWideChar(CP_UTF8, 0, #Source[1], Length(Source), nil, 0);
if i = 0 then Exit;
SetLength(TmpBuf, i * SizeOf(WCHAR));
Len := MultiByteToWideChar(CP_UTF8, 0, #Source[1], Length(Source), #TmpBuf[0], i);
if Len = 0 then Exit;
i := WideCharToMultiByte(CP_ACP, 0, #TmpBuf[0], Len, nil, 0, nil, nil);
if i = 0 then Exit;
SetLength(Result, i);
i := WideCharToMultiByte(CP_ACP, 0, #TmpBuf[0], Len, #Result[1], i, nil, nil);
SetLength(Result, i);
end;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Reading web pages / unicode - delphi

My first thought is to add the correct AcceptEncoding/CharSet header to the request: e.g: Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Related

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

How can I get a regular Delphi string from a stream after retrieving an object from Amazon S3?

Delphi Clientdataset conversion?

Why doesn't TStringStream remove the BOM when converting to a string?

Decode UTF-8 encoded Cyrillic with Delphi 2007

Categories

Resources