HttpGetText(), autodetect charset, and convert source to UTF8 - delphi

I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.
The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.
So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.
Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å type characters in the content.
If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.

When you retreive a webpage, its Content-Type header (or sometimes a <meta> tag inside the HTML itself) tells you which charset is being used for the data. You would decode the data to Unicode using that charset, then you can encode the Unicode to whatever you need for your processing.

I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)
Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.
procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
procedure LG2(ss:string);
begin
//dont log for now.
end;
begin
fs1 := TFileStream.Create(fileName,fmOpenRead);
fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
//compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
//also works for ASCII sources with htmlencoded accent chars, naturally
try
LG2('Files opened OK.');
GetMem(buf,bufsize);
ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
try
siz:=ts1.Read(buf^,bufsize);
LG2(inttostr(siz)+' bytes read.');
if siz>0 then ts2.Write(buf^,siz);
finally
LG2('Bytes read and written OK.');
FreeAndNil(ts1);FreeAndNil(ts2);end;
finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
LG2('Everything freed OK.');
end;
end; // UTF8FileTo88591

Related

How to correct encode a string to UTF8 in delphi10?

I am trying to replace some wildcards in a html code to send it via mailing.
Problem is when I try to replace the string with wildcard 'España$country$' with the string 'España', the result would be 'EspañaEspa?a'. I had the same problem before in Delphi 7 and I solved it by using the function 'UTF8Encode('España')' but it does not work on Delphi 10.
I have tried with 'España', 'UTF8Encode('España')' and 'AnsiToUTF8('España')'. I also tried to change the function StringReplace with ReplaceStr and ReplaceText, with same result.
......
var htmlText : TStringList;
......
htmlText := TStringList.Create;
htmlText.LoadFromFile('path.html');
htmlText.StringReplace(htmlText.Text, '$country$', UTF8Encode('España'), [rfReplaceAll]);
htmlText.SaveToFile('anotherpath.html');
......
This "stringreplace" along with "utf8encode" works well in Delphi7, showing 'España', but not in delphi 10, where you can read 'Espa?a' in the anotherpath.html.
The Delphi 7 string type, and consequently TStrings, did not support Unicode. Which is why you needed to use UTF8Encode.
Since Delphi 2009, Unicode is supported, and string maps to UnicodeString, and TStrings is a collection of such strings. Note that UnicodeString is internall encoded as UTF-16 although that's not a detail that you need to be concerned with here.
Since you are now using a Delphi that supports Unicode, your code can be much simpler. You can now write it like this:
htmlText.Text := StringReplace(htmlText.Text, '$country$', 'España', [rfReplaceAll]);
Note that if you wish the file to be encoded as UTF-8 when you save it you need to specify that when you save it. Like this:
htmlText.SaveToFile('anotherpath.html', TEncoding.UTF8);
And you may also need to specify the encoding when loading the file in case it does not include a UTF-8 BOM:
htmlText.LoadFromFile('path.html', TEncoding.UTF8);

Problems with unicode text

I use delphi xe3 and i have small problem !! but i don't how to fix it..
problem is with this letter "è" this letter is inside a file path "C:\lène.mp4"
i save this path into a tstringlist , when i save this tstringlist to a file the path will be shown fine inside the txt file ..
but when trying to loading it using tstringlist it will be shown as "è" (showing it inside a memo or int a variable) in this case it gonna be an invalid path ..
but adding the path(string) directly to the tstring list and then passing it to the path variable it works fine
but loading from the file and passing to the path variable it doesnt work (getting "è" instead of "è")
normally i will work with a lot of uncite string but for i'm struggling with that letter
this will not work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.LoadFromFile('C:\Demo6-out.txt'); // this file contains only "C:\lène.mp4"
resp := (xfiles.Strings[0]);
// if i save xfiles to a file "path string" will be saved fine ... !
finally
xfiles.Free ;
end;
but like this it work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.Add('C:lène.mp4');
resp := (xfiles.Strings[0]);
finally
xfiles.Free ;
end;
i'm really confused
First, you should be using UnicodeString instead of WideString. UnicodeString was introduced in Delphi 2009, and is much more efficient than WideString. The RTL uses UnicodeString (almost) everywhere it previously used AnsiString prior to 2009.
Second, something else introduced in Delphi 2009 is SysUtils.TEncoding, which is used for Byte<->Character conversions. Several existing RTL classes, including TStrings/TStringList, were updated to support TEncoding when converting bytes to/from strings.
What happens when you load a file into TStringList is that an internal TEncoding object is assigned to help convert the file's raw bytes to UnicodeString values. Which implementation of TEncoding it uses depends on the character encoding that LoadFromFile() thinks the file is using, if not explicitly stated (LoadFromFile() has an optional AEncoding parameter). If the file has a UTF BOM, a matching TEncoding is used, whether that be TEncoding.UTF8 or TEncoding.(BigEndian)Unicode. If no BOM is present, and the AEncoding parameter is not used, then TEncoding.Default is used, which represents the OS's default charset locale (and thus provides backwards compatibility with existing pre-2009 code).
When saving a TStringList to file, if the list was previously loaded from a file then the same TEncoding used for loading is used for saving, otherwise TEncoding.Default is used (again, for backwards compatibility), unless overwritten by the optional AEncoding parameter of SaveToFile().
In your first example, the input file is most likely encoded in UTF-8 without a BOM. So LoadFromFile() would use TEncoding.Default to interpret the file's bytes. è is the result of the UTF-8 encoded form of è (byte octets 0xC3 0xA8) being misinterpreted as Windows-1252 instead of UTF-8. So, you would have to load the file like this instead:
xfiles.LoadFromFile('C:\Demo6-out.txt', TEncoding.UTF8);
In your second example, you are not loading a file or saving a file. You are simply assigning a string literal (which is unicode-aware in D2009+) to a UnicodeString variable (inside of the TStringList) and then assigning that to a WideString variable (WideString and UnicodeString use the same UTF-16 character encoding, they just different memory managements). So there are no data conversions being performed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to avoid wrong characters reading UTF-8 emails with Indy 10.6 and Delphi 7

I am reading email with Indy 10.6.1.5187 and Delphi 7.
I have only problems with UTF-8 encoded emails, which translate to wrong characters in the customers computers.
I have read a lot about this problem, but I have not found a solution, except decoding the raw email by myself.
I wonder if there is a way to get correct emails when the sender encodes them in UTF-8.
Thanks.
UTF8 string is received like it was an Ansi string. You have to decode it.
You have to receive the message text in an UTF8String (aka. AnsiString aka String in Delphi 7) then convert them from UTF8 to AnsiString or (preferably)WideString. You can use the UFT8Decode() or Utf8ToAnsi() function to decode the email body.
If you use the UFT8Decode() function, you will still need WideString aware controls to display the received message.
If you use the Utf8ToAnsi() function, the result might not contain characters that are not part of the users local codepage.
So you will use something like:
var
ustrEmailBody: UTF8String;
wstrDecoded: WideString;
begin
...
// ustrEmailBody now contains the email body
wstrDecoded := UTF8Decode(ustrEmailBody);
SomeUnicodeAwareMemo.Text := wstrDecoded;
or
var
ustrEmailBody: UTF8String;
astrDecoded: AnsiString;
begin
...
// ustrEmailBody now contains the email body
astrDecoded := Utf8ToAnsi(ustrEmailBody);
SomeMemo.Text := astrDecoded; // the memo might display '?' in place of unknown characters
For further information see the documentation of the UFT8Decode() or Utf8ToAnsi() functions in the Delphi help.

How to convert text to UTF-8 Delphi

I have a function that returns an HTML page from Internet, but the Cyrillic symbols are displayed with some others unknown characters.
How can I convert the text and be able to see the normal Cyrillic symbols?
I'm with Delphi 2009 and im using indy to send HTTP request and get back response from the server.
(i think i have indy9)
This is how i take the HTML page
http := TIDHttp.Create(nil);
http.HandleRedirects := true;
http.ReadTimeout := 5000;
http.Request.ContentType:='multipart/form-data';
param:=TIdMultiPartFormDataStream.Create;
param.AddFormField('subcat_id','501');
param.AddFormField('reg_id','1');
text:=http.Post('example.com',param);
I don't know if indy has any functions that gets the page with any unicode.
You have not given enough information, but I will try to suggest this: If possible, load the data in a Stream and then create a StringList and load it like this:
var
MS:TMemoryStream;
SL: TStringList;
(...)
begin
MS:=TMemoryStream.Create;
SL:=TStringList.Create;
// Load your string to MS
SL.LoadFromStream(MS, TEncoding.UTF8);
(...)
MS.Free;
SL.Free;
end;
Comment if there is a problem.
Your question title seems to be out of sync with question body. Assuming you want to decode UTF-8 encoded HTML page, your friend is function UTF8Decode. The opposite operation done by UTF8Encode. These functions were available as early as Delphi 7 (correct me if D6 applies too). Check out "See Also" section of article, there are buffer handling entry-points for more convenience too.
Indy 9 does not support Delphi 2009. Make sure you are using the latest Indy 10 release instead. In Indy 10, the version of TIdHTTP.Post() (and TIdHTTP.Get()) that returns a String will automatically decode the data to Unicode using whatever charset is specified by the server, either in the HTTP Content-Type header, or in a <meta> tag within the HTML itself.

How can a text file be converted from ANSI to UTF-8 with Delphi 7?

I written a program with Delphi 7 which searches *.srt files on a hard drive. This program lists the path and name of these files in a memo. Now I need convert these files from ANSI to UTF-8, but I haven't succeeded.
The Utf8Encode function takes a WideString string as parameter and returns a Utf-8 string.
Sample:
procedure ConvertANSIFileToUTF8File(AInputFileName, AOutputFileName: TFileName);
var
Strings: TStrings;
begin
Strings := TStringList.Create;
try
Strings.LoadFromFile(AInputFileName);
Strings.Text := UTF8Encode(Strings.Text);
Strings.SaveToFile(AOutputFileName);
finally
Strings.Free;
end;
end;
Take a look at GpTextStream which looks like it works with Delphi 7. It has the ability to read/write unicode files in older versions of Delphi (although does work with Delphi 2009) and should help with your conversion.
var
Latin1Encoding: TEncoding;
begin
Latin1Encoding := TEncoding.GetEncoding(28591);
try
MyTStringList.SaveToFile('some file.txt', Latin1Encoding);
finally
Latin1Encoding.Free;
end;
end;
Please read the whole answer before you start coding.
The proper answer to question - and it is not the easy one - basically consist of tree steps:
You have to determine the ANSI code page used on your computer. You can achieve this goal by using the GetACP() function from Windows API. (Important: you have to retrieve the codepage as soon as possible after the file name retrieval, because it can be changed by the user.)
You must convert your ANSI string to Unicode by calling MultiByteToWideChar() Windows API function with the correct CodePage parameter (retrieved in the previous step). After this step you have an UTF-16 string (practically a WideString) containing the file name list.
You have to convert the Unicode string to UTF-8 using UTF8Encode() or the WideCharToMultiByte() Windows API. This function will return an UTF-8 string you needed.
However this solution will return an UTF-8 string containing the input ANSI string, this probably is not the best way to solve your problems, since the file names may already be corrupted when the ANSI functions returned them, so proper file names are not guaranteed.
The proper solution to your problem is ways more complicated:
If you want to be sure that your file name list is exactly clean, you have to make sure it won't get converted to ANSI at all. You can do this by explicitly using the "W" version of the file handling API's. In this case - of course - you can not use TFileStream and other ANSI file handling objects, but the Windows API calls directly.
It is not that hard, but if you already have a complex framework built on e.g. TFileStream it could be a bit of a pain in the #ss. In this case the best solution is to create a TStream descendant that uses the appropriate API's.
I hope my answer helps you or anyone who has to deal with the same problem. (I had to not so long ago.)
I did only this:
procedure TForm1.FormCreate(Sender: TObject);
begin
Strings := TStringList.Create;
end;
procedure TForm1.Button3Click(Sender: TObject);
begin
Strings.Text := UTF8Encode(Memo1.Text);
Strings.SaveToFile('new.txt');
end;
Verified with Notepad++ UTF8 without BOM
Did you mean ASCII?
ASCII is backwards compatible with UTF-8.
http://en.wikipedia.org/wiki/UTF-8

Resources