I am trying to get this code to work. It's a standard search and replace function.
I get no errors at all but nothing changes in the text file for some reason.
Here is the full code:
procedure FileReplaceString(const FileName, searchstring, replacestring: string);
var
fs: TFileStream;
S: string;
begin
fs := TFileStream.Create(FileName, fmOpenread or fmShareDenyNone);
try
SetLength(S, fs.Size);
fs.ReadBuffer(S[1], fs.Size);
finally
fs.Free;
end;
S := StringReplace(S, SearchString, replaceString, [rfReplaceAll, rfIgnoreCase]);
fs := TFileStream.Create(FileName, fmCreate);
try
fs.WriteBuffer(S[1], Length(S));
finally
fs.Free;
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
var Path, FullPath:string;
begin
Path:= ExtractFilePath(Application.ExeName);
FullPath:= Path + 'test.txt';
FileReplaceString(FullPath,'changethis','withthis');
end;
The reason is that S, searchstring, and replacestring are Unicode strings (so, e.g., "Test" is 54 00 65 00 73 00 74 00) while the text file probably is a UTF-8 or ANSI file (so, e.g., "Test" is 54 65 73 74).
This means that the value stored in S will be highly corrupt (you take the bytes of a UTF-8 text and interpret them as the bytes of a Unicode text)! In the Test example, you will get 敔瑳?? where the two last characters are random (why?).
To test this hypothesis, simply declare S as AnsiString instead, then it should work.
Of course, if you need Unicode support, you need to do some UTF-8 encoding/decoding. The simplest solution to your problem would be to use the TStringList; then you get everything you need for free.
Related
I have this string where I need to make some characters capital so I use that UpCase command... But what if I need to make small character from capital one? What do I use in that case?
UpCase is not locale aware and only handles the 26 letters of the English language. If that is really all you need then you can create equivalent LoCase functions like this:
function LoCase(ch: AnsiChar): AnsiChar; overload;
begin
case ch of
'A'..'Z':
Result := AnsiChar(Ord(ch) + Ord('a')-Ord('A'));
else
Result := ch;
end;
end;
function LoCase(ch: WideChar): WideChar; overload;
begin
case ch of
'A'..'Z':
Result := WideChar(Ord(ch) + Ord('a')-Ord('A'));
else
Result := ch;
end;
end;
You should learn how to find the solution on your own, not how to use Google or stackoverflow :)
You have the source of the UpCase function in System.pas. Take a look at how it works. All this does is subtract 32 from the lower case characters. If you want the opposite, add 32 instead of subtracting it. The Delphi help will tell you what Dec or Inc does.
var
S: string;
I: Integer;
begin
S := 'ABCd';
for I := 1 to Length(S) do
if S[I] in ['A'..'Z'] then // if you know that input is upper case, you could skip this line
Inc(S[I], 32); // this line converts to lower case
end;
In the olden times, i had a function that would convert a WideString to an AnsiString of the specified code-page:
function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString;
...
begin
...
// Convert source UTF-16 string (WideString) to the destination using the code-page
strLen := WideCharToMultiByte(CodePage, 0,
PWideChar(Source), Length(Source), //Source
PAnsiChar(cpStr), strLen, //Destination
nil, nil);
...
end;
And everything worked. I passed the function a unicode string (i.e. UTF-16 encoded data) and converted it to an AnsiString, with the understanding that the bytes in the AnsiString represented characters from the specified code-page.
For example:
TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
would return the Windows-1252 encoded string:
The qùíçk brown fôx jumped ovêr the lázÿ dog
Note: Information was of course lost during the conversion from the full Unicode character set to the limited confines of the Windows-1252 code page:
Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ (before)
The qùíçk brown fôx jumped ovêr the lázÿ dog (after)
But the Windows WideChartoMultiByte does a pretty good job of best-fit mapping; as it is designed to do.
Now the after times
Now we are in the after times. WideString is now a pariah, with UnicodeString being the goodness. It's an inconsequential change; as the Windows function only needed a pointer to a series of WideChar anyway (which a UnicodeString also is). So we change the declaration to use UnicodeString instead:
funtion WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
begin
...
end;
Now we come to the return value. i have an AnsiString that contains the bytes:
54 68 65 20 71 F9 ED E7 The qùíç
6B 20 62 72 6F 77 6E 20 k brown
66 F4 78 20 6A 75 6D 70 fôx jump
65 64 20 6F 76 EA 72 20 ed ovêr
74 68 65 20 6C E1 7A FF the lázÿ
20 64 6F 67 dog
In the olden times that was fine. I kept track of what code-page the AnsiString actually contained; i had to remember that the returned AnsiString was not encoded using the computer's locale (e.g. Windows 1258), but instead is encoded using another code-page (the CodePage code page).
But in Delphi XE6 an AnsiString also secretly contains the codepage:
codePage: 1258
length: 44
value: The qùíçk brown fôx jumped ovêr the lázÿ dog
This code-page is wrong. Delphi is specifying the code-page of my computer, rather than the code-page that the string is. Technically this is not a problem, i always understood that the AnsiString was in a particular code-page, i just had to be sure to pass that information along.
So when i wanted to decode the string, i had to pass along the code-page with it:
s := TUnicodeHeper.StringToWideString(s, 1252);
with
function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString;
begin
...
MultiByteToWideChar(...);
...
end;
Then one person screws everything up
The problem was that in the olden times i declared a type called Utf8String:
type
Utf8String = type AnsiString;
Because it was common enough to have:
function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String;
begin
Result := WideStringToString(s, CP_UTF8);
end;
and the reverse:
function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString;
begin
Result := StringToWideString(s, CP_UTF8);
end;
Now in XE6 i have a function that takes a Utf8String. If some existing code somewhere were take a UTF-8 encoded AnsiString, and try to convert it to UnicodeString using Utf8ToWideString it would fail:
s: AnsiString;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);
...
ws: UnicodeString;
ws := Utf8ToWideString(s); //Delphi will treat s an CP1252, and convert it to UTF8
Or worse, is the breadth of existing code that does:
s: Utf8String;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);
The returned string will become totally mangled:
the function returns AnsiString(1252) (AnsiString tagged as encoded using the current codepage)
the return result is being stored in an AnsiString(65001) string (Utf8String)
Delphi converts the UTF-8 encoded string into UTF-8 as though it was 1252.
How to move forward
Ideally my UnicodeStringToString(string, codePage) function (which returns an AnsiString) could set the CodePage inside the string to match the actual code-page using something like SetCodePage:
function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString;
begin
...
WideCharToMultiByte(...);
...
//Adjust the codepage contained in the AnsiString to match reality
//SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
if Length(Result) > 0 then
PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;
Except that manually mucking around with the internal structure of an AnsiString is horribly dangerous.
So what about returning RawByteString?
It has been said, over an over, by a lot of people who aren't me that RawByteString is meant to be the universal recipient; it wasn't meant to be as a return parameter:
function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString;
begin
...
WideCharToMultiByte(...);
...
//Adjust the codepage contained in the AnsiString to match reality
SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
end;
This has the virtue of being able to use the supported and documented SetCodePage.
But if we're going to cross a line, and start returning RawByteString, surely Delphi already has a function that can convert a UnicodeString to a RawByteString string and vice versa:
function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString;
begin
Result := SysUtils.Something(s, CodePage);
end;
function StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString;
begin
Result := SysUtils.SomethingElse(s, CodePage);
end;
But what is it?
Or what else should i do?
This was a long-winded set of background for a trivial question. The real question is, of course, what should i be doing instead? There is a lot of code out there that depends on the UnicodeStringToString and the reverse.
tl;dr:
I can convert a UnicodeString to UTF by doing:
Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
and i can convert a UnicodeString to the current code-page by using:
AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
But how do i convert a UnicodeString to an arbitrary (unspecified) code-page?
My feeling is that since everything really is an AnsiString:
Utf8String = AnsiString(65001);
RawByteString = AnsiString(65535);
i should bite the bullet, bust open the AnsiString structure, and poke the correct code-page into it:
function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString;
begin
LocaleCharsFromUnicode(CodePage, ..., s, ...);
...
if Length(Result) > 0 then
PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;
Then the rest of the VCL will fall in line.
In this particular case, using RawByteString is an appropriate solution:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;
var
strLen: Integer;
begin
strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
if strLen > 0 then
begin
SetLength(Result, strLen);
LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
SetCodePage(Result, CodePage, False);
end;
end;
This way, the RawByteString holds the codepage, and assigning the RawByteString to any other string type, whether that be AnsiString or UTF8String or whatever, will allow the RTL to automatically convert the RawByteString data from its current codepage to the destination string's codepage (which includes conversions to UnicodeString).
If you absolutely must return an AnsiString (which I do not recommend), you can still use SetCodePage() via a typecast:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
var
strLen: Integer;
begin
strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
if strLen > 0 then
begin
SetLength(Result, strLen);
LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
SetCodePage(PRawByteString(#Result)^, CodePage, False);
end;
end;
The reverse is much easier, just use the codepage already stored in a (Ansi|RawByte)String (just make sure those codepages are always accurate), since the RTL already knows how to retrieve and use the codepage for you:
function StringToWideString(const Source: AnsiString): UnicodeString;
begin
Result := UnicodeString(Source);
end;
function StringToWideString(const Source: RawByteString): UnicodeString;
begin
Result := UnicodeString(Source);
end;
That being said, I would suggest dropping the helper functions altogether and just use typed strings instead. Let the RTL handle conversions for you:
type
Win1252String = type AnsiString(1252);
var
s: UnicodeString;
a: Win1252String;
begin
s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
a := Win1252String(s);
s := UnicodeString(a);
end;
var
s: UnicodeString;
u: UTF8String;
begin
s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
u := UTF8String(s);
s := UnicodeString(u);
end;
I think that returning a RawByteString is probably as good as you'll get. You could do it using AnsiString as you outlined but RawByteString captures the intent better. In this scenario a RawByteString morally counts as a parameter in the sense of the official Embarcadero advice. It is just an output rather than an input. The real key is not to use it as a variable.
You could code it like this:
function MBCSString(const s: UnicodeString; CodePage: Word): RawByteString;
var
enc: TEncoding;
bytes: TBytes;
begin
enc := TEncoding.GetEncoding(CodePage);
try
bytes := enc.GetBytes(s);
SetLength(Result, Length(bytes));
Move(Pointer(bytes)^, Pointer(Result)^, Length(bytes));
SetCodePage(Result, CodePage, False);
finally
enc.Free;
end;
end;
Then
var
s: AnsiString;
....
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
Writeln(StringCodePage(s));
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1251);
Writeln(StringCodePage(s));
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 65001);
Writeln(StringCodePage(s));
outputs 1252, 1251, and then 65001 as you would expect.
And you could use LocaleCharsFromUnicode if you prefer. Of course, you need to take its documentation with a pinch of salt: LocaleCharsFromUnicode is a wrapper for the WideCharToMultiByte function. Amazing that text was ever written since LocaleCharsFromUnicode surely only exists to be cross-platform.
However, I wonder if you may be making a mistake in attempting to keep ANSI encoded text in AnsiString variables in your program. Normally you would encoded to ANSI as late as possible (at the interop boundary), and likewise decode as early as possible.
If you simply have to do this then perhaps there is a better solution that avoids the dreaded AnsiString completely. Instead of storing the text in an AnsiString, store it in TBytes. You already have data structures that keep track of encoding, so why not keep them. Replace the record that contains code page and AnsiString with one containing code page and TBytes. Then you would have no fear of anything recoding your text behind your back. And your code will be ready to use on the mobile compilers.
Grovelling through System.pas, i found the built-in function SetAnsiString that does what i want:
procedure SetAnsiString(Dest: _PAnsiStr; Source: PWideChar; Length: Integer; CodePage: Word);
It's also important to note that this function does push the CodePage into the internal StrRec structure for me:
PStrRec(PByte(Dest) - SizeOf(StrRec)).codePage := CodePage;
This allows me to write something like:
function WideStringToString(const s: UnicodeString; DestinationCodePage: Word): AnsiString;
var
strLen: Integer;
begin
strLen := Length(Source);
if strLen = 0 then
begin
Result := '';
Exit;
end;
//Delphi XE6 has a function to convert a unicode string to a tagged AnsiString
SetAnsiString(#Result, #Source[1], strLen, DestinationCodePage);
end;
So when i call:
actual := WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 850);
i get the resulting AnsiString:
codePage: $0352 (850)
elemSize: $0001 (1)
refCnt: $00000001 (1)
length: $0000002C (44)
contents: 'The qùíçk brown fôx jumped ovêr the láZÿ dog'
An AnsiString with the appropriate code-page already stuffed in the secret codePage member.
The other way
class function TUnicodeHelper.ByteStringToUnicode(const Source: RawByteString; CodePage: UINT): UnicodeString;
var
wideLen: Integer;
dw: DWORD;
begin
{
See http://msdn.microsoft.com/en-us/library/dd317756.aspx
Code Page Identifiers
for a list of code pages supported in Windows.
Some common code pages are:
CP_UTF8 (65001) utf-8 "Unicode (UTF-8)"
CP_ACP (0) The system default Windows ANSI code page.
CP_OEMCP (1) The current system OEM code page.
1252 Windows-1252 "ANSI Latin 1; Western European (Windows)", this is what most of us in north america use in Windows
437 IBM437 "OEM United States", this is your "DOS fonts"
850 ibm850 "OEM Multilingual Latin 1; Western European (DOS)", the format accepted by Fincen for LCTR/STR
28591 iso-8859-1 "ISO 8859-1 Latin 1; Western European (ISO)", Windows-1252 is a super-set of iso-8859-1, adding things like euro symbol, bullet and ellipses
20127 us-ascii "US-ASCII (7-bit)"
}
if Length(Source) = 0 then
begin
Result := '';
Exit;
end;
// Determine real size of final, string in symbols
// wideLen := MultiByteToWideChar(CodePage, 0, PAnsiChar(Source), Length(Source), nil, 0);
wideLen := UnicodeFromLocaleChars(CodePage, 0, PAnsiChar(Source), Length(Source), nil, 0);
if wideLen = 0 then
begin
dw := GetLastError;
raise EConvertError.Create('[StringToWideString] Could not get wide length of UTF-16 string. Error '+IntToStr(dw)+' ('+SysErrorMessage(dw)+')');
end;
// Allocate memory for UTF-16 string
SetLength(Result, wideLen);
// Convert source string to UTF-16 (WideString)
// wideLen := MultiByteToWideChar(CodePage, 0, PAnsiChar(Source), Length(Source), PWChar(wideStr), wideLen);
wideLen := UnicodeFromLocaleChars(CodePage, 0, PAnsiChar(Source), Length(Source), PWChar(Result), wideLen);
if wideLen = 0 then
begin
dw := GetLastError;
raise EConvertError.Create('[StringToWideString] Could not convert string to UTF-16. Error '+IntToStr(dw)+' ('+SysErrorMessage(dw)+')');
end;
end;
Note: Any code released into public domain. No attribution required.
I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.
Here is the outline of the code I would expect to have:
for I := 0 to 255 do
begin
anChar := AnsiChar(I); //obtain AnsiChar
//Apply codepage without converting the chars
//<<--- this part does not work, showing:
//"E2033 Types of actual and formal var parameters must be identical"
SetCodePage(anChar, aCodepages[K], False);
//Assign AnsiChar to UnicodeChar (automatic conversion)
uniChar := anChar;
//Here we get Unicode character index
uniCode := Ord(uniChar);
end;
The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.
What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?
I would do it like this:
function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
if MultiByteToWideChar(CodePage, 0, #ac, 1, #Result, 1) <> 1 then
RaiseLastOSError;
end;
I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.
Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.
However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.
If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.
You can also do it in one step for all characters. Here is an example for codepage 1250:
var
encoding: TEncoding;
bytes: TBytes;
unicode: TArray<Word>;
I: Integer;
S: string;
begin
SetLength(bytes, 256);
for I := 0 to 255 do
bytes[I] := I;
SetLength(unicode, 256);
encoding := TEncoding.GetEncoding(1250); // change codepage as needed
try
S := encoding.GetString(bytes);
for I := 0 to 255 do
unicode[I] := Word(S[I+1]); // as long as strings are 1-based
finally
encoding.Free;
end;
end;
Here is the code I have found to be working well:
var
I: Byte;
anChar: AnsiString;
Tmp: RawByteString;
uniChar: Char;
uniCode: Word;
begin
for I := 0 to 255 do
begin
anChar := AnsiChar(I);
Tmp := anChar;
SetCodePage(Tmp, aCodepages[K], False);
uniChar := UnicodeString(Tmp)[1];
uniCode := Word(uniChar);
<...snip...>
end;
When I try the code below there seem to be different output in XE2 compared to D2009.
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Writeln(Outfile,utf8string('总结'));
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
Compiling with XE2 on a Windows 8 PC gives in WordPad
??
C
txt hex code: EF BB BF 3F 3F 0D 0A B0 43 0D 0A
Compiling with D2009 on a Windows XP PC gives in Wordpad
总结
°C
txt hex code: EF BB BF E6 80 BB E7 BB 93 0D 0A B0 43 0D 0A
My questions is why it differs and how can I save Chinese characters to a text file using the old text file I/O?
Thanks!
In XE2 onwards, AssignFile() has an optional CodePage parameter that sets the codepage of the output file:
function AssignFile(var F: File; FileName: String; [CodePage: Word]): Integer; overload;
Write() and Writeln() both have overloads that support UnicodeString and WideChar inputs.
So, you can create a file that has its codepage set to CP_UTF8, and then Write/ln() will automatically convert Unicode strings to UTF-8 when writing them to the file.
The downside is that you will not be able to write the UTF-8 BOM using AnsiChar values anymore, because the individual bytes will get converted to UTF-8 and thus not be written correctly. You can get around that by writing the BOM as a single Unicode character (which it what it really is - U+FEFF) instead of as individual bytes.
This works in XE2:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TextFile;
begin
AssignFile(Outfile, 'test_chinese.txt', CP_UTF8);
Rewrite(Outfile);
//This is the UTF-8 BOM
Write(Outfile, #$FEFF);
Writeln(Outfile, '总结');
Writeln(Outfile, '°C');
CloseFile(Outfile);
end;
With that said, if you want something that is more compatible and reliable between D2009 and XE2, use TStreamWriter instead:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TStreamWriter;
begin
Outfile := TStreamWriter.Create('test_chinese.txt', False, TEncoding.UTF8);
try
Outfile.WriteLine('总结');
Outfile.WriteLine('°C');
finally
Outfile.Free;
end;
end;
Or do the file I/O manually:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TFileStream;
BOM: TBytes;
procedure WriteBytes(const B: TBytes);
begin
if B <> '' then Outfile.WriteBuffer(B[0], Length(B));
end;
procedure WriteStr(const S: UTF8String);
begin
if S <> '' then Outfile.WriteBuffer(S[1], Length(S));
end;
procedure WriteLine(const S: UTF8String);
begin
WriteStr(S);
WriteStr(sLineBreak);
end;
begin
Outfile := TFileStream.Create('test_chinese.txt', fmCreate);
try
WriteBytes(TEncoding.UTF8.GetPreamble);
WriteLine('总结');
WriteLine('°C');
finally
Outfile.Free;
end;
end;
You really shouldn't use the old text I/O anymore.
Anyway, you can use TEncoding to get the UTF-8 TBytes like this:
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
Bytes: TBytes;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Bytes := TEncoding.UTF8.GetBytes('总结');
for myByte in Bytes do begin
Write(Outfile, AnsiChar(myByte));
end;
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
I'm not sure if there is an easier way to write TBytes to a Textfile, maybe somebody else has a better idea.
Edit:
For a pure binary file (File instead of TextFile type) use can use BlockWrite.
There are a couple of tell-tale signs that may tell you what whent wrong when dealing with Unicode. In your case you're seeing "?" in the resulting output file: You get question marks when you try to convert some thing from Unicode to a Code Page and the target Code Page can't represent the requested characters.
Looking at the hex dump it's obvious (counting line terminators) that the question marks are the result of saving the two Chinese characters to the file. The two chars got converted to exactly two question marks. This tells you the Writeln() decided to give you helping and converted the text from UTF8 (a unicode representation) to your local code page. The Delphi team probably decided to do this since the old I/O routines are not supposed to be UNICODE compatible; since you're writing an UTF8 string using the old I/O routines, they're helping you by converting this to your Code Page. You might not welcome that helping hand, but it doesn't mean it was wrong to do so: it's undocumented territory.
Since you now know why that's happening you know what to do to stop it. Let WriteLn() know you're sending something that doesn't need converting. You'll discover that's not particularly easy, since Delphi XE2 apparently "helps you out" whatever you. For example, stuff like this doesn't just change the string type, it converts to AnsiString, going through the code-page conversion routine that gets you question marks:
AnsiString(UTF8String('Whatever Unicode'));
Because of this, and if you need one-liner solutions, you could try a conversion routine, something like this:
function FakeConvert(const InStr: UTF8String): AnsiString;
var N: Integer;
begin
N := Length(InStr);
SetLength(Result, N);
Move(InStr[1], Result[1], N);
end;
You'll then be able to do:
Writeln(Outfile,FakeConvert('总结'));
And it'll do what you expect (I did actually try it before posting!)
Of course the only TRUE answer to this question is, since you upgraded all the way to Delphi XE2:
Stop using deprecated I/O routines, move to TStream based
Good day! I'm using Delphi XE and Indy TIdHTTP. Using Get method I get remote directory listing and I need to parse it = get list of files with their sizes and timestamps and distinguish files and subdirectories. Please, is there a good routine to do that? Thank you in advance! Vojtech
Here is the sample:
<head>
<title>127.0.0.1 - /</title>
</head>
<body>
<H1>127.0.0.1 - /</H1><hr>
<pre>
Mittwoch, 30. März 2011 12:01 <dir> SubDir<br />
Mittwoch, 9. Februar 2005 17:14 113 file.txt<br />
</pre>
<hr>
</body>
Given the code sample, I guess the fastest way to parse it would be like this:
Identify the <pre>...</pre> block containing all the listing lines. Should be easy.
Put everything between the <pre> and </pre> into a TStringList. Each line is a file or folder, and the format is very simple.
Extract the links from each line, extract the date, time and size if you need it. Best done with a regex (you've got Delphi XE so you've got built-in Regex).
This should give you a good start and idea using DOM:
uses
MSHTML,
ActiveX,
ComObj;
procedure DocumentFromString(Document: IHTMLDocument2; const S: WideString);
var
v: OleVariant;
begin
v := VarArrayCreate([0, 0], varVariant);
v[0] := S;
Document.Write(PSafeArray(TVarData(v).VArray));
Document.Close;
end;
function StripMultipleChar(const S: string; const C: Char): string;
begin
Result := S;
while Pos(C + C, Result) <> 0 do
Result := StringReplace(Result, C + C, C, [rfReplaceAll]);
end;
procedure TForm1.Button1Click(Sender: TObject);
var
Document: IHTMLDocument2;
Elements: IHTMLElementCollection;
Element: IHTMLElement;
I: Integer;
Line: string;
begin
Document := CreateComObject(CLASS_HTMLDocument) as IHTMLDocument2;
DocumentFromString(Document, '<head>...'); // your HTML here
Elements := Document.all.tags('A') as IHTMLElementCollection;
for I := 0 to Elements.length - 1 do
begin
Element := Elements.item(I, '') as IHTMLElement;
Memo1.Lines.Add('A HREF=' + Element.getAttribute('HREF', 2));
Memo1.Lines.Add('A innerText=' + Element.innerText);
// Text is returned immediately before the element
Line := (Element as IHTMLElement2).getAdjacentText('beforeBegin');
// Line => "Mittwoch, 30. März 2011 12:01 <dir>" OR:
// Line => "Mittwoch, 9. Februar 2005 17:14 113"...
// I don't know what is the actual delimiter:
// It could be [space] or [tab] so we need to normalize the Line
// If it's tabs then it's easier because the timestamps also contains spaces
Line := Trim(Line);
Line := StripMultipleChar(Line, #32); // strip multiple Spaces sequences
Line := StripMultipleChar(Line, #9); // strip multiple Tabs sequences
// TODO: ParseLine (from right to left)
Memo1.Lines.Add(Line);
Memo1.Lines.Add('-------------');
end;
end;
Output:
A HREF=/SubDir/
A innerText=SubDir
Mittwoch, 30. März 2011 12:01 <dir>
-------------
A HREF=/file.txt
A innerText=file.txt
Mittwoch, 9. Februar 2005 17:14 113
-------------
EDIT:
I have changed StripMultipleChar implementation to be more simplified. yet I belive the former version was more optimized to speed. considering the fact that the Lines are very short in length, there will be no much differences in performance.