In Delphi 7, I have a widestring encoded with Base64(That I received from a Web service with WideString result) :
PD94bWwgdmVyc2lvbj0iMS4wIj8+DQo8c3RyaW5nPtiq2LPYqjwvc3RyaW5nPg==
when I decoded it, that result is not UTF-8:
<?xml version="1.0"?>
<string>طھط³طھ</string>
But when I decoded it by base64decode.org, result is true :
<?xml version="1.0"?>
<string>تست</string>
I have use EncdDecd unit for DecodeString function.
The problem you have is that you are using DecodeString. That function, in Delphi 7, treats the decoded binary data as being ANSI encoded. And the problem is that your text is UTF-8 encoded.
To continue with the EncdDecd unit you have a couple of options. You can switch to DecodeStream. For instance, this code will produce a UTF-8 encoded text file with your data:
{$APPTYPE CONSOLE}
uses
Classes,
EncdDecd;
const
Data = 'PD94bWwgdmVyc2lvbj0iMS4wIj8+DQo8c3RyaW5nPtiq2LPYqjwvc3RyaW5nPg==';
var
Input: TStringStream;
Output: TFileStream;
begin
Input := TStringStream.Create(Data);
try
Output := TFileStream.Create('C:\desktop\out.txt', fmCreate);
try
DecodeStream(Input, Output);
finally
Output.Free;
end;
finally
Input.Free;
end;
end.
Or you could continue with DecodeString, but then immediately decode the UTF-8 text to a WideString. Like this:
{$APPTYPE CONSOLE}
uses
Classes,
EncdDecd;
const
Data = 'PD94bWwgdmVyc2lvbj0iMS4wIj8+DQo8c3RyaW5nPtiq2LPYqjwvc3RyaW5nPg==';
var
Utf8: AnsiString;
wstr: WideString;
begin
Utf8 := DecodeString(Data);
wstr := UTF8Decode(Utf8);
end.
If the content of the file can be represented in your application's prevailing ANSI locale then you can convert that WideString to a plain AnsiString.
var
wstr: WideString;
str: string; // alias to AnsiString
....
wstr := ... // as before
str := wstr;
However, I really don't think that using ANSI encoded text is going to lead to a very fruitful programming life. I encourage you to embrace Unicode solutions.
Judging by the content of the decoded data, it is XML. Which is usually handed to an XML parser. Most XML parsers will accept UTF-8 encoded data, so you quite probably can base64 decode to a memory stream using DecodeStream and then hand that stream off to your XML parser. That way you don't need to decode the UTF-8 to text and can let the XML parser deal with that aspect.
As an addendum to David Heffernan's awesome answer, and Remy Lebeau's note on how it's broken on Delphi 7, I would like to add a function that will help any developer stuck on Delphi 7.
Since UTF8Decode() is broken in Delphi 7, I found a function in a forum that solved my problem:
function UTF8ToWideString(const S: AnsiString): WideString;
var
BufSize: Integer;
begin
Result := '';
if Length(S) = 0 then Exit;
BufSize := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), nil, 0);
SetLength(result, BufSize);
MultiByteToWideChar(CP_UTF8, 0, PANsiChar(S), Length(S), PWideChar(Result), BufSize);
end;
So now, you can use DecodeString, and then decode the UTF-8 text to a WideString using this function:
begin
Utf8 := DecodeString(Data);
wstr := UTF8ToWideString(Utf8);
end.
Related
My simple code is:
var
TMPStream : TStringStream;
myencoding: TEncoding;
...
try
myencoding := TEncoding.GetEncoding(CP_UTF8);
TMPStream := TStringStream.Create('', myencoding);
try
(IBQueryTMP.FieldByName('MYTEXT') as TBlobField).SaveToStream(TMPStream);
TMPStream.SaveToFile(ExtractFilePath(Application.ExeName)+'myfile.txt');
except
Showmessage ('Error');
end;
finally
TMPStream.Free;
myencoding.Free;
end;
As I see in the file (for instance, in Notepad++), the codepage is UTF-16 Little Endian, although it was declared UTF-8. The database is UTF-8, too.
What's wrong?
Consider this program:
{$APPTYPE CONSOLE}
begin
Writeln('АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ');
end.
The output on my console which uses the Consolas font is:
????????Z??????????????????????????????????????
The Windows console is quite capable of supporting Unicode as evidenced by this program:
{$APPTYPE CONSOLE}
uses
Winapi.Windows;
const
Text = 'АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ';
var
NumWritten: DWORD;
begin
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), PChar(Text), Length(Text), NumWritten, nil);
end.
for which the output is:
АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ
Can Writeln be persuaded to respect Unicode, or is it inherently crippled?
Just set the console output codepage through the SetConsoleOutputCP() routine with codepage cp_UTF8.
program Project1;
{$APPTYPE CONSOLE}
uses
System.SysUtils,Windows;
Const
Text = 'АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ';
VAR
NumWritten: DWORD;
begin
ReadLn; // Make sure Consolas font is selected
try
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), PChar(Text), Length(Text), NumWritten, nil);
SetConsoleOutputCP(CP_UTF8);
WriteLn;
WriteLn('АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ');
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
ReadLn;
end.
Outputs:
АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ
АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ
WriteLn() translates Unicode UTF16 strings to the selected output codepage (cp_UTF8) internally.
Update:
The above works in Delphi-XE2 and above.
In Delphi-XE you need an explicit conversion to UTF-8 to make it work properly.
WriteLn(UTF8String('АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ'));
Addendum:
If an output to the console is done in another codepage before calling SetConsoleOutputCP(cp_UTF8),
the OS will not correctly output text in utf-8.
This can be fixed by closing/reopening the stdout handler.
Another option is to declare a new text output handler for utf-8.
var
toutUTF8: TextFile;
...
SetConsoleOutputCP(CP_UTF8);
AssignFile(toutUTF8,'',cp_UTF8); // Works in XE2 and above
Rewrite(toutUTF8);
WriteLn(toutUTF8,'АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ');
The System unit declares a variable named AlternateWriteUnicodeStringProc that allows customisation of how Writeln performs output. This program:
{$APPTYPE CONSOLE}
uses
Winapi.Windows;
function MyAlternateWriteUnicodeStringProc(var t: TTextRec; s: UnicodeString): Pointer;
var
NumberOfCharsWritten, NumOfBytesWritten: DWORD;
begin
Result := #t;
if t.Handle = GetStdHandle(STD_OUTPUT_HANDLE) then
WriteConsole(t.Handle, Pointer(s), Length(s), NumberOfCharsWritten, nil)
else
WriteFile(t.Handle, Pointer(s)^, Length(s)*SizeOf(WideChar), NumOfBytesWritten, nil);
end;
var
UserFile: Text;
begin
AlternateWriteUnicodeStringProc := MyAlternateWriteUnicodeStringProc;
Writeln('АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ');
Readln;
end.
produces this output:
АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ
I'm sceptical of how I've implemented MyAlternateWriteUnicodeStringProc and how it would interact with classic Pascal I/O. However, it appears to behave as desired for output to the console.
The documentation of AlternateWriteUnicodeStringProc currently says, wait for it, ...
Embarcadero Technologies does not currently have any additional information. Please help us document this topic by using the Discussion page!
WriteConsoleW seems to be a quite magical function.
procedure WriteLnToConsoleUsingWriteFile(CP: Cardinal; AEncoding: TEncoding; const S: string);
var
Buffer: TBytes;
NumWritten: Cardinal;
begin
Buffer := AEncoding.GetBytes(S);
// This is a side effect and should be avoided ...
SetConsoleOutputCP(CP);
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), Buffer[0], Length(Buffer), NumWritten, nil);
WriteLn;
end;
procedure WriteLnToConsoleUsingWriteConsole(const S: string);
var
NumWritten: Cardinal;
begin
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), PChar(S), Length(S), NumWritten, nil);
WriteLn;
end;
const
Text = 'АБВГДЕЖЅZЗИІКЛМНОПҀРСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴ';
begin
ReadLn; // Make sure Consolas font is selected
// Works, but changing the console CP is neccessary
WriteLnToConsoleUsingWriteFile(CP_UTF8, TEncoding.UTF8, Text);
// Doesn't work
WriteLnToConsoleUsingWriteFile(1200, TEncoding.Unicode, Text);
// This does and doesn't need the CP anymore
WriteLnToConsoleUsingWriteConsole(Text);
ReadLn;
end.
So in summary:
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), ...) supports UTF-16.
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) doesn't support UTF-16.
My guess would be that in order to support different ANSI encodings the classic Pascal I/O uses the WriteFile call.
Also keep in mind that when used on a file instead of the console it has to work as well:
unicode text file output differs between XE2 and Delphi 2009?
That means that blindly using WriteConsole breaks output redirection. If you use WriteConsole you should fall back to WriteFile like this:
var
NumWritten: Cardinal;
Bytes: TBytes;
begin
if not WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), PChar(S), Length(S),
NumWritten, nil) then
begin
Bytes := TEncoding.UTF8.GetBytes(S);
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), Bytes[0], Length(Bytes),
NumWritten, nil);
end;
WriteLn;
end;
Note that output redirection with any encoding works fine in cmd.exe. It just writes the output stream to the file unchanged.
PowerShell however expects either ANSI output or the correct preamble (/ BOM) has to be included at the start of the output (or the file will be malencoded!). Also PowerShell will always convert the output into UTF-16 with preamble.
MSDN recommends using GetConsoleMode to find out if the standard handle is a console handle, also the BOM is mentioned:
WriteConsole fails if it is used with a standard handle that is
redirected to a file. If an application processes multilingual output
that can be redirected, determine whether the output handle is a
console handle (one method is to call the GetConsoleMode function and
check whether it succeeds). If the handle is a console handle, call
WriteConsole. If the handle is not a console handle, the output is
redirected and you should call WriteFile to perform the I/O. Be sure to
prefix a Unicode plain text file with a byte order mark. For more
information, see Using Byte Order Marks.
I develop a server and a mobile client that communicate over HTTP. Server is written in Delphi 7 (because it has to be compatible with old code), client is mobile application written in XE6. Server sends to client stream of data that contains strings. A problem is connected to encoding.
On the server I try to pass strings in UTF8:
//Writes string to stream
procedure TStreamWrap.WriteString(Value: string);
var
BytesCount: Longint;
UTF8: string;
begin
UTF8 := AnsiToUtf8(Value);
BytesCount := Length(UTF8);
WriteLongint(BytesCount); //It writes Longint to FStream: TStream
if BytesCount > 0 then
FStream.WriteBuffer(UTF8[1], BytesCount);
end;
As it's written in Delphi7, Value is a single byte string.
On the client I read string in UTF8 and encode it to Unicode
//Reads string from current position of stream
function TStreamWrap.ReadString: string;
var
BytesCount: Longint;
UTF8: String;
begin
BytesCount := ReadLongint;
if BytesCount = 0 then
Result := ''
else
begin
SetLength(UTF8, BytesCount);
FStream.Read(Pointer(UTF8)^, BytesCount);
Result := UTF8ToUnicodeString(UTF8);
end;
end;
But it doesn't work, when I display the string with ShowMessage the letters are wrong. So how to store string in Delphi 7 and restore it in XE6 on the mobile app? Should I add BOM at the beginning of data representing the string?
To read your UTF8 encoded string in your mobile application you use a byte array and the TEncoding class. Like this:
function TStreamWrap.ReadString: string;
var
ByteCount: Longint;
Bytes: TBytes;
begin
ByteCount := ReadLongint;
if ByteCount = 0 then
begin
Result := '';
exit;
end;
SetLength(Bytes, ByteCount);
FStream.Read(Pointer(Bytes)^, ByteCount);
Result := TEncoding.UTF8.GetString(Bytes);
end;
This code does what you need in XE6, but of course, this code will not compile in Delphi 7 because it uses TEncoding. What's more, your TStreamWrap.WriteString implementation does what you want in Delphi 7, but is broken in XE6.
Now it looks like you are using the same code base for both Delphi 7 and Delphi XE6 versions. Which means that you may need to use some conditional compilation to handle the treatment of text which differs between these versions.
Personally I would do this by following the example of TEncoding. What you need is a function that converts a native Delphi string to a UTF-8 encoded byte array, and a corresponding function in the reverse direction.
So, let's consider the string to bytes function. I cannot remember whether or not Delphi 7 has a TBytes type. I suspect not. So let us define it:
{$IFNDEF UNICODE} // definitely use a better conditional than this in real code
type
TBytes = array of Byte;
{$ENDIF}
Then we can define our function:
function StringToUTF8Bytes(const s: string): TBytes;
{$IFDEF UNICODE}
begin
Result := TEncoding.UTF8.GetBytes(s);
end;
{$ELSE}
var
UTF8: UTF8String;
begin
UTF8 := AnsiToUtf8(s);
SetLength(Result, Length(UTF8));
Move(Pointer(UTF8)^, Pointer(Result)^, Length(Result));
end;
{$ENDIF}
The function in the opposite direction should be trivial for you to produce.
Once you have the differences in handling of text encoding between the two Delphi versions encapsulated, you can then write conditional free code in the rest of your program. For example, you would code WriteString like this:
procedure TStreamWrap.WriteString(const Value: string);
var
UTF8: TBytes;
ByteCount: Longint;
begin
UTF8 := StringToUTF8Bytes(Value);
ByteCount := Length(UTF8);
WriteLongint(ByteCount);
if ByteCount > 0 then
FStream.WriteBuffer(Pointer(UTF8)^, ByteCount);
end;
Instead of
Utf8 : String;
Use
Utf8 : Utf8String;
on client. Then conversion is Automatic.
EDIT: Since the client is on a mobile platform, and Embarcadero has decided to eliminate the 8-bit strings in mobile compilers, the above won't work for this particular case. But in other cases where you have an 8-bit UTF-8 encoded string, the Utf8String can be used to seamlessly convert back and forth between UTF-8 and Unicode strings without the need to use explicit UTF-8 conversion functions. Just use it like
UnicodeStringVariable := Utf8StringVariable;
or
Utf8StringVariable := UnicodeStringVariable;
and the compiler will insert the appropriate conversion.
It's possible to convert the XML to UTF-8 encoding in Delphi 6?
Currently that's what I am doing:
Fill TXMLDocument with AnsiString
At the end convert the Data to UTF-8 by using WideStringVariable = AnsiToUtf8(Doc.XML.Text);
Save the value of WideStringVariable to file using TFileStream and Adding BOM for UTF8 at the file beggining.
CODE:
Procedure SaveAsUTF8( const Name:String; Data: TStrings );
const
cUTF8 = $BFBBEF;
var
W_TXT: WideString;
fs: TFileStream;
wBOM: Integer;
begin
if TRIM(Data.Text) <> '' then begin
W_TXT:= AnsiToUTF8(Data.Text);
fs:= Tfilestream.create( Name, fmCreate );
try
wBOM := cUTF8;
fs.WriteBUffer( wBOM, sizeof(wBOM)-1);
fs.WriteBuffer( W_TXT[1], Length(W_TXT)*Sizeof( W_TXT[1] ));
finally
fs.free
end;
end;
end;
If I open the file in Notepad++ or another editor that detects encoding, it shows me UTF-8 with BOM. However, it seems like the text it's not properly encoded.
What is wrong and how can I fix it?
UPDATE: XML Properties:
XMLDoc.Version := '1.0';
XMLDoc.Encoding := 'UTF-8';
XMLDoc.StandAlone := 'yes';
You can save the file using standard SaveToFile method over the TXMLDocument variable: http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/XMLDoc_TXMLDocument_SaveToFile.html
Whether the file would be or not UTF8 you have to check using local tools like aforementioned Notepad++ or Hex Editor or anything else.
If you insist of using intermediate string and file stream, you should use the proper variable. AnsiToUTF8 returns UTF8String type and that is what to be used.
Compiling `WideStringVar := AnsiStringSource' would issue compiler warning and
It is a proper warning. Googling for "Delphi WideString" - or reading Delphi manuals on topic - shows that WideString aka Microsoft OLE BSTR keeps data in UTF-16 format. http://delphi.about.com/od/beginners/l/aa071800a.htm
Thus assignment UTF16 string <= 8-bit source would necessarily convert data and thus dumping WideString data can not be dumping UTF-8 text by the definition of WideString
Procedure SaveAsUTF8( const Name:String; Data: TStrings );
const
cUTF8: array [1..3] of byte = ($EF,$BB,$BF)
var
W_TXT: UTF8String;
fs: TFileStream;
Trimmed: AnsiString;
begin
Trimmed := TRIM(Data.Text);
if Trimmed <> '' then begin
W_TXT:= AnsiToUTF8(Trimmed);
fs:= TFileStream.Create( Name, fmCreate );
try
fs.WriteBuffer( cUTF8[1], sizeof(cUTF8) );
fs.WriteBuffer( W_TXT[1], Length(W_TXT)*Sizeof( W_TXT[1] ));
finally
fs.free
end;
end;
end;
BTW, this code of yours would not create even empty file if the source data was empty. It looks rather suspicious, though it is you to decide whether that is an error or not wrt the rest of your program.
The proper "uploading" of received file or stream to web is yet another issue (to be put as a separate question on Q&A site like SO), related to testing conformance with HTTP. As a foreword, you can readsome hints at WWW server reports error after POST Request by Internet Direct components in Delphi
In order to have the correct encoding inside the document, you should set it by using the Encoding property in your XML Document, like this:
myXMLDocument.Encoding := 'UTF-8';
I hope this helps.
You simply need to call the SaveToFile method of the document:
XMLDoc.SaveToFile(FileName);
Since you specified the encoding already, the component will use that encoding.
This won't include a BOM, but that's generally what you want for an XML file. The content of the file will specify the encoding.
As regards your SaveAsUTF8 method, it is not needed, but it is easy to fix. And that may be instructive to you.
The problem is that you are converting to UTF-16 when you assign to a WideString variable. You should instead put the UTF-8 text into an AnsiString variable. Changing the type of the variable that you named W_TXT to AnsiString is enough.
The function might look like this:
Procedure SaveAsUTF8(const Name: string; Data: TStrings);
const
UTF8BOM: array [0..2] of AnsiChar = #$EF#$BB#$BF;
var
utf8: AnsiString;
fs: TFileStream;
begin
utf8 := AnsiToUTF8(Data.Text);
fs:= Tfilestream.create(Name, fmCreate);
try
fs.WriteBuffer(UTF8BOM, SizeOf(UTF8BOM));
fs.WriteBuffer(Pointer(utf8)^, Length(utf8));
finally
fs.free;
end;
end;
Another solution:
procedure SaveAsUTF8(const Name: string; Data: TStrings);
var
fs: TFileStream;
vStreamWriter: TStreamWriter;
begin
fs := TFileStream.Create(Name, fmCreate);
try
vStreamWriter := TStreamWriter.Create(fs, TEncoding.UTF8);
try
vStreamWriter.Write(Data.Text);
finally
vStreamWriter.Free;
end;
finally
fs.free;
end;
end;
I'm trying to save some lines of text in a codepage different from my system's such as Cyrillic to a TFileStream using Delphi XE. However I can't find any code sample to produce those encoded file ?
I tried using the same code as TStrings.SaveToStream however I'm not sure I implemented it correctly (the WriteBom part for example) and would like to know how it would be done elsewhere. Here is my code:
FEncoding := TEncoding.GetEncoding(1251);
FFilePool := TObjectDictionary<string,TFileStream>.Create([doOwnsValues]);
//...
procedure WriteToFile(const aFile, aText: string);
var
Preamble, Buffer: TBytes;
begin
// Create the file if it doesn't exist
if not FFilePool.ContainsKey(aFile) then
begin
// Create the file
FFilePool.Add(aFile, TFileStream.Create(aFile, fmCreate));
// Write the BOM
Preamble := FEncoding.GetPreamble;
if Length(Preamble) > 0 then
FFilePool[aFile].WriteBuffer(Preamble[0], Length(Preamble));
end;
// Write to the file
Buffer := FEncoding.GetBytes(aText);
FFilePool[aFile].WriteBuffer(Buffer[0], Length(Buffer));
end;
Thanks in advance.
Not sure what example are you looking for; may be the following can help - the example converts unicode strings (SL) to ANSI Cyrillic:
procedure SaveCyrillic(SL: TStrings; Stream: TStream);
var
CyrillicEncoding: TEncoding;
begin
CyrillicEncoding := TEncoding.GetEncoding(1251);
try
SL.SaveToStream(Stream, CyrillicEncoding);
finally
CyrillicEncoding.Free;
end;
end;
If I understand it's pretty simple. Declare an AnsiString with affinity for Cyrillic 1251:
type
// The code page for ANSI-Cyrillic is 1251
CyrillicString = type AnsiString(1251);
Then assign your Unicode string to one of these:
var
UnicodeText: string;
CyrillicText: CyrillicString;
....
CyrillicText := UnicodeText;
You can then write CyrillicText to a stream in the traditional manner:
if Length(CyrillicText)>0 then
Stream.WriteBuffer(CyrillicText[1], Length(CyrillicText));
There should be no BOM for an ANSI encoded text file.