I'm still using Delphi7 (I know) and I need to encode an UTF8 XML in Base64 format.
I create the XML using IXMLDocument, which support UTF8 (that is, if I save to a file).
Since I'm using Indy10 to HTTP Post the XML request, I tried using TIdEncoderMIME to Base64 encode the XML. But some UTF8 chars are not encoded well.
Try1:
XMLText := XML.XML.Text;
EncodedXML := TIdEncoderMIME.EncodeBytes(ToBytes(XMLText));
In the above case most probably some UTF8 information/characters are already lost when the XML is saved to a string.
Try2:
XMLStream := TMemoryStream.Create;
XML.SaveToStream(XMLStream);
EncodedXML := TIdEncoderMIME.EncodeStream(XMLStream);
//or
EncodedXML := TIdEncoderMIME.EncodeStream(XMLStream, XMLStream.Size);
Both of the above gives back EncodedXML = '' (empty string).
What am I doing wrong?
Try using the TIdEncoderMIME.EncodeString() method instead. It has an AByteEncoding parameter that you can use to specify the desired byte encoding that Indy should encode the string characters as, such as UTF-8, before it then base64 encodes the resulting bytes:
XMLText := XML.XML.Text;
EncodedXML := TIdEncoderMIME.EncodeString(XMLText, IndyTextEncoding_UTF8);
Also note that in Delphi 2007 and earlier, where string is AnsiString, there is also an optional ASrcEncoding that you can use the specify the encoding of the AnsiString (for instance, if it is already UTF-8), so that it can be decoded to Unicode properly before then being encoded to the specified byte encoding (or, in the case where the two encodings are the same, the AnsiString can be base64 encoded as-is):
XMLText := XML.XML.Text;
EncodedXML := TIdEncoderMIME.EncodeString(XMLText, IndyTextEncoding_UTF8, IndyTextEncoding_UTF8);
You are getting data loss when using EncodeBytes() because you are using ToBytes() without specifying any encoding parameters for it. ToBytes() has similar AByteEncoding and ASrcEncoding parameters.
In the case where you tried to encode a TMemoryStream, you simply forgot to reset the stream's Position back to 0 after calling SaveToStream(), so there was nothing for EncodeStream() to encode. That is why it returned a blank base64 string:
XMLStream := TMemoryStream.Create;
try
XML.SaveToStream(XMLStream);
XMLStream.Position := 0; // <-- add this
EncodedXML := TIdEncoderMIME.EncodeStream(XMLStream);
finally
XMLStream.Free;
end;
I'm trying to use a Stringlist to load a CSV file generated by Google Contacts. When i open this file in an text editor like Sublime Text, i can see the contents properly, with 75 lines. This is a sample from the Google Contacts file :
Name,Given Name,Additional Name,Family Name,Yomi Name,Given Name Yomi,Additional Name Yomi,Family Name Yomi,Name Prefix,Name Suffix,Initials,Nickname,Short Name,Maiden Name,Birthday,Gender,Location,Billing Information,Directory Server,Mileage,Occupation,Hobby,Sensitivity,Priority,Subject,Notes,Group Membership,Phone 1 - Type,Phone 1 - Value,Phone 2 - Type,Phone 2 - Value,Phone 3 - Type,Phone 3 - Value
H,H,,,,,,,,,,,,, 1-01-01,,,,,,,,,,,,* My Contacts ::: Importado 01/02/16,,,,,,
H - ?,H,-,?,,,,,,,,,,, 1-01-01,,,,,,,,,,,,* My Contacts ::: Importado 01/02/16,Mobile,031-863-64393,,,,
H - ?,H,-,?,,,,,,,,,,,,,,,,,,,,,,,* My Contacts ::: Importado 01/02/16,Mobile,031-986-364393,,,,
BUT when i try to load this same file using Stringlist, this is what i see in the Stringlist.text property :
'ÿþN'#$D#$A
Here is my code :
procedure Tform1.loadfile;
var sl : tstringlist;
begin
sl := tstringlist.create;
sl.loadfromfile('c:\google.csv');
showmessage('lines : '+inttostr(sl.count)+' / text : '+ sl.text);
end;
This is the result i get :
'1 / 'ÿþN'#$D#$A'
What is happening here ?
Thanks
According to the hex dump you provided, the BOM indicates that your file is encoded using UTF-16LE. You a few options in front of you, as I see it:
Switch to Unicode and use the TnT Unicode controls to work with this file.
Read the file as an array of bytes. Convert to ANSI and then continue using ANSI encoded text. Obviously you'll lose information for any characters than cannot be encoded by your ANSI code page. A cheap way to do this would be to read the file as a byte array. Copy the content after the first two bytes, the BOM, into a WideString. Then assign that WideString to an ANSI string.
Port your program to a Unicode version of Delphi (anything later than Delphi 2007) and work natively with Unicode.
I rather suspect that you are not very familiar with text encodings. If you were then I think you would have been able to answer the question yourself. That's just fine but I urge you to take the time to learn about this issue properly. If you rush into coding now, before having a sound grounding, you are sure to make a mess of it. And we've seen so many people make that same mistake. Please don't add to the list of text encoding casualties.
Thanks to the information of David, i could achieve the task by using the function below ; because Delphi 2007 does not have unicode support, it needs third-party function to do it.
procedure loadUnicodeFile( const filename: String; strings: TStringList);
Procedure SwapWideChars( p: PWideChar );
Begin
While p^ <> #0000 Do Begin
// p^ := Swap( p^ ); //<<< D3
p^ := WideChar( Swap( Word(p^)));
Inc( p );
End; { While }
End; { SwapWideChars }
Var
ms: TMemoryStream;
wc: WideChar;
pWc: PWideChar;
Begin
ms:= TMemoryStream.Create;
try
ms.LoadFromFile( filename );
ms.Seek( 0, soFromend );
wc := #0000;
ms.Write( wc, sizeof(wc));
pWC := ms.Memory;
If pWc^ = #$FEFF Then // normal byte order mark
Inc(pWc)
Else If pWc^ = #$FFFE Then Begin // byte order is big-endian
SwapWideChars( pWc );
Inc( pWc );
End { If }
Else; // no byte order mark
strings.Text := WideChartoString( pWc );
finally
ms.free;
end;
End;
I have an .URL file which contains the following text which contains a German Umlaut character:
[InternetShortcut]
URL=http://edn.embarcadero.com/article/44358
[MyApp]
Notes=Special Test geändert
Icon=default
Title=Bug fix list for RAD Studio XE8
I try to load the text with TMemIniFile:
uses System.IniFiles;
//
procedure TForm1.Button1Click(Sender: TObject);
var
BookmarkIni: TMemIniFile;
begin
// The error occurs here:
BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url',
TEncoding.UTF8);
try
// Some code here
finally
BookmarkIni.Free;
end;
end;
This is the error message text from the debugger:
Project MyApp.exe raised exception class EEncodingError with message
'No mapping for the Unicode character exists in the target multi-byte
code page'.
When I remove the word with the German Umlaut character "geändert" from the .URL file then there is NO error.
But that's why I use TMemIniFile, because TIniFile does not work here when the text in the .URL file contains Unicode characters. (There could also be other Unicode characters in the .URL file).
So why I get an exception here in TMemIniFile.Create?
EDIT: Found the culprit: The .URL file is in ANSI format. The error does not happen when the .URL file is in UTF-8 format. But what can I do when the file is in ANSI format?
EDIT2: I've created a workaround which does work BOTH with ANSI and UTF-8 files:
procedure TForm1.Button1Click(Sender: TObject);
var
BookmarkIni: TMemIniFile;
BookmarkIni_: TIniFile;
ThisFileIsAnsi: Boolean;
begin
try
ThisFileIsAnsi := False;
BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url',
TEncoding.UTF8);
except
BookmarkIni_ := TIniFile.Create('F:\Bug fix list for RAD Studio XE8.url');
ThisFileIsAnsi := True;
end;
try
// Some code here
finally
if ThisFileIsAnsi then
BookmarkIni_.Free
else
BookmarkIni.Free;
end;
end;
What do you think?
It is not possible, in general, to auto-detect the encoding of a file from its contents.
A clear demonstration of this is given by this article from Raymond Chen: The Notepad file encoding problem, redux. Raymond uses the example of a file containing these two bytes:
D0 AE
Raymond goes on to show that this is a well formed file with the following four encodings: ANSI 1252, UTF-8, UTF-16BE and UTF-16LE.
The take home lesson here is that you have to know the encoding of your file. Either agree it by convention with whoever writes the file. Or enforce the presence of a BOM.
You need to decide on what the encoding of the file is, once and for all. There's no fool proof way to auto-detect this, so you'll have to enforce it from your code that creates these files.
If the creation of this file is outside your control, then you are more or less out of luck. You can try to rely of the BOM (Byte-Order-Mark) at the beginning of the file (which should be there if it is a UTF-8 file). I can't see from the specification of the TMemIniFile what the CREATE constructor without an encoding parameter assumes about the encoding of the file (my guess is that it follows the BOM and if there's no such thing, it assumes ANSI, ie. system codepage).
One thing you can do - if you decide to stick to your current method - is to change your code to:
procedure TForm1.Button1Click(Sender: TObject);
var
BookmarkIni: TCustomIniFile;
begin
// The error occurs here:
try
BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url',
TEncoding.UTF8);
except
BookmarkIni := TIniFile.Create('F:\Bug fix list for RAD Studio XE8.url');
end;
try
// Some code here
finally
BookmarkIni.Free;
end;
end;
You don't need two separate variables, as both TIniFile and TMemIniFile (as well as TRegistryIniFile) all have a common ancestor: TCustomIniFile. By declaring your variable as this common ancestor, you can instantiate (create) it as any of the class types that inherit from TCustomIniFile. The actual (run-time) type is determined depending on which construtcor you're calling to create.
But first, you should try to use
BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url');
ie. without any encoding specified, and see if it works with both ANSI and UTF-8 files.
EDIT: Here's a test program to verify my claim made in the comments:
program Project21;
{$APPTYPE CONSOLE}
uses
IniFiles, System.SysUtils;
const
FileName = 'F:\Bug fix list for RAD Studio XE8.url';
var
TXT : TextFile;
procedure Test;
var
BookmarkIni: TCustomIniFile;
begin
try
BookmarkIni := TMemIniFile.Create(FileName,TEncoding.UTF8);
except
BookmarkIni := TIniFile.Create(FileName);
end;
try
Writeln(BookmarkIni.ReadString('MyApp','Notes','xxx'))
finally
BookmarkIni.Free;
end;
end;
begin
try
AssignFile(TXT,FileName); REWRITE(TXT);
try
WRITELN(TXT,'[InternetShortcut]');
WRITELN(TXT,'URL=http://edn.embarcadero.com/article/44358');
WRITELN(TXT,'[MyApp]');
WRITELN(TXT,'Notes=The German a umlaut consists of the following two ANSI characters: '#$C3#$A4);
WRITELN(TXT,'Icon=default');
WRITELN(TXT,'Title=Bug fix list for RAD Studio XE8');
finally
CloseFile(TXT)
end;
Test;
ReadLn
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
The rule of thumb - to read data (file, stream whatever) correctly you must know the encoding! And the best solution is to let user to choose encoding or force one e.g. utf-8.
Moreover, the information ANSI does make things easier without code page.
A must read - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Other approach is to try to detect encoding (like browsers do with sites if no encoding specified). Detecting UTF is relatively easy if BOM exists, but more often is omitted. Take a look Mozilla's universalchardet or chsdet.
When I save the object from TStringList class file content to a file, the file is saved with UTF-8 correctly but UTF-8 with BOM by default.
My code is:
myFile := TStringList.Create;
try
myFile.Text := myData;
myFile.saveToFile('myfile.dat', TEncoding.UTF8)
finally
FreeAndNil(myFile);
end;
In the example the file "myfile.dat" appear as "UTF-8 BOM" encoding.
How can I save the file without BOM?
You simply have to set the property TStrings.WriteBOM to false.
The documentation tells us about this:
Will cause SaveToStream or SaveToFile to write a BOM.
Set WriteBOM to True to cause SaveToStream to write a BOM (byte-order mark) to the stream and to cause SaveToFile to write a BOM to the file.
You can achieve this by creating your own encoding class descended from TUTF8Encoding and overriding the GetPreamble method :-
type
TUTF8EncodingNoBOM = class(TUTF8Encoding)
public
function GetPreamble: TBytes; override;
end;
function TUTF8EncodingNoBOM.GetPreamble: TBytes;
begin
SetLength(Result, 0);
end;
I have a text file that contains many NULL CHARACTERS and its encoding is UTF8.
I loaded the file using RichEdit1.Lines.LoadFromFile(FileName,Encoding) stoped after the first Null Character and it didn't load the rest of file.
Is there any help. How can I remove NULL Chars from a text file.
**BTW My text file encoding is UTF8.
Reading the file shouldn't be a problem. Rather, the problem is more likely when you try to store the data in a rich-edit control. Those controls don't accept arbitrary binary data. You need to ensure you only put text in that control.
Load the file into an ordinary string or stream:
var
s: string;
ss: TStringStream;
s := TFile.ReadAllText(FileName);
Then remove the invalid characters. #0 is the notation in Delphi to represent a null character. Ordinarily, we might use StringReplace to remove characters:
s := StringReplace(s, #0, '', [rfReplaceAll]);
However, it's not binary-safe; it stops at null characters. Instead, you'll need a different function for removing those characters. I've demonstrated that before. Call that function to adjust the string:
RemoveNullCharacters(s);
Finally, put the data in the rich-edit control:
ss := TStringStream.Create(s);
try
RichEdit1.Lines.LoadFromStream(ss, Encoding);
finally
ss.Free;
end;
Are you sure it is a UTF8 and not a UNICODE file? As you may know UNICODE is two bytes, where first one is a null character for non UNICODE languages, for example Chinese and the like.
Have you try to open the file with the IDE editor? Open it, select all the text (Ctrl+A) and copy (Ctrl+C) create a new empty text file and paste (Ctrl+V) the text.
Save the new file and try the RichEdit with this new file.