How to read a text file that contains 'NULL CHARACTER' in Delphi? - delphi

I have a text file that contains many NULL CHARACTERS and its encoding is UTF8.
I loaded the file using RichEdit1.Lines.LoadFromFile(FileName,Encoding) stoped after the first Null Character and it didn't load the rest of file.
Is there any help. How can I remove NULL Chars from a text file.
**BTW My text file encoding is UTF8.

Reading the file shouldn't be a problem. Rather, the problem is more likely when you try to store the data in a rich-edit control. Those controls don't accept arbitrary binary data. You need to ensure you only put text in that control.
Load the file into an ordinary string or stream:
var
s: string;
ss: TStringStream;
s := TFile.ReadAllText(FileName);
Then remove the invalid characters. #0 is the notation in Delphi to represent a null character. Ordinarily, we might use StringReplace to remove characters:
s := StringReplace(s, #0, '', [rfReplaceAll]);
However, it's not binary-safe; it stops at null characters. Instead, you'll need a different function for removing those characters. I've demonstrated that before. Call that function to adjust the string:
RemoveNullCharacters(s);
Finally, put the data in the rich-edit control:
ss := TStringStream.Create(s);
try
RichEdit1.Lines.LoadFromStream(ss, Encoding);
finally
ss.Free;
end;

Are you sure it is a UTF8 and not a UNICODE file? As you may know UNICODE is two bytes, where first one is a null character for non UNICODE languages, for example Chinese and the like.
Have you try to open the file with the IDE editor? Open it, select all the text (Ctrl+A) and copy (Ctrl+C) create a new empty text file and paste (Ctrl+V) the text.
Save the new file and try the RichEdit with this new file.

Related

Problems with unicode text

I use delphi xe3 and i have small problem !! but i don't how to fix it..
problem is with this letter "è" this letter is inside a file path "C:\lène.mp4"
i save this path into a tstringlist , when i save this tstringlist to a file the path will be shown fine inside the txt file ..
but when trying to loading it using tstringlist it will be shown as "è" (showing it inside a memo or int a variable) in this case it gonna be an invalid path ..
but adding the path(string) directly to the tstring list and then passing it to the path variable it works fine
but loading from the file and passing to the path variable it doesnt work (getting "è" instead of "è")
normally i will work with a lot of uncite string but for i'm struggling with that letter
this will not work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.LoadFromFile('C:\Demo6-out.txt'); // this file contains only "C:\lène.mp4"
resp := (xfiles.Strings[0]);
// if i save xfiles to a file "path string" will be saved fine ... !
finally
xfiles.Free ;
end;
but like this it work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.Add('C:lène.mp4');
resp := (xfiles.Strings[0]);
finally
xfiles.Free ;
end;
i'm really confused
First, you should be using UnicodeString instead of WideString. UnicodeString was introduced in Delphi 2009, and is much more efficient than WideString. The RTL uses UnicodeString (almost) everywhere it previously used AnsiString prior to 2009.
Second, something else introduced in Delphi 2009 is SysUtils.TEncoding, which is used for Byte<->Character conversions. Several existing RTL classes, including TStrings/TStringList, were updated to support TEncoding when converting bytes to/from strings.
What happens when you load a file into TStringList is that an internal TEncoding object is assigned to help convert the file's raw bytes to UnicodeString values. Which implementation of TEncoding it uses depends on the character encoding that LoadFromFile() thinks the file is using, if not explicitly stated (LoadFromFile() has an optional AEncoding parameter). If the file has a UTF BOM, a matching TEncoding is used, whether that be TEncoding.UTF8 or TEncoding.(BigEndian)Unicode. If no BOM is present, and the AEncoding parameter is not used, then TEncoding.Default is used, which represents the OS's default charset locale (and thus provides backwards compatibility with existing pre-2009 code).
When saving a TStringList to file, if the list was previously loaded from a file then the same TEncoding used for loading is used for saving, otherwise TEncoding.Default is used (again, for backwards compatibility), unless overwritten by the optional AEncoding parameter of SaveToFile().
In your first example, the input file is most likely encoded in UTF-8 without a BOM. So LoadFromFile() would use TEncoding.Default to interpret the file's bytes. è is the result of the UTF-8 encoded form of è (byte octets 0xC3 0xA8) being misinterpreted as Windows-1252 instead of UTF-8. So, you would have to load the file like this instead:
xfiles.LoadFromFile('C:\Demo6-out.txt', TEncoding.UTF8);
In your second example, you are not loading a file or saving a file. You are simply assigning a string literal (which is unicode-aware in D2009+) to a UnicodeString variable (inside of the TStringList) and then assigning that to a WideString variable (WideString and UnicodeString use the same UTF-16 character encoding, they just different memory managements). So there are no data conversions being performed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to save TStringList with UNIX line endings?

I cannot figure how to save the lines of a TStringList using UNIX line endings (LF) instead of the default CRLF ones.
I've tried to use StringReplace() on the stringList.Text property without any success :-(
StringList.Text is a property that generates the text every time. So when you assign the modified text back to the stringlist, you will undo you changes. When you get the text again, the stringlist will just build a new string with its default linebreak character.
This character can be influenced by setting the LineBreak property of the stringlist.
The default value for LineBreak is the sLineBreak constant, which can be either #13#10 on Windows or #10 on Linux or #13 on Mac.
Otherwise, if you save StringList.Text in a string variable, you can use StringReplace to change that string, or even better, use AdjustLineBreaks.
One more possibility is to use Jedi Code Library ( http://jcl.sf.net ) with split/join functionality in their version of string list.
var so : TJclStringList; // PODO style, requires finally-free-end
si : iJclStringList; // ref-counted interface for method chaining (aka Fluent API style)
s : String;
...
s := so.Join(^J);
s := si.Join(^J);

Saving a string with null characters to a file

I have a string that contains null characters.
I've tried to save it to a file with this code:
myStringList.Text := myString;
myStringList.SaveToFile('c:\myfile');
Unfortunately myStringList.Text is empty if the source string has a null character at the beginning.
I thought only C string were terminated by a null character, and Delphi was always fine with it.
How to save the content of the string to a file?
I think you mean "save a string that has #0 characters in it".
If that's the case, don't try and put it in a TStringList. In fact, don't try to save it as a string at all; just like in C, a NULL character (#0 in Delphi) causes the string to be truncated at times. Use a TFileStream and write it directly as byte content:
var
FS: TFileStream;
begin
FS := TFileStream.Create('C:\MyFile', fmCreate);
try
FS.Write(myString[1], Length(myString) * SizeOf(Char));
finally
FS.Free;
end;
end;
To read it back:
var
FS: TFileStream;
begin
FS := TFileStream.Create('C:\MyFile', fmOpenRead);
try
SetLength(MyString, FS.Size);
FS.Read(MyString[1], FS.Size);
finally
FS.Free;
end;
end;
When you set the Text property of a TStrings object, the new value is parsed as a null-terminated string. Therefore when the code reaches your null character, the parsing stops.
I'm not sure why the Delphi RTL code was designed that way, and its not documented, but that's just how setting the Text property works.
You can avoid this by using the Add method rather than the Text property.
myStringList.Clear;
myStringList.Add(myString);
myStringList.SaveToFile(FileName);
About writing strings to a file in general.. I still see people creating streams or stringlists just to write some stuff to a file, and then destroy the stream or stringlist.
Delphi7 didn't have IOUtuls.pas yet, but you're missing out on that.
There's a handy TFile record with class methods that lets you write text to a file with a single line, without requiring temporary variables:
TFile.WriteAllText('out.txt','hi');
Upgrading makes your life as a Delphi developer a lot easier. This is just a tiny example.

Read From Text File in Delphi 2009

I have a text file with UTF8 encoding, and I create an application in delphi 2009 with an opendialoge , a memo and a button and write this code:
if OpenTextFileDialog1.Execute then
Memo1.Lines.LoadFromFile(OpenTextFileDialog1.FileName);
When I Run my application ,I click on the button and select my text file, in the memo i see :
"Œ ط¯ط± ط¢ظ…â€چظˆط²ط´â€Œ ع©â€چط´â€چط§ظˆط±ط²غŒâ€Œ: ط±"
the characters was not show correctly.
How can I solve this problem?
If the file does not have a UTF-8 BOM at the beginning, then you need to tell LoadFromFile() that the file is encoded, eg:
Memo1.Lines.LoadFromFile(OpenTextFileDialog1.FileName, TEncoding.UTF8);
It is possible to select an encoding format in the OpenTextFile Dialog.
OpenTextFileDialog.Encodings represents a list of encodings that can be used, default list: ANSI, ASCII, Unicode, BigEndian, UTF8 and UTF7.
// Optionally add Encoding formats to the list:
FMyEncoding := TMyEncoding.Create;
OpenTextFileDialog1.Encodings.AddObject('MyEncoding', FMyEncoding);
// Don't forget to free FMyEncoding
var
Encoding : TEncoding;
EncIndex : Integer;
Filename : String;
begin
if OpenTextFileDialog1.Execute(Self.Handle) then
begin
Filename := OpenTextFileDialog1.FileName;
EncIndex := OpenTextFileDialog1.EncodingIndex;
Encoding := OpenTextFileDialog1.Encodings.Objects[EncIndex] as TEncoding;
// No Encoding found in Objects, probably a default Encoding:
if not Assigned(Encoding) then
Encoding := StandardEncodingFromName(OpenTextFileDialog1.Encodings[EncIndex]);
//Checking if the file exists
if FileExists(Filename) then
//Display the contents in a memo based on the selected encoding.
Memo1.Lines.LoadFromFile(FileName, Encoding)

HttpGetText(), autodetect charset, and convert source to UTF8

I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.
The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.
So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.
Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å type characters in the content.
If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.
When you retreive a webpage, its Content-Type header (or sometimes a <meta> tag inside the HTML itself) tells you which charset is being used for the data. You would decode the data to Unicode using that charset, then you can encode the Unicode to whatever you need for your processing.
I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)
Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.
procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
procedure LG2(ss:string);
begin
//dont log for now.
end;
begin
fs1 := TFileStream.Create(fileName,fmOpenRead);
fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
//compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
//also works for ASCII sources with htmlencoded accent chars, naturally
try
LG2('Files opened OK.');
GetMem(buf,bufsize);
ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
try
siz:=ts1.Read(buf^,bufsize);
LG2(inttostr(siz)+' bytes read.');
if siz>0 then ts2.Write(buf^,siz);
finally
LG2('Bytes read and written OK.');
FreeAndNil(ts1);FreeAndNil(ts2);end;
finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
LG2('Everything freed OK.');
end;
end; // UTF8FileTo88591

Resources