Writing to a TMemoryStream and losing unicode

Writing to a TMemoryStream and losing unicode - delphi

I'm getting a weird behavior when copying a TMemoryStream (containing unicode string) to another TMemoryStream, using Delphi XE2:
I have two instances of a TMemoryStream. The first instance contains unicode text (SourceMS). I write some arbitrary data to the second MemoryStream (DestMS) and then copy the contents of the first stream to the second stream, like that:
var
SomeInt: Integer;
SomeByte: Byte;
SourceMS, DestMS: TMemoryStream;
begin
...
DestMS.Write(SomeInt, SizeOf(SomeInt));
DestMS.Write(SomeByte, SizeOf(SomeByte));
SourceMS.SaveToFile('c:\SourceMS.txt'); // SourceMS.txt contains the unicode chars
DestMS.CopyFrom(SourceMS, 0); // copy the whole content of SourceMS to DestMS
DestMS.SaveToFile('c:\DestMS.txt'); // DestMS.txt DOEST NOT contain unicode chars
end;
How can I copy the contents of the first stream to the second stream without losing unicode (having an implicit conversion)?
When I say "losing unicode", I mean: The unicode string is indeed copied to the second stream, but the unicode is lost. I get ANSI chars only.

It seems that DestMS is just some arbitrary bytes and that SourceMS is where your Unicode content resides. If you append source to dest, then the BOM from source will not be at the beginning of the memory stream. When you open the saved text file in Windows, it won't see the BOM because it isn't at the beginning of the file, so it won't know that other characters later in the file should be treated as Unicode.
It appears that you are trying to insert some content at the front of the Unicode content.
If this is true, then you could place the Unicode content in a Unicode compliant control, add the characters to the beginning and then capture the content from the control. This would keep the BOM at the beginning of the byte stream.

Here's what could happen, if we were to judge this based exclusively on the 5 lines of code that were posted. TMemoryStream does not alter the the bytes in any way, we have to assume the raw bytes were successfully copied from one .txt file to the other. Both files should contain the exact same bytes, yet when viewing the files with a Text Viewer application, those same bytes are not interpreted the same way.
I can only imagine one such case:
One of the files has a BOM, most likely UTF8.
The other file has no BOM, so it's interpreted as ANSI.
It doesn't even matter which file has a BOM: going through such a process changes the way bytes are interpreted. According to Wikipedia, the vast majority of code pages are a super-set of ASCII, meaning that all bytes that can be written using 7bit are interpreted the exact same way with both UTF8 and ANSI. The "Unicode" characters that the OP complains about are certainly in the "extend" ANSI (8bit) or, when using UTF8, they're composed using 2 or more bytes. This gives the failure modes:
If the original is a ANSI file that contains extended characters (non-ASCII), if those were interpreted as UTF8, the result would probably look a bit like garbage: Two (or more) characters of the original file would seem to be replaced by some weird character.
If the original was UTF8, then all international characters would be represented using a minimum of two bytes: When interpreted as ANSI those two bytes would be represented as two distinct characters, according to the code page of the PC.

CopyFrom does indeed copy the whole source stream into the target stream, but it starts at the current position of the target. The arbitrary data written before still exists!
You should set MS1.Position := 0 before you call CopyFrom.

Related

Encode ASCII Files

this question will have a very simple answers which is yes or no I guess ?
If I encode from x64 bit unicode delphi app my stringlist like this
StringList.SaveToFile(FileName, TEncoding.ASCII);
is there any other limitation , difference in file layout while writing this file with the statement
StringList.SaveToFile(FileName);
or
StringList.SaveToFile(FileName, TEncoding.UTF8);
I'm afraid on line length and control char issues between both versions....Answer NO will make me happy.

UTF-8 and the Windows 'Ansi' codepages are all superset of ASCII. As such, if the string list only contains characters in the ASCII range, the three statements you listed will be equivalent if you prepend the last with this:
StringList.WriteBOM := False;
This is because by default, TStrings will write out a small marker (a BOM) to denote UTF-8 text.

The difference is simply in the encoding used. This in turn, of course, leads to differences in size. So ASCII files will be smaller than UTF-16 (what you get with TEncoding.Unicode. And UTF-8 files could be the same size as ASCII, or larger than UTF-16.
I guess you are asking if using ASCII or UTF-8 in any way damages the text that is written. Well, using ASCII will if the text contains non-ASCII characters. ASCII can only encode 127 characters.
On the other hand, UTF-8 is a full encoding of Unicode. Which means that
StringList.SaveToFile(FileName, TEncoding.UTF8);
StringList.LoadFromFile(FileName, TEncoding.UTF8);
results in the list having exactly the same content as it did before the save.
You ask if lines can be truncated by SaveToFile. They cannot.
Another point to make is that 32/64 bit is not relevant here. The code behaves in exactly the same way under 32 and 64 bit. The issues are always to do with encoding.
I would also note that the title of your question is somewhat mis-leading. When you encode with TEncoding.UTF8 you not do not have an ASCII file.

Problem reading a TStream in Delphi XE

In the previous versions of Delphi, the following code:
var InBuf: array[1..45] of Byte;
Count := InStream.Read(InBuf, SizeOf(InBuf));
filled the variable InBuf with the correct values ( every byte had a value ). Now in Delphi XE, every second byte of the array is 0, I suppose because the Byte data type is twice as big, because of its Unicode nature in Delphi XE. But, my streams are already generated and need to pass through this procedure, so I need another type (maybe?) that is half size of Byte or another solution if someone faced this problem. Thanks

What has happened here, with >99% probability is that you have written the stream from a string variable. Unicode strings with UTF-16 encoding have two bytes per character whereas older versions of Delphi using ANSI encodings with one byte per character.
English text, when encoded with UTF-16 have the pattern you observe of every second byte being zero.
In order to solve this you will need to investigate the section of code that writes to the stream.

ReadLn working with WideString (utf-8 files)

I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?

If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.

If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.

If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do

WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.

You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).

How to read unicode characters accurately

I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?

You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?

delphi 2009 unicode + ansi problem

I'm porting an isapi (pageproducers) application from delphi 7 to delphi 2009, the pages are based on html files in UTF8.
Everything goes well except when Onhtmltag is fired and I replace a transparent tag with any value with special characters like accented characters (áé...) Those characters are replaced in the output with an � character.
What's wrong?

As part of your debugging procedure, you should go find out exactly what byte value(s) the browser receives for the question-mark character.
As you should know, Delphi 2009's string type is Unicode, whereas all previous version were ANSI. Delphi 7 introduced the Utf8String type, but Delphi 2009 made that type special. If you're not using that type for holding strings that are encoded as UTF-8, then you should start doing so. Values held in Utf8String variables will be converted to UnicodeString values automatically when you assign one to the other.
If you're storing your UTF-8-encoded strings in ordinary AnsiString variables, then they will be converted to Unicode using the default system code page if you assign them to a UnicodeString. That's not what you want.
If you're assigning UTF-8-encoded literals to variables of type string, stop that. That type expects its values to be encoded as UTF-16, just like WideString always has.
If you are loading your files into a TStrings descendant with LoadFromFile, then you need to start using that method's second parameter, which tells it what encoding to use. UTF-8-encoded files should use TEncoding.UTF8. The default is TEncoding.Unicode, which is little-endian UTF-16.

This is probably a character encoding issue.
The Delphi IDE usually uses Windows-1252 or UTF-16 to encode source code.
HTML often uses UTF-8.
You probably need some transliteration between those encodings.
For that you need to find out what encodings are used exactly (like Rob mentions).
Or revert to HTML escaping accented characters (like Ralph mentions)
Can you post a small app that shows the problem? (you can email me, about anything that has jeroen in the username and pluimers.com in the domain name will arrive in my mailbox).
--jeroen

Thank you for your help, after some test the problem was very very simple (or stupid also)
response.contenttype := 'text/html charset=UTF-8'
No need to translate manually between unicodestring utf8string ansistring widestring. Delphi 2009 string usage is near to perfect.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart