ReadLn working with WideString (utf-8 files) - delphi

I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?

If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.

If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.

If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do

WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.

You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).

Related

Converting special characters in TStringList

I am using Delphi 7 and have a routine which takes a csv file with a series of records and imports them. This is done by loading it into a TStringList with MyStringList.LoadFromFile(csvfile) and then getting each line with line = MyStringList[i].
This has always worked fine but I have now discovered that special characters are not picked up correctly. For example, Rue François Coppée comes out as Rue François Coppée - the accented French characters are the problem.
Is there a simple way to solve this?
Your file is encoded as UTF-8. For instance consider the ç. As you can see from the link, this is encoded in UTF-8 as 0xC3 0xA7. And in Windows-1252, 0xC3 encodes à and 0xA7 encodes §.
Whether or not you can handle this easily using your ANSI Delphi depends on the prevailing code page under which your program runs.
If you are using Windows 1252 then you will be fine. You just need to decode the UTF-8 encoded text with a call to UTF8Decode.
If you are using a different locale then life gets more difficult. Those characters may not be present in your locale's character set and in that case you cannot represent them in a Delphi string variable which is encoded using the prevailing ANSI charset. If this is the case then you need to use Unicode.
If you care about handling international text then you need to either:
Upgrade to a modern Delphi which has Unicode support, or
Stick to Delphi 7 and use WideString and the TNT Unicode components.
Probably it's not in UTF8 encoding. Try to convert it:
Text := UTF8Encode(Text);
Regards,

What should I use? UTF8 or UTF16?

I have to distribute my app internationally.
Let's say I have a control (like a memo) where the user enters some text. The user can be Japanese, Russian, Canadian, etc.
I want to save the string to disk as TXT file for later use. I will use MY OWN function to write the text and not something like TMemo.SaveToFile().
How do I want to save the string to disk? In UTF8 or UTF16 format?
The main difference between them is that UTF8 is backwards compatible with ASCII. As long as you only use the first 128 characters, an application that is not Unicode aware can still process the data (which may be an advantage or disadvantage, depending on your scenario). In particular, when switching to UTF16 every API function needs to be adjusted for 16bit strings, while with UTF8 you can often leave old API functions untouched if they don't do any string processing.
Also UTF8 does not depend on endianess, while UTF16 does, which may complicate string I/O.
A common misconception is that UTF16 is easier to process because each character always occupies exactly two bytes. That is, unfortunately, not true. UTF16 is a variable-length encoding where a character may either take up 2 or 4 bytes. So any difficulties associated with UTF8 regarding variable-length issues apply to UTF16 just as well.
Finally, storage sizes: Another common myth about UTF16 is that it is more storage-efficient than UTF8 for most foreign languages. UTF8 takes less storage for all European languages, which can be encoded with one or two bytes per character. Non-BMP characters take up 4 bytes in both UTF8 and UTF16. The only case in which UTF16 takes less storage is if your text mainly consists of characters from the range U+0800 through U+FFFF, where the characters for Chinese, Japanese and Hindi are stored.
James McNellis gave an excellent talk at BoostCon 2014, discussing the various trade-offs between different encodings in great detail. Even though the talk is titled Unicode in C++, the entire first half is actually language agnostic. A video recording of the full talk is available at Boostcon's Youtube channel, while the slides can be found on github.
Depends on the language of your data.
If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16. You will pay a penalty when reading the data as it will be / needs to be converted to UTF-16 which is the Windows default and used by Delphi's (Unicode) string.
If your data is mostly in non-western languages, UTF-8 can take more storage than UTF-16 as it may take up to 6 4 bytes per character for some. (see comment by #KennyTM)
Basically: do some tests with representative samples of your users' data and see which performs better, both in storage requirements and load times. We have had some surprises with UTF-16 being slower than we thought. The performance gain of not having to transform from UTF-8 to UTF-16 was lost because of disk access as the data volume in UTF-16 is greater.
First of all, be aware that the standard encoding under Windows is UCS2 (until Windows 2000) or UTF-16 (since XP), and that Delphi native "string" type uses the same native format since Delphi 2009 (string=UnicodeString char=WideChar).
In all cases, it is it unsafe to assume 1 WideChar == 1 Unicode character - this is the surrogate problem.
About UTF-8 or UTF-16 choice, it depends on the storage itself:
If your file is a plain text file (including XML) you may use either UTF-8 or UTF-16 - but you will have to use a BOM at the beginning of the file, otherwise applications (like Notepad) may be confused at opening - for XML this is handled by your library (if it is not, change to another library);
If you are sure that your content is mostly 7 bit ASCII, use UTF-8 and the associated BOM;
If your file is some kind of database or a custom binary format, certainly the best format is UTF-16/UCS2, i.e. the default Delphi 2009+ string layout, and certainly the default database API layout;
Some file formats require or prefer UTF-8 (like JSON or even SQLite3), even if UTF-8 files can be bigger than UTF-16 for Asiatic characters.
For instance, we used UTF-8 for our Client-Server framework, since we use JSON as exchange format (which requires UTF-8), and since SQlite3 likes UTF-8. Of course, we had to write some dedicated functions and classes, to avoid conversion to/from string (which is slow for the string=UnicodeString type since Delphi 2009, and may loose some data when used with string=AnsiString type before Delphi 2009. See this post and this unit). The easiest is to rely on the string=UnicodeString type, use the RTL functions which handles directly UTF-16 encoding, and avoid conversions. And do not forget about your previous question.
If disk space and read/write speed is a problem, consider using compression instead of changing the encoding. There are real-time compression around (faster than ZIP), like LZO or our SynLZ.

Delphi 2006 translate from/into French/Dutch/German with a single ansi codepage

I need to make some translations from/into the French/Dutch/German languages using Delphi 2006 (without any third party units/components).
These 3 languages have the code page 1252. Our database is UTF-8 compliant, so at this moment I rely on the fact that all the values from the tables are UTF-8. Should I be confident on this assuming? This will work well, or I should worry about UTF-8 -> code page 1252 differences, if there are any? I didn't understand the difference between UTF-8 and code pages(for example I understood that the first 127 bytes are the same, and begging with the 128th byte are different).
Second, I need to make a search on some fields. Can I rely on ANSIUpperCase function from D2006? Or should I do a custom function, to treat each special character?
LE: data is stored in UTF-8 format.
Thanks in advance!
The database being UTF8-compliant doesn't mean the data is actually stored in UTF8. E.g. in Firebird (which is UTF8-compliant) you can declare tables using ANSI character sets.
You'll need to convert from UTF8 to ANSI 1252 and vice versa. E.g. with UTF8Encode and UTF8Decode routines.

Delphi XE - should I use String or AnsiString?

I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places.
I want to convert all other libraries to Unicode, but for this specific library I think it will be better to stick with ANSI. The advantage is the memory requirement as in some cases I load very large TXT files (containing ONLY Ansi characters). The disadvantage might be that I have to do lots and lots of typecasts when I make those libraries to interact with normal (unicode) libraries.
There are some general guidelines to show when is good to convert to Unicode and when to stick with Ansi?
The problem with general guidelines is that something like this can be very specific to a person's situation. Your example here is one of those.
However, for people Googling and arriving here, some general guidelines are:
Yes, convert to Unicode. Don't try to keep an old app fully using AnsiStrings. The reason is that the whole VCL is Unicode, and you shouldn't try to mix the two, because you will convert every time you assign a Unicode string to an ANSI string, and that is a lossy conversion. Trying to keep the old way because it's less work (or some similar reason) will cause you pain; just embrace the new string type, convert, and go with it.
Instead of randomly mixing the two, explicitly perform any conversions you need to, once - for example, if you're loading data from an old version of your program you know it will be ANSI, so read it into a Unicode string there, and that's it. Ever after, it will be Unicode.
You should not need to change the type of your string variables - string pre-D2009 is ANSI, and in D2009 and alter is Unicode. Instead, follow compiler warnings and watch which string methods you use - some still take an AnsiString parameter and I find it all confusing. The compiler will tell you.
If you use strings to hold bytes (in other words, using them as an array of bytes because a character was a byte) switch to TBytes.
You may encounter specific problems for things like encryption (strings are no longer byte/characters, so 'character' for 'character' you may get different output); reading text files (use the stream classes and TEncoding); and, frankly, miscellaneous stuff. Search here on SO, most things have been asked before.
Commenters, please add more suggestions... I mostly use C++Builder, not Delphi, and there are probably quite a few specific things for Delphi I don't know about.
Now for your specific question: should you convert this library?
If:
The values between A and U are truly only ever in this range, and
These values represent characters (A really is A, not byte value 65 - if so, use TBytes), and
You load large text files and memory is a problem
then not converting to Unicode, and instead switching your strings to AnsiStrings, makes sense.
Be aware that:
There is an overhead every time you convert from ANSI to Unicode
You could use UTF8String, which is a specific type of AnsiString that will not be lossy when converted, and will still store most text (Roman characters) in a single byte
Changing all the instances of string to AnsiString could be a bit of work, and you will need to check all the methods called with them to see if too many implicit conversions are being performed (for performance), etc
You may need to change the outer layer of your library to use Unicode so that conversion code or ANSI/Unicode compiler warnings are not visible to users of your library
If you convert to Unicode, sets of characters (can't remember the syntax, maybe if 'S' in MySet?) won't work. From your description of characters A to U, I could guess you would like to use this syntax.
My recommendation? Personally, the only reason I would do this from the information you've given is the memory use, and possibly performance depending on what you're doing with this huge amount of A..Us. If that truly is significant, it's both the driver and the constraint, and you should convert to ANSI.
You should be able to wrap up the conversion at the interface between this unit and its clients. Use AnsiString internally and string everywhere else and you should be fine.
In general only use AnsiString if it is important that the Chars are single bytes, Otherwise the use of string ensures future compatibility with Unicode.
You need to check all libraries anyway because all Windows API functions in Delhpi XE replaced by their unicode-analogues, etc. If you will never use UNICODE you need to use Delphi 7.
Use AnsiString explicitly everywhere in this unit and then you'll get compiler warning errors (which you should never ignore) for String to AnsiString conversion errors if you happen to access the routines incorrectly.
Alternately, perhaps preferably depending on your situation, simply convert everything to UTF8.
Stick with Ansi strings ONLY if you do not have the time to convert the code properly. The use of Ansi strings is really only for backward compatibility - to my knowledge C# does not have an equiavalent to Ansi strings. Otherwise use the standard Unicode strings. If you have a look on my web-site I have a whole strings routines unit (about 5,000 LOC) that works with both Delphi 2007 (non-Uniocde) and XE (Unicode) with only "string" interfaces and contains almost all of the conversion issues you might face.

delphi 2009 unicode + ansi problem

I'm porting an isapi (pageproducers) application from delphi 7 to delphi 2009, the pages are based on html files in UTF8.
Everything goes well except when Onhtmltag is fired and I replace a transparent tag with any value with special characters like accented characters (áé...) Those characters are replaced in the output with an � character.
What's wrong?
As part of your debugging procedure, you should go find out exactly what byte value(s) the browser receives for the question-mark character.
As you should know, Delphi 2009's string type is Unicode, whereas all previous version were ANSI. Delphi 7 introduced the Utf8String type, but Delphi 2009 made that type special. If you're not using that type for holding strings that are encoded as UTF-8, then you should start doing so. Values held in Utf8String variables will be converted to UnicodeString values automatically when you assign one to the other.
If you're storing your UTF-8-encoded strings in ordinary AnsiString variables, then they will be converted to Unicode using the default system code page if you assign them to a UnicodeString. That's not what you want.
If you're assigning UTF-8-encoded literals to variables of type string, stop that. That type expects its values to be encoded as UTF-16, just like WideString always has.
If you are loading your files into a TStrings descendant with LoadFromFile, then you need to start using that method's second parameter, which tells it what encoding to use. UTF-8-encoded files should use TEncoding.UTF8. The default is TEncoding.Unicode, which is little-endian UTF-16.
This is probably a character encoding issue.
The Delphi IDE usually uses Windows-1252 or UTF-16 to encode source code.
HTML often uses UTF-8.
You probably need some transliteration between those encodings.
For that you need to find out what encodings are used exactly (like Rob mentions).
Or revert to HTML escaping accented characters (like Ralph mentions)
Can you post a small app that shows the problem? (you can email me, about anything that has jeroen in the username and pluimers.com in the domain name will arrive in my mailbox).
--jeroen
Thank you for your help, after some test the problem was very very simple (or stupid also)
response.contenttype := 'text/html charset=UTF-8'
No need to translate manually between unicodestring utf8string ansistring widestring. Delphi 2009 string usage is near to perfect.

Resources