I'm porting an isapi (pageproducers) application from delphi 7 to delphi 2009, the pages are based on html files in UTF8.
Everything goes well except when Onhtmltag is fired and I replace a transparent tag with any value with special characters like accented characters (áé...) Those characters are replaced in the output with an � character.
What's wrong?
As part of your debugging procedure, you should go find out exactly what byte value(s) the browser receives for the question-mark character.
As you should know, Delphi 2009's string type is Unicode, whereas all previous version were ANSI. Delphi 7 introduced the Utf8String type, but Delphi 2009 made that type special. If you're not using that type for holding strings that are encoded as UTF-8, then you should start doing so. Values held in Utf8String variables will be converted to UnicodeString values automatically when you assign one to the other.
If you're storing your UTF-8-encoded strings in ordinary AnsiString variables, then they will be converted to Unicode using the default system code page if you assign them to a UnicodeString. That's not what you want.
If you're assigning UTF-8-encoded literals to variables of type string, stop that. That type expects its values to be encoded as UTF-16, just like WideString always has.
If you are loading your files into a TStrings descendant with LoadFromFile, then you need to start using that method's second parameter, which tells it what encoding to use. UTF-8-encoded files should use TEncoding.UTF8. The default is TEncoding.Unicode, which is little-endian UTF-16.
This is probably a character encoding issue.
The Delphi IDE usually uses Windows-1252 or UTF-16 to encode source code.
HTML often uses UTF-8.
You probably need some transliteration between those encodings.
For that you need to find out what encodings are used exactly (like Rob mentions).
Or revert to HTML escaping accented characters (like Ralph mentions)
Can you post a small app that shows the problem? (you can email me, about anything that has jeroen in the username and pluimers.com in the domain name will arrive in my mailbox).
--jeroen
Thank you for your help, after some test the problem was very very simple (or stupid also)
response.contenttype := 'text/html charset=UTF-8'
No need to translate manually between unicodestring utf8string ansistring widestring. Delphi 2009 string usage is near to perfect.
Related
I am using Delphi 7 and have a routine which takes a csv file with a series of records and imports them. This is done by loading it into a TStringList with MyStringList.LoadFromFile(csvfile) and then getting each line with line = MyStringList[i].
This has always worked fine but I have now discovered that special characters are not picked up correctly. For example, Rue François Coppée comes out as Rue François Coppée - the accented French characters are the problem.
Is there a simple way to solve this?
Your file is encoded as UTF-8. For instance consider the ç. As you can see from the link, this is encoded in UTF-8 as 0xC3 0xA7. And in Windows-1252, 0xC3 encodes à and 0xA7 encodes §.
Whether or not you can handle this easily using your ANSI Delphi depends on the prevailing code page under which your program runs.
If you are using Windows 1252 then you will be fine. You just need to decode the UTF-8 encoded text with a call to UTF8Decode.
If you are using a different locale then life gets more difficult. Those characters may not be present in your locale's character set and in that case you cannot represent them in a Delphi string variable which is encoded using the prevailing ANSI charset. If this is the case then you need to use Unicode.
If you care about handling international text then you need to either:
Upgrade to a modern Delphi which has Unicode support, or
Stick to Delphi 7 and use WideString and the TNT Unicode components.
Probably it's not in UTF8 encoding. Try to convert it:
Text := UTF8Encode(Text);
Regards,
I need to make some translations from/into the French/Dutch/German languages using Delphi 2006 (without any third party units/components).
These 3 languages have the code page 1252. Our database is UTF-8 compliant, so at this moment I rely on the fact that all the values from the tables are UTF-8. Should I be confident on this assuming? This will work well, or I should worry about UTF-8 -> code page 1252 differences, if there are any? I didn't understand the difference between UTF-8 and code pages(for example I understood that the first 127 bytes are the same, and begging with the 128th byte are different).
Second, I need to make a search on some fields. Can I rely on ANSIUpperCase function from D2006? Or should I do a custom function, to treat each special character?
LE: data is stored in UTF-8 format.
Thanks in advance!
The database being UTF8-compliant doesn't mean the data is actually stored in UTF8. E.g. in Firebird (which is UTF8-compliant) you can declare tables using ANSI character sets.
You'll need to convert from UTF8 to ANSI 1252 and vice versa. E.g. with UTF8Encode and UTF8Decode routines.
I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?
If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.
If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.
If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do
WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.
You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).
Output: Period: from 11-Ê®¶þÔÂ-10 to 13-Ê®¶þÔÂ-10
The above output is from a line like this:
FormatDateTime('dd-mmm-yy', dateValue)
The IDE is Delphi 2007 and we are trying to gear up our app to the Chinese market.
How can I display the correct characters?
With the setting turn to Hindi (India), instead of the funny characters I have the "?".
I'm trying to display the date on a report, using ReportBuilder 11.
Any help will be much appreciated.
The characters seem to be correct, only IMO they have been rendered wrong.
Here's what I've done:
copied the string as presented by the OP ("11-Ê®¶þÔÂ-10 to 13-Ê®¶þÔÂ-10");
pasted it into a blank plain-text editor window with CP 1252 (Windows Latin-1) and saved;
opened the text file in a browser;
the text showed up the same as the browser chose the same codepage, so I turned on the automatic detection of character encoding, hinting it that the contents was Chinese;
the text changed to "11-十二月-10 to 13-十二月-10" (hope your browser displays correct Chinese characters here, my does anyway) and the codepage changed to GB18030 (and I then tried GB2312, but the text wouldn't change);
well, I was curious and searched for "十二月", and it turned out to stand for "December", quite suitable for the context unless the month names had been mixed up.
So, this is why I think it's a text rendering (or whatever you call it, I'm not really sure about the term) problem.
EDIT: Of course, it must have had something to do with the data type chosen for storing the string. If the function result is AnsiString and the variable is WideString, then maybe the characters get converted as WideChars and so they are no longer one-byte compounds of multi-byte characters but are multi-byte characters on their own? At least that's what happened when the OP posted them here.
I don't know actually, but if it is so then I doubt if they can be rendered correctly unless converted back and rendered as part of an AnsiString.
Another solution is to use TntControls. They're a set of standard Delphi controls enhanced to support Unicode. You'll have to go through all your form files and replace
Button1: TButton
Label1: TLabel
with TTntButton, TTntLabel et cetera.
Please note, that as things stand, it's not only Chinese which will not work. Try any language using symbols other than standard European set (latin + stress marks etc), for instance Russian.
But
By replacing the controls, you'll solve one part of the problem. Another part is that everywhere where you use "string" or "AnsiString" and "char/pchar" or "AnsiChar/PAnsiChar", you can store only strings in default system encoding.
For instance, if your system encoding ("Language for non-unicode programs") is EN/US, Russian characters will be replaced with question marks when you assign them to "string" variable:
a: WideString;
b: string;
...
a := 'ЯУЭФЫЦ'; //WideString can store international characters
b := a; //string cannot, so the data is lost - you cannot restore it from just "b"
To store string data which is independent of system encoding, use WideString/WideChar/PWideChar and appropriate functions. If you have
a, b: WideString;
...
a := UpperCase(b);
then unicode information will still be lost because UpperCase() accepts "string":
function UpperCase(const S: string): string;
Your WideString will be converted to "string" (losing all international characters), given to UpperCase, then the result will be converted back to WideString but it's already too late.
Therefore you have to replace all string functions with Wide versions:
a := WideUpperCase(b);
(for some functions, their wide versions are unavailable or called differently, TntControls also contain a bunch of wide function versions)
The Chinese Market requires support for multi-byte character sets (either WideChar or Unicode).
The Delphi 2007 RTL/VCL only supports single-byte character sets (there is very limited support for WideChar in the RTL and VCL).
The easiest for you is to upgrade to a Delphi version that supports Unicode (Delphi 2009 was the first version that supports Unicode, the current Delphi vesion is Delphi XE).
Or you will need to update all your components to support WideChar, and rewrite the portions of RTL/VCL for which you need WideChar support.
--jeroen
Did you install Far East charset support in Windows? In Windows pre 7 (or Vista) those charset are not installed by default in Western versions, you have to add them in Control Panel -> Regional Settins, IIRC
Using a non-Unicode version of Delphi unluckily what character can be displayed depends on the current codepage. If it is not one of the Chinese ones, for example, it could not display the characters you need. What characters are actually displayed depends on how the codes you're using are mapped in the current codepage. You could use a multi-lingual version of Windows to switch fully to the locale you need, or you have to use a Unicode version of Delphi (from 2009 onwards).
I have to convert a large legacy application to Delphi 2009 which uses strings, AnsiStrings, WideStrings and UTF8 data all over the place and I have a hard time to understand how the new string types work and how they should be used.
The application fully supported Unicode using TntUnicodeControls and there are 3rd party DLLs which require strings in specific encodings, mostly UTF8 and UTF16, making the conversion task not as trivial as one would suspect.
I especially have problems with the C DLL calls and choosing the right type.
I also get the impression that there are many implicit string conversions happening, because one of the DLL seems to always receive UTF-8 encoded strings, no matter how the Delphi string is encoded.
Can someone please provide a short overview about the new Delphi 2009 string types UnicodeString and RawByteString, perhaps some usage hints and possible pitfalls when converting a pre 2009 application?
See Delphi and Unicode, a white paper written by Marco Cantù and I guess
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), written by Joel.
One pitfall is that the default Win32 API call has been mapped to use the W (wide string) version instead of the A (ANSI) version, for example ShellExecuteA If your code is doing tricky pointer code assuming internal layout of AnsiString, it will break. A fallback is to substitute PChar with PAnsiChar, Char with AnsiChar, string with AnsiString, and append A at the end of Win32 API call for that portion of code. After the code actually compiles and runs normally, you could refactor your code to use string (UnicodeString).
Watch my CodeRage 4 talk on "Using Unicode and Other Encodings in your Programs" this friday, or wait until the replay of it is available online.
I'm going to cover some encodings and explain about the string format.
The slides will be available shortly (I'll try to get them online today) and contain a lot of references to stuff you should read on the internet (but I must admit I forgot the link to Joel on Unicode that eed3si9n posted).
Will edit this answer today with the uploads and the links.
Edit:
If you have a small sample where you can show that your C/C++ DLL receives the strings UTF8 encoded, but thought they should be encoded otherwise, please post it (mail me; almost anything at the pluimers dot com gets to me, especially if you use my first name before the at sign).
Session materials can be downloaded now, including the "Using Unicode and Other Encodings in your Programs" session.
These are links from that session:
Read these:
Marco Cantu, Whitepaper “Delphi and Unicode”
Marco Cantu, Presentation “Delphi and Unicode”
Nick Hodges, Whitepaper “Delphi in a Unicode World”
Relevant on-line help topics:
What's New in Delphi and C++Builder 2009
String Types: Base: ShortString, AnsiString, WideString, UnicodeString
String Types: Unicode (including internal memory layouts of the string types)
String Types: Enabling for Unicode
String Types: RawByteString (AnsiString with CodePage $ffff)
String Types: UTF8String (AnsiString with CodePage 65001)
String <-> PChar conversions: PChar fundamentals
String <-> PChar conversions: Returning a PChar Local Variable
String <-> PChar conversions: Passing a Local Variable as a PChar
Hope this gets you going. If not, mail me and I'll try to extend the answer here.
Note that it does not only hit real string code. It also hits code where PCHAR is used to trawl through buffers, or interface with APIs.
E.g. initialization code of headers that load the DLL dynamically (getprocedureaddress/loadlibray)
It seems almost all my problems come from the automatic conversion on assignments to UTF8String.
I already had old code using UTF8String just to help me think which type of string a variable should contain.
When starting to port my application, I replaced AnsiString with UTF8String for the same reason, but the code depended on UTF8String being just an alias to (classic) AnsiString
Now with the automatic conversion that assumption is no longer true, which created many problems.
Be careful if you use UTF8String when porting from pre-2009 Delphi code!
Another thing to watch out for when passing string between dlls built with different versions of Delphi or C++ Builder is that, starting with 2009, the StrRec part of AnsiStringBase gained two extra fields; codePage and elemSize. They are 2 bytes each (short ints), so the size of StrRec is now 12 bytes instead of 8. This can cause invalid pointer exception problems with memory allocation and destruction, even when the data part of the string seems to transfer ok.