Delphi routine to display arbitrary bytes in arbitrary encoding in arbitrary language - delphi

I have some byte streams that may or may not be encoded as 1) extended ASCII, 2) UTF-8, or 3) UTF-16. And they may be in English, French, or Chinese. I would like to write a simple program that allows the user to enter a byte stream and then pick one of the encodings and one of the languages and see what the string would look like when interpreted in that manner. Or simply interpret each string in each of the 9 possible ways and display them all. I would like to avoid having to switch regionalizations repeatedly. I'm using Delphi 2007. Is what I am trying to do even possible?

In Delphi 2009 or later, this would be easier, since it supports Unicode and can do most of this transparently. For older versions, you have to do a bit more manual work.
The first thing you want to do is convert the text to a common codepage; preferably UTF-16, since that's the native codepage on Windows. For that, you use the MultiByteToWideChar function. For UTF-8 to UTF-16, the language doesn't matter; for "extended ASCII", you will need to choose an appropriate source code page (e.g. Windows-1252 for English and French, and GB2312 or Big5 or some other Chinese code page - that depends on what you expect to receive). To store these, you can use a WideString, which stores UTF-16 directly.
Once you have that, you have to draw the text somehow - and that requires you to either get a Unicode-capable control (a label is likely sufficient), or write one, or just call the appropriate Windows API function directly to draw - and that's where it can get a bit messy, because there are several functions for doing that. TextOutW is probably the simplest choice here, but another option would be DrawText. Make sure you explicitly call the W version of these function in order to work with Unicode. (See also the related question How do I draw Unicode text?).
Take note: Due to CJK unification - the encoding of equivalent Chinese Hanzi, Japanese Kanji, and Korean Hanja characters at the same code points in Unicode - you need to pick a font that matches the expected kind of Chinese, traditional or simplified, in order to get expected rendering. To quote a somewhat related post by Michael Kaplan:
What it comes down to is that there are many characters which can have
four different possible looks:
Japanese will default to using MS UI Gothic (fallback to PMingLIU, then SimSun, then Gulim)
Korean will default to using Gulim (fallback to PMingLiu, then MS UI Gothic, then SimSun)
Simplified Chinese will default to using SimSun (fallback to PMingLiu, then MS UI Gothic, then Batang)
Traditional Chinese will default to using PMingLiu (fallback to SimSun, then MS Mincho, then Batang)
Unless you have a specific font you want/need to use, pick the first font in the list for the language variant you want to use, since these are standard fonts (on XP, you will need to enable East Asian Language support before they are available, on Vista and above, they are always included). If you do not do this, then Windows may either not render the characters at all (showing the missing character glyph instead), or it may use an inappropriate fallback (e.g. PMingLiu for Simplified Chinese) - the exact behavior depends on the API function you use to render the text.

Related

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

My non-Unicode Delphi 7 application allows users to open .txt files.
Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.
I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.
In cases when converting is not possible, the function should return an error.
The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;
If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.
It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).
The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:
The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.
Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:
It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.
There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:
% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert
Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.
Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.
Notepad has trouble with this issue.
The solution as requested, would probably entail more effort than you put into the original program.
Possible solutions
Add a text editor into your program. If you write it, you will be able to read it.
The following solution pushes the translation to established tables provided by Windows.
Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):
Caution  Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.
Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.
Note  The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.
This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.
Simply require the text file to be saved in the local code page.
Other thoughts:
ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.
In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".
You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.
Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.
For example, if it is a .csv file, find a comma in the various formats...
Bottom Line
There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

Extended ASCII characters displayed as ? (question mark)

I have a form with a bunch of flags (static images) and below each flag is a tick box. The user selects the tick box to allow them to use a particular language. At design-time, I've set the checkbox captions for each language in their localised equivalent, in this example "Español" (Spanish).
For nearly every language this is displayed just fine at runtime, but for a couple of languages this changes to "Espa?ol". Specifically, this happens when I select Lithuanian and use:
// Note: 1063 = ((SUBLANG_DEFAULT shl 10) or LANG_LITHUANIAN)
SetThreadLocale(1063);
Curiously, if I simply re-apply the caption with the following line in the form's OnShow handler, then it displays correctly as "Español".
tbLangSpanish.Caption := 'Español'; // Strange, it now corrects itself!
The above code might be improved slightly by checking to see whether the runtime caption has a "?" character in it and only then re-apply the caption. The rest of the application displays Lithuanian perfectly (with labels being set at runtime).
Note that "ñ" is extended ASCII code 241. This issue affects a couple of other extended characters such as "ç" (character 231) in "Français". Of interest is that some extended ASCII characters are displayed correctly eg. "¾" (character 190).
Is this a bug in the IDE (using Delphi 7) or just a fact-of-life with legacy ASCII (ie. non-UNICODE) characters? Is there a prefered way to detect incompatible design-time extended ASCII characters at runtime (perhaps based on locale)?
None of the searches I performed gave any explanation about why a character would display as "?". I'm assuming this is because the requested character must be missing from the current Windows codepage, but no reference I could find explicitly says what is displayed when this happens (nor how to overcome the problem if you cannot use UNICODE).
The ? character is what happens when a conversion from one code page to another fails because the target code page does not contain the required character. This is an inevitable consequence of programming against the ANSI Win32 API. You simply cannot represent all characters in all languages.
The only realistic way forward is to use Unicode. You have two main options starting from Delphi 7:
Stick to Delphi 7 and use the TNT Unicode components.
Upgrade to a modern version of Delphi which has native support for Unicode.

Display specific regional characters

I need to display LST ISO/IEC 8859-13 codepage characters on window. Currently I'm using ShowMessage function for this purpose. Evrything displayed fine when windows locale is from this region, but how to deal when I have for example locale English UK? In this case I have just "?" instead of character. It should be some kind of possibility to show regional characters since MS Word displays them without correct locale. But how to do that?
You have two viable, tractable options:
Upgrade to a Unicode version of Delphi that has built in support for international text, or
Use the TNT Unicode controls that graft that support onto pre-Unicode Delphi by using the COM WideString type which is encoded using Unicode.
Word has no problems doing this because it uses the native Unicode API of Windows. On the other hand Delphi 7 uses the ANSI API that exists solely to provide compatibility with Windows 95/98/ME.
Short version:
you must also set the Font.Charset property if you want to be (more) sure that a particular component will display characters in a given charset.
Long version (sorry: i am prone to be wordy)
Using unicode (and you should switch to an unicode version of delphi, if you haven't done it yet) does not guarantee that the fonts installed on a foreign pc will contain the all the symbols you want to display.
Using unicode, moreover, does nothing to force the operating system to choose a font that actually supports the charset you need: even if there is an installed font able to display cyrillic characters, windows will NOT choose that font just because you are asking him to render a string containing cyrillic unicode code points: it will still be using the default system fonts.
So: there always is the possibility that you will need to ask your customers to install a font supporting the charset your application needs. if this can be a serious issue, you should consider the idea of distributing the required fonts along with your binaries (be careful with font copirights).
In second place: if there are components in your application you are SURE that they will always show russian text, well, in such components you MUST assign Font.Charset = RUSSIAN_CHARSET. This is the way of telling windows "I really need to display cyrillic chars in this component, so choose an appropriate font, regardeless of which side of the planet you are running"
It is a common misconception that che charset property is useless with unicode programs. it is quite the opposite.
Another common error is to assume that the "XYZ" font is identical on all windows installations in the world so, if I can see cyrillic chars with Thamoa on my pc, then I am safe using Thamoa for displaying cyrillic in the rest of the world. it is quite the opposite: a different unicode subset gets installed depending on the computer locale.
and... Since AFAIK ShowMessage() uses the system default font, you can't use this procedure for displaying messages containing "strange" characters: you need to write your own ShowMessage dialog box.
EDIT: here is an example demonstrating what I am saying
just drop a TPaintBox component on a form, name it "pbox", and write this OnPaint event handler:
(remember to save the source in utf-8 format, otherwise the russian symbols will be mangled)
procedure TForm1.pboxPaint(Sender: TObject);
begin
pbox.canvas.Font.Name := 'Fixedsys';
pbox.Canvas.TextOut(0,0,'Это русский');
pbox.canvas.Font.Name := 'Fixedsys';
pbox.canvas.Font.Charset := RUSSIAN_CHARSET;
pbox.Canvas.TextOut(0,20,'Это русский');
end;
On an italian pc (and I guess on any west-european or american pc) the fixedsys font does not normally contain the russian characters symbols: the first TextOut will insist in using the FixedSys font and will write garbage. On my pc i get a sequence of black square boxes, for example.
The second textout is made after having set charset=RUSSIAN_CHARSET, so windows will know that we need the russian symbols and so chooses another font. The second TextOut is not using the FixedSys font I wanted to use, but at least it is readable!
On a russian installation of windows, both TextOut calls will correctly render the russian text using the FixedSys font, since russian installations of windows have a russian version of the fixedsys font. and Windows knows it.
You can install more than one locale on a Windows system. If you are using the matching locale then it is the default locale and you can use a dialog with a text field which uses the correct locale / character set. On your development system, where English UK is installed, add the missing language(s).
Unicode is nicer, but not required to display characters from non-default character sets (computers were able to handle many character sets before Uincode was invented). Even MS Wordpad was able to display characters from different codepages, including multi-byte character sets (Korean, Japanese, Chinese) long before Unicode existed.
ShowMessage can not be used because it sticks to the default locale. But can easily be replaced with a custom dialog-style form.

FormatDateTime with chinese location - wrong characters... Delphi 2007

Output: Period: from 11-Ê®¶þÔÂ-10 to 13-Ê®¶þÔÂ-10
The above output is from a line like this:
FormatDateTime('dd-mmm-yy', dateValue)
The IDE is Delphi 2007 and we are trying to gear up our app to the Chinese market.
How can I display the correct characters?
With the setting turn to Hindi (India), instead of the funny characters I have the "?".
I'm trying to display the date on a report, using ReportBuilder 11.
Any help will be much appreciated.
The characters seem to be correct, only IMO they have been rendered wrong.
Here's what I've done:
copied the string as presented by the OP ("11-Ê®¶þÔÂ-10 to 13-Ê®¶þÔÂ-10");
pasted it into a blank plain-text editor window with CP 1252 (Windows Latin-1) and saved;
opened the text file in a browser;
the text showed up the same as the browser chose the same codepage, so I turned on the automatic detection of character encoding, hinting it that the contents was Chinese;
the text changed to "11-十二月-10 to 13-十二月-10" (hope your browser displays correct Chinese characters here, my does anyway) and the codepage changed to GB18030 (and I then tried GB2312, but the text wouldn't change);
well, I was curious and searched for "十二月", and it turned out to stand for "December", quite suitable for the context unless the month names had been mixed up.
So, this is why I think it's a text rendering (or whatever you call it, I'm not really sure about the term) problem.
EDIT: Of course, it must have had something to do with the data type chosen for storing the string. If the function result is AnsiString and the variable is WideString, then maybe the characters get converted as WideChars and so they are no longer one-byte compounds of multi-byte characters but are multi-byte characters on their own? At least that's what happened when the OP posted them here.
I don't know actually, but if it is so then I doubt if they can be rendered correctly unless converted back and rendered as part of an AnsiString.
Another solution is to use TntControls. They're a set of standard Delphi controls enhanced to support Unicode. You'll have to go through all your form files and replace
Button1: TButton
Label1: TLabel
with TTntButton, TTntLabel et cetera.
Please note, that as things stand, it's not only Chinese which will not work. Try any language using symbols other than standard European set (latin + stress marks etc), for instance Russian.
But
By replacing the controls, you'll solve one part of the problem. Another part is that everywhere where you use "string" or "AnsiString" and "char/pchar" or "AnsiChar/PAnsiChar", you can store only strings in default system encoding.
For instance, if your system encoding ("Language for non-unicode programs") is EN/US, Russian characters will be replaced with question marks when you assign them to "string" variable:
a: WideString;
b: string;
...
a := 'ЯУЭФЫЦ'; //WideString can store international characters
b := a; //string cannot, so the data is lost - you cannot restore it from just "b"
To store string data which is independent of system encoding, use WideString/WideChar/PWideChar and appropriate functions. If you have
a, b: WideString;
...
a := UpperCase(b);
then unicode information will still be lost because UpperCase() accepts "string":
function UpperCase(const S: string): string;
Your WideString will be converted to "string" (losing all international characters), given to UpperCase, then the result will be converted back to WideString but it's already too late.
Therefore you have to replace all string functions with Wide versions:
a := WideUpperCase(b);
(for some functions, their wide versions are unavailable or called differently, TntControls also contain a bunch of wide function versions)
The Chinese Market requires support for multi-byte character sets (either WideChar or Unicode).
The Delphi 2007 RTL/VCL only supports single-byte character sets (there is very limited support for WideChar in the RTL and VCL).
The easiest for you is to upgrade to a Delphi version that supports Unicode (Delphi 2009 was the first version that supports Unicode, the current Delphi vesion is Delphi XE).
Or you will need to update all your components to support WideChar, and rewrite the portions of RTL/VCL for which you need WideChar support.
--jeroen
Did you install Far East charset support in Windows? In Windows pre 7 (or Vista) those charset are not installed by default in Western versions, you have to add them in Control Panel -> Regional Settins, IIRC
Using a non-Unicode version of Delphi unluckily what character can be displayed depends on the current codepage. If it is not one of the Chinese ones, for example, it could not display the characters you need. What characters are actually displayed depends on how the codes you're using are mapped in the current codepage. You could use a multi-lingual version of Windows to switch fully to the locale you need, or you have to use a Unicode version of Delphi (from 2009 onwards).

Finding System Fonts with Delphi

What is the best way to find all the system fonts a user has available so they can be displayed in a dropdown selection box?
I would also like to distinguish between Unicode and non-Unicode fonts.
I am using Delphi 2009 which is fully Unicode enabled, and would like a Delphi solution.
The Screen.Fonts property is populated via the EnumFontFamiliesEx API function. Look in Forms.pas for an example of calling that function.
The callback function that it calls will receive a TNewTextMetricEx record, and one of the members of that record is a TFontSignature. The fsUsb field indicates which Unicode subranges the font claims to support.
The system doesn't actually have "Unicode fonts." Even the fonts that have the word Unicode in their names don't have glyphs for all Unicode characters. You can distinguish between bitmap, printer, and TrueType fonts, but beyond that, the best you can do is to figure out whether the font you're considering supports the characters you want. And if the font isn't what you'd consider a "Unicode font," but it supports all the characters you need, then what difference does it make? To get this information, you may be interested in GetFontUnicodeRanges.
The Microsoft technology for displaying text with different fonts based on which fonts contain which characters is Uniscribe, particularly font fallback. I'm not aware of any Delphi support for Uniscribe; I started writing a set of import units for it once, but my interests are fickle, and I moved on to something else before I completed it. Michael Kaplan's blog talks about Uniscribe sometimes, so that's another place to look.
I can answer half your question, you can get a list of the Fonts that your current environment has access to as a string list from the global Screen object
i.e.
Listbox1.Items.AddStrings(Screen.Fonts);
You can look in the forms.pas source to see how Codegear fill Screen.Fonts by enumerating the Windows fonts. The returned LOGFONT structure has a charset member, but this does not provide a simple 'Unicode' determination.
As far as I know Windows cannot tell you explicitly if a font is 'Unicode'. Moreover if you try to display Unicode text in a 'non-Unicode' font Windows may substitute a different font, so it is difficult to say whether a font will or will not display Unicode. For example I have an ancient Arial Black font file which contains no Unicode glyphs, but if I use this to display Japanese text in a D2009 memo, the Japanese shows up correctly in Arial and the rest in Arial Black. In other examples, the usual empty squares may show up.

Resources