Unicode Precomposition and Decomposition with Delphi - delphi

The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:
While Subversion stores filenames as Unicode, it does not specify if
precomposition or decomposition is used for certain accented
characters (such as é). Thus, files added in SVN clients running on
some operating systems (such as OS X) use decomposition encoding,
while clients running on other operating systems (such as Linux) use
precomposition encoding, with the consequence that those accented
characters do not display correctly if the local SVN client is not
using the same encoding as the client used to add the files
While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicode composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?

There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.
However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.
Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.
It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.
(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)
The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.
To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.

AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.
The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.

The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.

Related

Are there any keyboard character code differences between PC and Mac?

So I'm working on a site in PHP/JS and also a database. I have a co-worker that sends me documents written on Apple devices and I'm on a PC. Since I don't have access to a Mac, I'd like to know if spaces and punctuation are identical typed on different keyboards.
I want to be able to copy the contents of the documents and paste it in the database, however I don't want to assume that the PC dash character is the same as a Mac dash (that might be an actual minus character).. or that a PC space turns out to be a Mac narrow/en space.
I could just test a received document, but she works all over the place and never knows where she wrote what.
This is a programming question because I'm pasting mathematical expressions where such characters make a difference.. and also using PHP and JavaScript to interpret those characters.
The main issue is the character encoding in the document. Mostly likely that's a Unicode encoding (e.g. UTF-8) which is fully cross-platform.
Someone using a U.S. keyboard layout (and probably most others) intending to type something like dash/hyphen/minus would most likely produce HYPHEN-MINUS U+002D. There are, of course, ways of typing EN DASH U+2013, EM DASH U+2014, SMALL EM DASH U+FE58, HYPHEN U+2010, and others, but the user would have to do that deliberately. It wouldn't be done routinely just because they're using a Mac.
Also, some editors or word processors may do "smart substitutions", replacing the ASCII characters with fancier (more typographically correct) non-ASCII ones. That would be independent of Mac vs. PC. If it does that, the character would still come across to the PC as such, but if your use of the document data is sensitive to such differences (as is apparently the case), then that would be problematic.
It would be very unlikely that Space would routinely be anything other than a normal SPACE U+0020. There are, of course, ways of typing variants such as NO-BREAK SPACE U+00A0, EN SPACE U+2002, EM SPACE U+2003, etc., but the user would have to go out of their way to type those. And I doubt smart substitutions would replace normal spaces.

Loading textfile into stringlist with firemonkey on osx when the encoding is unknown

I am having a hard time to load a textfile into a stringlist in firemonkey on osx when the encoding of the textfile in not known.
When I just use list.loadfromfile(filename), I get most of the time an exception regarding encoding.
list.loadfromfile(filename,TEncoding.unicode) will also fail when the file is in ansi, and opposite.
There is no issue on Windows, list.loadfromfile(filename) just works, but not on osx.
I cant specify the encoding, because it will be unknown (user provide the text files).
Any clue how I can get around this encoding issue when running the app on a mac?
In general this is not possible. It is quite possible to create a single file that is valid when interpreted in all common encodings. This has been discussed many times, for instance: The Notepad file encoding problem, redux.
I'm assuming that you are working with files that do not contain byte order marks, BOMs. Obviously if your input files contained BOMs then you could simply check the BOM and be done.
With that assumption stated, the right solution to the problem, in a perfect world, is to know the encoding. Either pick a specific encoding which your program requires, or arrange for the user to tell you the encoding when they supply the file.
If, for whatever reason, you cannot do that then the next best thing to do is to use heuristics to attempt to guess the encoding used. I'm not aware of any Pascal code to do this. But you should be able to put something together that will work reasonably well. This answer gives an outline of a basic strategy: https://stackoverflow.com/a/20747074

Error encoding special characters

Bit of a strange problem (at least for me). In my Grails app I'm sending emails with some special characters (east European letters). Values of strings with special characters that I get from database are valid but the ones I create in application have "?".
Even more confusing is the fact that in development everything works fine, but when I deploy app to Tomcat instance I get the question marks.
I've set up everything to encode to UTF-8. At least I beleave so - obviously I'm missing something.
It sounds like you don't have the operating system language
packs installed for the languages you're trying to display.
While it appears as if the files themselves are saved properly, and the JVM
'understands' them because the character sets are supported, the GUIs
you're using can't display the corresponding encoding because the
underlying OS isn't displaying them.
I've experienced similar problems and the solution that
worked for me was to turn on the corresponding languages in the OS.

Binary Serialized File - Delphi

I am trying to deserialize an old file format that was serialized in Delphi, it uses binary seralization. I know nothing about the structure of the file except some very high level records that are in it.
What steps would you take to solve this problem? Any tools etc?
A good hexeditor, and use the gray matter to identify structures.
If you get a hint what kind of file it is, you can search for more specialized tools.
Running the unix/Linux "file" command can be good too (*) See Barry's comment below for how it works. It can be a quick check for common filetypes like DBF,ZIP etc hidden by using a different extension.
(*) there are 3rd party builds for windows, but they might lag in versions. If you can do it on a recent *nix distro, it is advised to do so.
The serialization process simply loops over all published properties and streams their value to a text file. If you do not know the exact classes that were streamed to the file you will have a very hard time deserializing the file. (if not impossible)
A good hex editor is first. If the file is read without buffering (eg read directly from a TFileStream) you could gain some information when using ProcMon from SysInternals; You can see exactly what data is read in what chunks and thus determine more quickly where the boundaries are between the structures you already identified.

Terminology and concepts surrounding the use of code pages

I'm in the process of researching code pages and have come across many conflicting uses of terminology, even amongst different Wikipedia entries. I just can't find a source of information that spells out the entire character handling process from start to finish. Could someone well versed in this field suggest ways in which the following information is inaccurate or incorrect:
The process of character representation as far as I understand:
We start with sets of symbols (not sure of the correct terminology here, possibly 'scripts') that are not associated with any specific platform. 'The Cyrillic alphabet' is understood to refer to the same entity in the context of Windows as in Linux, for example.
Members of these sets are selected, generally in bunches, by vendors to form a platform specific character set. The platform might assign these various codes such as GDI values on Windows (eg. 0 for ANSI_CHARSET and the other codes mentioned here: http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes). I cannot find much information on these sets such as whether they are in fact coded character sets or if they are simply unordered and abstract.
From these sets, individual code pages are developed that appear to have a one to one mapping with GDI values. Since these GDI values appear to represent sets that are platform dependent, does this mean Windows code pages are essentially a coded version of each individual set?
I've been having trouble reconciling this idea with a link shown to me earlier (which I've lost) that showed a one to many mapping between these GDI charsets and code pages across different platforms. Is this accurate, do these GDI values point to sets from which different code pages across different platforms can be developed?
Each code page maps a member of an abstract character set onto an integer to represent its position in the set. In the case of the 'simpler' code pages mentioned on the above webpage, these can be referred to using the more precise 'character map' term. Is this term worth considering or is the distinction too subtle and unimportant?
A font resolves a code point to a glyph if it contains one for that code point, otherwise it reports a failure. I've also read that a font may return its own blank glyph for those code points which it doesn't support. Can an application distinguish between this blank glyph and a successful resolution, ie. does the font return an error code of sorts with this blank glyph?
I believe that's the extent of my confusion. Any clarification in this regard would be invaluable. Thanks in advance.
You are essentially correct:
Start with the number of known characters.
Select a subset of this characters (a character set)
Map these to bit patterns (code page and encoding)
Render these to an output device by combining the character with a glyph (ie. using a font, a bit pattern, and a codepage/encoding that maps bit pattern to character).
Across platforms, there are similar code pages. And even across many code pages there are similar mappings of value to character. For example, Windows Latin, Mac Roman and unicode share characters for the first 127 values. There is some standardization (eg. http://en.wikipedia.org/wiki/Shift_JIS for Japanese) of codepages so that machines can interact.
Generally for new development, you should be using a unicode codepage with one of the popular encodings. UTF8 is popular on most modern systems. UTF16LE is used for Windows system calls ending in W.
This might be a good match: http://mihai-nita.net/2006/08/06/basic-lingo/

Resources