How does "cut and paste" affect character encoding and what can go wrong? - character-encoding

I have a document A in encoding A displayed in tool A and a document B in encoding B displayed in tool B. If I cut and paste (part of) B into A what might be the resultant character encoding? I realise this depends on tool A and tool B and the information held in the paste buffer (which presumably can contain an encoding?) and the operating system.
What should high-quality tools do? and in practice how many of the common tools (e.g. Word, TextPad, various IDEs, etc.) do a good job?

First of all, a text editor's internal representation of text has no bearing on how the text is encoded (serialized) when you save the file. So a document is not "in" an encoding; it's a sequence of abstract characters. When the document is saved to a file (or transmitted over the network) then it gets encoded.
It's up to each application to decide what it puts on the clipboard. Typically, a windows app that knows what it's doing will put a number of different representations on the clipboard. When you paste in the other app, the app will look for the representation that best suits its need.
In your case, a text editor (that knows what it's doing) will put a Unicode representation of a selected string onto the clipboard (where Unicode, in Windows, is typically moved around as UTF-16, but that's not important). When you paste in the other app, it will insert that sequence of Unicode characters into the document at the selection point.
There's an app floating around called "ClipSpy" that will help you see what I'm talking about, interactively.

I observed the following behavior when I looked into Unicode normalization: When copying a canonically decomposed string (NFD) in Firefox in macOS 10.15.7, the string is normalized to NFC when pasting it in Chrome. What's weird is that the pasting affects the content of the clipboard: When pasting the string in Firefox again, it's then also canonically composed there. If I don't paste it anywhere else before pasting it in Firefox again, the NFD form survives. Interestingly, the problem doesn't occur in the other direction: When copying a canonically decomposed string in Chrome, it's pasted in NFD form anywhere I can tell. My conclusion is that Firefox stores text to the clipboard differently from other applications. One way to play around with this yourself is to copy 'mañana' === 'mañana' to your JavaScript console. The statement returns false if the NFD form of the string on the right survived the copy & paste.

This is a very good question. When you copy/paste, exactly what is copied/pasted - CHARACTERS or BYTES?. And if BYTES, what encoding are they in?
From the answers, it sounds like the answer is "it depends". Different programs will put different things in the clipboard, sometimes placing multiple representations.
Then the pasting program needs to pick the best one and "do the right thing" with it.

Following my conversion with #Kaspar Etter, I did some testing. Here is what I found:
Copy from and Paste to:
Firefox:
Firefox to Firefox: NO normalization
Other apps to Firefox: NO normalization
Firefox to other apps: normalization
Even if we use AppleScript, JXA, or Python to directly read the SystemClipboard that contains the text copied from Firefox, the text is still normalized. Since copying and pasting from Firefox to Firefox does not involve normalization, Firefox probably does not normalize the text during the copy process. I have no idea when the normalization happens.
Safari (MacOS, not iOS):
Safari to Safari: normalization
Other apps to Safari: normalization
Safari to other apps: NO normalization
For Safari (MacOS), the normalization also happens at least on Canvas by instructure.com. In the fill-in-blank questions of Classic Quizzes, when students type Hebrew words in quizzes and hit "submit", the input was normalized, but the answer key was not. In that of the New Quizzes, however, both the input and the answer key are normalized. It's a mystery to me.
Chrome:
Chrome to Chrome: NO normalization
Other apps to Chrome: NO normalization (Firefox overrides)
Chrome to other apps: NO normalization (Safari overrides)
Conclusion: Firefox and Safari behave in the opposite way. Chrome behaves normally and consistently (except when it is overridden by Firefox and Safari).

Related

Are there any keyboard character code differences between PC and Mac?

So I'm working on a site in PHP/JS and also a database. I have a co-worker that sends me documents written on Apple devices and I'm on a PC. Since I don't have access to a Mac, I'd like to know if spaces and punctuation are identical typed on different keyboards.
I want to be able to copy the contents of the documents and paste it in the database, however I don't want to assume that the PC dash character is the same as a Mac dash (that might be an actual minus character).. or that a PC space turns out to be a Mac narrow/en space.
I could just test a received document, but she works all over the place and never knows where she wrote what.
This is a programming question because I'm pasting mathematical expressions where such characters make a difference.. and also using PHP and JavaScript to interpret those characters.
The main issue is the character encoding in the document. Mostly likely that's a Unicode encoding (e.g. UTF-8) which is fully cross-platform.
Someone using a U.S. keyboard layout (and probably most others) intending to type something like dash/hyphen/minus would most likely produce HYPHEN-MINUS U+002D. There are, of course, ways of typing EN DASH U+2013, EM DASH U+2014, SMALL EM DASH U+FE58, HYPHEN U+2010, and others, but the user would have to do that deliberately. It wouldn't be done routinely just because they're using a Mac.
Also, some editors or word processors may do "smart substitutions", replacing the ASCII characters with fancier (more typographically correct) non-ASCII ones. That would be independent of Mac vs. PC. If it does that, the character would still come across to the PC as such, but if your use of the document data is sensitive to such differences (as is apparently the case), then that would be problematic.
It would be very unlikely that Space would routinely be anything other than a normal SPACE U+0020. There are, of course, ways of typing variants such as NO-BREAK SPACE U+00A0, EN SPACE U+2002, EM SPACE U+2003, etc., but the user would have to go out of their way to type those. And I doubt smart substitutions would replace normal spaces.

Loading textfile into stringlist with firemonkey on osx when the encoding is unknown

I am having a hard time to load a textfile into a stringlist in firemonkey on osx when the encoding of the textfile in not known.
When I just use list.loadfromfile(filename), I get most of the time an exception regarding encoding.
list.loadfromfile(filename,TEncoding.unicode) will also fail when the file is in ansi, and opposite.
There is no issue on Windows, list.loadfromfile(filename) just works, but not on osx.
I cant specify the encoding, because it will be unknown (user provide the text files).
Any clue how I can get around this encoding issue when running the app on a mac?
In general this is not possible. It is quite possible to create a single file that is valid when interpreted in all common encodings. This has been discussed many times, for instance: The Notepad file encoding problem, redux.
I'm assuming that you are working with files that do not contain byte order marks, BOMs. Obviously if your input files contained BOMs then you could simply check the BOM and be done.
With that assumption stated, the right solution to the problem, in a perfect world, is to know the encoding. Either pick a specific encoding which your program requires, or arrange for the user to tell you the encoding when they supply the file.
If, for whatever reason, you cannot do that then the next best thing to do is to use heuristics to attempt to guess the encoding used. I'm not aware of any Pascal code to do this. But you should be able to put something together that will work reasonably well. This answer gives an outline of a basic strategy: https://stackoverflow.com/a/20747074

Delphi encoding

The company I work for has a program that is no longer supported called QADisplay. Inside of this program is a tool for annotating images. It's very similar to photoshop in that it takes a layer based approach to the annotations with each annotation as its own class in Delphi 7. These annotations are stored as the base image and a text file with the information describing the contents of the annotaion.
The issue is that the text that is displayed in the annotations is somehow encoded in the text file. For example, if the annotation displays as "Arial" (without the quotes), the text file will be written as:
TEXT (Type of annotation)
5 (Length of the literal string, in this case: Arial)
07)I86P (The encoded string)
What I need to do is extract all of the text from the annotations in preparation for the installation of our new software system.
I am not familiar with Delphi and do not have access to the source code. I have tried to disassemble the executable but haven't had much luck there. Does anyone have any ideas on how to approach decoding this? I've googled around a bit (Arial "07)I86P") and found some results relating to virus scan error logs and things of that nature but no dice on anything that I found helpful in relation to the issue I'm having.
That is not a standard text encoding. Maybe it is encrypted?
Without documentation or contact with the original developers, you will have to reverse engineer the app. Using a disassembler/debugger like IDA, if you can pause the app after it loads 07)I86P into memory, you can follow the code as it processes the characters, which will help you reconstruct the decode algorithm.

why can't I use secureTextEntry with a UTF-8 keyboard?

All,
I ran into this problem where for a UITextField that has secureTextEntry=YES, I cannot get any UTF-8 keyboards(Japanese, Arabic, etc.) to show, only non UTF-8 ones do(English, French, etc..). I did alot of searching on Google, on this site, and on Apple dev forums and see others with the same problem, but short of implementing my own UITextField, nobody seems to have a reasonable solution or an answer as to whether this is a bug or intended behavior.
And if this is intended behavior, why? Is there a standard, a white paper, SOMETHING someplace that I can look at and then point to when I go to my Product Manager and say we cannot support UTF-8 passwords?
THanks,
I was unable to find anything in Apple's documentation to explain why this should be the case, but after creating a test project it does indeed appear to be so. At a guess, I imagine secure text entry is disallowed for any language using composite characters because it would make character input difficult.
For instance, for Japanese input, should each kana character be hidden after it is typed? Or just kanji characters? If the latter, the length of time characters remain onscreen is long enough make secure input almost moot. Similarly for other languages using composite input methods.
This post includes code for manually implementing your own secure input behaviour.

Unicode Precomposition and Decomposition with Delphi

The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:
While Subversion stores filenames as Unicode, it does not specify if
precomposition or decomposition is used for certain accented
characters (such as é). Thus, files added in SVN clients running on
some operating systems (such as OS X) use decomposition encoding,
while clients running on other operating systems (such as Linux) use
precomposition encoding, with the consequence that those accented
characters do not display correctly if the local SVN client is not
using the same encoding as the client used to add the files
While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicode composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?
There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.
However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.
Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.
It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.
(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)
The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.
To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.
AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.
The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.
The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.

Resources