So I'm working on a site in PHP/JS and also a database. I have a co-worker that sends me documents written on Apple devices and I'm on a PC. Since I don't have access to a Mac, I'd like to know if spaces and punctuation are identical typed on different keyboards.
I want to be able to copy the contents of the documents and paste it in the database, however I don't want to assume that the PC dash character is the same as a Mac dash (that might be an actual minus character).. or that a PC space turns out to be a Mac narrow/en space.
I could just test a received document, but she works all over the place and never knows where she wrote what.
This is a programming question because I'm pasting mathematical expressions where such characters make a difference.. and also using PHP and JavaScript to interpret those characters.
The main issue is the character encoding in the document. Mostly likely that's a Unicode encoding (e.g. UTF-8) which is fully cross-platform.
Someone using a U.S. keyboard layout (and probably most others) intending to type something like dash/hyphen/minus would most likely produce HYPHEN-MINUS U+002D. There are, of course, ways of typing EN DASH U+2013, EM DASH U+2014, SMALL EM DASH U+FE58, HYPHEN U+2010, and others, but the user would have to do that deliberately. It wouldn't be done routinely just because they're using a Mac.
Also, some editors or word processors may do "smart substitutions", replacing the ASCII characters with fancier (more typographically correct) non-ASCII ones. That would be independent of Mac vs. PC. If it does that, the character would still come across to the PC as such, but if your use of the document data is sensitive to such differences (as is apparently the case), then that would be problematic.
It would be very unlikely that Space would routinely be anything other than a normal SPACE U+0020. There are, of course, ways of typing variants such as NO-BREAK SPACE U+00A0, EN SPACE U+2002, EM SPACE U+2003, etc., but the user would have to go out of their way to type those. And I doubt smart substitutions would replace normal spaces.
Related
All,
I ran into this problem where for a UITextField that has secureTextEntry=YES, I cannot get any UTF-8 keyboards(Japanese, Arabic, etc.) to show, only non UTF-8 ones do(English, French, etc..). I did alot of searching on Google, on this site, and on Apple dev forums and see others with the same problem, but short of implementing my own UITextField, nobody seems to have a reasonable solution or an answer as to whether this is a bug or intended behavior.
And if this is intended behavior, why? Is there a standard, a white paper, SOMETHING someplace that I can look at and then point to when I go to my Product Manager and say we cannot support UTF-8 passwords?
THanks,
I was unable to find anything in Apple's documentation to explain why this should be the case, but after creating a test project it does indeed appear to be so. At a guess, I imagine secure text entry is disallowed for any language using composite characters because it would make character input difficult.
For instance, for Japanese input, should each kana character be hidden after it is typed? Or just kanji characters? If the latter, the length of time characters remain onscreen is long enough make secure input almost moot. Similarly for other languages using composite input methods.
This post includes code for manually implementing your own secure input behaviour.
The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:
While Subversion stores filenames as Unicode, it does not specify if
precomposition or decomposition is used for certain accented
characters (such as é). Thus, files added in SVN clients running on
some operating systems (such as OS X) use decomposition encoding,
while clients running on other operating systems (such as Linux) use
precomposition encoding, with the consequence that those accented
characters do not display correctly if the local SVN client is not
using the same encoding as the client used to add the files
While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicode composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?
There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.
However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.
Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.
It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.
(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)
The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.
To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.
AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.
The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.
The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.
I'm in the process of researching code pages and have come across many conflicting uses of terminology, even amongst different Wikipedia entries. I just can't find a source of information that spells out the entire character handling process from start to finish. Could someone well versed in this field suggest ways in which the following information is inaccurate or incorrect:
The process of character representation as far as I understand:
We start with sets of symbols (not sure of the correct terminology here, possibly 'scripts') that are not associated with any specific platform. 'The Cyrillic alphabet' is understood to refer to the same entity in the context of Windows as in Linux, for example.
Members of these sets are selected, generally in bunches, by vendors to form a platform specific character set. The platform might assign these various codes such as GDI values on Windows (eg. 0 for ANSI_CHARSET and the other codes mentioned here: http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes). I cannot find much information on these sets such as whether they are in fact coded character sets or if they are simply unordered and abstract.
From these sets, individual code pages are developed that appear to have a one to one mapping with GDI values. Since these GDI values appear to represent sets that are platform dependent, does this mean Windows code pages are essentially a coded version of each individual set?
I've been having trouble reconciling this idea with a link shown to me earlier (which I've lost) that showed a one to many mapping between these GDI charsets and code pages across different platforms. Is this accurate, do these GDI values point to sets from which different code pages across different platforms can be developed?
Each code page maps a member of an abstract character set onto an integer to represent its position in the set. In the case of the 'simpler' code pages mentioned on the above webpage, these can be referred to using the more precise 'character map' term. Is this term worth considering or is the distinction too subtle and unimportant?
A font resolves a code point to a glyph if it contains one for that code point, otherwise it reports a failure. I've also read that a font may return its own blank glyph for those code points which it doesn't support. Can an application distinguish between this blank glyph and a successful resolution, ie. does the font return an error code of sorts with this blank glyph?
I believe that's the extent of my confusion. Any clarification in this regard would be invaluable. Thanks in advance.
You are essentially correct:
Start with the number of known characters.
Select a subset of this characters (a character set)
Map these to bit patterns (code page and encoding)
Render these to an output device by combining the character with a glyph (ie. using a font, a bit pattern, and a codepage/encoding that maps bit pattern to character).
Across platforms, there are similar code pages. And even across many code pages there are similar mappings of value to character. For example, Windows Latin, Mac Roman and unicode share characters for the first 127 values. There is some standardization (eg. http://en.wikipedia.org/wiki/Shift_JIS for Japanese) of codepages so that machines can interact.
Generally for new development, you should be using a unicode codepage with one of the popular encodings. UTF8 is popular on most modern systems. UTF16LE is used for Windows system calls ending in W.
This might be a good match: http://mihai-nita.net/2006/08/06/basic-lingo/
We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment)
Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno.
or if we should replace them with their Latin "counterparts" (similar looking letters)
Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno.
I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?
Most of the times, search engines deal with latin counterparts good, although sometimes, results for i.e. "létání" and "letani" slightly differ.
So, in terms of SEO, almost no harm is done - once your site has good content, good markup and all that other stuff, your site won't suffer from having latin URLs.
You don't always know what combination of system browser and plugins users use, so make them as easy as possible - all websites use standard latin in URLs, because non-latin symbols can choke anything from server through browser to any plugin that might break user's experience.
And I can't stress this enough; Users before SEO!
"what's better from SEO perspective"
Who's your audience? Americans who think all those extra letters are a mistake?
Or folks who read (and search) for "non-ASCII" letters because those non-ASCII letters are part of their language?
SEO is a bad thing to chase. Complete, correct, consistent and usable is what you what to build first.
well i suggest you to replace them with there latin counterparts because it's user friendly and your website will be accessible on every single computer (as the keyboard changes from computer to another but all of them have latins letters), but for SEO perspective i don't think it's gonna be a problem.
Pawel, first of all, you should decide whether you're going to optimize for global Google (google.com) or Polish one.
In accordance with the URI specification, RFC 3986, only 7bit ASCII characters are allowed, and characters among those mentioned in the specification as control characters must be properly escaped. If you want to represent other characters or URI control characters then you should be using IRI, RFC 3987. Keep in mind that HTTP is not compatible with IRI, however.
When in doubt RTFM.
Another issue is that there are Unicode code points whose glyphs look very much alike in most fonts, which is absolutely ideal for phishers. Stick to ASCII and the glyphs are visibly different when the characters are.
I have a document A in encoding A displayed in tool A and a document B in encoding B displayed in tool B. If I cut and paste (part of) B into A what might be the resultant character encoding? I realise this depends on tool A and tool B and the information held in the paste buffer (which presumably can contain an encoding?) and the operating system.
What should high-quality tools do? and in practice how many of the common tools (e.g. Word, TextPad, various IDEs, etc.) do a good job?
First of all, a text editor's internal representation of text has no bearing on how the text is encoded (serialized) when you save the file. So a document is not "in" an encoding; it's a sequence of abstract characters. When the document is saved to a file (or transmitted over the network) then it gets encoded.
It's up to each application to decide what it puts on the clipboard. Typically, a windows app that knows what it's doing will put a number of different representations on the clipboard. When you paste in the other app, the app will look for the representation that best suits its need.
In your case, a text editor (that knows what it's doing) will put a Unicode representation of a selected string onto the clipboard (where Unicode, in Windows, is typically moved around as UTF-16, but that's not important). When you paste in the other app, it will insert that sequence of Unicode characters into the document at the selection point.
There's an app floating around called "ClipSpy" that will help you see what I'm talking about, interactively.
I observed the following behavior when I looked into Unicode normalization: When copying a canonically decomposed string (NFD) in Firefox in macOS 10.15.7, the string is normalized to NFC when pasting it in Chrome. What's weird is that the pasting affects the content of the clipboard: When pasting the string in Firefox again, it's then also canonically composed there. If I don't paste it anywhere else before pasting it in Firefox again, the NFD form survives. Interestingly, the problem doesn't occur in the other direction: When copying a canonically decomposed string in Chrome, it's pasted in NFD form anywhere I can tell. My conclusion is that Firefox stores text to the clipboard differently from other applications. One way to play around with this yourself is to copy 'mañana' === 'mañana' to your JavaScript console. The statement returns false if the NFD form of the string on the right survived the copy & paste.
This is a very good question. When you copy/paste, exactly what is copied/pasted - CHARACTERS or BYTES?. And if BYTES, what encoding are they in?
From the answers, it sounds like the answer is "it depends". Different programs will put different things in the clipboard, sometimes placing multiple representations.
Then the pasting program needs to pick the best one and "do the right thing" with it.
Following my conversion with #Kaspar Etter, I did some testing. Here is what I found:
Copy from and Paste to:
Firefox:
Firefox to Firefox: NO normalization
Other apps to Firefox: NO normalization
Firefox to other apps: normalization
Even if we use AppleScript, JXA, or Python to directly read the SystemClipboard that contains the text copied from Firefox, the text is still normalized. Since copying and pasting from Firefox to Firefox does not involve normalization, Firefox probably does not normalize the text during the copy process. I have no idea when the normalization happens.
Safari (MacOS, not iOS):
Safari to Safari: normalization
Other apps to Safari: normalization
Safari to other apps: NO normalization
For Safari (MacOS), the normalization also happens at least on Canvas by instructure.com. In the fill-in-blank questions of Classic Quizzes, when students type Hebrew words in quizzes and hit "submit", the input was normalized, but the answer key was not. In that of the New Quizzes, however, both the input and the answer key are normalized. It's a mystery to me.
Chrome:
Chrome to Chrome: NO normalization
Other apps to Chrome: NO normalization (Firefox overrides)
Chrome to other apps: NO normalization (Safari overrides)
Conclusion: Firefox and Safari behave in the opposite way. Chrome behaves normally and consistently (except when it is overridden by Firefox and Safari).