Bit of a strange problem (at least for me). In my Grails app I'm sending emails with some special characters (east European letters). Values of strings with special characters that I get from database are valid but the ones I create in application have "?".
Even more confusing is the fact that in development everything works fine, but when I deploy app to Tomcat instance I get the question marks.
I've set up everything to encode to UTF-8. At least I beleave so - obviously I'm missing something.
It sounds like you don't have the operating system language
packs installed for the languages you're trying to display.
While it appears as if the files themselves are saved properly, and the JVM
'understands' them because the character sets are supported, the GUIs
you're using can't display the corresponding encoding because the
underlying OS isn't displaying them.
I've experienced similar problems and the solution that
worked for me was to turn on the corresponding languages in the OS.
Related
If I use a domain such as www.äöü.com, is there any way to avoid it being displayed as www.xn--4ca0bs.com in users’ browsers?
Domains such as www.xn--4ca0bs.com cause a lot of confusion with average internet users, I guess.
This is entirely up to the browser. In fact, IDNs are pretty much a browser-only technology. Domain names cannot contain non-ASCII characters, so the actual domain name is always the Punycode encoded xn--... form. It's up to the browser to prettify this, but many choose to not do so to avoid domain name spoofing using lookalike Unicode characters.
From a security perspective, Unicode domains can be problematic because many Unicode characters are difficult to distinguish from common ASCII characters (or indeed other Unicode characters).
It is possible to register domains such as "xn–pple-43d.com", which is equivalent to "аpple.com". It may not be obvious at first glance, but "аpple.com" uses the Cyrillic "а" (U+0430) rather than the ASCII "a" (U+0061). This is known as a homograph attack.
Fortunately modern browsers have mechanisms in place to limit IDN homograph attacks. The page IDN Policy on chrome highlights the conditions under which an IDN is displayed in its native Unicode form. Generally speaking, the Unicode form will be hidden if a domain label contains characters from multiple different languages. The "аpple.com" domain as described above will appear in its Punycode form as "xn–pple-43d.com" to limit confusion with the real "apple.com".
For more information see this blog post by Xudong Zheng.
Internet Explorer 8.0 on Windows 7 displays your UTF-8 domain just fine.
Google Chrome 19 on the other hand doesn't.
Read more here: An Introduction to Multilingual Web Addresses #phishing.
Different browsers to things differently, possibly because some use the system codepage/locale/encoding/wtvr. And others use their own settings, or a list of allowed characters.
Read that article carefully, it explains how each browser works when making a decision.
If you are targeting a specific language, you can get away with it and make it work.
So I'm working on a site in PHP/JS and also a database. I have a co-worker that sends me documents written on Apple devices and I'm on a PC. Since I don't have access to a Mac, I'd like to know if spaces and punctuation are identical typed on different keyboards.
I want to be able to copy the contents of the documents and paste it in the database, however I don't want to assume that the PC dash character is the same as a Mac dash (that might be an actual minus character).. or that a PC space turns out to be a Mac narrow/en space.
I could just test a received document, but she works all over the place and never knows where she wrote what.
This is a programming question because I'm pasting mathematical expressions where such characters make a difference.. and also using PHP and JavaScript to interpret those characters.
The main issue is the character encoding in the document. Mostly likely that's a Unicode encoding (e.g. UTF-8) which is fully cross-platform.
Someone using a U.S. keyboard layout (and probably most others) intending to type something like dash/hyphen/minus would most likely produce HYPHEN-MINUS U+002D. There are, of course, ways of typing EN DASH U+2013, EM DASH U+2014, SMALL EM DASH U+FE58, HYPHEN U+2010, and others, but the user would have to do that deliberately. It wouldn't be done routinely just because they're using a Mac.
Also, some editors or word processors may do "smart substitutions", replacing the ASCII characters with fancier (more typographically correct) non-ASCII ones. That would be independent of Mac vs. PC. If it does that, the character would still come across to the PC as such, but if your use of the document data is sensitive to such differences (as is apparently the case), then that would be problematic.
It would be very unlikely that Space would routinely be anything other than a normal SPACE U+0020. There are, of course, ways of typing variants such as NO-BREAK SPACE U+00A0, EN SPACE U+2002, EM SPACE U+2003, etc., but the user would have to go out of their way to type those. And I doubt smart substitutions would replace normal spaces.
I am having a hard time to load a textfile into a stringlist in firemonkey on osx when the encoding of the textfile in not known.
When I just use list.loadfromfile(filename), I get most of the time an exception regarding encoding.
list.loadfromfile(filename,TEncoding.unicode) will also fail when the file is in ansi, and opposite.
There is no issue on Windows, list.loadfromfile(filename) just works, but not on osx.
I cant specify the encoding, because it will be unknown (user provide the text files).
Any clue how I can get around this encoding issue when running the app on a mac?
In general this is not possible. It is quite possible to create a single file that is valid when interpreted in all common encodings. This has been discussed many times, for instance: The Notepad file encoding problem, redux.
I'm assuming that you are working with files that do not contain byte order marks, BOMs. Obviously if your input files contained BOMs then you could simply check the BOM and be done.
With that assumption stated, the right solution to the problem, in a perfect world, is to know the encoding. Either pick a specific encoding which your program requires, or arrange for the user to tell you the encoding when they supply the file.
If, for whatever reason, you cannot do that then the next best thing to do is to use heuristics to attempt to guess the encoding used. I'm not aware of any Pascal code to do this. But you should be able to put something together that will work reasonably well. This answer gives an outline of a basic strategy: https://stackoverflow.com/a/20747074
I am using Protege 4.3 to create and organize an Ontology which contains Chinese characters.
As you can see, some Chinese characters are displayed properly, but others are displayed in little squares. The little squares do not always occur, for example: if I click on the []-[]-[]-cheatsheet-[]-[]-[]-[]-[], I can the same Chinese characters are displayed without problem.
Do you know what I can do to make Protege 4.3 display chinese characters correctly and consistently?
I guess I could have done further homework for this question. It's a post close to the final solution. (I have to post this as an answer for the length doesn't fit comment box)
To be specific, I found from Protege Mailing List Archive the following feedback post
[p4-feedback] Protege 4.2.0 Chinese Display Problem:
https://mailman.stanford.edu/pipermail/p4-feedback/2012-June/004721.html
I know this problem and have even fixed it on one occasion. But I don't truly understand it or know what to do about it. I am sorry that I don't have good information on this problem but I will give you my best current understanding.
In my experience, when this happens the character information is correctly encoded in the OWL file. The problem is exclusively a display problem. This is consistent with your description of the problem - in some of the screens the individuals are displaying correctly.
I believe that the problem has to do with the configuration of fonts in the java virtual machine. If you change the instance of java that Protege is using the problem will manifest in different ways or it will go away. When I worked on this problem before (it has happened
a couple of times) I gathered some web pages. Unfortunately only one of them is still valid, but perhaps it is part of the solution.
I will post my own investigation results after trying the suggested approach above.
PS: A useful owl example is provided here - some unicode characters do not display correctly in Protege
The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:
While Subversion stores filenames as Unicode, it does not specify if
precomposition or decomposition is used for certain accented
characters (such as é). Thus, files added in SVN clients running on
some operating systems (such as OS X) use decomposition encoding,
while clients running on other operating systems (such as Linux) use
precomposition encoding, with the consequence that those accented
characters do not display correctly if the local SVN client is not
using the same encoding as the client used to add the files
While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicode composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?
There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.
However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.
Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.
It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.
(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)
The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.
To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.
AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.
The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.
The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.