If I use a domain such as www.äöü.com, is there any way to avoid it being displayed as www.xn--4ca0bs.com in users’ browsers?
Domains such as www.xn--4ca0bs.com cause a lot of confusion with average internet users, I guess.
This is entirely up to the browser. In fact, IDNs are pretty much a browser-only technology. Domain names cannot contain non-ASCII characters, so the actual domain name is always the Punycode encoded xn--... form. It's up to the browser to prettify this, but many choose to not do so to avoid domain name spoofing using lookalike Unicode characters.
From a security perspective, Unicode domains can be problematic because many Unicode characters are difficult to distinguish from common ASCII characters (or indeed other Unicode characters).
It is possible to register domains such as "xn–pple-43d.com", which is equivalent to "аpple.com". It may not be obvious at first glance, but "аpple.com" uses the Cyrillic "а" (U+0430) rather than the ASCII "a" (U+0061). This is known as a homograph attack.
Fortunately modern browsers have mechanisms in place to limit IDN homograph attacks. The page IDN Policy on chrome highlights the conditions under which an IDN is displayed in its native Unicode form. Generally speaking, the Unicode form will be hidden if a domain label contains characters from multiple different languages. The "аpple.com" domain as described above will appear in its Punycode form as "xn–pple-43d.com" to limit confusion with the real "apple.com".
For more information see this blog post by Xudong Zheng.
Internet Explorer 8.0 on Windows 7 displays your UTF-8 domain just fine.
Google Chrome 19 on the other hand doesn't.
Read more here: An Introduction to Multilingual Web Addresses #phishing.
Different browsers to things differently, possibly because some use the system codepage/locale/encoding/wtvr. And others use their own settings, or a list of allowed characters.
Read that article carefully, it explains how each browser works when making a decision.
If you are targeting a specific language, you can get away with it and make it work.
Related
All,
I ran into this problem where for a UITextField that has secureTextEntry=YES, I cannot get any UTF-8 keyboards(Japanese, Arabic, etc.) to show, only non UTF-8 ones do(English, French, etc..). I did alot of searching on Google, on this site, and on Apple dev forums and see others with the same problem, but short of implementing my own UITextField, nobody seems to have a reasonable solution or an answer as to whether this is a bug or intended behavior.
And if this is intended behavior, why? Is there a standard, a white paper, SOMETHING someplace that I can look at and then point to when I go to my Product Manager and say we cannot support UTF-8 passwords?
THanks,
I was unable to find anything in Apple's documentation to explain why this should be the case, but after creating a test project it does indeed appear to be so. At a guess, I imagine secure text entry is disallowed for any language using composite characters because it would make character input difficult.
For instance, for Japanese input, should each kana character be hidden after it is typed? Or just kanji characters? If the latter, the length of time characters remain onscreen is long enough make secure input almost moot. Similarly for other languages using composite input methods.
This post includes code for manually implementing your own secure input behaviour.
The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:
While Subversion stores filenames as Unicode, it does not specify if
precomposition or decomposition is used for certain accented
characters (such as é). Thus, files added in SVN clients running on
some operating systems (such as OS X) use decomposition encoding,
while clients running on other operating systems (such as Linux) use
precomposition encoding, with the consequence that those accented
characters do not display correctly if the local SVN client is not
using the same encoding as the client used to add the files
While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicode composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?
There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.
However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.
Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.
It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.
(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)
The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.
To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.
AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.
The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.
The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.
I'm in the process of researching code pages and have come across many conflicting uses of terminology, even amongst different Wikipedia entries. I just can't find a source of information that spells out the entire character handling process from start to finish. Could someone well versed in this field suggest ways in which the following information is inaccurate or incorrect:
The process of character representation as far as I understand:
We start with sets of symbols (not sure of the correct terminology here, possibly 'scripts') that are not associated with any specific platform. 'The Cyrillic alphabet' is understood to refer to the same entity in the context of Windows as in Linux, for example.
Members of these sets are selected, generally in bunches, by vendors to form a platform specific character set. The platform might assign these various codes such as GDI values on Windows (eg. 0 for ANSI_CHARSET and the other codes mentioned here: http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes). I cannot find much information on these sets such as whether they are in fact coded character sets or if they are simply unordered and abstract.
From these sets, individual code pages are developed that appear to have a one to one mapping with GDI values. Since these GDI values appear to represent sets that are platform dependent, does this mean Windows code pages are essentially a coded version of each individual set?
I've been having trouble reconciling this idea with a link shown to me earlier (which I've lost) that showed a one to many mapping between these GDI charsets and code pages across different platforms. Is this accurate, do these GDI values point to sets from which different code pages across different platforms can be developed?
Each code page maps a member of an abstract character set onto an integer to represent its position in the set. In the case of the 'simpler' code pages mentioned on the above webpage, these can be referred to using the more precise 'character map' term. Is this term worth considering or is the distinction too subtle and unimportant?
A font resolves a code point to a glyph if it contains one for that code point, otherwise it reports a failure. I've also read that a font may return its own blank glyph for those code points which it doesn't support. Can an application distinguish between this blank glyph and a successful resolution, ie. does the font return an error code of sorts with this blank glyph?
I believe that's the extent of my confusion. Any clarification in this regard would be invaluable. Thanks in advance.
You are essentially correct:
Start with the number of known characters.
Select a subset of this characters (a character set)
Map these to bit patterns (code page and encoding)
Render these to an output device by combining the character with a glyph (ie. using a font, a bit pattern, and a codepage/encoding that maps bit pattern to character).
Across platforms, there are similar code pages. And even across many code pages there are similar mappings of value to character. For example, Windows Latin, Mac Roman and unicode share characters for the first 127 values. There is some standardization (eg. http://en.wikipedia.org/wiki/Shift_JIS for Japanese) of codepages so that machines can interact.
Generally for new development, you should be using a unicode codepage with one of the popular encodings. UTF8 is popular on most modern systems. UTF16LE is used for Windows system calls ending in W.
This might be a good match: http://mihai-nita.net/2006/08/06/basic-lingo/
We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment)
Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno.
or if we should replace them with their Latin "counterparts" (similar looking letters)
Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno.
I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?
Most of the times, search engines deal with latin counterparts good, although sometimes, results for i.e. "létání" and "letani" slightly differ.
So, in terms of SEO, almost no harm is done - once your site has good content, good markup and all that other stuff, your site won't suffer from having latin URLs.
You don't always know what combination of system browser and plugins users use, so make them as easy as possible - all websites use standard latin in URLs, because non-latin symbols can choke anything from server through browser to any plugin that might break user's experience.
And I can't stress this enough; Users before SEO!
"what's better from SEO perspective"
Who's your audience? Americans who think all those extra letters are a mistake?
Or folks who read (and search) for "non-ASCII" letters because those non-ASCII letters are part of their language?
SEO is a bad thing to chase. Complete, correct, consistent and usable is what you what to build first.
well i suggest you to replace them with there latin counterparts because it's user friendly and your website will be accessible on every single computer (as the keyboard changes from computer to another but all of them have latins letters), but for SEO perspective i don't think it's gonna be a problem.
Pawel, first of all, you should decide whether you're going to optimize for global Google (google.com) or Polish one.
In accordance with the URI specification, RFC 3986, only 7bit ASCII characters are allowed, and characters among those mentioned in the specification as control characters must be properly escaped. If you want to represent other characters or URI control characters then you should be using IRI, RFC 3987. Keep in mind that HTTP is not compatible with IRI, however.
When in doubt RTFM.
Another issue is that there are Unicode code points whose glyphs look very much alike in most fonts, which is absolutely ideal for phishers. Stick to ASCII and the glyphs are visibly different when the characters are.
As the title says, how SEO friendly is a URL containing Unicode characters.
Edit: To clarify, I meant URL with non-ASCII characters but valid Unicode.
If I were a Google other search engines authority I wouldn't consider the unicode URL-s an advantage. I have been using unicode urls for more than two years in my Persian website but believe me I just did it because I felt I was forced to do this. We know Google handles Uncode urls very well but I can't see the unicode words in URL-s when I'm working with them in google webmaster tools here is an example:
http://www.learnfast.ir/%D9%88%D8%A8%D9%84%D8%A7%DA%AF-%D8%A2%DA%AF%D9%87%DB%8C
there are only two Farsi words in such a messy and lengthy URL.
I believe other Unicode url users don't like this either but they do this only for SEO optimization not for categorizing their contents or directing their users to the right address. Of course unicode is excellent for crawling the contents but there should be other ways to index URL-s. more over, English is our international language; Isn't it? It can be beneficially used for URL-s. There should be other means for indexing Unicode Urls. (Sorry for too much words from an amateur webmaster).
All URLs can be represented as Unicode. Unicode just defines a range of code-points from U+0000 to U+10FFFF, which allows you to define any characters.
If what you mean is "How SEO friendly are URLs containing characters above U+007F" then they should be as good as anything else, as long as the word is correct. However, they won't be very easy for most users to type if that's a concern, and may not be supported by all internet browsers/libraries/proxies etc. so I'd tend to steer clear.
FWIW, Amazon (Japan) uses Unicode URL for their product pages.
http://www.amazon.co.jp/任天堂-193706011-Wiiスポーツ-リゾート-「Wiiモーションプラス」1個同梱/dp/B001DLXXCC/ref=pd_bxgy_vg_img_a
(As you can see, it causes trouble with systems like the Stackoverflow wiki formatter)
if we consider that the urls that have the searched keywords in them have higher placements in the search results and you're targeting unicode search terms then it may actually help.
But of course this is hardly the most important thing when it comes to position in search results.
I would guess from an SEO point of view it would be a really bad idea, unless you are specifically looking to target unicode search terms.