I was fooling around on my phone and decided to try putting an emoji in the url bar of google chrome. I entered in 😀.com, the emoji which is equivalent to unicode U+1F600. Chrome ended up evaluating that as http://xn--e28h.com/, which took me to a "webpage unavailable" screen (ERR_NAME_NOT_RESOLVED). I looked up xn--e28h on godaddy and it was unavailable.
Here are my questions:
Why did 😀 turn into xn--e28h? I don't see any relation with the unicode.
Why are domains of this format unavailable on godaddy?
Bonus question: why can't we put emojis in domain names?
DNS uses a special way to encode Unicode into ASCII. The xn-- prefix says that it's an encoded name, and since the whole name in this case is one Unicode codepoint the rest just looks incomprehensible. You can start reading more about this here.
Most (if not all) top-level domains have rules on which Unicode characters they allow for names in that TLD. For example, .SE only allows those characters that are used in one of the official languages of Sweden. This is entirely a policy thing, so the "why" gets fuzzy.
See 2.
Related
I've just encountered something that I don't quite understand. I received a document (administrative memo from my employer) containing a web address. The address is not a clickable hyperlink, it is just text.
What is interesting is that when the address is copy & pasted into a web browser address bar, it causes the web browser to attempt to contact a different web address than the cut & pasted text contains. The address text initially appears to be pasted correctly into the address bar, until I hit enter -- then instantly the text changes to something else.
Please note that this is not a matter of simple web site redirection. I know this because if I manually type in the same address (instead of copy & pasting it from the original document), the "correct" address is loaded. It is only following the copy/paste/load process that text appears to be magically changing.
I have also noticed that if I copy & paste the address first into a Notepad text file, save the text file, close, re-open, and then copy/paste to the web browser, the "correct" site then loads. Of note, when I save, Notepad warns that there are characters in Unicode format which will be lost. So I assume that there is some hidden unicode text that is being stripped out when I save as plain text.
But, in Notepad if I enable the "Show Unicode Control Characters" option, I see nothing. So what could be going on here?
To get really specific, the domain transforms like this: http://www.aaaaaaaaaa-usa.com/bbbbb/ddddddtools.html ==> www.xn--aaaaaaaaaausa-km6g.com. (The browser of course reports that it cannot find the IP address of the server)
For compatibility, domain names should be ASCII text, so there is a standard (IDN) to convert other characters to ASCII, using the two letter prefixes followed by two dashes --.
Additional, there were some phishing attack, using letter on other alphabets, that looked like latin letters, so deceiving users. So some browsers choose to display the ascii name instead of the intended name. (It changes from browser to browser, and usually only on selected similar characters).
If I navigate to the following URL with a special UTF8 encoded character I get different results in web browsers:
http://example.com/lörickè
Firefox 37 - Shows the correct URL as above.
Chrome 42 - Shows the correct URL as above.
Edge - Shows the correct URL as above.
IE 11 - Shows percent encoded URL http://example.com/l%c3%b6rick%c3%a8/
Where can I find a list of browsers and versions that support this feature and are there any announcements of whether the new Microsoft Edge browser supports this.
This StackOverflow post highlights the above issue for those interested.
What is shown in browser address bars is not necessarily what is used internally.
If you enter http://example.com/lörickè in Firefox, it gets shown like that, but it actually gets percent-encoded and becomes http://example.com/l%C3%B6rick%C3%A8. This is for usability reasons (or, if IRIs are not supported, like in HTTP/1.1, for transforming an IRI into a URI), so users don’t necessarily have to enter the correct URL (with percent-encoding), and don’t get confused by seeing these cryptic parts.
You can easily check what really gets used by copy-pasting the URL from the address bar into a text document.
So the three browsers from your example probably use the same URI (i.e., percent-encoded), but two browsers decided to display the un-encoded variant instead.
After researching Google and SO, there seems to be conflicting opinions on this.
We have run-in to a problem with Google Chrome substituting | separator as %7C, whereas Firefox and Safari do not.
Here's an example:
http://www.example.com/page1|sub-page2|sub-page-3
Are there any strict rules to follow when choosing a separator character for semantic URLs and are there any strong arguments against (or workarounds when) using |?
| is not a valid character in a URL. Modern browsers will silently encode it to %7C when sending, and may or may not display this change in the address bar. Similarly, servers will silently decode the character for you.
This would have been a problem in last millennium, where browsers would crash just because you didn't specify http://, but today you can just use whatever you want and the browser will take care of it. However, automatic parsers such as http://example.com/test|fish Markdown may not agree to it being a valid URL. In this case, it looks like it does, but try that on my forums and it will complain at you.
Internet explorer/chrome use url encoding when displaying the url in the address bar after a page request has been made, %7C is the safe way of displaying a pipe ('|')
so its not a problem that chrome is doing this.
as a cheeky fix to make all browsers behave the same way, why not use %7C as your separator from the get-go, instead of a pipe, and then all browsers should interpret this as a pipe for you behind the scenes, but display it as &7C in the address bar.
We have built a "redirect" engine into our product so our customers can add/edit/delete custom redirects without us having to maintain a bunch of rewrite rules on the server.
Some issues are arising in the URLs that get passed into our code. We are pulling these from the CGI.QUERY_STRING property populated by Coldfusion (it picks up on 404's thrown by IIS/Coldfusion, which appends the bad URL as a query string like ?404;http://www.mysite.com:80/nonexistent-file.cfm).
What we see is that some special characters are getting an additional character thrown in there (an Â) character. Take this URL (%A9 is the copyright symbol):
http://www.mysite.com/%A9/
The CGI.QUERY_STRING is reporting this as:
http://www.mysite.com:80/©/
I have no idea where this extra "Â" is coming from. I have a hunch that this is being brought in by IIS, but it could also be with Coldfusion as it has to populate the CGI variable.
Any ideas as to why this is happening and how to fix it? It appears that not all percent-encoded/special characters do this...
EDIT:
I am probably giving up on my exact problem, however, it would be beneficial still to know why either IIS or Coldfusion is throwing in this extra character (especially for certain escape sequences over others).
Wow... that's a tough one. Usually folks design sites to use alphanumeric plus the tilde (~) and dash (=). I'm not even sure if the RFC allows for a copywrite symbol as part of the host header. I'm not positive that it should be allowed in the scheme portion of the URL. This article might shed some light on it for you. As for IIS - you might be able to add a specific rewrite rule that takes care of the issue. Personally I would avoid these characters in the schema part of the URL.
What's the proper way to handle unicode characters in an iOS app that calls the foursquare API?
Our current setup calls the foursquare API from our iOS app, and returns XML (yes, we're changing this to JSON).
While testing our app, we discovered the hard way that this foursquare location borked our app -- apparently because we did not setup to handle the two Emoji characters in the venue name.
What's the proper way to handle unicode characters at each level?
In Objective-C, as we call the foursquare API?
In our our WCF web services, as we return XML data to the app?
In SQL Server 2008, as we store a place name that may contain unicode characters?
On the database side, I know that we need to make some changes to the Collation settings, among other things.
What changes, if any, are needed to support unicode 6.1 correctly in our iOS app and web services?
Thanks!
I'm guessing you are trying to convert a unicode string to another encoding then back to unicode. If the converter doesn't know what the unicode character is then it could do something strange(e.g. the converter might convert one of the new sad pussycat emoji characters to a square root sign).
This might happen without you explicitly doing it depending on how you have your database setup, the XML encoding you use, your HTTP server configuration, etc.
Unicode 6.1 just adds characters, the encoding is exactly the same(i.e. some codes that didn't map to anything now map to things).