Domain Name with unicode Pitfalls - url

According to yahoo and stackoverflow.com they advise having a static content site that you don't assign cookies to. http://developer.yahoo.com/performance/rules.html#cookie_free http://sstatic.net
Based of the desire for a static only domain name I though it would be cool to have the domain name made up of unicode characters. From what I understand pitfalls of unicode characters include: difficulty to type and automatic punycode conversation due to the paypal.com innocent.
For example if I wanted to link my stylesheet.
<link rel=stylesheet href=☺.com/s.css>
....
<script src=☺.com/s.js></script>
Considering I only plan to link to static content, are there are issues or pitfalls?
Do all browsers natively support unicode -> punycode conversation? It has been unclear to me if internet explorer less than 7 supports punycode. Also would IE display a notice if you are simply linking to server content in unicode format.
Bonus: Also is there any place to find a list of legal url unicode url characters? Supposedly some characters aren't permitted?! Or would a url containing non permitted characters simply be translated to punycode immediately therefore not effecting my situation?

Read http://www.ietf.org/rfc/rfc3454.txt.

Related

Language specific characters in URL

Colleagues from work have created API endpoint which uses language specific characters in url. This api url looks like
http://somedomain.com/someapi/somemethod/zażółć/gęślą/jaźń
Is this OK or is it a bad approach?
Technically, that's not a valid URL but web browsers and other clients finesse it. The script that characters are from is not an issue but structural characters like "/?#" could be. You'll have to consider what to do when they show up in data that you are "pasting" into your URLs.
An HTTP URL is:
an ASCII-encoded scheme (in this case the protocol "http")
a punycode-encoded, ASCII-encoded domain
a %-encoded, ASCII-encoded, server-defined sequence of octets for the path, optional query, and optional hash.
See RFC 3986
The assumption that everyone makes—quite reasonably because it is the predominant practice—is that the path, query, and hash are text. There is no text but encoded text. So, some character encoding is involved. Where %-encoding is needed outside of structural characters, browsers are going to assume UTF-8. If you don't want browsers to do the %-encoding, use valid URLs by doing it yourself with the character encoding that you are using.
As the world is standardizing on UTF-8 (where applicable), the HTML DOM has also with the encodeURIComponent function. Clients using JavaScript in a web browser are likely to use this function, either directly or through some library.
UTF-8 encoded, %-encoded (and, then on the wire, ASCII-encoded) version of your URL that my browser created:
http://somedomain.com/someapi/somemethod/za%C5%BC%C3%B3%C5%82%C4%87/g%C4%99%C5%9Bl%C4%85/ja%C5%BA%C5%84
(You can see this yourself using your browser's dev tools [F12 key, network tab] or a packet sniffer [e.g., Wireshark or Fiddler]. What you gave as a URL is never seen on the wire.)
Your server application probably understands that just fine. In any case, it is your server's rules that the client complies with. If your API uses UTF-8 encoded, %-encoded URLs then just document that. (But phrase it in a way that doesn't confuse people who do that already without knowing.)

How to transform encoded URL to readable texts?

It's about Bangla Unicode texts, but can be a problem for any language other than Latin glyphs.
I'm a host of a Bangla blog with all its texts and categories in Bangla (I prefer not to say Bengali as because the name of the language is Bangla rather than Bengali).
So the category in Bangla "বাংলা" saying a URL like:
http://www.example.com/category/বাংলা
But whenever I copied the URL from address bar and put 'em into a chat panel or somewhere else, it changed with some strange characters, for example:
http://www.example.com/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0*
* it's just an example, not the exact gibberish for the word "বাংলা")
So, in many cases I got some encoded URLs like above, from where I found no trace which Unicode text they are saying. Recently I'm getting some 404 error logged by one of my plugin. From there I found a URI like:
/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0%A6%BE%E0%A7%9F%E0%A7%81%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A7%8D%E0%A6%AF%E0
I used the Jetpack's Omnisearch to find out any match, but the result is empty. I can't even trace which category that is— creating such a 404.
So here comes the question:
How can I transform the encoded URL to readable glyphs?
http://www.example.com/category/বাংলা
isn't a URL; URLs can only contain ASCII characters. This is an IRI.
http://www.example.com/category/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE
is the URI representation of that IRI. They are otherwise equivalent. A browser may display the ‘pretty’ IRI version in the user interface, but put the URI version on the clipboard so that you can paste it into other tools that don't support IRI.
The 404 address you pasted translates to:
/category/স্নায়ুবিদ্য�
where the last character is a � because it is an invalid, truncated UTF-8 sequence. (This is probably why the request failed.) Someone may have mis-pasted a partial URI here.
If you're using javascript you can do:
decodeURIComponent(url);
This will make sure the original language is preserved.

URL containing non-visual characters

My crawler engine seems to have a problem with a specific customer's site.
At that site, there are redirects to URLs that look like this:
http://example.com/dir/aaa$0081 aaa.php
(Showing the URL as non-encoded, with $0081 being two bytes represented using HEX.)
Now, this is when inspecting the buffer returned after using the WinInet Windows API call HttpQueryInfo, so the two bytes actually represent a WideChar at this point.
Now, I can see that e.g. $0081 is a non-visual control character:
Latin-1 Supplement (Unicode block)
The problem is that if I use the URL "as-is" (URL encoded) for future requests to the server, it responds with 400 or 404. (On the other hand, is it removed entirely, it works and the server delivers the correct page and response...)
I suspect that FireFox/IE/etc. is stripping non-visible controls characters in URLs before making the HTTP requests... (At least IEHTTPHeaders and FF Live HTTP Headers addins don't show any non-visible characters.)
I was wondering if anyone can point to a standard for this? For what I can see non-visible chracters should not be found in URLs, so I am thinking a solution might be (in this and future cases) that I remove these. But it is not a topic that seems widely discussed on the net.
In the example given, $0081 is just five Ascii characters. But if you mean that this is just what it looks like and you have (somehow) inferred that the actual URL contains U+0081, then what should happen, and does happen at least on Firefox, is that it is %-encoded (“URL encoded”) as %C2%81 (formed by %-encoding the two bytes of the UTF-8 encoded form of U+0081. Firefox shows this as empty in its address bar, since U+0081 is control character, but the server actually gets %C2%81 and must take it from there.
I have no idea of where the space comes from, but a URL must not contain a space, except as %-encoded (%20).
The relevant standard is Internet-standard STD 66, URI Generic Syntax. (Currently RFC 3986. Beware: people still often refer to older RFCs as “standard” in this issue.)

Foreign characters in meta tags not displaying correctly

I have a site that is replicated in many languages. The site itself display characters correctly but when viewing source the meta tags show the "unknown character" question mark instead of the foreign character.
What do I need to do differently for meta tags?
I have this tag already:
<meta http-equiv="content-type" content="application/xhtml+xml; charset=utf-8" />
I changed the charset to iso-8859-1 and it works now.
Then it means that you have saved the file as ISO-8859-1 (or possibly as CP-1252 when on windows) instead of UTF-8. In a bit decent text editor / IDE you should be able to configure the default file encoding and/or use the Save As option to set the desired encoding. Also, don't forget to set the HTTP response headers accordingly. How to do this depends on the webserver used and/or the server side language in question (if any).
By the way, you really don't want to use ISO-8859-1 when you want to go for World Domination. It doesn't cover all characters the world is aware of. It only covers Latin, not Hebrew, Cyrillic, Arabic, Chinese/Korean/Japanese, etc..etc..

What is the _snowman param in Ruby on Rails 3 forms for?

In Ruby on Rails 3 (currently using Beta 4), I see that when using the form_tag or form_for helpers there is a hidden field named _snowman with the value of ☃ (Unicode \x9731) showing up.
So, what is this for?
This parameter was added to forms in order to force Internet Explorer (5, 6, 7 and 8) to encode its parameters as unicode.
Specifically, this bug can be triggered if the user switches the browser's encoding to Latin-1. To understand why a user would decide to do something seemingly so crazy, check out this google search. Once the user has put the web-site into Latin-1 mode, if they use characters that can be understood as both Latin-1 and Unicode (for instance, é or ç, common in names), Internet Explorer will encode them in Latin-1.
This means that if a user searches for "Ché Guevara", it will come through incorrectly on the server-side. In Ruby 1.9, this will result in an encoding error when the text inevitably makes its way into the regular expression engine. In Ruby 1.8, it will result in broken results for the user.
By creating a parameter that can only be understood by IE as a unicode character, we are forcing IE to look at the accept-charset attribute, which then tells it to encode all of the characters as UTF-8, even ones that can be encoded in Latin-1.
Keep in mind that in Ruby 1.8, it is extremely trivial to get Latin-1 data into your UTF-8 database (since nothing in the entire stack checks that the bytes that the user sent at any point are valid UTF-8 characters). As a result, it's extremely common for Ruby applications (and PHP applications, etc. etc.) to exhibit this user-facing bug, and therefore extremely common for users to try to change the encoding as a palliative measure.
All that said, when I wrote this patch, I didn't realize that the name of the parameter would ever appear in a user-facing place (it does with forms that use the GET action, such as search forms). Since it does, we will rename this parameter to _e, and use a more innocuous-looking unicode character.
This is here to support Internet Explorer 5 and encourage it to use UTF-8 for its forms.
The commit message seen here details it as follows:
Fix several known web encoding issues:
Specify accept-charset on all forms. All recent browsers, as well as
IE5+, will use the encoding specified
for form parameters
Unfortunately, IE5+ will not look at accept-charset unless at least one
character in the form's values is not
in the page's charset. Since the
user can override the default
charset (which Rails sets to UTF-8),
we provide a hidden input containing
a unicode character, forcing IE to
look at the accept-charset.
Now that the vast majority of web input is UTF-8, we set the inbound
parameters to UTF-8. This will
eliminate many cases of incompatible
encodings between ASCII-8BIT and
UTF-8.
You can safely ignore params[:_snowman]
In short, you can safely ignore this parameter.
Still, I am not sure why we're supporting old technologies like Internet Explorer 5. It seems like a very non-Ruby on Rails decision if you ask me.

Resources