I have an action like this:
<%=Html.ActionLink("My_link", "About", "Home", new RouteValueDictionary {
{ "id", "Österreich" } }, null)%>
This produces the following link: http://localhost:1855/Home/About/%C3%96sterreich
I want a link which looks like this - localhost:1855/Home/About/Österreich
I have tried.
Server.HtmlDecode("Österreich")
HttpUtility.UrlDecode("Österreich")
Neither seems to be helping. What else can I try to get my desired result?
I think this is an issue with your browser (IE).
Your code is correct as it is, no explicit UrlEncoding needed.
<%=Html.ActionLink("My_link", "About", "Home", new RouteValueDictionary {
{ "id", "Österreich" } }, null)%>
There is nothing wrong with ASP.NET MVC. See unicode urls on the web, e.g. http://he.wikipedia.org/wiki/%D7%9B%D7%A9%D7%A8%D7%95%D7%AA in IE and in a browser that handles unicode in URLs correctly.
E.g. chrome displays unicode URLs without any problem. IE does not decode "special" unicode characters in address bar.
This is only a cosmetic issue.
According to RFC 1738 Uniform Resource Locators (URL), only US-ASCII is supported, all other characters must be encoded.
2.2. URL Character Encoding Issues
URLs are sequences of characters, i.e., letters, digits, and special
characters. A URLs may be represented
in a variety of ways: e.g., ink on
paper, or a sequence of octets in a
coded character set. The
interpretation of a URL depends only
on the identity of the characters
used.
In most URL schemes, the sequences of characters in different parts of a
URL are used to represent sequences of
octets used in Internet protocols. For
example, in the ftp scheme, the host
name, directory name and file names
are such sequences of octets,
represented by parts of the URL.
Within those parts, an octet may be
represented by the chararacter which
has that octet as its code within the
US-ASCII [20] coded character set.
In addition, octets may be encoded by a character triplet consisting of
the character "%" followed by the two
hexadecimal digits (from
"0123456789ABCDEF") which forming the
hexadecimal value of the octet. (The
characters "abcdef" may also be used
in hexadecimal encodings.)
Octets must be encoded if they have no corresponding graphic
character within the US-ASCII coded
character set, if the use of the
corresponding character is unsafe, or
if the corresponding character is
reserved for some other interpretation
within the particular URL scheme.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The
octets 80-FF hexadecimal are not used
in US-ASCII, and the octets 00-1F and
7F hexadecimal represent control
characters; these must be encoded.
I think your desire for a non urlencode url is valid, but I don't think the tools actually make it easy to do this.
Would putting the generated link inside an <a>, with the link text being the non-encoded string, not be good enough? It would still look bad in the browser's URL field, but your UI would be a little prettier.
Also, in Firefox anyway, the URL shown in my status bar when I mouse over your "ugly" link shows the unencoded version, so it would probably look fine there as well.
Related
On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.
If we type into firefox or chrome
http://☃.net/
It takes us to
http://xn--n3h.net/
Which is a mirror of unicodesnowmanforyou.com
What I don't understand is by what rules the unicode snowman can decode to xn--n3h, it doesn't look anything like utf-8 or urlencoding.
I think I found a hint while mucking around in python3, because:
>>> '☃'.encode('punycode')
b'n3h'
But I still don't understand the xn-- part. How are domain names internationalised, what is the standard and where is this stuff documented?
It uses an encoding scheme called Punycode (as you've already discovered from the Python testing you've done), capable of representing Unicode characters in ASCII-only format.
Each label (delimited by dots, so get.me.a.coffee.com has five labels) that contains Unicode characters is encoded in Punycode and prefixed with the string xn--.
The label encoding first copies all the ASCII characters, then appends the encoded Unicode characters. The Unicode characters are always after the final - in the label, so one is added after the ASCII characters if needed.
More detail can be found in this page over at the w3 site, and in RFC 3987. For details on how Punycode actually encodes labels, see the Wikipedia page.
When someone types an url in a browser to access a page, which charset is used for that URL? Is there a standard? Can I consider that UTF-8 is used everywhere? Which characters are accepted?
URLs may contain only a subset of ASCII, all URLs are valid ASCII.
International domain names must be Punycode encoded. Non-ASCII characters in the path or query parts must be encoded, with Percent-encoding being the generally agreed-upon standard.
Percent-encoding only takes the raw bytes and encodes each byte as %xx. There's no generally followed standard on what encoding should be used to determine a byte representation. As such, it's basically impossible to assume any particular character set being used in the percent-encoded representation. If you're creating those links, then you're in full control over the used charset before percent-encoding; if you're not, you're mostly out of luck. Though you will most likely encounter UTF-8, this is not guaranteed.
I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.
Are Latin encoded characters considered URL safe?
Having read this post, I'm aware that web safe characters are outlined in this document. The specs do not make clear, however, if Latin encoded characters are part of the unreserved list. For example: ç and õ.
I don't see why those characters would not be included in the unreserved list. But, that said, I'm yet to see any URLs that contain such characters.
Relevant question: Assuming I can use such characters in my URL, should I?
My URLs will be generated by user input. Should I keep titles with such characters, or substitute them? For example, ç to becomes c, and so on.
My reader's native language is Portuguese, but I'm not sure if they will care about these characters in the page's friendly-URL.
The RFC you linked mentioned specifically mentions ASCII as the character set for URIs:
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII].
That would make characters outside of ASCII not safe, as far as the RFC is concerned.
Of course, this is all before IDN existed. There is an RFC that specifies how conversions between ASCII and Unicode on the URL should occur.
You can use any characters you want, because if any character is used outside the range of ASCII code list the percent-code octets is used in order to make the uri transportable