Are Latin encoded characters considered URL safe? - url

Are Latin encoded characters considered URL safe?
Having read this post, I'm aware that web safe characters are outlined in this document. The specs do not make clear, however, if Latin encoded characters are part of the unreserved list. For example: ç and õ.
I don't see why those characters would not be included in the unreserved list. But, that said, I'm yet to see any URLs that contain such characters.
Relevant question: Assuming I can use such characters in my URL, should I?
My URLs will be generated by user input. Should I keep titles with such characters, or substitute them? For example, ç to becomes c, and so on.
My reader's native language is Portuguese, but I'm not sure if they will care about these characters in the page's friendly-URL.

The RFC you linked mentioned specifically mentions ASCII as the character set for URIs:
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII].
That would make characters outside of ASCII not safe, as far as the RFC is concerned.
Of course, this is all before IDN existed. There is an RFC that specifies how conversions between ASCII and Unicode on the URL should occur.

You can use any characters you want, because if any character is used outside the range of ASCII code list the percent-code octets is used in order to make the uri transportable

Related

If it is valid that Wikipedia uses Chinese characters (and other unicode characters) in URL

On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.

What is the charset of URLs?

When someone types an url in a browser to access a page, which charset is used for that URL? Is there a standard? Can I consider that UTF-8 is used everywhere? Which characters are accepted?
URLs may contain only a subset of ASCII, all URLs are valid ASCII.
International domain names must be Punycode encoded. Non-ASCII characters in the path or query parts must be encoded, with Percent-encoding being the generally agreed-upon standard.
Percent-encoding only takes the raw bytes and encodes each byte as %xx. There's no generally followed standard on what encoding should be used to determine a byte representation. As such, it's basically impossible to assume any particular character set being used in the percent-encoded representation. If you're creating those links, then you're in full control over the used charset before percent-encoding; if you're not, you're mostly out of luck. Though you will most likely encounter UTF-8, this is not guaranteed.

Is there a character that is illegal in all parts of a URI?

I need a character to separate two or more URIs in one string. Later I will the split the string to get each URI separately.
The problem is I'm not sure what character to pick here. Is there a good character to choose here that definitely can't be part of a URI itself? Or is ultimately pretty much all characters allowed in a URI?
I know certain characters are illegal in certain parts of the URI, but I'm talking about a URI as a whole, like this:
scheme://username:password#domain.tld/path/to/file.ext?key=value#blah
I'm thinking maybe space, although technically I suppose that could be part of the password, or would it be escaped as %20 in that case?
Any of the control characters should be good for this, such as TAB, FF and so on.
RFC3986 (a) controls the URI specification and Appendix A of that RFC states that the characters are limited to:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
0123456789-._~:/?#[]#!$&'()*+,;=
(and the % encoding character, of course, for all other characters not listed above).
So, basically, any other character should be okay as a delimiter.
(a) This has actually been augmented by RFC6874 which has to do with changes to the IPv6 part of the URI, adding a zone identifier. Since the zone ID consists of % and "unreserved" characters already included above, it doesn't change the set of characters allowed.

URL escaping: Is it legal to write multi-byte character "%83%4a" as "%83J" in URL?

Suppose an URL is encoded in a multi-byte character set where one of the characters in the multi-byte sequence could be between 0 and 127, i.e. an otherwise valid 7-bit ASCII character.
Example: The Japanese Shift_JIS character set, where the character カ would
be escaped as %83%4a. Now %4a is also the ASCII character J, so I could instead write %83J.
Would that be OK by the whatever standard(s) apply?
I'm not asking because I want to send URLs like this (although the latter saves a couple bytes), but whether I should accept those on the server side, i.e. whether it is standards-compliant and also, whether I can expect other servers to handle this in the same way.
I'm basing my answer on RFC 2396, as that is what's being used by HTTP 1.1.
According to Section 2.1, there are 2 separate steps, the latter being optional:
URI character sequence->octet sequence
octet sequence->original character sequence
So the answer is: Yes, it's OK.

Can a path in a URI contain unicode?

Is it possible for a valid URL to contain non-escaped Unicode characters?
Yes, the subset of ASCII (and therefore Unicode) that is allowed unescaped in URIs, such as letters and numbers. But the majority of the Unicode character set has to be percent-encoded.
URI and URL do not natively support unescaped non-ASCII Unicode characters, however many servers do allow percent-encoded UTF-8 or localized Ansi octets to be used (but no way of specifying which is actually used). For standardized native Unicode handling, use IRI instead, which is the new protocol that replaces URI/URL. It requires UTF-8 encoding for everything, and provides rules for how to convert between IRI and URI.

Resources