Why need encode and decode urls? - url

I have read the question Why do you need to encode URLs
but I still confused:
Why the W3C just allow more character could exist in URL?So it could avoid encoding?
Why there is exist decode

The URL representation of characters may differ from the characters you have in your code. In other words, there is a specific grammar that defines how URLs are assembled. Special characters that are used in forming a URL need to be encoded so that they do not cause unexpected results.
Now to answer your questions more specifically:
They may already allow some of the characters you are thinking of, but these characters (&, ?, for example) are given special meaning to function in a certain way. Therefore, they cannot be used in a different context. From the link to the question you posted, it also looks like in the example of the space character, it is not supported because of the problems it would introduce in its use.
Decode is useful for decoding the URL to get the string representation of the URL before it was encoded for manipulation/other functions in the application.

Related

Does URL encoding guarrantee for all outputted characters to be printable (visible)?

Does URL encoding guarantee for all encoded characters (after the encoding process) to be printable (visible)? Within its specification and scope? "Printable" here is defined as "visible on paper". Unfortunately could not find any documents mentioning anything similar online
URL encoding uses a very limited set of characters (probably 7-bit ascii), hence is always printable.
All 8-bit codes, plus all of these: !"# $%&' ()*+ ,/:; <=>? #[\] ^``{| }~ are turned into something else.
Perhaps importantly, but confusing: a single space is turned into +.
The goal of the encoding is to avoid parsing problems in URLs:
HTTP://example.com/blah.php?my_url=example.com?confusion reighn&x=(a+b)
The stuff after my_url= should have been encoded.

If it is valid that Wikipedia uses Chinese characters (and other unicode characters) in URL

On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.

Can urls have UTF-8 characters?

I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.

Prevent encoding URI twice

I am trying to write a function to encode URIs in order to make them compliant with rfc 3986.
I.e. checking that every character other than alphanum; /?:#&=+$-_.!~*'()|\^[]``# gets replaced by %[hex octet]
I want to be sure that if the function gets called with an already encoded URI, the code won't ruin it.
So far all I am doing is looking for a '%' sign followed by 2 octect characters. Any other reserved character I find I replace.
Is there any other check I should be doing?
Don't mind security issues; they are being handled somewhere else.
I think that properly-encoded URIs should always pass through cleanly the second time.
The reason being that you have to correctly parse a URI no matter what, because it's entirely legal to have characters such as / # . : ? & = in a URI, provided they appear in the right places.
So you only encode a character if it is not legal in that part of the URI. With that assertion, you then create an encoded string that IS legal at every position, so when you parse it, there is nothing left to encode.
Bear in mind that if someone throws a URI at you to be encoded and it happens to be ambiguous (ie it contains special characters that alter the URI syntax), they cannot expect a correct result.
To answer your question more directly, I would say yes: in light of all the above, you only need to have special treatment for the % escape sequences.
Um, how do you know that an already encoded URI should not be encoded once again? Maybe the URI contains, I don't know, example how to encode URIs, and if will not get encoded a second time, then the decoding will break it?
That said, you can check whether only allowed characters plus % are present, and whether every % is followed by a hex number. If yes, there is a good chance (but no guarantee) that the encoding has already been done.

Why do you need to encode URLs?

Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1

Resources