Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1
Related
On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.
I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.
I need a character to separate two or more URIs in one string. Later I will the split the string to get each URI separately.
The problem is I'm not sure what character to pick here. Is there a good character to choose here that definitely can't be part of a URI itself? Or is ultimately pretty much all characters allowed in a URI?
I know certain characters are illegal in certain parts of the URI, but I'm talking about a URI as a whole, like this:
scheme://username:password#domain.tld/path/to/file.ext?key=value#blah
I'm thinking maybe space, although technically I suppose that could be part of the password, or would it be escaped as %20 in that case?
Any of the control characters should be good for this, such as TAB, FF and so on.
RFC3986 (a) controls the URI specification and Appendix A of that RFC states that the characters are limited to:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
0123456789-._~:/?#[]#!$&'()*+,;=
(and the % encoding character, of course, for all other characters not listed above).
So, basically, any other character should be okay as a delimiter.
(a) This has actually been augmented by RFC6874 which has to do with changes to the IPv6 part of the URI, adding a zone identifier. Since the zone ID consists of % and "unreserved" characters already included above, it doesn't change the set of characters allowed.
I am trying to write a function to encode URIs in order to make them compliant with rfc 3986.
I.e. checking that every character other than alphanum; /?:#&=+$-_.!~*'()|\^[]``# gets replaced by %[hex octet]
I want to be sure that if the function gets called with an already encoded URI, the code won't ruin it.
So far all I am doing is looking for a '%' sign followed by 2 octect characters. Any other reserved character I find I replace.
Is there any other check I should be doing?
Don't mind security issues; they are being handled somewhere else.
I think that properly-encoded URIs should always pass through cleanly the second time.
The reason being that you have to correctly parse a URI no matter what, because it's entirely legal to have characters such as / # . : ? & = in a URI, provided they appear in the right places.
So you only encode a character if it is not legal in that part of the URI. With that assertion, you then create an encoded string that IS legal at every position, so when you parse it, there is nothing left to encode.
Bear in mind that if someone throws a URI at you to be encoded and it happens to be ambiguous (ie it contains special characters that alter the URI syntax), they cannot expect a correct result.
To answer your question more directly, I would say yes: in light of all the above, you only need to have special treatment for the % escape sequences.
Um, how do you know that an already encoded URI should not be encoded once again? Maybe the URI contains, I don't know, example how to encode URIs, and if will not get encoded a second time, then the decoding will break it?
That said, you can check whether only allowed characters plus % are present, and whether every % is followed by a hex number. If yes, there is a good chance (but no guarantee) that the encoding has already been done.
I have a long URL with several values.
Example 1:
http://www.domain.com/list?seach_type[]=0&search_period[]=1&search_min=3000&search_max=21000&search_area=6855%3B7470%3B7700%3B7730%3B7741%3B7742%3B7752%3B7755%3B7760%3B7770%3B7800%3B7840%3B7850%3B7860%3B7870%3B7884%3B7900%3B7950%3B7960%3B7970%3B7980%3B7990%3B8620%3B8643%3B8800%3B8830%3B8831%3B8832%3B8840%3B8850%3B8860%3B8881%3B9620%3B9631%3B9632
My variable search area contains only 4 number digits (example 4000, 5000), but can contain a lot of them. Right now I seperate these in the URL by using ; as separator symbol. Though as seen in Example 1, the ; is converted into %3B. This makes me believe that this is a bad symbol to use.
What is the best URL separator?
Moontear, I think you have misread the linked document. That limitation only applies to the "scheme" portion of the URL. In the case of WWW URLs, that is the "http".
The next section of the document goes on to say:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
I'd personally use comma (,). However, plus (+) and dash (-) are both reasonable options as well.
BTW, that document also mentions that semi-colon (;) is reserved in some schemes.
Well, according to RFC1738, valid URLs may only contain the letters a-z, the plus sign (+), period and hyphen (-).
Generally I would go with a plus to separate your search areas. So your URL would become http://www.domain.com/list?seach_type=0&search_period=1&search_min=3000&search_max=21000&search_area=6855+7470+7700+...
--EDIT--
As GinoA pointed out I misread the document. Hence "$-_.+!*'()," are valid characters too. I'd still go with the + sign though.
If there are only numbers to separate, you have a large choice of separators. You can choose any letter for example.
Probably a space can be a good choice. It will be transformed into + character in the URL, so will be more readable than a letter.
Example: search_area=4000+5000+6000
I'm very late to the party, but a valid query string can repeat variables so instead of...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855+7470+7700
...you could also use...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855&area=7470&area=7700
"+" is to be interpreted as a space " " when the content-type is application/x-www-form-urlencoded (standard for HTML forms). This may be handled by your server software.
I prefer "!". It doesn't get URL encoded (at least not in Chrome) and it reserves "+" for use as a real space character in the typical case.