This question already has answers here:
Why should I use urlencode?
(8 answers)
Closed 8 years ago.
When submit a form while using the GET action method, changed the + token thats insert in the textfield to %2B. But why the url do this? Even other tokens like * and % will be chance.
I also wonder of this applies for the security or other things, but what are thee?
Check out what W3Schools says about URL encoding. I think it will help you out.
http://www.w3schools.com/tags/ref_urlencode.asp
Here is an exerpt:
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has
to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by
two hexadecimal digits. URLs cannot contain spaces. URL encoding
normally replaces a space with a plus (+) sign or with %20.
Related
On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.
I have noticed that Google does not encode all special characters in the query part of the URL . For example:
Placing this string in Google's search: !##$%^&*()
Yields this URL: https://www.google.com/#q=!%40%23%24%25^%26*()
Notice that the !, ^, *, ( , and ) are not encoded.
Some of the characters such as : or < are considered unsafe or reserved, yet Google doesn't encode them.
Can someone explain why Google does this, and if they have a reference document as to exactly what characters get encoded and which don't?
Thanks for any help!
As documented here:
Some characters are not safe to use in a URL without first being
encoded. Because a Google search request is made by using an HTTP URL,
the search request must follow URL conventions, including character
encoding, where necessary.
The HTTP URL syntax defines that only alphanumeric characters, the
special characters $-_.+!*'(), and the reserved characters ;/?:#=& can
be used as values within an HTTP URL request. Since reserved
characters are used by the search engine to decode the URL, and some
special characters are used to request search features, then all
non-alphanumeric characters used as a value to an input parameter must
be URL-encoded.
To URL-encode a string:
Replace space characters with a "+" character Replace each
non-alphanumeric character by its hexadecimal ASCII value, in the
format of a "%" character followed by two hexadecimal digits. (Such an
ASCII value may be referred to as an escape code.)
Some input parameters require that the values passed to Google search are double-URL-encoded. This requirement means that you must apply the URL encoding to the string twice in succession to generate the final value.
While creating a course I send the character to course name which when I looked up in Desire2Learn got transferred to a "?". The character send was dash (not the one that you can directly entered from keyboard (-) but little bit longer (like in "Course – name") - I got that from while copying the name from word). However this leads to the question which character are supported in D2L course name so that we can send the simmilar character set. Also what will happen if the character is for a chinese or polish course name?
[updated - 15/12] - When I am sending Japanese character in place of username - it is also showing them as a series of "?"
thanks
Resolved - actually I was using Ascii encoding to convert the data into bytes before submitting to POST. Change to UTF8 encoding now resolved the issue.
I am trying to write a function to encode URIs in order to make them compliant with rfc 3986.
I.e. checking that every character other than alphanum; /?:#&=+$-_.!~*'()|\^[]``# gets replaced by %[hex octet]
I want to be sure that if the function gets called with an already encoded URI, the code won't ruin it.
So far all I am doing is looking for a '%' sign followed by 2 octect characters. Any other reserved character I find I replace.
Is there any other check I should be doing?
Don't mind security issues; they are being handled somewhere else.
I think that properly-encoded URIs should always pass through cleanly the second time.
The reason being that you have to correctly parse a URI no matter what, because it's entirely legal to have characters such as / # . : ? & = in a URI, provided they appear in the right places.
So you only encode a character if it is not legal in that part of the URI. With that assertion, you then create an encoded string that IS legal at every position, so when you parse it, there is nothing left to encode.
Bear in mind that if someone throws a URI at you to be encoded and it happens to be ambiguous (ie it contains special characters that alter the URI syntax), they cannot expect a correct result.
To answer your question more directly, I would say yes: in light of all the above, you only need to have special treatment for the % escape sequences.
Um, how do you know that an already encoded URI should not be encoded once again? Maybe the URI contains, I don't know, example how to encode URIs, and if will not get encoded a second time, then the decoding will break it?
That said, you can check whether only allowed characters plus % are present, and whether every % is followed by a hex number. If yes, there is a good chance (but no guarantee) that the encoding has already been done.
Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1