What’s the difference between an URL Encode and a HTML Encode?
HTML Encoding escapes special characters in strings used in HTML documents to prevent confusion with HTML elements like changing
"<hello>world</hello>"
to
"<hello>world</hello>"
URL Encoding does a similar thing for string values in a URL like changing
"hello+world = hello world"
to
"hello%2Bworld+%3D+hello+world"
urlEncode replaces special characters with characters that can be understood by web browsers/web servers for the purpose of addressing... hence URL. For instance, spaces are replaced with %20, ' = %27 etc...
See these references:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
http://www.degraeve.com/reference/urlencoding.php
HtmlEncode replaces special characters with character strings that are recognised by the HTML engine itself to render the content of the page - things like & become & or < = <, > = > this prevents the HTML engine from interpreting these characters as parts of the HTML markup and therefore render them as if they were strings.
See this reference:
http://msdn.microsoft.com/en-us/library/ms525347.aspx
Both HTML and URL's are essentially very constrained languages. As a language they add meaning to specific keywords or operators. For both of these languages though, keywords are almost always single characters. For example
HTML: > and <
URL: / and :
In the use of each language though it is possible to use these constructs in a manner that does not ensure the meaning of the language. For instance this post contains a > character. I do not want it to be interpreted as HTML, just text.
This is where Encode and Decode methods come into play. These methods will respectively take a string and convert any of the characters that would otherwise be treated as keywords into an escaped form which will not be interpreted as part of the language.
For instance: Passing > into HtmlEncode will return >
HTMLEncode and URLEncode deal with invalid characters in HTML and URLs, or more accurately, characters that need to be specially written to be interpreted correctly. For example, in HTML the < and > characters are used to indicate tags. Thus, if you wanted to write a math formula, something like 1+1 < 2+2, the '<' would normally be interpreted as the beginning of a tag. HTMLEncoding turns this character into "<" which is the encoded representation of the less-than sign. URLEncoding does the same, but for URLs, for which the special characters are different, although there is some overlap.
I don't know what language you are working in, but the PHP manual for example provides good explanations.
URLEncode
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
Read on
Related
(Context: I'm writing an HTML sanitiser, and want to normalize URLs as a defence-in-depth measure, making it impossible to use abnormally escaped URLs to bypass downstream blacklists (I'm not relying on blacklists myself) or mislead users.)
When given a URL, in what contexts can a character be changed to its percent-encoded version, or vice versa, without changing the meaning of the URL?
What I've been able to conclude so far:
In the path portion of a URL, / is not equivalent to its escaped form %2F
The separator ? between the path and query string is not equivalent to its escaped form %3F (presumably the same rule also applies to the fragment separator #)
For the special cases of . and .. within a hierarchical path, . is equivalent to %2E according to the specification
Some characters, such as ^, are illegal in URLs, and thus must only appear in encoded form – the decoded form is not equivalent because it can't be used at all
I don't have a second-hand source for this, but all the software I've tested agrees that percent-encoded domain names are equivalent to the corresponding decoded versions (e.g. ex%61mple.com is equivalent to example.com in the host part of a URL) – this makes sense because %, /, and illegal-in-URL characters are all illegal in domain names anyway, so escaping could not possibly be of use
% cannot be equivalent to its encoded form %25, otherwise there would be no way to escape the escape character
application/x-www-form-urlencoded is a commonly (although not universally) used format for URL query strings, and in that format, =, +, & are not equivalent to %3D, %2B, %26 respectively; thus these equivalences cannot hold in URL query strings
However, I'm finding it unclear what the correct action to take with real-world URLs is in other cases, especially as real-life URL parsing libraries tend not to match the specification exactly. In particular:
Should I be percent-decoding characters in the path portion of a URL that are URL-safe (other than %/?#) but have been unexpectedly encoded anyway? The most common software behaviour that I've seen for URLs like http://example.com/ind%65x.html is to treat them as distinct URLs from http://example.com/index.html (e.g. they appear differently in logs and don't compare as equal), but to actually handle the two "distinct" URLs the same way. I don't know whether this is an implementation detail, or whether it's some sort of compatibility workaround.
Should I be decoding any characters in query strings? If so, which?
Should I be decoding any characters in fragments? If so, which?
There seem to be competing standards on this subject, and real-world application behaviour might not match any of them, so I'm interested in knowing how far I can go with URL normalization without breaking real-world use cases. (It would also be helpful to know in which situations escaped characters might be technically different in meaning from the non-escaped versions, but in which escaping them would have no legitimate uses – a sanitiser could have an option to reject URLs that escaped these characters as being likely to be malicious.)
I hope this may provide some insight to your question:
We should only encode the individual components of the ur (example query parameters and fragments), excluding the domain name, that may contain unsafe symbols. Please note, the different components have different rules of what characters need to be encoded and which ones do not. Please read here [https://datatracker.ietf.org/doc/html/rfc3986].
In general, you may follow below:
These unreserved Characters Need not be encoded: ALPHA (uppercase and lowercase) / Decimal Digits / "-" / "." / "_" / "~"
The space character is converted into a plus sign "+" and should not trigger encoding.
All other characters (unsafe, reserved characters if not used for their reserved purposes) should be encoded. Below is a list of such characters (it may include a few more):
! * ' ( ) ; : # & = + $ , / ? # [ ] % { } | \ ^
On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.
I have noticed that Google does not encode all special characters in the query part of the URL . For example:
Placing this string in Google's search: !##$%^&*()
Yields this URL: https://www.google.com/#q=!%40%23%24%25^%26*()
Notice that the !, ^, *, ( , and ) are not encoded.
Some of the characters such as : or < are considered unsafe or reserved, yet Google doesn't encode them.
Can someone explain why Google does this, and if they have a reference document as to exactly what characters get encoded and which don't?
Thanks for any help!
As documented here:
Some characters are not safe to use in a URL without first being
encoded. Because a Google search request is made by using an HTTP URL,
the search request must follow URL conventions, including character
encoding, where necessary.
The HTTP URL syntax defines that only alphanumeric characters, the
special characters $-_.+!*'(), and the reserved characters ;/?:#=& can
be used as values within an HTTP URL request. Since reserved
characters are used by the search engine to decode the URL, and some
special characters are used to request search features, then all
non-alphanumeric characters used as a value to an input parameter must
be URL-encoded.
To URL-encode a string:
Replace space characters with a "+" character Replace each
non-alphanumeric character by its hexadecimal ASCII value, in the
format of a "%" character followed by two hexadecimal digits. (Such an
ASCII value may be referred to as an escape code.)
Some input parameters require that the values passed to Google search are double-URL-encoded. This requirement means that you must apply the URL encoding to the string twice in succession to generate the final value.
It seems that it is best to use the & escape, instead of simply typing the ampersand (&).
However, should we be using X/HTML character entity references for dashes and other common typographical characters when writing blog posts on CMSs like WordPress or hard-coding websites by hand?
For example:
– is an en dash (–)
— is an em dash (—)
What is the risk if we do not?
Why is the hyphen (-) never written as - but simply typed directly from the keyboard in HTML? (Assuming that it the hyphen, and not a minus sign.)
The W3C released an official response about when to use and when not to use character escapes which you can find here. As they are also the group that is in charge of the HTML specification, I think it's best to follow their advice.
From the section "When to Use Escapes"
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.
< (<)
> (>)
& (&)
They also mention using characters that might not be supported in the current encoding.
From the section "When Not to Use Escapes"
It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.
http://www.w3.org/International/questions/qa-escapes
Those entities are there to help you, the author, with characters not usually typable on your average keyboard. (The em dash is an example —, as well as © and ).
You only need to escape those characters that have meaning in (X)HTML < > and &.
Are Latin encoded characters considered URL safe?
Having read this post, I'm aware that web safe characters are outlined in this document. The specs do not make clear, however, if Latin encoded characters are part of the unreserved list. For example: ç and õ.
I don't see why those characters would not be included in the unreserved list. But, that said, I'm yet to see any URLs that contain such characters.
Relevant question: Assuming I can use such characters in my URL, should I?
My URLs will be generated by user input. Should I keep titles with such characters, or substitute them? For example, ç to becomes c, and so on.
My reader's native language is Portuguese, but I'm not sure if they will care about these characters in the page's friendly-URL.
The RFC you linked mentioned specifically mentions ASCII as the character set for URIs:
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII].
That would make characters outside of ASCII not safe, as far as the RFC is concerned.
Of course, this is all before IDN existed. There is an RFC that specifies how conversions between ASCII and Unicode on the URL should occur.
You can use any characters you want, because if any character is used outside the range of ASCII code list the percent-code octets is used in order to make the uri transportable