I have noticed that Google does not encode all special characters in the query part of the URL . For example:
Placing this string in Google's search: !##$%^&*()
Yields this URL: https://www.google.com/#q=!%40%23%24%25^%26*()
Notice that the !, ^, *, ( , and ) are not encoded.
Some of the characters such as : or < are considered unsafe or reserved, yet Google doesn't encode them.
Can someone explain why Google does this, and if they have a reference document as to exactly what characters get encoded and which don't?
Thanks for any help!
As documented here:
Some characters are not safe to use in a URL without first being
encoded. Because a Google search request is made by using an HTTP URL,
the search request must follow URL conventions, including character
encoding, where necessary.
The HTTP URL syntax defines that only alphanumeric characters, the
special characters $-_.+!*'(), and the reserved characters ;/?:#=& can
be used as values within an HTTP URL request. Since reserved
characters are used by the search engine to decode the URL, and some
special characters are used to request search features, then all
non-alphanumeric characters used as a value to an input parameter must
be URL-encoded.
To URL-encode a string:
Replace space characters with a "+" character Replace each
non-alphanumeric character by its hexadecimal ASCII value, in the
format of a "%" character followed by two hexadecimal digits. (Such an
ASCII value may be referred to as an escape code.)
Some input parameters require that the values passed to Google search are double-URL-encoded. This requirement means that you must apply the URL encoding to the string twice in succession to generate the final value.
Related
I've encountered a situation whereby a GET request for a URL where the query string contains unencoded special characters returns 200 OK for the URL as-is but returns a 400 Bad Request when the special characters are url-encoded.
A simplified example of such a URL is: http://example.com/??foo/bar+foobar
Note the double question mark such that the entire query string is a single key-value pair with a null value.
URL-encoding the query string key gives us: http://example.com/?%3Ffoo%2Fbar+foobar
A GET request for this URL containing encoded characters will return 400 Bad Request.
The application handling these URLs (a third party I don't control) appears to not like the query string key to contain url-encoded equivalents of non-alphanumeric characters.
I was under the assumption that http://example.com/??foo/bar+foobar and http://example.com/?%3Ffoo%2Fbar+foobar should be equivalent and consequently interchangeable. This may be an invalid assumption.
Are these two URLs equivalent and consequently interchangable?
Is it the application that handles these URLs at fault for not treating these URLs as equivalent or is it my application, which is applying url-encoding to query string keys and values, at fault for applying such encoding?
Are Latin encoded characters considered URL safe?
Having read this post, I'm aware that web safe characters are outlined in this document. The specs do not make clear, however, if Latin encoded characters are part of the unreserved list. For example: ç and õ.
I don't see why those characters would not be included in the unreserved list. But, that said, I'm yet to see any URLs that contain such characters.
Relevant question: Assuming I can use such characters in my URL, should I?
My URLs will be generated by user input. Should I keep titles with such characters, or substitute them? For example, ç to becomes c, and so on.
My reader's native language is Portuguese, but I'm not sure if they will care about these characters in the page's friendly-URL.
The RFC you linked mentioned specifically mentions ASCII as the character set for URIs:
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII].
That would make characters outside of ASCII not safe, as far as the RFC is concerned.
Of course, this is all before IDN existed. There is an RFC that specifies how conversions between ASCII and Unicode on the URL should occur.
You can use any characters you want, because if any character is used outside the range of ASCII code list the percent-code octets is used in order to make the uri transportable
What are the valid characters that can be used in a URL query variable?
I'm asking because I would like to create GUIDs of minimal string length by using the largest character set so long as they can be passed as a URL query variable (www.StackOverflow.com?query=guiddaf09834fasnv)
Edit
If you want to encode a UUID/GUID or any other information represented in a byte array into a url-friendly string, you can use this method in the Apache Commons Code library:
Base64.encodeBase64URLSafeString(byte[])
When in doubt, just go to the RFC.
Note: A query variable is not dealt with any differently then the rest of the URL.
From the section "2.2. URL Character Encoding Issues"
... only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
I have an action like this:
<%=Html.ActionLink("My_link", "About", "Home", new RouteValueDictionary {
{ "id", "Österreich" } }, null)%>
This produces the following link: http://localhost:1855/Home/About/%C3%96sterreich
I want a link which looks like this - localhost:1855/Home/About/Österreich
I have tried.
Server.HtmlDecode("Österreich")
HttpUtility.UrlDecode("Österreich")
Neither seems to be helping. What else can I try to get my desired result?
I think this is an issue with your browser (IE).
Your code is correct as it is, no explicit UrlEncoding needed.
<%=Html.ActionLink("My_link", "About", "Home", new RouteValueDictionary {
{ "id", "Österreich" } }, null)%>
There is nothing wrong with ASP.NET MVC. See unicode urls on the web, e.g. http://he.wikipedia.org/wiki/%D7%9B%D7%A9%D7%A8%D7%95%D7%AA in IE and in a browser that handles unicode in URLs correctly.
E.g. chrome displays unicode URLs without any problem. IE does not decode "special" unicode characters in address bar.
This is only a cosmetic issue.
According to RFC 1738 Uniform Resource Locators (URL), only US-ASCII is supported, all other characters must be encoded.
2.2. URL Character Encoding Issues
URLs are sequences of characters, i.e., letters, digits, and special
characters. A URLs may be represented
in a variety of ways: e.g., ink on
paper, or a sequence of octets in a
coded character set. The
interpretation of a URL depends only
on the identity of the characters
used.
In most URL schemes, the sequences of characters in different parts of a
URL are used to represent sequences of
octets used in Internet protocols. For
example, in the ftp scheme, the host
name, directory name and file names
are such sequences of octets,
represented by parts of the URL.
Within those parts, an octet may be
represented by the chararacter which
has that octet as its code within the
US-ASCII [20] coded character set.
In addition, octets may be encoded by a character triplet consisting of
the character "%" followed by the two
hexadecimal digits (from
"0123456789ABCDEF") which forming the
hexadecimal value of the octet. (The
characters "abcdef" may also be used
in hexadecimal encodings.)
Octets must be encoded if they have no corresponding graphic
character within the US-ASCII coded
character set, if the use of the
corresponding character is unsafe, or
if the corresponding character is
reserved for some other interpretation
within the particular URL scheme.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The
octets 80-FF hexadecimal are not used
in US-ASCII, and the octets 00-1F and
7F hexadecimal represent control
characters; these must be encoded.
I think your desire for a non urlencode url is valid, but I don't think the tools actually make it easy to do this.
Would putting the generated link inside an <a>, with the link text being the non-encoded string, not be good enough? It would still look bad in the browser's URL field, but your UI would be a little prettier.
Also, in Firefox anyway, the URL shown in my status bar when I mouse over your "ugly" link shows the unencoded version, so it would probably look fine there as well.
What’s the difference between an URL Encode and a HTML Encode?
HTML Encoding escapes special characters in strings used in HTML documents to prevent confusion with HTML elements like changing
"<hello>world</hello>"
to
"<hello>world</hello>"
URL Encoding does a similar thing for string values in a URL like changing
"hello+world = hello world"
to
"hello%2Bworld+%3D+hello+world"
urlEncode replaces special characters with characters that can be understood by web browsers/web servers for the purpose of addressing... hence URL. For instance, spaces are replaced with %20, ' = %27 etc...
See these references:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
http://www.degraeve.com/reference/urlencoding.php
HtmlEncode replaces special characters with character strings that are recognised by the HTML engine itself to render the content of the page - things like & become & or < = <, > = > this prevents the HTML engine from interpreting these characters as parts of the HTML markup and therefore render them as if they were strings.
See this reference:
http://msdn.microsoft.com/en-us/library/ms525347.aspx
Both HTML and URL's are essentially very constrained languages. As a language they add meaning to specific keywords or operators. For both of these languages though, keywords are almost always single characters. For example
HTML: > and <
URL: / and :
In the use of each language though it is possible to use these constructs in a manner that does not ensure the meaning of the language. For instance this post contains a > character. I do not want it to be interpreted as HTML, just text.
This is where Encode and Decode methods come into play. These methods will respectively take a string and convert any of the characters that would otherwise be treated as keywords into an escaped form which will not be interpreted as part of the language.
For instance: Passing > into HtmlEncode will return >
HTMLEncode and URLEncode deal with invalid characters in HTML and URLs, or more accurately, characters that need to be specially written to be interpreted correctly. For example, in HTML the < and > characters are used to indicate tags. Thus, if you wanted to write a math formula, something like 1+1 < 2+2, the '<' would normally be interpreted as the beginning of a tag. HTMLEncoding turns this character into "<" which is the encoded representation of the less-than sign. URLEncoding does the same, but for URLs, for which the special characters are different, although there is some overlap.
I don't know what language you are working in, but the PHP manual for example provides good explanations.
URLEncode
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
Read on