Context: I am creating an app that stores its data in the location.hash. I want to encode as few characters as possible to maintain maximum legibility.
As explained in this answer, reserved characters are different for each segment of the URL. So what are the limitations for URL Fragment/location.hash specifically?
Related post:
Unicode characters in URLs
According to RFC 3986: Uniform Resource Identifier (URI):
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Unpacking all that, and ignoring percent-encoding, I find the following set of characters:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~!$&'()*+,;=:#/?
Although the RFC does not mandate a particular encoding and deals in characters only (not bytes), according to Section 2.3 ALPHA means ASCII only, i.e. the 26 letters of the Latin alphabet. Any non-ASCII letters must therefore be percent-encoded.
Related
For example I quite often see this URL come up.
https://ghbtns.com/github-btn.html?user=example&repo=card&type=watch&count=true
Is the & meant to be & or should/can it be left as &?
& is for encoding the ampersand in HTML.
For example, in a hyperlink:
…
(Note that this only changes the link, not the URL. The URL is still /github-btn.html?user=example&repo=card&type=watch&count=true.)
While you may encode every & (that is part of the content) with & in HTML, you are only required to encode ambiguous ampersands.
From rfc3986:
Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.
...
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI. URIs
that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
Percent-encoding a reserved character, or decoding a percent-encoded
octet that corresponds to a reserved character, will change how the
URI is interpreted by most applications.
...
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So & within a URL should be encoded if it's part of the value and has no delimiting role.Here's simple PHP code fragment using urlencode() function:
<?php
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
?>
The rfc 1738 is not precise about encoding of forward slashes in "search part":
If the character corresponding to an octet is reserved in a scheme, the octet must be encoded.
...
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
...
Within the 'path' and 'searchpart' components, "/", ";", "?" are reserved.
Do you know what is the "reserved purpose" of "/" in search part of the urls?
Is there any real reason to follow the spec and encode the forward slashes providing that
my server handles unecoded slashes?
It drive me nuts when I need to constantly decode urls parameters that are just alphanumeric with slashes.
Here is an life example:
http://localhost/login?url=/a/path/to/protected/content
vs
http://localhost/login?url=%2Fa%2Fpath%2Fto%2Fprotected%2Fcontent"
Note that RFC 3986 updates RFC 1738 (though doesn't obsolete it, which I think indicates that it's intended to clarify rather than contradict).
RFC 3986 says, in section 3.4, that the syntax of the query part of the URI is:
query = *( pchar / "/" / "?" )
The ABNF for URIs is conveniently collected in Appendix A, which indicates
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
That pretty unequivocally indicates that slashes are legitimate in the query part, and so don't need to be encoded. In particular, your example http://localhost/login?url=/a/path/to/protected/content is fine as it is, and so is http://localhost/login?abc123-.+~!$&'()*+,;=%00/?:#
Section 2.4 indicates that characters need to be encoded only when one wants to include reserved characters in a part of the URI (that doesn't apply here).
I have an application that takes all the parameters in the url like this: /category/subcategory/sub-subcategory. I want to be able to give out extra parameters at the end of the URL, like page-2/order-desc. This would make the whole URL into cat/subcat/sub-subcat{delimiting-character}page-2/order-desc.
My question is: what characters could I use as {delimiting-character}. I tend to prefer ":" as I know for sure it will never appear anyplace else but I don't know if it would be standard compliant or at least if it will not give me problems in the future.
As I recall vimeo used something like this: vimeo.com/video:{code} but they seem to have changed this.
You can use alphanumeric, plus the special characters "$-_.+!*'(),"
More info here: http://www.ietf.org/rfc/rfc1738.txt
Also, take note not to exceed 2000 characters in url
The most recent URI spec is RFC 3986; see the ABNF for details on what characters are allowed in which parts for the URI.
The format for an absolute path part is:
path-absolute = "/" [ segment-nz *( "/" segment ) ]
segment = *pchar
segment-nz = 1*pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
See http://www.ietf.org/rfc/rfc1738.txt
Basically, you are allowed all aphanumerics as well as $ - _ . + ! * ' ( ) ,
If you use dash or underscore, remember that a dash is read by Google as a hyphen, so does not alter how your URL is categorized. An underscore is counted as a character, and can mess up your SEO.
Ex: dash-use = dash use (2 words);
underscore_use = underscore_use (1 word)
You could use a dash or an underscore (these are used frequently). You could use any character you want to but for example, spaces turn into %20 in the url so they don't look too-nice.
I stumbled across a site that uses multiple fragment identifiers in their URLs, like http://www.ejeby.se/#newprodukt#produkt#1075#1 (no, it is not my site, but I am linking to it, which brings problems for me).
But is this really correct? It does seem to cause problems for Safari and possibly also Internet Explorer (hearsay, I have not tried IE myself).
Isn't the fragment identifier supposed to uniquely identify one location in the document?
Is this a bug in Safari or is it www.ejeby.se that uses fragment idenifiers in a wrong way?
Edit: Seems that the problem for Safari is that it escapes all # but the first in the URL. The other browsers do not do this. Correct behaviour or not?
From the specification point of view, a fragment can contain the following characters (I’ve already expanded the productions):
fragment = *( ALPHA / DIGIT / "-" / "." / "_" / "~" / "%" HEXDIG HEXDIG / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "#" / "/" / "?" )
So, no, the fragment must not contain a plain #; it must be encoded with %23.
But it is possible that some browsers display it differently just as sequences of percent-encoded octets, that represent valid UTF-8 characters are replaced by the characters they represent.
Does anyone know the full list of characters that can be used within a GET without being encoded? At the moment I am using A-Z a-z and 0-9... but I am looking to find out the full list.
I am also interested into if there is a specification released for the up coming addition of Chinese, Arabic url's (as obviously that will have a big impact on my question)
EDIT: As #Jukka K. Korpela correctly points out, RFC 1738 was updated by RFC 3986.
This has expanded and clarified the characters valid for host, unfortunately it's not easily copied and pasted, but I'll do my best.
In first matched order:
host = IP-literal / IPv4address / reg-name
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address
h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
reg-name = *( unreserved / pct-encoded / sub-delims )
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" <---This seems like a practical shortcut, most closely resembling original answer
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG
Original answer from RFC 1738 specification:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
^ obsolete since 1998.
The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding)
http://en.wikipedia.org/wiki/Percent-encoding#Types_of_URI_characters
says these are RFC 3986 unreserved characters (sec. 2.3) as well as reserved characters (sec 2.2) if they need to retain their special meaning. And also a percent character as part of a percent-encoding.
The full list of the 66 unreserved characters is in RFC3986, here: https://www.rfc-editor.org/rfc/rfc3986#section-2.3
This is any character in the following regex set:
[A-Za-z0-9_.\-~]
I tested it by requesting my website (apache) with all available chars on my german keyboard as URL parameter:
http://example.com/?^1234567890ß´qwertzuiopü+asdfghjklöä#<yxcvbnm,.-°!"§$%&/()=? `QWERTZUIOPÜ*ASDFGHJKLÖÄ\'>YXCVBNM;:_²³{[]}\|µ#€~
These were not encoded:
^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.-!/()=?`*;:_{}[]\|~
Not encoded after urlencode():
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_
Not encoded after rawurlencode():
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_~
Note: Before PHP 5.3.0 rawurlencode() encoded ~ because of RFC 1738. But this was replaced by RFC 3986 so its safe to use, now. But I do not understand why for example {} are encoded through rawurlencode() because they are not mentioned in RFC 3986.
An additional test I made was regarding auto-linking in mail texts. I tested Mozilla Thunderbird, aol.com, outlook.com, gmail.com, gmx.de and yahoo.de and they fully linked URLs containing these chars:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_~+#,%&=*;:#
Of course the ? was linked, too, but only if it was used once.
Some people would now suggest to use only the rawurlencode() chars, but did you ever hear that someone had problems to open these websites?
Asterisk
http://wayback.archive.org/web/*/http://google.com
Colon
https://en.wikipedia.org/wiki/Wikipedia:About
Plus
https://plus.google.com/+google
At sign, Colon, Comma and Exclamation mark
https://www.google.com/maps/place/USA/#36.2218457,...
Because of that these chars should be usable unencoded without problems. Of course you should not use &; because of encoding sequences like &. The same reason is valid for % as it used to encode chars in general. And = as it assigns a value to a parameter name.
Finally I would say its ok to use these unencoded:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_~!+,*:#
But if you expect randomly generated URLs you should not use punctuation marks like .!, because some mail apps will not auto-link them:
http://example.com/?foo=bar! < last char not linked
From here
Thus, only alphanumerics, the special characters $-_.+!*'(),
and reserved characters used for their
reserved purposes may be used unencoded within a URL.
RFC3986 defines two sets of characters you can use in a URI:
Reserved Characters: :/?#[]#!$&'()*+,;=
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.
Unreserved Characters: A-Za-z0-9-_.~
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
Characters that are allowed in a URI but do not have a reserved purpose are called unreserved.
These are listed in RFC3986. See the Collected ABNF for URI to see what is allowed where and the regex for parsing/validation.
This answer discusses characters may be included inside a URL fragment part without being escaped. I'm posting a separate answer since this part is slightly different than (and can be used in conjunction with) other excellent answers here.
The fragment part is not sent to the server and it is the characters that go after # in this example:
https://example.com/#STUFF-HERE
Specification
The relevant specifications in RFC 3986 are:
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
This also references rules in RFC 2234
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
DIGIT = %x30-39 ; 0-9
Result
So the full list, excluding escapes (pct-encoded) are:
A-Z a-z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = : # / ?
For your convenience here is a PCRE expression that matches a valid, unescaped fragment:
/^[A-Za-z0-9\-._~!$&'()*+,;=:#\/?]*$/
Encoding
Counting this up, there are:
26 + 26 + 10 + 19 = 81 code points
You could use base 81 to efficiently encode data here.
The upcoming change is for chinese, arabic domain names not URIs. The internationalised URIs are called IRIs and are defined in RFC 3987. However, having said that I'd recommend not doing this yourself but relying on an existing, tested library since there are lots of choices of URI encoding/decoding and what are considered safe by specification, versus what are safe by actual use (browsers).
If you like to give a special kind of experience to the users you could use pushState to bring a wide range of characters to the browser's url:
var u="";var tt=168;
for(var i=0; i< 250;i++){
var x = i+250*tt;
console.log(x);
var c = String.fromCharCode(x);
u+=c;
}
history.pushState({},"",250*tt+u);