Should we encode slashes in search part of URLs? - url

The rfc 1738 is not precise about encoding of forward slashes in "search part":
If the character corresponding to an octet is reserved in a scheme, the octet must be encoded.
...
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
...
Within the 'path' and 'searchpart' components, "/", ";", "?" are reserved.
Do you know what is the "reserved purpose" of "/" in search part of the urls?
Is there any real reason to follow the spec and encode the forward slashes providing that
my server handles unecoded slashes?
It drive me nuts when I need to constantly decode urls parameters that are just alphanumeric with slashes.
Here is an life example:
http://localhost/login?url=/a/path/to/protected/content
vs
http://localhost/login?url=%2Fa%2Fpath%2Fto%2Fprotected%2Fcontent"

Note that RFC 3986 updates RFC 1738 (though doesn't obsolete it, which I think indicates that it's intended to clarify rather than contradict).
RFC 3986 says, in section 3.4, that the syntax of the query part of the URI is:
query = *( pchar / "/" / "?" )
The ABNF for URIs is conveniently collected in Appendix A, which indicates
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
That pretty unequivocally indicates that slashes are legitimate in the query part, and so don't need to be encoded. In particular, your example http://localhost/login?url=/a/path/to/protected/content is fine as it is, and so is http://localhost/login?abc123-.+~!$&'()*+,;=%00/?:#
Section 2.4 indicates that characters need to be encoded only when one wants to include reserved characters in a part of the URI (that doesn't apply here).

Related

Should an ampersand be URL encoded in a query string?

For example I quite often see this URL come up.
https://ghbtns.com/github-btn.html?user=example&repo=card&type=watch&count=true
Is the & meant to be & or should/can it be left as &?
& is for encoding the ampersand in HTML.
For example, in a hyperlink:
…
(Note that this only changes the link, not the URL. The URL is still /github-btn.html?user=example&repo=card&type=watch&count=true.)
While you may encode every & (that is part of the content) with & in HTML, you are only required to encode ambiguous ampersands.
From rfc3986:
Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.
...
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI. URIs
that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
Percent-encoding a reserved character, or decoding a percent-encoded
octet that corresponds to a reserved character, will change how the
URI is interpreted by most applications.
...
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So & within a URL should be encoded if it's part of the value and has no delimiting role.Here's simple PHP code fragment using urlencode() function:
<?php
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
?>

What are the eligible characters in a URL's Fragment (location.hash)?

Context: I am creating an app that stores its data in the location.hash. I want to encode as few characters as possible to maintain maximum legibility.
As explained in this answer, reserved characters are different for each segment of the URL. So what are the limitations for URL Fragment/location.hash specifically?
Related post:
Unicode characters in URLs
According to RFC 3986: Uniform Resource Identifier (URI):
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Unpacking all that, and ignoring percent-encoding, I find the following set of characters:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~!$&'()*+,;=:#/?
Although the RFC does not mandate a particular encoding and deals in characters only (not bytes), according to Section 2.3 ALPHA means ASCII only, i.e. the 26 letters of the Latin alphabet. Any non-ASCII letters must therefore be percent-encoded.

Objective C: using non latin letters in NSURL objects corrupt custom URL schemes on iOS

I want to add custom URL schemes to my app. I made it, but i found that if I use a NSString that contain not a latin letters as a parameter in my URL, my app doesn't open.
My aim is to share string like: myapp://?text=blabla, but on "blabla" place might be any string or maybe emoji. According to RFC 1808, URL can contain only latin letters and this looks very strange to me because what if I want to share text in french language or russian, or asian characters?
So, is there a way to do this anyhow?
RFC 1808 is obsoleted by RFC 3986. You care about Section 2 here. The fragment allows:
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
As you note, "ALPHA" here means "the basic Latin alphabet," but calling this "Latin" will often confuse people unless you're very explicit, since Latin-1 is something different. In particular, the encoding NSISOLatin1StringEncoding is not "the basic Latin alphabet."
OK, lots of words, let's get to how to implement this. It's actually pretty simple, and Duncan's answer is close, but you shouldn't mess with the encoding. Still use UTF8 as normal:
NSString *escapedURL = [string stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
You want the percent-encoding to be based on UTF-8, and you always want Cocoa strings in UTF-8 unless you have a specific interoperability issue. As the docs say:
encoding: The encoding to use for the returned string. If you are uncertain of the correct encoding you should use NSUTF8StringEncoding.
Note that NSURL URLWithString: requires that you already have percent-escaped the string passed to it. That sometimes surprises people (also note that "Any percent-escaped characters are interpreted using UTF-8 encoding" as noted above).
You need to percent escape the special characters. Use the NSString method stringByAddingPercentEscapesUsingEncoding. Try passing in the NSNonLossyASCIIStringEncoding or perhaps NSISOLatin1StringEncoding. You'll have to play with encodings.

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.
The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.
According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.
David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.
I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).
My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".
I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

What characters can one use in a URL?

I have an application that takes all the parameters in the url like this: /category/subcategory/sub-subcategory. I want to be able to give out extra parameters at the end of the URL, like page-2/order-desc. This would make the whole URL into cat/subcat/sub-subcat{delimiting-character}page-2/order-desc.
My question is: what characters could I use as {delimiting-character}. I tend to prefer ":" as I know for sure it will never appear anyplace else but I don't know if it would be standard compliant or at least if it will not give me problems in the future.
As I recall vimeo used something like this: vimeo.com/video:{code} but they seem to have changed this.
You can use alphanumeric, plus the special characters "$-_.+!*'(),"
More info here: http://www.ietf.org/rfc/rfc1738.txt
Also, take note not to exceed 2000 characters in url
The most recent URI spec is RFC 3986; see the ABNF for details on what characters are allowed in which parts for the URI.
The format for an absolute path part is:
path-absolute = "/" [ segment-nz *( "/" segment ) ]
segment = *pchar
segment-nz = 1*pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
See http://www.ietf.org/rfc/rfc1738.txt
Basically, you are allowed all aphanumerics as well as $ - _ . + ! * ' ( ) ,
If you use dash or underscore, remember that a dash is read by Google as a hyphen, so does not alter how your URL is categorized. An underscore is counted as a character, and can mess up your SEO.
Ex: dash-use = dash use (2 words);
underscore_use = underscore_use (1 word)
You could use a dash or an underscore (these are used frequently). You could use any character you want to but for example, spaces turn into %20 in the url so they don't look too-nice.

Resources