Should an ampersand be URL encoded in a query string? - url

For example I quite often see this URL come up.
https://ghbtns.com/github-btn.html?user=example&repo=card&type=watch&count=true
Is the & meant to be & or should/can it be left as &?

& is for encoding the ampersand in HTML.
For example, in a hyperlink:
…
(Note that this only changes the link, not the URL. The URL is still /github-btn.html?user=example&repo=card&type=watch&count=true.)
While you may encode every & (that is part of the content) with & in HTML, you are only required to encode ambiguous ampersands.

From rfc3986:
Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.
...
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI. URIs
that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
Percent-encoding a reserved character, or decoding a percent-encoded
octet that corresponds to a reserved character, will change how the
URI is interpreted by most applications.
...
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So & within a URL should be encoded if it's part of the value and has no delimiting role.Here's simple PHP code fragment using urlencode() function:
<?php
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
?>

Related

What parts of a URL can be URL-encoded?

My Chrome version 101 allows me to open
https://%65%78%61%6D%70%6C%65%2E%63%6F%6D (https://example.com, encoded except for the https://.)
but not
https://%65%78%61%6D%70%6C%65%2E%63%6F%6D%2F%74%65%73%74 (https://example.com/test, with the path delimiter / also encoded.).
Exactly what parts and what characters of a URL can be URL-encoded, according to the latest specification?
By “parts,” I mean the scheme, username, password, host, port, path, query, fragment, ., :, //, #, ?, #, et cetera.
By “what characters,” I mean “characters of what value in what part.”
By the specification
From RFC 3986.
2.1. Percent-Encoding
….
pct-encoded = "%" HEXDIG HEXDIG
The uppercase hexadecimal digits “A” through “F” are equivalent to the lowercase digits “a” through “f,” respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.
Percent-encoding is case-insensitive.
2.2. Reserved Characters
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.
A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3. A component’s ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each syntax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are “reserved” for use as subcomponent delimiters within the component. Only the most common subcomponents are defined by this specification; other subcomponents may be defined by a URI scheme’s specification, or by the implementation-specific syntax of a URI’s dereferencing algorithm, provided that such subcomponents are delimited by characters in the reserved set allowed within that component.
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character’s encoding in US-ASCII.
The characters “:/?#[]#!$&'()*+,;=” are reserved characters.
URL scheme specifications define syntactic URL delimiters to be some characters from the reserved characters.
Syntactic URL delimiters are not percent-encoded.
The reserved characters that are not syntactic URL delimiters can be either percent-encoded or not, but are recommended to be percent-encoded.
2.3. Unreserved Characters
Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.
6. Normalization and Comparison
…URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them.
Characters that are allowed in a URL and not the reserved, that is, “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~”, are unreserved characters.
The unreserved characters can be either percent-encoded or not, but are recommended to be not.
Summary
Syntactic URL delimiters → cannot be percent-encoded.
Other than those → can be either percent-encoded or not.
Percent-encoding is case-insensitive.
How the implementations would do
Some implementations don’t do complete, extensive URL normalization. For example, “%68%74%74%70%73://example.com” is a valid URL by the specification, but Chrome (version 101) does not normalize it into “https://example.com” when it’s put into the omnibar.

Discrepancies of Percent Encoding for URLs

After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of '+' for encoding space characters is specific to the application/x-www-form-urlencoded format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.
The application/x-www-form-urlencoded format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP GET request instead of a POST request, the webform data is encoded using application/x-www-form-urlencoded and placed as-is in the URL query component.
Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+' is a reserved character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The query component explicitly allows unencoded '+' characters, as it allows characters from sub-delims:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
query = *( pchar / "/" / "?" )
So, in the context of a webform submission, spaces are encoded using '+' prior to then being put as-is into the query component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded is compatible with the definition of the query component.
So, for example: http://server/script?field=hello+world
However, outside of a webform submission, putting a space character directly into the query component requires the use of pct-encoded, since ' ' is not included in either unreserved or sub-delims, and is not explicitly allowed by the query definition.
So, for example: http://server/script?hello%20world
Similar rules also apply to the path component, due to its use of pchar:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
So, although path does allow for unencoded sub-delims characters, a '+' character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded is not used with the path component, so a space character has to be encoded as %20 due to the definitions of pchar and segment-nz-nc.
Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset attribute or hidden _charset_ field directly in the <form> itself, otherwise the charset is typically the charset used by the parent HTML.
However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding
), but those are not commonly used.

Objective C: using non latin letters in NSURL objects corrupt custom URL schemes on iOS

I want to add custom URL schemes to my app. I made it, but i found that if I use a NSString that contain not a latin letters as a parameter in my URL, my app doesn't open.
My aim is to share string like: myapp://?text=blabla, but on "blabla" place might be any string or maybe emoji. According to RFC 1808, URL can contain only latin letters and this looks very strange to me because what if I want to share text in french language or russian, or asian characters?
So, is there a way to do this anyhow?
RFC 1808 is obsoleted by RFC 3986. You care about Section 2 here. The fragment allows:
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
As you note, "ALPHA" here means "the basic Latin alphabet," but calling this "Latin" will often confuse people unless you're very explicit, since Latin-1 is something different. In particular, the encoding NSISOLatin1StringEncoding is not "the basic Latin alphabet."
OK, lots of words, let's get to how to implement this. It's actually pretty simple, and Duncan's answer is close, but you shouldn't mess with the encoding. Still use UTF8 as normal:
NSString *escapedURL = [string stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
You want the percent-encoding to be based on UTF-8, and you always want Cocoa strings in UTF-8 unless you have a specific interoperability issue. As the docs say:
encoding: The encoding to use for the returned string. If you are uncertain of the correct encoding you should use NSUTF8StringEncoding.
Note that NSURL URLWithString: requires that you already have percent-escaped the string passed to it. That sometimes surprises people (also note that "Any percent-escaped characters are interpreted using UTF-8 encoding" as noted above).
You need to percent escape the special characters. Use the NSString method stringByAddingPercentEscapesUsingEncoding. Try passing in the NSNonLossyASCIIStringEncoding or perhaps NSISOLatin1StringEncoding. You'll have to play with encodings.

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.
The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.
According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.
David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.
I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).
My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".
I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

Should we encode slashes in search part of URLs?

The rfc 1738 is not precise about encoding of forward slashes in "search part":
If the character corresponding to an octet is reserved in a scheme, the octet must be encoded.
...
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
...
Within the 'path' and 'searchpart' components, "/", ";", "?" are reserved.
Do you know what is the "reserved purpose" of "/" in search part of the urls?
Is there any real reason to follow the spec and encode the forward slashes providing that
my server handles unecoded slashes?
It drive me nuts when I need to constantly decode urls parameters that are just alphanumeric with slashes.
Here is an life example:
http://localhost/login?url=/a/path/to/protected/content
vs
http://localhost/login?url=%2Fa%2Fpath%2Fto%2Fprotected%2Fcontent"
Note that RFC 3986 updates RFC 1738 (though doesn't obsolete it, which I think indicates that it's intended to clarify rather than contradict).
RFC 3986 says, in section 3.4, that the syntax of the query part of the URI is:
query = *( pchar / "/" / "?" )
The ABNF for URIs is conveniently collected in Appendix A, which indicates
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
That pretty unequivocally indicates that slashes are legitimate in the query part, and so don't need to be encoded. In particular, your example http://localhost/login?url=/a/path/to/protected/content is fine as it is, and so is http://localhost/login?abc123-.+~!$&'()*+,;=%00/?:#
Section 2.4 indicates that characters need to be encoded only when one wants to include reserved characters in a part of the URI (that doesn't apply here).

Resources