Is the "#" character valid in a URL after the hostname?

Is the "#" character valid in a URL after the hostname? - url

"#" is certainly allowed in this case:
http://user:pass#domain.com/foo
However, in theses cases:
http://www.foo.com/#bar
http://www.foo.com/?email=a#b.com
Are they 'ok' or should the "#" be encoded?
Similarly, if they are 'ok', does it make the host portion "bar" and "b.com" respectively?
I took a look at the rfc (http://www.ietf.org/rfc/rfc3986.txt) and page 45 uses this example:
ftp://cnn.example.com&story=breaking_news#10.0.0.1/top_story.htm
to indicate that the "#" means "10.0.0.1" is the host, but I'm not sure because the query portion didn't start correctly (no "?"). (Also it then mentions "attacks" and I got confused.)
The background: I am trying to determine if Steven Levithan's regex is correct in parsing "http://www.foo.com/#bar" as having a host of "bar":
http://stevenlevithan.com/demo/parseuri/js/

The example you are mentioning is used in the RFC to illustrate how a URI like that can be deceiving to humans. In this case cnn.example.com&story=breaking_news would be the user info portion of the URI, in the same way as user:pass is in your first example.
As far as whether or not # is allowed in the URI itself, as far as I can tell it is.
If you look at pages 48 and 49 you'll find (among other things) the following rules:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty / *snip*
authority = [ userinfo "#" ] host [ ":" port ]
path-abempty = *( "/" segment )
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
Applying this to http://www.foo.com/#bar we find that scheme is http. authority does only contain the mandatory host portion which is www.foo.com (userinfo and port are both optional). Together with this authority component hier-part has a path-abempty component which consits of a single repetition /#bar. The segment consists of 4 repetitions of pchar: #, b, a and r. As such bar is not the hostname.
How well any given browser and or webserver follows the RFC on the other hand is an entierly different question.
Disclaimer: I am no expert, and it's been a while since I looked at ABNF in general.

Related

vcap failed to open[udp # 0x56378e8a76a0] bind failed: Permission denied

Im trying to use the VideoCapture function from opencv with an ipv6 address to stream from my raspberry pi to my debian virtual machine but I get the error in the title when I try.
I've confirmed that the ipv6 address is reachable with netcat and mplayer with the following:
Debian host machine:
netcat -l -6 -u 2222
raspberry pi:
/opt/vc/bin/raspivid -t 0 -w 300 -h 300 -hf -fps 20 -o - | nc -u (ipv6 address) 2222
Code:
VideoCapture vcap;
const string videoStreamAddress = "udp://" + "(my Ipv6 address)" + ":2222";
vcap.open(videoStreamAddress);
edit: I've confirmed vcap.open works, with 127.0.0.1 but the problem is that it still doesnt work with my ipv6 address

IPv6 addresses used in the format you have specified, <protocol>://, are required to be enclosed in brackets ([ and ]). This was originally specified in RFC 2732, Format for Literal IPv6 Addresses in URL's and continued in RFC 3896:, Uniform Resource Identifier (URI): Generic Syntax:
3.2.2. Host
The host subcomponent of authority is identified by an IP literal
encapsulated within square brackets, an IPv4 address in dotted-
decimal form, or a registered name. The host subcomponent is case-
insensitive. he presence of a host subcomponent within a URI does not
imply that the scheme requires access to the given host on the
Internet. In many cases, the host syntax is used only for the sake of
reusing the existing registration process created and deployed for
DNS, thus obtaining a globally unique name without the cost of
deploying another registry. However, such use comes with its own
costs: domain name ownership may change over time for reasons not
anticipated by the URI producer. In other cases, the data within the
host component identifies a registered name that has nothing to do
with an Internet host. We use the name "host" for the ABNF rule
because that is its most common purpose, not its only purpose.
host = IP-literal / IPv4address / reg-name
The syntax rule for host is ambiguous because it does not completely
distinguish between an IPv4address and a reg-name. In order to
disambiguate the syntax, we apply the "first-match-wins" algorithm: If
host matches the rule for IPv4address, then it should be considered an
IPv4 address literal and not a reg-name. Although host is
case-insensitive, producers and normalizers should use lowercase for
registered names and hexadecimal addresses for the sake of uniformity,
while only using uppercase letters for percent-encodings.
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax. In
anticipation of future, as-yet-undefined IP literal address formats,
an implementation may use an optional version flag to indicate such a
format explicitly rather than rely on heuristic determination.
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
The version flag does not indicate the IP version; rather, it
indicates future versions of the literal format. As such,
implementations must not provide the version flag for the existing
IPv4 and IPv6 literal address forms described below. If a URI
containing an IP-literal that starts with "v" (case-insensitive),
indicating that the version flag is present, is dereferenced by an
application that does not know the meaning of that version flag, then
the application should return an appropriate error for "address
mechanism not supported".
A host identified by an IPv6 literal address is represented inside the
square brackets without a preceding version flag. The ABNF provided
here is a translation of the text definition of an IPv6 literal
address provided in [RFC3513]. This syntax does not support IPv6
scoped addressing zone identifiers.
A 128-bit IPv6 address is divided into eight 16-bit pieces. Each piece
is represented numerically in case-insensitive hexadecimal, using one
to four hexadecimal digits (leading zeroes are permitted). The eight
encoded pieces are given most-significant first, separated by colon
characters. Optionally, the least-significant two pieces may instead
be represented in IPv4 address textual format. A sequence of one or
more consecutive zero-valued 16-bit pieces within the address may be
elided, omitting all their digits and leaving exactly two consecutive
colons in their place to mark the elision.
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address
h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal
A host identified by an IPv4 literal address is represented in
dotted-decimal notation (a sequence of four decimal numbers in the
range 0 to 255, separated by "."), as described in [RFC1123] by
reference to [RFC0952]. Note that other forms of dotted notation may
be interpreted on some platforms, as described in Section 7.4, but
only the dotted-decimal form of four octets is allowed by this
grammar.
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
A host identified by a registered name is a sequence of characters
usually intended for lookup within a locally defined host or service
name registry, though the URI's scheme-specific semantics may require
that a specific registry (or fixed name table) be used instead. The
most common name registry mechanism is the Domain Name System (DNS). A
registered name intended for lookup in the DNS uses the syntax defined
in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123]. Such a name
consists of a sequence of domain labels separated by ".", each domain
label starting and ending with an alphanumeric character and possibly
also containing "-" characters. The rightmost domain label of a fully
qualified domain name in DNS may be followed by a single "." and
should be if it is necessary to distinguish between the complete
domain name and some local domain.
reg-name = *( unreserved / pct-encoded / sub-delims )
If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the registered
name is empty (zero length). For example, the "file" URI scheme is
defined so that no authority, an empty host, and "localhost" all mean
the end-user's machine, whereas the "http" scheme considers a missing
authority or empty host invalid.
This specification does not mandate a particular registered name
lookup technology and therefore does not restrict the syntax of reg-
name beyond what is necessary for interoperability. Instead, it
delegates the issue of registered name syntax conformance to the
operating system of each application performing URI resolution, and
that operating system decides what it will allow for the purpose of
host identification. A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names. However, a globally scoped naming system,
such as DNS fully qualified domain names, is necessary for URIs
intended to have global scope. URI producers should use names that
conform to the DNS syntax, even when use of DNS is not immediately
apparent, and should limit these names to no more than 255 characters
in length.
The reg-name syntax allows percent-encoded octets in order to
represent non-ASCII registered names in a uniform way that is
independent of the underlying name resolution technology. Non-ASCII
characters must first be encoded according to UTF-8 [STD63], and then
each octet of the corresponding UTF-8 sequence must be percent-
encoded to be represented as URI characters. URI producing
applications must not use percent-encoding in host unless it is used
to represent a UTF-8 character sequence. When a non-ASCII registered
name represents an internationalized domain name intended for
resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup. URI producers should provide
these registered names in the IDNA encoding, rather than a
percent-encoding, if they wish to maximize interoperability with
legacy URI resolvers.

Discrepancies of Percent Encoding for URLs

After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.

This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of '+' for encoding space characters is specific to the application/x-www-form-urlencoded format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.
The application/x-www-form-urlencoded format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP GET request instead of a POST request, the webform data is encoded using application/x-www-form-urlencoded and placed as-is in the URL query component.
Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+' is a reserved character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The query component explicitly allows unencoded '+' characters, as it allows characters from sub-delims:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
query = *( pchar / "/" / "?" )
So, in the context of a webform submission, spaces are encoded using '+' prior to then being put as-is into the query component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded is compatible with the definition of the query component.
So, for example: http://server/script?field=hello+world
However, outside of a webform submission, putting a space character directly into the query component requires the use of pct-encoded, since ' ' is not included in either unreserved or sub-delims, and is not explicitly allowed by the query definition.
So, for example: http://server/script?hello%20world
Similar rules also apply to the path component, due to its use of pchar:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
So, although path does allow for unencoded sub-delims characters, a '+' character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded is not used with the path component, so a space character has to be encoded as %20 due to the definitions of pchar and segment-nz-nc.
Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset attribute or hidden _charset_ field directly in the <form> itself, otherwise the charset is typically the charset used by the parent HTML.
However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding
), but those are not commonly used.

Do browsers ignore slashes in URLs? [duplicate]

This question already has answers here:
url with multiple forward slashes, does it break anything?
(8 answers)
Closed 8 years ago.
I noticed that both Chrome and Firefox ignore slashes between words in a URL.
So, github.com/octocat/hello-world seems to be equivalent to github.com//////octocat////hello-world.
I am writing an application that parses a URL and retrieves a part of it, and thanks to this behavior, I am able to return the original URL without modifying the code, which in my case is rather convenient. I don't know if it would be a good idea to rely on this quirk though.

Path separators are defined to be a single slash according to this. (Search for Path Component)
Note that browsers don't usually modify the URL. Browsers could append a / at the end of a URL, but in your case, the URL with extra slashes is simply sent along in the request, so it is the server ignoring the slashes instead.
Also, have a look at:
Is a URL with // in the path-section valid?
URL with multiple forward slashes, does it break anything?
What does the double slash mean in URLs?
Even if this behavior is convenient for you, it is generally not recommended. In addition, caching may also be affected (source):
Since both your browser and the server cache individual pages (according to their caching settings), requesting same file multiple times via slightly different URIs might affect the caching (depending on server and client implementation).

An empty path segment is valid as per specification:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
In the latter URI https://github.com//////octocat////hello-world, the path //////octocat////hello-world would be composed of:
//////octocat////hello-world: path-abempty
/: segment
/: segment
/: segment
/: segment
/: segment
/octocat: segment-nz
/: segment
/: segment
/: segment
/hello-world: segment-nz
Removing these empty path segments would make up a completely different URI. How the server would handle these empty path segments is a completely different question.

Actually browsers do not ignore them, they pass them to the web server in the HTTP request. It's the server that may decide to ignore them, but technically multiplying slashes results in a different URL.
W3.org specifies that the path part of a URL consists of "path segments", separated by /, and a path segment consists of zero or more "URL units" (characters) except / and ?, so empty path segments are allowed, which is what you get when you duplicate slashes.
See http://www.w3.org/TR/url-1/ for details

Actually browsers do not ignore slashes between URLs.
If you use document.URL in (client side) JavaScript you get the URL with the repeating '///'s.
Similarly in (server side) PHP, when using $_SERVER['REQUEST_URI'] you get the URL with the repeating '///'s.
It is the server, e.g., Apache, that actually redirects to the proper page without URL. In Apache you can write rules in the .htaccess file to not redirect to the page with ///s ignored.

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.

The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.

According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.

David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.

I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).

My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".

I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

Should we encode slashes in search part of URLs?

The rfc 1738 is not precise about encoding of forward slashes in "search part":
If the character corresponding to an octet is reserved in a scheme, the octet must be encoded.
...
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
...
Within the 'path' and 'searchpart' components, "/", ";", "?" are reserved.
Do you know what is the "reserved purpose" of "/" in search part of the urls?
Is there any real reason to follow the spec and encode the forward slashes providing that
my server handles unecoded slashes?
It drive me nuts when I need to constantly decode urls parameters that are just alphanumeric with slashes.
Here is an life example:
http://localhost/login?url=/a/path/to/protected/content
vs
http://localhost/login?url=%2Fa%2Fpath%2Fto%2Fprotected%2Fcontent"

Note that RFC 3986 updates RFC 1738 (though doesn't obsolete it, which I think indicates that it's intended to clarify rather than contradict).
RFC 3986 says, in section 3.4, that the syntax of the query part of the URI is:
query = *( pchar / "/" / "?" )
The ABNF for URIs is conveniently collected in Appendix A, which indicates
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
That pretty unequivocally indicates that slashes are legitimate in the query part, and so don't need to be encoded. In particular, your example http://localhost/login?url=/a/path/to/protected/content is fine as it is, and so is http://localhost/login?abc123-.+~!$&'()*+,;=%00/?:#
Section 2.4 indicates that characters need to be encoded only when one wants to include reserved characters in a part of the URI (that doesn't apply here).

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart