vcap failed to open[udp # 0x56378e8a76a0] bind failed: Permission denied - opencv

Im trying to use the VideoCapture function from opencv with an ipv6 address to stream from my raspberry pi to my debian virtual machine but I get the error in the title when I try.
I've confirmed that the ipv6 address is reachable with netcat and mplayer with the following:
Debian host machine:
netcat -l -6 -u 2222
raspberry pi:
/opt/vc/bin/raspivid -t 0 -w 300 -h 300 -hf -fps 20 -o - | nc -u (ipv6 address) 2222
Code:
VideoCapture vcap;
const string videoStreamAddress = "udp://" + "(my Ipv6 address)" + ":2222";
vcap.open(videoStreamAddress);
edit: I've confirmed vcap.open works, with 127.0.0.1 but the problem is that it still doesnt work with my ipv6 address

IPv6 addresses used in the format you have specified, <protocol>://, are required to be enclosed in brackets ([ and ]). This was originally specified in RFC 2732, Format for Literal IPv6 Addresses in URL's and continued in RFC 3896:, Uniform Resource Identifier (URI): Generic Syntax:
3.2.2. Host
The host subcomponent of authority is identified by an IP literal
encapsulated within square brackets, an IPv4 address in dotted-
decimal form, or a registered name. The host subcomponent is case-
insensitive. he presence of a host subcomponent within a URI does not
imply that the scheme requires access to the given host on the
Internet. In many cases, the host syntax is used only for the sake of
reusing the existing registration process created and deployed for
DNS, thus obtaining a globally unique name without the cost of
deploying another registry. However, such use comes with its own
costs: domain name ownership may change over time for reasons not
anticipated by the URI producer. In other cases, the data within the
host component identifies a registered name that has nothing to do
with an Internet host. We use the name "host" for the ABNF rule
because that is its most common purpose, not its only purpose.
host = IP-literal / IPv4address / reg-name
The syntax rule for host is ambiguous because it does not completely
distinguish between an IPv4address and a reg-name. In order to
disambiguate the syntax, we apply the "first-match-wins" algorithm: If
host matches the rule for IPv4address, then it should be considered an
IPv4 address literal and not a reg-name. Although host is
case-insensitive, producers and normalizers should use lowercase for
registered names and hexadecimal addresses for the sake of uniformity,
while only using uppercase letters for percent-encodings.
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax. In
anticipation of future, as-yet-undefined IP literal address formats,
an implementation may use an optional version flag to indicate such a
format explicitly rather than rely on heuristic determination.
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
The version flag does not indicate the IP version; rather, it
indicates future versions of the literal format. As such,
implementations must not provide the version flag for the existing
IPv4 and IPv6 literal address forms described below. If a URI
containing an IP-literal that starts with "v" (case-insensitive),
indicating that the version flag is present, is dereferenced by an
application that does not know the meaning of that version flag, then
the application should return an appropriate error for "address
mechanism not supported".
A host identified by an IPv6 literal address is represented inside the
square brackets without a preceding version flag. The ABNF provided
here is a translation of the text definition of an IPv6 literal
address provided in [RFC3513]. This syntax does not support IPv6
scoped addressing zone identifiers.
A 128-bit IPv6 address is divided into eight 16-bit pieces. Each piece
is represented numerically in case-insensitive hexadecimal, using one
to four hexadecimal digits (leading zeroes are permitted). The eight
encoded pieces are given most-significant first, separated by colon
characters. Optionally, the least-significant two pieces may instead
be represented in IPv4 address textual format. A sequence of one or
more consecutive zero-valued 16-bit pieces within the address may be
elided, omitting all their digits and leaving exactly two consecutive
colons in their place to mark the elision.
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address
h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal
A host identified by an IPv4 literal address is represented in
dotted-decimal notation (a sequence of four decimal numbers in the
range 0 to 255, separated by "."), as described in [RFC1123] by
reference to [RFC0952]. Note that other forms of dotted notation may
be interpreted on some platforms, as described in Section 7.4, but
only the dotted-decimal form of four octets is allowed by this
grammar.
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
A host identified by a registered name is a sequence of characters
usually intended for lookup within a locally defined host or service
name registry, though the URI's scheme-specific semantics may require
that a specific registry (or fixed name table) be used instead. The
most common name registry mechanism is the Domain Name System (DNS). A
registered name intended for lookup in the DNS uses the syntax defined
in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123]. Such a name
consists of a sequence of domain labels separated by ".", each domain
label starting and ending with an alphanumeric character and possibly
also containing "-" characters. The rightmost domain label of a fully
qualified domain name in DNS may be followed by a single "." and
should be if it is necessary to distinguish between the complete
domain name and some local domain.
reg-name = *( unreserved / pct-encoded / sub-delims )
If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the registered
name is empty (zero length). For example, the "file" URI scheme is
defined so that no authority, an empty host, and "localhost" all mean
the end-user's machine, whereas the "http" scheme considers a missing
authority or empty host invalid.
This specification does not mandate a particular registered name
lookup technology and therefore does not restrict the syntax of reg-
name beyond what is necessary for interoperability. Instead, it
delegates the issue of registered name syntax conformance to the
operating system of each application performing URI resolution, and
that operating system decides what it will allow for the purpose of
host identification. A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names. However, a globally scoped naming system,
such as DNS fully qualified domain names, is necessary for URIs
intended to have global scope. URI producers should use names that
conform to the DNS syntax, even when use of DNS is not immediately
apparent, and should limit these names to no more than 255 characters
in length.
The reg-name syntax allows percent-encoded octets in order to
represent non-ASCII registered names in a uniform way that is
independent of the underlying name resolution technology. Non-ASCII
characters must first be encoded according to UTF-8 [STD63], and then
each octet of the corresponding UTF-8 sequence must be percent-
encoded to be represented as URI characters. URI producing
applications must not use percent-encoding in host unless it is used
to represent a UTF-8 character sequence. When a non-ASCII registered
name represents an internationalized domain name intended for
resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup. URI producers should provide
these registered names in the IDNA encoding, rather than a
percent-encoding, if they wish to maximize interoperability with
legacy URI resolvers.

Related

What parts of a URL can be URL-encoded?

My Chrome version 101 allows me to open
https://%65%78%61%6D%70%6C%65%2E%63%6F%6D (https://example.com, encoded except for the https://.)
but not
https://%65%78%61%6D%70%6C%65%2E%63%6F%6D%2F%74%65%73%74 (https://example.com/test, with the path delimiter / also encoded.).
Exactly what parts and what characters of a URL can be URL-encoded, according to the latest specification?
By “parts,” I mean the scheme, username, password, host, port, path, query, fragment, ., :, //, #, ?, #, et cetera.
By “what characters,” I mean “characters of what value in what part.”
By the specification
From RFC 3986.
2.1. Percent-Encoding
….
pct-encoded = "%" HEXDIG HEXDIG
The uppercase hexadecimal digits “A” through “F” are equivalent to the lowercase digits “a” through “f,” respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.
Percent-encoding is case-insensitive.
2.2. Reserved Characters
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.
A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3. A component’s ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each syntax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are “reserved” for use as subcomponent delimiters within the component. Only the most common subcomponents are defined by this specification; other subcomponents may be defined by a URI scheme’s specification, or by the implementation-specific syntax of a URI’s dereferencing algorithm, provided that such subcomponents are delimited by characters in the reserved set allowed within that component.
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character’s encoding in US-ASCII.
The characters “:/?#[]#!$&'()*+,;=” are reserved characters.
URL scheme specifications define syntactic URL delimiters to be some characters from the reserved characters.
Syntactic URL delimiters are not percent-encoded.
The reserved characters that are not syntactic URL delimiters can be either percent-encoded or not, but are recommended to be percent-encoded.
2.3. Unreserved Characters
Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.
6. Normalization and Comparison
…URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them.
Characters that are allowed in a URL and not the reserved, that is, “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~”, are unreserved characters.
The unreserved characters can be either percent-encoded or not, but are recommended to be not.
Summary
Syntactic URL delimiters → cannot be percent-encoded.
Other than those → can be either percent-encoded or not.
Percent-encoding is case-insensitive.
How the implementations would do
Some implementations don’t do complete, extensive URL normalization. For example, “%68%74%74%70%73://example.com” is a valid URL by the specification, but Chrome (version 101) does not normalize it into “https://example.com” when it’s put into the omnibar.

Should an ampersand be URL encoded in a query string?

For example I quite often see this URL come up.
https://ghbtns.com/github-btn.html?user=example&repo=card&type=watch&count=true
Is the & meant to be & or should/can it be left as &?
& is for encoding the ampersand in HTML.
For example, in a hyperlink:
…
(Note that this only changes the link, not the URL. The URL is still /github-btn.html?user=example&repo=card&type=watch&count=true.)
While you may encode every & (that is part of the content) with & in HTML, you are only required to encode ambiguous ampersands.
From rfc3986:
Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.
...
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI. URIs
that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
Percent-encoding a reserved character, or decoding a percent-encoded
octet that corresponds to a reserved character, will change how the
URI is interpreted by most applications.
...
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So & within a URL should be encoded if it's part of the value and has no delimiting role.Here's simple PHP code fragment using urlencode() function:
<?php
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
?>

Is the "#" character valid in a URL after the hostname?

"#" is certainly allowed in this case:
http://user:pass#domain.com/foo
However, in theses cases:
http://www.foo.com/#bar
http://www.foo.com/?email=a#b.com
Are they 'ok' or should the "#" be encoded?
Similarly, if they are 'ok', does it make the host portion "bar" and "b.com" respectively?
I took a look at the rfc (http://www.ietf.org/rfc/rfc3986.txt) and page 45 uses this example:
ftp://cnn.example.com&story=breaking_news#10.0.0.1/top_story.htm
to indicate that the "#" means "10.0.0.1" is the host, but I'm not sure because the query portion didn't start correctly (no "?"). (Also it then mentions "attacks" and I got confused.)
The background: I am trying to determine if Steven Levithan's regex is correct in parsing "http://www.foo.com/#bar" as having a host of "bar":
http://stevenlevithan.com/demo/parseuri/js/
The example you are mentioning is used in the RFC to illustrate how a URI like that can be deceiving to humans. In this case cnn.example.com&story=breaking_news would be the user info portion of the URI, in the same way as user:pass is in your first example.
As far as whether or not # is allowed in the URI itself, as far as I can tell it is.
If you look at pages 48 and 49 you'll find (among other things) the following rules:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty / *snip*
authority = [ userinfo "#" ] host [ ":" port ]
path-abempty = *( "/" segment )
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
Applying this to http://www.foo.com/#bar we find that scheme is http. authority does only contain the mandatory host portion which is www.foo.com (userinfo and port are both optional). Together with this authority component hier-part has a path-abempty component which consits of a single repetition /#bar. The segment consists of 4 repetitions of pchar: #, b, a and r. As such bar is not the hostname.
How well any given browser and or webserver follows the RFC on the other hand is an entierly different question.
Disclaimer: I am no expert, and it's been a while since I looked at ABNF in general.

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.
The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.
According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.
David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.
I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).
My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".
I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

What is the semicolon reserved for in URLs?

The RFC 3986 URI: Generic Syntax specification lists a semicolon as a reserved (sub-delim) character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
What is the reserved purpose of the ";" of the semicolon in URIs? For that matter, what is the purpose of the other sub-delims (I'm only aware of purposes for "&", "+", and "=")?
There is an explanation at the end of section 3.3.
Aside from dot-segments in
hierarchical paths, a path segment is
considered opaque by the generic
syntax. URI producing applications
often use the reserved characters
allowed in a segment to delimit
scheme-specific or
dereference-handler-specific
subcomponents. For example, the
semicolon (";") and equals ("=")
reserved characters are often used
to delimit parameters and parameter
values applicable to that segment.
The comma (",") reserved character is
often used forsimilar purposes.
For example, one URI producer might
use a segment uch as "name;v=1.1"
to indicate a reference to version 1.1
of "name", whereas another might
use a segment such as "name,1.1" to
indicate the same. Parameter types
may be defined by scheme-specific
semantics, but in most cases the
syntax of a parameter is specific to
the implementation of the URI's
dereferencing algorithm.
In other words, it is reserved so that people who want a delimited list of something in the URL can safely use ; as a delimiter even if the parts contain ;, as long as the contents are percent-encoded. In other words, you can do this:
foo;bar;baz%3bqux
and interpret it as three parts: foo, bar, baz;qux. If semicolon were not a reserved character, the ; and %3bwould be equivalent, so the URI would be incorrectly interpreted as four parts: foo, bar, baz, qux.
The intent is clearer if you go back to older versions of the specification:
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
I believe it has its origins in FTP URIs.
Section 3.3 covers this - it's an opaque delimiter a URI-producing application can use if convenient:
Aside from dot-segments in
hierarchical paths, a path segment is
considered opaque by the generic
syntax. URI producing applications
often use the reserved characters
allowed in a segment to delimit
scheme-specific or
dereference-handler-specific
subcomponents. For example, the
semicolon (";") and equals ("=")
reserved characters are often used to
delimit parameters and parameter
values applicable to that segment. The
comma (",") reserved character is
often used for similar purposes. For
example, one URI producer might use a
segment such as "name;v=1.1" to
indicate a reference to version 1.1 of
"name", whereas another might use a
segment such as "name,1.1" to indicate
the same. Parameter types may be
defined by scheme-specific semantics,
but in most cases the syntax of a
parameter is specific to the
implementation of the URI's
dereferencing algorithm.
There are some conventions around its current usage that are interesting. These speak to when to use a semicolon or comma. From the book "RESTful Web Services":
Use punctuation characters to separate multiple pieces of data at the same level of hierarchy. Use commas when the order of the items matters, ... Use semicolons when the order doesn't matter.
Since 2014, path segments are known to contribute to Reflected File Download attacks. Let's assume we have a vulnerable API that reflects whatever we send to it:
https://google.com/s?q=rfd%22||calc||
{"results":["q", "rfd\"||calc||","I love rfd"]}
Now, this is harmless in a browser as it's JSON, so it's not going to be rendered, but the browser will rather offer to download the response as a file. Now here's the path segments come to help (for the attacker):
https://google.com/s;/setup.bat;?q=rfd%22||calc||
Everything between semicolons (;/setup.bat;) will be not sent to the web service, but instead the browser will interpret it as the file name... to save the API response.
Now, a file called setup.bat will be downloaded and run without asking about dangers of running files downloaded from the Internet (because it contains the word "setup" in its name). The contents will be interpreted as a Windows batch file, and the calc.exe command will be run.
Prevention:
sanitize your API's input (in this case, they should just allow alphanumerics); escaping is not sufficient
add Content-Disposition: attachment; filename="whatever.txt" on APIs that are not going to be rendered; Google was missing the filename part which actually made the attack easier
add X-Content-Type-Options: nosniff header to API responses
I found the following use cases:
It's the final character of an HTML entity:
List of XML and HTML character entity references
To use one of these character entity references in an HTML or XML
document, enter an ampersand followed by the entity name and a
semicolon, e.g., & for the ampersand ("&").
Apache Tomcat 7 (or newer versions?!) us it as path parameter:
Three Semicolon Vulnerabilities
Apache Tomcat is one example of a web server that supports "Path
Parameters". A path parameter is extra content after a file name,
separated by a semicolon. Any arbitrary content after a semicolon does
not affect the landing page of a web browser. This means that
http://example.com/index.jsp;derp will still return index.jsp, and not
some error page.
URI scheme splits by it the MIME and data:
Data URI scheme
It can contain an optional character set parameter, separated from the
preceding part by a semicolon (;) .
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
And there was a bug in IIS 5 and IIS 6 to bypass file upload restrictions:
Unrestricted File Upload
Blacklisting File Extensions This protection might be bypassed by: ...
by adding a semi-colon character after the forbidden extension and
before the permitted one (e.g. "file.asp;.jpg")
Conclusion:
Do not use semicolons in URLs or they could accidentally produce an HTML entity or URI scheme.

Resources