Encoding percent symbol in URL creates invalid folder name on Sharepoint Group - microsoft-graph-api

i'm trying to upload file to specific non-existing path on Microsoft Sharepoint Group assuming folder hierarchy will be created based on that path. And that's true.
Problem appears when path segment have special characters. I found MS documentation stating that path segment should be encoded (using escape function in Javascript).
So let's say i'm uploading file File1.txt to a path Test 1/Whatever%Text!Here
Here's what url would look like:
PUT https://graph.microsoft.com/v1.0/groups/<group-id>/drive/items/root:/Test%201/Whatever%25Text%21Here:/children/File1.txt/content
You can see encoded path segment (/Test%201/Whatever%25Text%21Here) and how % is encoded to %25. Seems fine to me. But this URL will create subfolder called Whatever%25Text!Here, not Whatever%Text!Here
%25 stays %25, it's not decoded to %. Does anyone have a clue what's going on?
I was mainly testing through Microsoft Graph Api explorer, trying several different URLs, like % changing to %2525 but without luck.

The % symbol is one of OneDrive for Business' "Reserved Characters".
From the documentation:
OneDrive reserved characters
The following characters are OneDrive reserved characters, and can't be used in OneDrive folder and file names.
onedrive-reserved = "/" / "\" / "*" / "<" / ">" / "?" / ":" / "|"
onedrive-business-reserved = "/" / "\" / "*" / "<" / ">" / "?" / ":" / "|" / "#" / "%"

Related

Discrepancies of Percent Encoding for URLs

After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of '+' for encoding space characters is specific to the application/x-www-form-urlencoded format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.
The application/x-www-form-urlencoded format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP GET request instead of a POST request, the webform data is encoded using application/x-www-form-urlencoded and placed as-is in the URL query component.
Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+' is a reserved character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The query component explicitly allows unencoded '+' characters, as it allows characters from sub-delims:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
query = *( pchar / "/" / "?" )
So, in the context of a webform submission, spaces are encoded using '+' prior to then being put as-is into the query component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded is compatible with the definition of the query component.
So, for example: http://server/script?field=hello+world
However, outside of a webform submission, putting a space character directly into the query component requires the use of pct-encoded, since ' ' is not included in either unreserved or sub-delims, and is not explicitly allowed by the query definition.
So, for example: http://server/script?hello%20world
Similar rules also apply to the path component, due to its use of pchar:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
So, although path does allow for unencoded sub-delims characters, a '+' character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded is not used with the path component, so a space character has to be encoded as %20 due to the definitions of pchar and segment-nz-nc.
Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset attribute or hidden _charset_ field directly in the <form> itself, otherwise the charset is typically the charset used by the parent HTML.
However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding
), but those are not commonly used.

Do browsers ignore slashes in URLs? [duplicate]

This question already has answers here:
url with multiple forward slashes, does it break anything?
(8 answers)
Closed 8 years ago.
I noticed that both Chrome and Firefox ignore slashes between words in a URL.
So, github.com/octocat/hello-world seems to be equivalent to github.com//////octocat////hello-world.
I am writing an application that parses a URL and retrieves a part of it, and thanks to this behavior, I am able to return the original URL without modifying the code, which in my case is rather convenient. I don't know if it would be a good idea to rely on this quirk though.
Path separators are defined to be a single slash according to this. (Search for Path Component)
Note that browsers don't usually modify the URL. Browsers could append a / at the end of a URL, but in your case, the URL with extra slashes is simply sent along in the request, so it is the server ignoring the slashes instead.
Also, have a look at:
Is a URL with // in the path-section valid?
URL with multiple forward slashes, does it break anything?
What does the double slash mean in URLs?
Even if this behavior is convenient for you, it is generally not recommended. In addition, caching may also be affected (source):
Since both your browser and the server cache individual pages (according to their caching settings), requesting same file multiple times via slightly different URIs might affect the caching (depending on server and client implementation).
An empty path segment is valid as per specification:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
In the latter URI https://github.com//////octocat////hello-world, the path //////octocat////hello-world would be composed of:
//////octocat////hello-world: path-abempty
/: segment
/: segment
/: segment
/: segment
/: segment
/octocat: segment-nz
/: segment
/: segment
/: segment
/hello-world: segment-nz
Removing these empty path segments would make up a completely different URI. How the server would handle these empty path segments is a completely different question.
Actually browsers do not ignore them, they pass them to the web server in the HTTP request. It's the server that may decide to ignore them, but technically multiplying slashes results in a different URL.
W3.org specifies that the path part of a URL consists of "path segments", separated by /, and a path segment consists of zero or more "URL units" (characters) except / and ?, so empty path segments are allowed, which is what you get when you duplicate slashes.
See http://www.w3.org/TR/url-1/ for details
Actually browsers do not ignore slashes between URLs.
If you use document.URL in (client side) JavaScript you get the URL with the repeating '///'s.
Similarly in (server side) PHP, when using $_SERVER['REQUEST_URI'] you get the URL with the repeating '///'s.
It is the server, e.g., Apache, that actually redirects to the proper page without URL. In Apache you can write rules in the .htaccess file to not redirect to the page with ///s ignored.

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.
The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.
According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.
David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.
I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).
My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".
I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

Multiple fragment identifiers correct in URL?

I stumbled across a site that uses multiple fragment identifiers in their URLs, like http://www.ejeby.se/#newprodukt#produkt#1075#1 (no, it is not my site, but I am linking to it, which brings problems for me).
But is this really correct? It does seem to cause problems for Safari and possibly also Internet Explorer (hearsay, I have not tried IE myself).
Isn't the fragment identifier supposed to uniquely identify one location in the document?
Is this a bug in Safari or is it www.ejeby.se that uses fragment idenifiers in a wrong way?
Edit: Seems that the problem for Safari is that it escapes all # but the first in the URL. The other browsers do not do this. Correct behaviour or not?
From the specification point of view, a fragment can contain the following characters (I’ve already expanded the productions):
fragment = *( ALPHA / DIGIT / "-" / "." / "_" / "~" / "%" HEXDIG HEXDIG / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "#" / "/" / "?" )
So, no, the fragment must not contain a plain #; it must be encoded with %23.
But it is possible that some browsers display it differently just as sequences of percent-encoded octets, that represent valid UTF-8 characters are replaced by the characters they represent.

What is the semicolon reserved for in URLs?

The RFC 3986 URI: Generic Syntax specification lists a semicolon as a reserved (sub-delim) character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
What is the reserved purpose of the ";" of the semicolon in URIs? For that matter, what is the purpose of the other sub-delims (I'm only aware of purposes for "&", "+", and "=")?
There is an explanation at the end of section 3.3.
Aside from dot-segments in
hierarchical paths, a path segment is
considered opaque by the generic
syntax. URI producing applications
often use the reserved characters
allowed in a segment to delimit
scheme-specific or
dereference-handler-specific
subcomponents. For example, the
semicolon (";") and equals ("=")
reserved characters are often used
to delimit parameters and parameter
values applicable to that segment.
The comma (",") reserved character is
often used forsimilar purposes.
For example, one URI producer might
use a segment uch as "name;v=1.1"
to indicate a reference to version 1.1
of "name", whereas another might
use a segment such as "name,1.1" to
indicate the same. Parameter types
may be defined by scheme-specific
semantics, but in most cases the
syntax of a parameter is specific to
the implementation of the URI's
dereferencing algorithm.
In other words, it is reserved so that people who want a delimited list of something in the URL can safely use ; as a delimiter even if the parts contain ;, as long as the contents are percent-encoded. In other words, you can do this:
foo;bar;baz%3bqux
and interpret it as three parts: foo, bar, baz;qux. If semicolon were not a reserved character, the ; and %3bwould be equivalent, so the URI would be incorrectly interpreted as four parts: foo, bar, baz, qux.
The intent is clearer if you go back to older versions of the specification:
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
I believe it has its origins in FTP URIs.
Section 3.3 covers this - it's an opaque delimiter a URI-producing application can use if convenient:
Aside from dot-segments in
hierarchical paths, a path segment is
considered opaque by the generic
syntax. URI producing applications
often use the reserved characters
allowed in a segment to delimit
scheme-specific or
dereference-handler-specific
subcomponents. For example, the
semicolon (";") and equals ("=")
reserved characters are often used to
delimit parameters and parameter
values applicable to that segment. The
comma (",") reserved character is
often used for similar purposes. For
example, one URI producer might use a
segment such as "name;v=1.1" to
indicate a reference to version 1.1 of
"name", whereas another might use a
segment such as "name,1.1" to indicate
the same. Parameter types may be
defined by scheme-specific semantics,
but in most cases the syntax of a
parameter is specific to the
implementation of the URI's
dereferencing algorithm.
There are some conventions around its current usage that are interesting. These speak to when to use a semicolon or comma. From the book "RESTful Web Services":
Use punctuation characters to separate multiple pieces of data at the same level of hierarchy. Use commas when the order of the items matters, ... Use semicolons when the order doesn't matter.
Since 2014, path segments are known to contribute to Reflected File Download attacks. Let's assume we have a vulnerable API that reflects whatever we send to it:
https://google.com/s?q=rfd%22||calc||
{"results":["q", "rfd\"||calc||","I love rfd"]}
Now, this is harmless in a browser as it's JSON, so it's not going to be rendered, but the browser will rather offer to download the response as a file. Now here's the path segments come to help (for the attacker):
https://google.com/s;/setup.bat;?q=rfd%22||calc||
Everything between semicolons (;/setup.bat;) will be not sent to the web service, but instead the browser will interpret it as the file name... to save the API response.
Now, a file called setup.bat will be downloaded and run without asking about dangers of running files downloaded from the Internet (because it contains the word "setup" in its name). The contents will be interpreted as a Windows batch file, and the calc.exe command will be run.
Prevention:
sanitize your API's input (in this case, they should just allow alphanumerics); escaping is not sufficient
add Content-Disposition: attachment; filename="whatever.txt" on APIs that are not going to be rendered; Google was missing the filename part which actually made the attack easier
add X-Content-Type-Options: nosniff header to API responses
I found the following use cases:
It's the final character of an HTML entity:
List of XML and HTML character entity references
To use one of these character entity references in an HTML or XML
document, enter an ampersand followed by the entity name and a
semicolon, e.g., & for the ampersand ("&").
Apache Tomcat 7 (or newer versions?!) us it as path parameter:
Three Semicolon Vulnerabilities
Apache Tomcat is one example of a web server that supports "Path
Parameters". A path parameter is extra content after a file name,
separated by a semicolon. Any arbitrary content after a semicolon does
not affect the landing page of a web browser. This means that
http://example.com/index.jsp;derp will still return index.jsp, and not
some error page.
URI scheme splits by it the MIME and data:
Data URI scheme
It can contain an optional character set parameter, separated from the
preceding part by a semicolon (;) .
<img src="
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
And there was a bug in IIS 5 and IIS 6 to bypass file upload restrictions:
Unrestricted File Upload
Blacklisting File Extensions This protection might be bypassed by: ...
by adding a semi-colon character after the forbidden extension and
before the permitted one (e.g. "file.asp;.jpg")
Conclusion:
Do not use semicolons in URLs or they could accidentally produce an HTML entity or URI scheme.

Resources