Can . (period) be part of the path part of an URL? - url

Is the following URL valid?
http://www.example.com/module.php/lib/lib.php
According to https://www.rfc-editor.org/rfc/rfc1738 section the hpath element of an URL can not contain a '.' (period). There is in the above case a '.' after "module" which is not allowed according to RFC1738.
Am I reading the RFC wrong or is this RFC succeed by another RFC? Some other RFC's allows '.' in URLs (https://www.rfc-editor.org/rfc/rfc1808).

I don't see where RFC1738 disallows periods (.) in URLs. Here are some excerpts from there:
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
safe = "$" | "-" | "_" | "." | "+"
So the answer to your question is: Yes, http://www.example.com/module.php/lib/lib.php is a valid URL.

As others have noted, periods are allowed in URLs, but be careful. If a single or double period is used in part of a URL's path, the browser will treat it as a change in the path, and you may not get the behavior you want.
For example:
www.example.com/foo/./ redirects to www.example.com/foo/
www.example.com/foo/../ redirects to www.example.com/
Whereas the following will not redirect:
www.example.com/foo/bar.biz/
www.example.com/foo/..biz/
www.example.com/foo/biz../

Periods are allowed. See section "2.3 Unreserved Characters" in this document:
https://www.rfc-editor.org/rfc/rfc3986
"Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde".

Nothing wrong with a period in a url. If you look at the makeup in the grammar in the link you provided a period is mentioned via the 'safe' group, which is included via uchar a
Ignore my answer, Adams is better

Related

Not able to add {} in URI provided by axis-adb

I have the following piece of Code that I need to execute
URI uri = new URI("http://localhost:8080/rest/{data}");
The URI in the above example is from axis2-adb-1.5.1.jar - org.apache.axis2.databinding.types.URI
I tired using axis2-adb-1.6.1.jar as well. I get a MalformedURIException stating "Path Contains invalid character:{".
I can use a workaround and modify the URI to make it work
URI uri = new URI("http://localhost:8080/rest/%7Bdata%7D");
However, I am looking for options wherein I dont need to modify my input.
Moreover, can anyone answer me why does the axis jar have this limitation. I tried looking for explanations but could not find any.
Found few days ago that its not a valid scenario to add curly braces in URL. That can be added only after proper encoding
http://axis.apache.org/axis2/java/core/api/org/apache/axis2/databinding/types/URI.html
states
Parsing of a URI specification is done according to the URI syntax described in RFC 2396, and amended by RFC 2732.
Both RFC 2396 and RFC 2732 prescribes the following
Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Data corresponding to excluded characters must be escaped in order to
be properly represented within a URI.

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.
The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.
According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.
David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.
I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).
My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".
I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

Can any path segments of a URI have a query component?

According to the Section 3.3, Path Component of RFC2396 - Uniform Resource Identifiers,
The path may consist of a sequence of path segments separated by a single slash "/" character. Within a path segment, the characters "/", ";", "=", and "?" are reserved. Each path segment may include a sequence of parameters, indicated by the semicolon ";" character. The parameters are not significant to the parsing of relative references.
However, I have never seen a URL with a query parameters in any segment other than the final one. So, I am not sure if I am reading this correctly.
Is http://www.url.com/segment1?seg1param1=val1/page.html?pageparam1=val2 a valid URL?
What the RFC is referring to is something like this:
http://www.example.com/foo/bar;param=value/baz.html
That could be interpreted as the path /foo/bar/baz.html with the parameter param=value to the bar segment. No question marks are used.
Note that RFC 2396 has been obsoleted by RFC 3986, which omits specification of segment-specific parameters in favor of a general note that implementations can (and do) do different things to embed segment-specific parameters:
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same. Parameter types may be defined by scheme-specific
semantics, but in most cases the syntax of a parameter is specific to
the implementation of the URI's dereferencing algorithm.
When you look at the grammar which is just below, it is written:
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "#" | "&" | "=" | "+" | "$" | ","
A segment is composed of pchar and param, param being itself a pchar.
When we continue to read, there is absolutely no "?" character in the pchar character components. So the parameters cannot have any "?", and there cannot be a "?" in segments.
So I agree with the answer of Edward Thomson, who says that "?" only delimit the query segment, and cannot be used inside a path.
According to my reading of RFC 2396, no. The ? is a reserved character and serves only to delimit the query segment. The ? is not allowed in either the path or the query segment.
In your example, the first ? marks the beginning of the query segment. The second ? is inside the query segment, and is disallowed.
I believe you could do a get with that and most web servers would process it but I don't believe you would get the results you are expecting. That is the pageparam1=val2 would not evaluate.
If you want parameters like that you could always use the # symbol (as a lot of javascript based GUIs do now).

URL encoding the space character: + or %20?

When is a space in a URL encoded to +, and when is it encoded to %20?
From Wikipedia (emphasis and link added):
When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST, or, historically, via email. The encoding used by default is based on a very early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with "+" instead of "%20". The MIME type of data encoded this way is application/x-www-form-urlencoded, and it is currently defined (still in a very outdated manner) in the HTML and XForms specifications.
So, the real percent encoding uses %20 while form data in URLs is in a modified form that uses +. So you're most likely to only see + in URLs in the query string after an ?.
This confusion is because URLs are still 'broken' to this day.
From a blog post:
Take "http://www.google.com" for instance. This is a URL. A URL is a Uniform Resource Locator and is really a pointer to a web page (in most cases). URLs actually have a very well-defined structure since the first specification in 1994.
We can extract detailed information about the "http://www.google.com" URL:
+---------------+-------------------+
| Part | Data |
+---------------+-------------------+
| Scheme | http |
| Host | www.google.com |
+---------------+-------------------+
If we look at a more complex URL such as:
"https://bob:bobby#www.lunatech.com:8080/file;p=1?q=2#third"
we can extract the following information:
+-------------------+---------------------+
| Part | Data |
+-------------------+---------------------+
| Scheme | https |
| User | bob |
| Password | bobby |
| Host | www.lunatech.com |
| Port | 8080 |
| Path | /file;p=1 |
| Path parameter | p=1 |
| Query | q=2 |
| Fragment | third |
+-------------------+---------------------+
https://bob:bobby#www.lunatech.com:8080/file;p=1?q=2#third
\___/ \_/ \___/ \______________/ \__/\_______/ \_/ \___/
| | | | | | \_/ | |
Scheme User Password Host Port Path | | Fragment
\_____________________________/ | Query
| Path parameter
Authority
The reserved characters are different for each part.
For HTTP URLs, a space in a path fragment part has to be encoded to "%20" (not, absolutely not "+"), while the "+" character in the path fragment part can be left unencoded.
Now in the query part, spaces may be encoded to either "+" (for backwards compatibility: do not try to search for it in the URI standard) or "%20" while the "+" character (as a result of this ambiguity) has to be escaped to "%2B".
This means that the "blue+light blue" string has to be encoded differently in the path and query parts:
"http://example.com/blue+light%20blue?blue%2Blight+blue".
From there you can deduce that encoding a fully constructed URL is impossible without a syntactical awareness of the URL structure.
This boils down to:
You should have %20 before the ? and + after.
Source
I would recommend %20.
Are you hard-coding them?
This is not very consistent across languages, though.
If I'm not mistaken, in PHP urlencode() treats spaces as + whereas Python's urlencode() treats them as %20.
EDIT:
It seems I'm mistaken. Python's urlencode() (at least in 2.7.2) uses quote_plus() instead of quote() and thus encodes spaces as "+".
It seems also that the W3C recommendation is the "+" as per here: http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
And in fact, you can follow this interesting debate on Python's own issue tracker about what to use to encode spaces: http://bugs.python.org/issue13866.
EDIT #2:
I understand that the most common way of encoding " " is as "+", but just a note, it may be just me, but I find this a bit confusing:
import urllib
print(urllib.urlencode({' ' : '+ '})
>>> '+=%2B+'
A space may only be encoded to "+" in the "application/x-www-form-urlencoded" content-type key-value pairs query part of an URL. In my opinion, this is a may, not a must. In the rest of URLs, it is encoded as %20.
In my opinion, it's better to always encode spaces as %20, not as "+", even in the query part of an URL, because it is the HTML specification (RFC 1866) that specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs (see paragraph 8.2.1. subparagraph 1.)
This way of encoding form data is also given in later HTML specifications. For example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
Here is a sample string in a URL where the HTML specification allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses. In other cases, spaces should be encoded to %20. But since it's hard to determine the context correctly, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all character except "unreserved" defined in RFC 3986, p.2.3
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The implementation depends on the programming language that you chose.
If your URL contains national characters, first encode them to UTF-8 and then percent-encode the result.
To summarize the (somewhat conflicting) answers here, I think it can be boiled down to:
| standard | + | %20 |
|---------------+-----+-----|
| URL | no | yes |
| query string | yes | yes |
| form params | yes | no |
| mailto query | no | yes |
So historically I think what happened is:
The RFC specified a pretty clear standard about the form of URLs and how they are encoded. In this context the query is just a "string", there is no specification how key/value pairs should be encoded
The HTTP guys put out a standard of how key/value pairs are to be encoded in form params, and borrowed from the URL encoding standard, except that spaces should be encoded as +.
The web guys said: cool we have a way to encode key/value pairs let's put that into the URL query string
Result: We end up with two different ways how to encode spaces in a URL depending on which part you're talking about. But it doesn't even violate the URL standard. From URL perspective the "query" is just a blackbox. If you want to use other encodings there besides percent encoding: knock yourself out.
But as the email example shows it can be problematic to borrow from the form-params implementation for an URL query string. So ultimately using %20 is safer but there might not be out of the box library support for it.

In a URL, should spaces be encoded using %20 or +? [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 6 years ago.
In a URL, should I encode the spaces using %20 or +? For example, in the following example, which one is correct?
www.mydomain.com?type=xbox%20360
www.mydomain.com?type=xbox+360
Our company is leaning to the former, but using the Java method URLEncoder.encode(String, String) with "xbox 360" (and "UTF-8") returns the latter.
So, what's the difference?
Form data (for GET or POST) is usually encoded as application/x-www-form-urlencoded: this specifies + for spaces.
URLs are encoded as RFC 1738 which specifies %20.
In theory I think you should have %20 before the ? and + after:
example.com/foo%20bar?foo+bar
According to the W3C (and they are the official source on these things), a space character in the query string (and in the query string only) may be encoded as either "%20" or "+". From the section "Query strings" under "Recommendations":
Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URIs easier to pass in systems which did not allow spaces.
According to section 3.4 of RFC2396 which is the official specification on URIs in general, the "query" component is URL-dependent:
3.4. Query Component
The query component is a string of information to be interpreted by
the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
It is therefore a bug in the other software if it does not accept URLs with spaces in the query string encoded as "+" characters.
As for the third part of your question, one way (though slightly ugly) to fix the output from URLEncoder.encode() is to then call replaceAll("\\+","%20") on the return value.
This confusion is because URL is still 'broken' to this day
Take "http://www.google.com" for instance. This is a URL. A URL
is a Uniform Resource Locator and is really a pointer to a web page
(in most cases). URLs actually have a very well-defined structure
since the first specification in 1994.
We can extract detailed information about the "http://www.google.com"
URL:
+---------------+-------------------+
| Part | Data |
+---------------+-------------------+
| Scheme | http |
| Host address | www.google.com |
+---------------+-------------------+
If we look at a more
complex URL such as
"https://bob:bobby#www.lunatech.com:8080/file;p=1?q=2#third" we can
extract the following information:
+-------------------+---------------------+
| Part | Data |
+-------------------+---------------------+
| Scheme | https |
| User | bob |
| Password | bobby |
| Host address | www.lunatech.com |
| Port | 8080 |
| Path | /file |
| Path parameters | p=1 |
| Query parameters | q=2 |
| Fragment | third |
+-------------------+---------------------+
The reserved characters are different for each part
For HTTP URLs, a space in a path fragment part has to be encoded to
"%20" (not, absolutely not "+"), while the "+" character in the path
fragment part can be left unencoded.
Now in the query part, spaces may be encoded to either "+" (for
backwards compatibility: do not try to search for it in the URI
standard) or "%20" while the "+" character (as a result of this
ambiguity) has to be escaped to "%2B".
This means that the "blue+light blue" string has to be encoded
differently in the path and query parts:
"http://example.com/blue+light%20blue?blue%2Blight+blue". From there
you can deduce that encoding a fully constructed URL is impossible
without a syntactical awareness of the URL structure.
What this boils down to is
you should have %20 before the ? and + after
Source
It shouldn't matter, any more than if you encoded the letter A as %41.
However, if you're dealing with a system that doesn't recognize one form, it seems like you're just going to have to give it what it expects regardless of what the "spec" says.
You can use either - which means most people opt for "+" as it's more human readable.
When encoding query values, either form, plus or percent-20, is valid; however, since the bandwidth of the internet isn't infinite, you should use plus, since it's two fewer bytes.

Resources