I working on a routine that parses a URI. Among obvious cases there is an empty string case. Is the empty string a valid input? What would be an outcome URI of an empty string?
An empty string can’t possibly be a URI. The general URI syntax specifies that at least the scheme component followed by a : followed by the hier-part component (which may be empty) must be present.
But a relative URI reference can be empty. The syntax of relative references specifies that at least the relative-part component must be present, but it’s allowed to be empty (path-empty).
An empty relative URI reference is a same-document reference (bold emphasis mine):
The most frequent examples of same-document references are relative references that are empty or include only the number sign ("#") separator followed by a fragment identifier.
Since the root of a URI is a limited set we know it does not have a valid root.
It therefore does not identify anything and is invalid as an identifier. But this might have a meaning in your context still since tere is no specific standard for a URI.
You can parse an empty String to a Uri without an exception but its fields will be null or their default invalid values.
val myUri = Uri.parse("")
myUri.host is null and
myUri.port is -1
Related
I cannot find documentation anywhere regarding whether the following URL that has a query string is valid.
http://www.example.com/webapp&someKey=someValue
I know that ? starts a list of key-value pairs separated by &.
Is the ? required?
? appears to be required for the trailing part to be called query.
Query string is defined in RFC 3986. Section 3.3 Path says:
The path component contains data, usually organized in hierarchical
form, that, along with data in the non-hierarchical query component
(Section 3.4), serves to identify a resource within the scope of the
URI's scheme and naming authority (if any). The path is terminated
by the first question mark ("?") or number sign ("#") character, or
by the end of the URI.
Section 3.4 defines query:
The query component contains non-hierarchical data that, along with
data in the path component (Section 3.3), serves to identify a
resource within the scope of the URI's scheme and naming authority
(if any). The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
RFC 1738 for URL has a section for HTTP URL scheme. It says in section 3.3 that:
An HTTP URL takes the form:
http://<host>:<port>/<path>?<searchpart>
where and are as described in Section 3.1. If :
is omitted, the port defaults to 80. No user name or password is
allowed. is an HTTP selector, and is a query
string. The is optional, as is the and its
preceding "?". If neither nor is present, the "/"
may also be omitted.
Within the and components, "/", ";", "?" are
reserved. The "/" character may be used within HTTP to designate a
hierarchical structure.
You can use tricks to take the URI as you mention and then split it as if it was a query string. Frameworks like Laravel, Django etc. allow you to handle routes in a query string like manner. There's more to it than what I say; I was just giving an example about Frameworks' handling of URIs.
Look at this example from Laravel documentation: https://laravel.com/docs/7.x/routing#required-parameters. It shows how Laravel takes a route like https://site/posts/1/comments/3 and handles the post id 1 and comment id 3 through a function.
Route::get('posts/{post}/comments/{comment}', function ($postId, $commentId) {
//
});
You can, perhaps, handle routes like http://site/webapp/somekey/somevalue.
A url without parameters but with a question mark appended at the end is passed to the client to be parsed and used.
I've been told the client should be robust enough to handle such url and proceed. But shouldn't this be fixed server-side?
thanks
An empty query part is not an error, so it definitely needs to be accepted by the client. (Reference: RFC 3986 section 3.4 which shows the syntax of the query as 0 or more allowable characters.)
An empty query is different from an undefined query (i.e., the URI does not contain a ?). If the base URI contains a query component, merging a relative URI with an empty query will override the base URI's query, whereas merging a relative URI without a query will copy the base URI's query into the merged result.
Just say I have the following url that has a query string parameter that's an url:
http://www.someSite.com?next=http://www.anotherSite.com?test=1&test=2
Should I url encode the next parameter? If I do, who's responsible for decoding it - the web browser, or my web app?
The reason I ask is I see lots of big sites that do things like the following
http://www.someSite.com?next=http://www.anotherSite.com/another/url
In the above, they don't bother encoding the next parameter because I'm guessing, they know it doesn't have any query string parameters itself. Is this ok to do if my next url doesn't include any query string parameters as well?
RFC 2396 sec. 2.2 says that you should URL-encode those symbols anywhere where they're not used for their explicit meanings; i.e. you should always form targetUrl + '?next=' + urlencode(nextURL).
The web browser does not 'decode' those parameters at all; the browser doesn't know anything about the parameters but just passes along the string. A query string of the form http://www.example.com/path/to/query?param1=value¶m2=value2 is GET-requested by the browser as:
GET /path/to/query?param1=value¶m2=value2 HTTP/1.1
Host: www.example.com
(other headers follow)
On the backend, you'll need to parse the results. I think PHP's $_REQUEST array will have already done this for you; in other languages you'll want to split over the first ? character, then split over the & characters, then split over the first = character, then urldecode both the name and the value.
According to RFC 3986:
The query component is indicated by the first question mark ("?")
character and terminated by a number sign ("#") character or by the
end of the URI.
So the following URI is valid:
http://www.example.com?next=http://www.example.com
The following excerpt from the RFC makes this clear:
... as query components are often used to carry identifying
information in the form of "key=value" pairs and one frequently used
value is a reference to another URI, it is sometimes better for
usability to avoid percent-encoding those characters.
It is worth noting that RFC 3986 makes RFC 2396 obsolete.
I'm implementing a login system using OpenID.
The doc says :
Subject Identifier
An identifier for a set of attributes. It MUST be a URI. The subject identifier corresponds to the end-user identifier in the authentication portion of the messages. In other words, the subject of the identity attributes in the attribute exchange part of the message is the same as the end-user in the authentication part. The subject identifier is not included in the attribute exchange.
URI are quite larges in definition, it can be http://, but also gopher://.
I'm sure gopher is not a valid URI protocol, but then, excluding http(s), what else is allowed as a subject identifier from the OpenID protocol ?
You're quoting the wrong spec. The openid specification, section 7.2 says:
7.2. Normalization
The end user's input MUST be normalized into an Identifier, as follows:
If the user's input starts with the "xri://" prefix, it MUST be stripped off, so that XRIs are used in the canonical form.
If the first character of the resulting string is an XRI Global Context Symbol ("=", "#", "+", "$", "!") or "(", as defined in Section 2.2.1 of [XRI_Syntax_2.0], then the input SHOULD be treated as an XRI.
Otherwise, the input SHOULD be treated as an http URL; if it does not include a "http" or "https" scheme, the Identifier MUST be prefixed with the string "http://". If the URL contains a fragment part, it MUST be stripped off together with the fragment delimiter character "#". See Section 11.5.2 for more information.
URL Identifiers MUST then be further normalized by both following redirects when retrieving their content and finally applying the rules in Section 6 of [RFC3986] to the final destination URL. This final URL MUST be noted by the Relying Party as the Claimed Identifier and be used when requesting authentication.
From the third point we can infer that the identifier must be either a http(s) URL or an XRI.
Is a url like http://example.com/foo?bar valid?
I'm looking for a link to something official that says one way or the other. A simple yes/no answer or anecdotal evidence won't cut it.
Valid to the URI RFC
Likely acceptable to your server-side framework/code
The URI RFC doesn't mandate a format for the query string. Although it is recognized that the query string will often carry name-value pairs, it is not required to (e.g. it will often contain another URI).
3.4. Query
The query component contains non-hierarchical data that, along with
data in the path component (Section 3.3), serves to identify a
resource within the scope of the URI's scheme and naming authority
(if any). ...
... However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, ...
HTML establishes that a form submitted via HTTP GET should encode the form values as name-value pairs in the form "?key1=value1&key2=value2..." (properly encoded). Parsing of the query string is up to the server-side code (e.g. Java servlet engine).
You don't identify what server-side framework you use, if any, but it is possible that your server-side framework may assume the query string will always be in name-value pairs and it may choke on a query string that is not in that format (e.g. ?bar). If its your own custom code parsing the query string, you simply have to ensure you handle that query string format. If its a framework, you'll need to consult your documentation or simply test it to see how it is handled.
They're perfectly valid. You could consider them to be the equivalent of the big muscled guy standing silently behind the mob messenger. The guy doesn't have a name and doesn't speak, but his mere presence conveys information.
"The "http" scheme is used to locate network resources via the HTTP protocol. This section defines the scheme-specific syntax and semantics for http URLs." http://www.w3.org/Protocols/rfc2616/rfc2616.html
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
So yes, anything is valid after a question mark. Your server may interpret differently, but anecdotally, you can see some languages treat that as a boolean value which is true if listed.
Yes, it is valid.
If one simply want to check if the parameter exists or not, this is one way to do so.
URI Spec
The only relevant part of the URI spec is to know everything between the first ? and the first # fits the spec's definition of a query. It can include any characters such as [:/.?]. This means that a query string such as ?bar, or ?ten+green+apples is valid.
Find the RFC 3986 here
HTML Spec
isindex is not meaningfully HTML5.
It's provided deprecated for use as the first element in a form only, and submits without a name.
If the entry's name is "isindex", its type is "text", and this is the first entry in the form data set, then append the value to result and skip the rest of the substeps for this entry, moving on to the next entry, if any, or the next step in the overall algorithm otherwise.
The isindex flag is for legacy use only. Forms in conforming HTML documents will not generate payloads that need to be decoded with this flag set.
The last time isindex was supported was HTML3. It's use in HTML5 is to provide easier backwards compatibility.
Support in libraries
Support in libraries for this format of URI varies however some libraries do provide legacy support to ease use of isindex.
Perl URI.pm (special support)
Some libraries like Perl's URI provide methods of parsing these kind of structures
$uri->query_keywords
$uri->query_keywords( $keywords, ... )
$uri->query_keywords( \#keywords )
Sets and returns query components that use the keywords separated by "+" format.
Node.js url (no special support)
As another far more frequent example, node.js takes the normal route and eases parsing as either
A string
or, an object of keys and values (using parseQueryString)
Most other URI-parsing APIs following something similar to this.
PHP parse_url, follows as similar implementation but only returns the string for the query. Parsing into an object of k=>v requires parse_string()
It is valid: see Wikipedia, RFC 1738 (3.3. HTTP), RFC 3986 (3. Syntax Components).
isindex deprecated magic name from HTML5
This deprecated feature allows a form submission to generate such an URL, providing further evidence that it is valid for HTML. E.g.:
<form action="#isindex" class="border" id="isindex" method="get">
<input type="text" name="isindex" value="bar"/>
<button type="submit">Submit</button>
</form>
generates an URL of type:
?bar
Standard: https://www.w3.org/TR/html5/forms.html#naming-form-controls:-the-name-attribute
isindex is however deprecated as mentioned at: https://stackoverflow.com/a/41689431/895245
As all other answers described, it's perfectly valid for checking, specially for boolean kind stuff
Here is a simple function to get the query string by name:
function getParameterByName(name, url) {
if (!url) {
url = window.location.href;
}
name = name.replace(/[\[\]]/g, "\\$&");
var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
results = regex.exec(url);
if (!results) return null;
if (!results[2]) return '';
return decodeURIComponent(results[2].replace(/\+/g, " "));
}
and now you want to check if the query string you are looking for exists or not, you may do a simple thing like:
var exampleQueryString = (getParameterByName('exampleQueryString') != null);
the exampleQueryString will be false if the function can't find the query string, otherwise will be true.
The correct resource to look for this is RFC6570. Please refer to section 3.2.9 where in examples empty parameter is presented as below.
Example Template Expansion
{&x,y,empty} &x=1024&y=768&empty=