Are protocol-relative URLs relative URLs? - url

So consider a protocol-relative URL like so;
//www.example.com/file.jpg
The idea I've had in my head for as long as I can remember is that protocol-relative URLs are in fact absolute URLs. They behave exactly like absolute URLs, and never do they work like relative URLs. I wouldn't expect this to make the browser go find something at
http://www.example.com///www.example.com/file.jpg
The URL defines the host and the path (like an absolute URL does), and the scheme is inherited from whatever the page used, and therefore it makes a complete unambiguous URL, i.e. an absolute URL.
Right?
Now, upon further research into this, I came upon this answer, which states;
A URL is called an absolute URL if it begins with the scheme and scheme specific part (here // after http:). Anything else is a relative URL.
Neither the question nor the answer specifically discuss protocol-relative URLs, so I'm mindful that it can just be an oversight in wording.
However, I'm now also now running into an issue in my development, where a system that only accepts absolute URLs doesn't function with protocol-relative URLs, and I don't know if that's by design or due to a bug.
The RFC3986 section which is often linked to in relation to protocol-relative URLs also splashes the word "relative" around a lot. 4.3 then goes on to say that absolute URIs define a scheme.
All this evidence against my initial assumption led me to the question;
Are protocol-relative URLs relative or absolute?

Every relative URL is an unambiguous URL given the URL it is relative to. So if your page is http://mypage.com/some/folder/ then you know the relative URL this/that corresponds to http://mypage.com/some/folder/this/that and you know the relative URL //otherpage.com/ resolves to http://otherpage.com/. Importantly, it cannot be resolved without knowing the page URL it is relative to.
A relative URL is any URL that is relative to something and cannot be resolved by itself. An aboslute URL does not require any context whatsoever to resolve.

What you are calling a “protocol-relative URL” WHATWG calls a “scheme-relative URL” in the URL Standard document, and it is not an absolute URL, but a relative URL.
Granted most sites available on HTTPS show the same content on the corresponding HTTP URLs, that is not necessarily the case, and it therefore makes sense a URL that does not include the scheme cannot be considered absolute.
From the document:
An absolute URL must be a scheme, followed by ":", followed by either a scheme-relative URL, if scheme is a relative scheme, or scheme data otherwise, optionally followed by "?" and a query.
Specifically answering your question, we have:
A relative URL must be either a scheme-relative URL, an absolute-path-relative URL, or a path-relative URL that does not start with a scheme and ":", optionally followed by a "?" and a query.
At the point where a relative URL is parsed, a base URL must be in scope.
Examples (brackets indicate optional)
path-relative URL [path segment][/[path segment]]…
about
about/staff.html
about/staff.html?
about/staff.html?parameters
absolute-path-relative URL: /[path-relative URL]
/
/about
/about/staff.html
/about/staff.html?
/about/staff.html?parameters
scheme-relative URL: //[userinfo#]host[:port][absolute-path-relative URL]
//username:password#example.com:8888
//username#example.com
//example.com
//example.com/
//example.com/about
//example.com/about/staff.html
//example.com/about/staff.html?
//example.com/about/staff.html?parameters
absolute URL: scheme:[scheme-relative URL][?parameters]
https://username:password#example.com:8888
https://username#example.com
https://example.com
https://example.com/
https://example.com/about
https://example.com/about/staff.html
https://example.com/about/staff.html?
https://example.com/about/staff.html?parameters
relative URL:
Anything from scheme-relative URL list
Anything from absolute-path-relative URL list
Anything from path-relative URL list
Note: This answer does not disagree with the first answer, but it was only somewhat clear to me that post answered the question after reading it several times and doing further research. Hopefully this answer spells it out better for others stumbling on this.

Related

Relative urls vs Protocol-relative URLs

I am just wondering if I use a relative URL as follows:
"/myfolder"
It will change to
mydomain/myfolder
But does it also maintain if the root is HTTP or HTTPS similar to the "//" approach.
i.e. if the page loading my relative URL /myfolder has HTTPS will this change to
"https://mydomain/myfolder"
tl;dr: Yes.
Relative references are always applied against a base URI (see how).
In HTML5, the document base URL is, in the common case (i.e., no base element, no iframe-srcdoc document, no about:blank), the document's address.
So if you have a document at http://example.com/foo, a link with the relative reference /bar will link to the URL http://example.com/bar. And if the document is at https://example.com/foo, it will link to https://example.com/bar.

What's the difference between beginAt and gotoPage in JWebUnit?

JWebUnit.beginAt:
Begin conversation at a URL absolute or relative to base URL. Use getTestContext().setBaseUrl(String) to define base URL. Absolute URL should start with "http://", "https://" or "www.".
JWebUnit.gotoPage:
Go to the given page like if user has typed the URL manually in the browser. Use getTestContext().setBaseUrl(String) to define base URL. Absolute URL should start with "http://", "https://" or "www.".
So, one says "Begin conversation at URL absolute or relative to base URL", while the other says "Go to the given page like if user has typed the URL manually in the browser". This doesn't help me in the slightest in understanding them (well, specifically the former; the latter makes sense). What's the actual difference between them? Which should I be using, and when?
I finally did manage to find the answer in the source code.
beginAt does two things: start the browser, then call gotoPage with its argument. Thus, you need to use beginAt the first time, and gotoPage subsequent times. (Perhaps if managing multiple windows it has more use; I haven't dug that deeply.)

Jsoup parse link <a href="www.abc.com">

I want to extract links from html, using jsoup
Expected output: absolute link.
I use "abs:href" for that.
This works:
Jsoup.parse("<a \n\r\t href=\"http://www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");
delivers: http://www.ibm.com/123/?id=abc
This doesnt work:
Jsoup.parse("<a \n\r\t href=\"www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");
delivers: http://www.ibm.com/www.ibm.com/123/?id=abc
I know its kinda difficult to know whether "www.ibm.com" is an absolute or relative link. It might be a top level domain, but also a foldername. Any proven solutions? Just this hack comes into my mind:
String domain = url.replace("http://", "");
url.replace(domain + domain, domain);
Your second example is unambiguously a relative URL. An absolute URL, by definition, starts with a protocol (e.g. http or https). All browsers will give the same output for your example.
Can you provide an example URL that you're working with? Why does it have these pseudo-absolute URLs?

Is the Scheme Optional in URIs?

I was recently asked to add some Woopra JavaScript to a website and noticed that the URL started with a double slash (i.e. omitted the scheme). I've never seen this before, so I went trying to find out more about it, but the only thing I could really find was an item on the Woopra FAQ:
The Woopra JavaScript in the Setup does not include http in the URL call for the script. This is correct. The JavaScript has been optimized to run very fast and efficiently on your site.
However, some validation and site testing/debugging services and tools do not recognize the code as correct. It is correct and valid. If the warnings annoy you, just add the http to the script’s URL. It will not impact the script.
(For clarification, the URL is "//static.woopra.com/js/woopra.v2.js"—the colon is omitted in addition to the "http".)
Is there any more information about this practice? If this is indeed valid, there must be a spec that talks about it, and I'd very much like to see it.
Thanks in advance for satisfying my curiousity!
This is a valid URL. It's called a "network-path reference" as defined in RFC 3986. When you don't specify a scheme/protocol, it will fall back to the current scheme. So if you are viewing a page via https:// all network path references will also use https.
For an example, here's a link to the RFC 3986 document again but with a network path reference. If you were viewing this page over https (although it looks like you can't use https with StackOverflow) the link will reflect your current URI scheme, unlike the first link.
See RFC 3986, section 3:
The generic URI syntax consists of a
hierarchical sequence of components
referred to as the scheme, authority,
path, query, and fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment
]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
The scheme and path components are
required, though the path may be
empty (no characters).

Absolute urls, relative urls, and...?

I am writing some documentation and I have a little vocabulary problem:
http://www.example.com/en/public/img/logo.gif is called an "absolute" url, right?
../../public/img/logo.gif is called a "relative" url, right?
so how do you call this: /en/public/img/logo.gif ?
Is it also considered an "absolute url", although without the protocol and domain parts?
Or is it considered a relative url, but relative to the root of the domain?
I googled a bit and some people categorize this as absolute, and others as relative.
What should I call it? A "semi-absolute url"? Or "semi-relative"? Is there another word?
Here are the URL components:
http://www.example.com/en/public/img/logo.gif
\__/ \_____________/\_____________________/
#1 #2 #3
scheme/protocol
host
path
A URL is called an absolute URL if it begins with the scheme and scheme specific part (here // after http:). Anything else is a relative URL.
A URL path is called an absolute URL path if it begins with a /. Any other URL path is called a relative URL path.
Thus:
http://www.example.com/en/public/img/logo.gif is a absolute URL,
../../public/img/logo.gif is a relative URL with a relative URL path and
/en/public/img/logo.gif is a relative URL with an absolute URL path.
Note: The current definition of URI (RFC 3986) is different from the old URL definition (RFC 1738 and RFC 1808).
The three examples with URI terms:
http://www.example.com/en/public/img/logo.gif is a URI,
../../public/img/logo.gif is a relative reference with just a relative path and
/en/public/img/logo.gif is a relative reference with just an absolute path.
I have seen it called a root relative URL.
From the Microsoft's documentation about Absolute and Relative URLs
A URL specifies the location of a target stored on a local or networked computer. The target can be a file, directory, HTML page, image, program, and so on.
An absolute URL contains all the information necessary to locate a resource.
A relative URL locates a resource using an absolute URL as a starting point. In effect, the "complete URL" of the target is specified by concatenating the absolute and relative URLs.
An absolute URL uses the following format: scheme://server/path/resource
A relative URL typically consists only of the path, and optionally, the resource, but no scheme or server. The following tables define the individual parts of the complete URL format.
scheme - Specifies how the resource is to be accessed.
server - Specifies the name of the computer where the resource is located.
path - Specifies the sequence of directories leading to the target. If resource is omitted, the target is the last directory in path.
resource - If included, resource is the target, and is typically the name of a file. It may be a simple file, containing a single binary stream of bytes, or a structured document, containing one or more storages and binary streams of bytes.
It is sometimes called a virtual url, for example in SSI:
<!--#include virtual = "/lib/functions.js" -->
Keep in mind just how many segments of the URL can be omited, making them relative (note: its all of them, just about). These are all valid URLs:
http://example.com/bar?baz
?qoo=qalue
/bar2
dat/sly
//auth.example.com (most people are surprised by this one! Will use http or https, depending on the current resource)
#anchor

Resources