Jsoup parse link <a href="www.abc.com"> - html-parsing

I want to extract links from html, using jsoup
Expected output: absolute link.
I use "abs:href" for that.
This works:
Jsoup.parse("<a \n\r\t href=\"http://www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");
delivers: http://www.ibm.com/123/?id=abc
This doesnt work:
Jsoup.parse("<a \n\r\t href=\"www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");
delivers: http://www.ibm.com/www.ibm.com/123/?id=abc
I know its kinda difficult to know whether "www.ibm.com" is an absolute or relative link. It might be a top level domain, but also a foldername. Any proven solutions? Just this hack comes into my mind:
String domain = url.replace("http://", "");
url.replace(domain + domain, domain);

Your second example is unambiguously a relative URL. An absolute URL, by definition, starts with a protocol (e.g. http or https). All browsers will give the same output for your example.
Can you provide an example URL that you're working with? Why does it have these pseudo-absolute URLs?

Related

What does "URL=" in a URL Mean?

May be a dumb question, but it's been bugging me recently. I see "URL=" inside alot of URL's, such as this one:
http://www.tierraantigua.com/search-2?url=http%3A%2F%2Flink.flexmls.com%2Fwws30ham
What exactly is this used to do? Is it part of the iFrame functionality? I know the last part of the URL (after the URL=) is the part being displayed in the iFrame, but I'm unsure of why it is included in the primary URL as well.
Thanks!
The url you see here is just a standard query parameter wit the name url and the encoded value http%3A%2F%2Flink.flexmls.com%2Fwws30ham which decodes to http://link.flexmls.com/Fwws30ham. Most of the times it is used for determining redirection or source information by the application you are using. It is entirely domain-specific and can have any meaning the website developer would like to use.
PHP GET
Description ¶
An associative array of variables passed to the current script via the URL parameters.
$url = $_GET['url'];
echo $url; // http%3A%2F%2Flink.flexmls.com%2Fwws30ham

What's the difference between beginAt and gotoPage in JWebUnit?

JWebUnit.beginAt:
Begin conversation at a URL absolute or relative to base URL. Use getTestContext().setBaseUrl(String) to define base URL. Absolute URL should start with "http://", "https://" or "www.".
JWebUnit.gotoPage:
Go to the given page like if user has typed the URL manually in the browser. Use getTestContext().setBaseUrl(String) to define base URL. Absolute URL should start with "http://", "https://" or "www.".
So, one says "Begin conversation at URL absolute or relative to base URL", while the other says "Go to the given page like if user has typed the URL manually in the browser". This doesn't help me in the slightest in understanding them (well, specifically the former; the latter makes sense). What's the actual difference between them? Which should I be using, and when?
I finally did manage to find the answer in the source code.
beginAt does two things: start the browser, then call gotoPage with its argument. Thus, you need to use beginAt the first time, and gotoPage subsequent times. (Perhaps if managing multiple windows it has more use; I haven't dug that deeply.)

Are protocol-relative URLs relative URLs?

So consider a protocol-relative URL like so;
//www.example.com/file.jpg
The idea I've had in my head for as long as I can remember is that protocol-relative URLs are in fact absolute URLs. They behave exactly like absolute URLs, and never do they work like relative URLs. I wouldn't expect this to make the browser go find something at
http://www.example.com///www.example.com/file.jpg
The URL defines the host and the path (like an absolute URL does), and the scheme is inherited from whatever the page used, and therefore it makes a complete unambiguous URL, i.e. an absolute URL.
Right?
Now, upon further research into this, I came upon this answer, which states;
A URL is called an absolute URL if it begins with the scheme and scheme specific part (here // after http:). Anything else is a relative URL.
Neither the question nor the answer specifically discuss protocol-relative URLs, so I'm mindful that it can just be an oversight in wording.
However, I'm now also now running into an issue in my development, where a system that only accepts absolute URLs doesn't function with protocol-relative URLs, and I don't know if that's by design or due to a bug.
The RFC3986 section which is often linked to in relation to protocol-relative URLs also splashes the word "relative" around a lot. 4.3 then goes on to say that absolute URIs define a scheme.
All this evidence against my initial assumption led me to the question;
Are protocol-relative URLs relative or absolute?
Every relative URL is an unambiguous URL given the URL it is relative to. So if your page is http://mypage.com/some/folder/ then you know the relative URL this/that corresponds to http://mypage.com/some/folder/this/that and you know the relative URL //otherpage.com/ resolves to http://otherpage.com/. Importantly, it cannot be resolved without knowing the page URL it is relative to.
A relative URL is any URL that is relative to something and cannot be resolved by itself. An aboslute URL does not require any context whatsoever to resolve.
What you are calling a “protocol-relative URL” WHATWG calls a “scheme-relative URL” in the URL Standard document, and it is not an absolute URL, but a relative URL.
Granted most sites available on HTTPS show the same content on the corresponding HTTP URLs, that is not necessarily the case, and it therefore makes sense a URL that does not include the scheme cannot be considered absolute.
From the document:
An absolute URL must be a scheme, followed by ":", followed by either a scheme-relative URL, if scheme is a relative scheme, or scheme data otherwise, optionally followed by "?" and a query.
Specifically answering your question, we have:
A relative URL must be either a scheme-relative URL, an absolute-path-relative URL, or a path-relative URL that does not start with a scheme and ":", optionally followed by a "?" and a query.
At the point where a relative URL is parsed, a base URL must be in scope.
Examples (brackets indicate optional)
path-relative URL [path segment][/[path segment]]…
about
about/staff.html
about/staff.html?
about/staff.html?parameters
absolute-path-relative URL: /[path-relative URL]
/
/about
/about/staff.html
/about/staff.html?
/about/staff.html?parameters
scheme-relative URL: //[userinfo#]host[:port][absolute-path-relative URL]
//username:password#example.com:8888
//username#example.com
//example.com
//example.com/
//example.com/about
//example.com/about/staff.html
//example.com/about/staff.html?
//example.com/about/staff.html?parameters
absolute URL: scheme:[scheme-relative URL][?parameters]
https://username:password#example.com:8888
https://username#example.com
https://example.com
https://example.com/
https://example.com/about
https://example.com/about/staff.html
https://example.com/about/staff.html?
https://example.com/about/staff.html?parameters
relative URL:
Anything from scheme-relative URL list
Anything from absolute-path-relative URL list
Anything from path-relative URL list
Note: This answer does not disagree with the first answer, but it was only somewhat clear to me that post answered the question after reading it several times and doing further research. Hopefully this answer spells it out better for others stumbling on this.

How do SO URLs self correct themselves if they are mistyped?

If an extra character (like a period, comma or a bracket or even alphabets) gets accidentally added to URL on the stackoverflow.com domain, a 404 error page is not thrown. Instead, URLs self correct themselves & the user is led to the relevant webpage.
For instance, the extra 4 letters I added to the end of a valid SO URL to demonstrate this would be automatically removed when you access the below URL -
https://stackoverflow.com/questions/194812/list-of-freely-available-programming-booksasdf
I guess this has something to do with ASP.NET MVC Routing. How is this feature implemented?
Well, this is quite simple to explain I guess, even without knowing the code behind it:
The text is just candy for search engines and people reading the URL:
This URL will work as well, with the complete text removed!
The only part really important is the question ID that's also embedded in the "path".
This is because EVERYTHING after http://stackoverflow.com/questions/194812 is ignored. It is just there to make the link, if posted somewhere, if more speaking.
Internally the URL is mapped to a handler, e.g., by a rewrite, that transforms into something like: http://stackoverflow.com/questions.php?id=194812 (just an example, don't know the correct internal URL)
This also makes the URL search engine friendly, besides being more readable to humans.

trouble constructiing url encoded link

if i do this:
<a target="_blank" href="<%=Url.Encode(sitelink)%>"> LINK TO SITE</a>
I get the link encoded but prepended with the current local domain "http://localhost/http://...."
whats the proper way to do this
The Url.Encode method is used to escape special characters for usage in the query part of a url - it's not meant to be applied to the entire url, because that will escape things like the :// at the beginning (which is why you get the local domain prepended, because it's no longer a full URL, instead getting interpreted as a relative url).

Resources