I'm writing a 'webcrawler' in python that takes a URL and does a depth-first search following links down to some limited depth. The problem I'm having is interpreting relative paths in URLS.
On the page http://learnyouahaskell.com/introduction/ have a look at the "Starting Out" link; it looks like Starting Out. How can I determine whether this link refers to "http://learnyouahaskell.com/introduction/starting-out" or "http://learnyouahaskell.com/starting-out"? The second one is correct according to my browser.
Yet on the page http://math.colgate.edu/~mionescu/math399s11/ there is a link here which resolves to "http://math.colgate.edu/~mionescu/math399s11/Finalprojects.pdf".
Can someone explain this inconsistency to me? How can I determine how these paths should be resolved in my crawler?
The reason for this 'apparent' inconsistency is that the learnyouahaskell site is using the <base href=""> tag in their source. This directs all domainless hrefs to use the base as their starting point.
Without the base tag it would have appeared as expected (the first link you post) and acted just like the math.colgate.edu link.
Related
I am using orchard 1.9 and I am building a service in which I need to get current URL.
I have OrchardServices and from that I can get the URL like so:
_orchardServices.WorkContext.HttpContext.Request.Url.AbsolutePath;
This works like a charm for pages/routes that I have created but when I go to the Login or register page (/Users/Account/LogOn) the absolute URL is / and I can't find anyway to get the URL or at least any indication that I am in the LogOn or Register.
Anyone knows how I could get the full url?
If I understand what your're asking, you could use the ItemAdminLink from the ContentItemExtension class.
You will need to add references to Orchard.ContentManagement, Orchard.Mvc.Html and Orchard.Utility.Extensions, but then you will have access to the #Html and #Url helpers.
From there you will have the ability to get the link to the item using:
#Html.ItemDisplayLink((ContentItem)Model.ContentItem)
The link to the item with the Url as the title using: #Url.ItemDisplayUrl((ContentItem)Model.ContentItem)
And you should get the same for the admin area by using these:
#Html.ItemAdminLink((ContentItem)Model.ContentItem)
#Url.ItemAdminUrl((ContentItem)Model.ContentItem)
They will give you relative paths, e.g. '/blog/blog-post-1', but it sounds like you've already got a partial solution for that sorted, so it would be a matter of combining the two.
Although I'm sure there are (much) better ways of doing it, you could get the absolute URL using:
String.Format("{0}{1}", WorkContext.CurrentSite.BaseUrl, yourRelativeURL);
...but if anyone has a more elegent way of doing it then post a comment below.
Hope that helps someone.
I have two questions here, that I thought there were already asked, but I could not find anything related.
Let's suppose I have the following URL:
http://www.domain.com/folder/page
And I have an anchor like this:
Page2
First:
Of course when it is clicked, it will navigate to
http://www.domain.com/folder/page2
But if the user has this URL:
http://www.domain.com/folder/page/ <-- Note the last slash
Then the anchor will navigate to:
http://www.domain.com/folder/page/page2
The first question is:
How can I avoid this?
And the second question would be:
How to always do this?
I mean that even if the url ends with a slash or not, navigate to:
http://www.domain.com/folder/page/page2
I know I can do this with javascript, but the idea is to keep using the href without using javascript in every case this happens. I also know I can use relative urls starting with / to referrer the root, but I can't in this case because the url has some IDs in the middle that may change.
Your basic problem is that you have two URLs that resolve to the same resource.
Pick one of them to be canonical and redirect from the other one two it using HTTP.
Failing that, use root relative URIs:
href="/folder/page2"
JWebUnit.beginAt:
Begin conversation at a URL absolute or relative to base URL. Use getTestContext().setBaseUrl(String) to define base URL. Absolute URL should start with "http://", "https://" or "www.".
JWebUnit.gotoPage:
Go to the given page like if user has typed the URL manually in the browser. Use getTestContext().setBaseUrl(String) to define base URL. Absolute URL should start with "http://", "https://" or "www.".
So, one says "Begin conversation at URL absolute or relative to base URL", while the other says "Go to the given page like if user has typed the URL manually in the browser". This doesn't help me in the slightest in understanding them (well, specifically the former; the latter makes sense). What's the actual difference between them? Which should I be using, and when?
I finally did manage to find the answer in the source code.
beginAt does two things: start the browser, then call gotoPage with its argument. Thus, you need to use beginAt the first time, and gotoPage subsequent times. (Perhaps if managing multiple windows it has more use; I haven't dug that deeply.)
If an extra character (like a period, comma or a bracket or even alphabets) gets accidentally added to URL on the stackoverflow.com domain, a 404 error page is not thrown. Instead, URLs self correct themselves & the user is led to the relevant webpage.
For instance, the extra 4 letters I added to the end of a valid SO URL to demonstrate this would be automatically removed when you access the below URL -
https://stackoverflow.com/questions/194812/list-of-freely-available-programming-booksasdf
I guess this has something to do with ASP.NET MVC Routing. How is this feature implemented?
Well, this is quite simple to explain I guess, even without knowing the code behind it:
The text is just candy for search engines and people reading the URL:
This URL will work as well, with the complete text removed!
The only part really important is the question ID that's also embedded in the "path".
This is because EVERYTHING after http://stackoverflow.com/questions/194812 is ignored. It is just there to make the link, if posted somewhere, if more speaking.
Internally the URL is mapped to a handler, e.g., by a rewrite, that transforms into something like: http://stackoverflow.com/questions.php?id=194812 (just an example, don't know the correct internal URL)
This also makes the URL search engine friendly, besides being more readable to humans.
I have recently been experimenting with the profile features of ASP.NET. I am having trouble getting a "website" property to display correctly. For example, if the website I enter is: facebook.com/contactalig and I render it using
<%: Profile.Website %>
it gets rendered on screen as http://localhost:51225/users/facebook.com/contactalig Initially, I thought I might just prepend "http://" if it didn't contain one, but I feel there should be a cleaner solution.
Thanks in advance.
Without the protocol etc it isn't an absolute uri, so the browser (correctly) treats it as relative to the current URL.
So yes: check for a protocol. Perhaps just StartsWith is enough here, else a regex or maybe Uri.TryCreate (or whatever it is) specifying absolute-only.
Personally I would do this check at the point of data-entry, not at display.