W3C validator says 'feed does not validate' 'url must be a full URL'... whats wrong with it? - url

Validating my feed, it has an enclosure with a URL of
https://archive.org/download/NigelFarageAPersonalMessageToNorthernIrelandVoters./Nigel%20Farage,%20a%20personal%20message%20to%20Northern%20Ireland%20voters..mp3
I know it is a bit convoluted... but what is wrong with it? The stop in the directory name? the double dot in the file name? the comma? all of em?
I have looked at the RFC on URL's but cant make it out(!).
This feed does not validate.
line 441, column 2: url must be a full URL: https://archive.org/download/NigelFarageAPersonalMessageToNorthernIrelandVoters./Nigel%20Farage,%20a%20personal%20message%20to%20Northern%20Ireland%20voters..mp3 (4 occurrences) [help]
<enclosure type="audio/mpeg" url="https://archive.org/download/NigelFarage ...
^
** edit **
A useful (even if incorrect) answer was added (and removed...) showing the result from the w3c URL validator - https://validator.w3.org/checklink
This Link Checker looks for issues in links, anchors and referenced objects in a Web page, CSS style sheet, or recursively on a whole Web site. For best results, it is recommended to first ensure that the documents checked use Valid (X)HTML Markup and CSS. The Link Checker is part of the W3C's validators and Quality Web tools.
If you find this question, you may find the link checker a useful resource!

The problem seems to be that it’s a HTTPS URL instead of a HTTP URL.
The linked error documentation, foo attribute of bar must be a full URL, says:
If this is a link to a web page, you must include the "http://" at the beginning and immediately follow it with a valid domain name.
The RSS 2.0 spec says about <enclosure>:
The url must be an http url.
If you change https://archive.org/download/… to http://archive.org/download/…, it validates.

And if you don't have httpS then your SSL says your page isn't secure. #feedvalidator step up. There are a ton of feedback/complaints about this on the support forum here https://groups.google.com/forum/#!forum/feedvalidator-users
More specifically here: https://github.com/rubys/feedvalidator/issues/16

Related

Regex URL validation in Ruby

I have the following ruby code in a model:
if privacy_policy_link.match(/#{domain}($|\/$)/)
errors.add(:privacy_policy_link, 'Link to a specific privacy policy page on your site instead of your homepage.')
end
This worked until a user tried to save a privacy policy link that looked like this:
https://example.com/about-me/privacy-policy-for-example-com/
The goal is that I don't want them linking to their base homepage (example.com or www.example.com etc) for this privacy policy link (for some random, complicated reasons I won't go into). The link provided above should pass the matcher (meaning that because they are linking to a separate page on their site, it shouldn't be considered a match and they shouldn't see any errors when saving the form) - but because they reference their base domain in second half of the url, it comes up as a match.
I cannot, for the life of me, figure out how the correct regex on rubular to get this url to pass the matching algorithm. And of course I cannot just ask the user to rename their privacy policy link to remove the "com" from it at the end - because this: https://example.com/about-me/privacy-policy-for-example would pass. :)
I would be incredibly grateful for any assistance that could help me understand how to solve this problem!
Rubular link: http://rubular.com/r/G5OmYfzi6t
Your issue is the . character is any character so it matched the - in example-com.
If you chain it to the beginning of the line it will match correctly without trying to escape the . in the domain.
if privacy_policy_link.match(%r{^(http[s]?://|)(www.)?#{domain}($|/$)})

Test for URL Format

I'd like to test whether the URL that the user inputs into my form is "proper", e.g. the following are proper:
http://www.google.com
www.google.com
www.google.com/
but the following probably shouldn't be:
google
http://www.go?ogle?#%
I don't have in mind what "proper" means, but is there some standard out there that I can use?
In HTML5 you can use the input element with the type value url: http://www.w3.org/TR/html5/states-of-the-type-attribute.html#url-state-type-url. You'd need to check which browsers already implemented a validation for it, though. If it's important, you'd also need server-side validation, of course.
Here you can see what URLs are considered valid by HTML5: http://www.w3.org/TR/html5/urls.html#valid-url. It references RFC 3986 for URIs and RFC 3987 for IRIs.
You should probably have a look at RegEx for URL validation (see for example this question: PHP validation/regex for URL) or check if your library/programming-language/CMS has special functions for it.

#! as opposed to just # in a permalink

I'm designing a permalink system and I just noticed that Twitter and Hipmunk both prefix their permalinks with #!. I was wondering why this is, and if the exclamation point in particular is there for a reason. Wouldn't #/ work just as well, since they're no doubt using a framework that lets them redirect queries to certain templates with a regex URL parser?
http://www.hipmunk.com/#!BOS.SEA,Dec15.Jan02
http://twitter.com/#!/dozba
My only guess is it's because browsers use # to link to an anchor element. Is this why the exclamation point is appended?
This is done to make an "AJAX" page crawlable [by google] for indexing -- It does not affect the other well-defined semantics of the fragment identifier at all!
See Making AJAX Applications Crawlable: Getting Started
Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler. The search results will show the original URL.
I am sure other search-engines are also following this lead/protocol.
Happy coding.
Also, It is actually perfectly valid, at least per HTML5, to have an element with an ID of "!foo" so the
reasoning in the post is invalid. See the article "The id attribute just got more classy":
HTML5 gets rid of the additional restrictions on the id attribute. The only requirements left — apart from being unique in the document — are that the value must contain at least one character (can’t be empty), and that it can’t contain any space characters.
My guess is that both pages use this in their JavaScript to differ between # (a link to an anchor) and their custom #! which loads some additional content using Ajax.
In that case pretty much everything else would work after the # sign.

How do SO URLs self correct themselves if they are mistyped?

If an extra character (like a period, comma or a bracket or even alphabets) gets accidentally added to URL on the stackoverflow.com domain, a 404 error page is not thrown. Instead, URLs self correct themselves & the user is led to the relevant webpage.
For instance, the extra 4 letters I added to the end of a valid SO URL to demonstrate this would be automatically removed when you access the below URL -
https://stackoverflow.com/questions/194812/list-of-freely-available-programming-booksasdf
I guess this has something to do with ASP.NET MVC Routing. How is this feature implemented?
Well, this is quite simple to explain I guess, even without knowing the code behind it:
The text is just candy for search engines and people reading the URL:
This URL will work as well, with the complete text removed!
The only part really important is the question ID that's also embedded in the "path".
This is because EVERYTHING after http://stackoverflow.com/questions/194812 is ignored. It is just there to make the link, if posted somewhere, if more speaking.
Internally the URL is mapped to a handler, e.g., by a rewrite, that transforms into something like: http://stackoverflow.com/questions.php?id=194812 (just an example, don't know the correct internal URL)
This also makes the URL search engine friendly, besides being more readable to humans.

Is the Scheme Optional in URIs?

I was recently asked to add some Woopra JavaScript to a website and noticed that the URL started with a double slash (i.e. omitted the scheme). I've never seen this before, so I went trying to find out more about it, but the only thing I could really find was an item on the Woopra FAQ:
The Woopra JavaScript in the Setup does not include http in the URL call for the script. This is correct. The JavaScript has been optimized to run very fast and efficiently on your site.
However, some validation and site testing/debugging services and tools do not recognize the code as correct. It is correct and valid. If the warnings annoy you, just add the http to the script’s URL. It will not impact the script.
(For clarification, the URL is "//static.woopra.com/js/woopra.v2.js"—the colon is omitted in addition to the "http".)
Is there any more information about this practice? If this is indeed valid, there must be a spec that talks about it, and I'd very much like to see it.
Thanks in advance for satisfying my curiousity!
This is a valid URL. It's called a "network-path reference" as defined in RFC 3986. When you don't specify a scheme/protocol, it will fall back to the current scheme. So if you are viewing a page via https:// all network path references will also use https.
For an example, here's a link to the RFC 3986 document again but with a network path reference. If you were viewing this page over https (although it looks like you can't use https with StackOverflow) the link will reflect your current URI scheme, unlike the first link.
See RFC 3986, section 3:
The generic URI syntax consists of a
hierarchical sequence of components
referred to as the scheme, authority,
path, query, and fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment
]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
The scheme and path components are
required, though the path may be
empty (no characters).

Resources