Xhtml namespace on https site? - url

Im migrating a site from using http to redirect all requests to https and therefor im making sure external script, images etc are references with just // in the beginning of the url instead of http://
My question is this. Do i also need to change stuff like the xhtml namespaces for the html tag or the doctype declaration url? And if I do need to change this, will they resolve urls starting with //?

Namespaces are identifying strings that happen to use URL syntax. They should not be changed.
The DTD is a tricky one.
In theory, if it was altered with a man-in-the-middle attack, then it could be used to change named entities and insert new content into the document.
In practise however, browsers don't generally parse the DTD so this isn't really a worry. Additionally, W3C DTDs are not served over HTTPS so you can't reference them without copying the files to your own server (and possibly updating internal references). If you want to be really safe, you should do this.
Personally, I'd scrap DTDs and just use (X)HTML 5.

Related

Is a protocol (eg. http or https) required for a URL to be valid?

Recently I came across a lot of code from analytics plugins where they specify the URL as //fonts.googleapis.com or //www.google.com.
Basically it starts with two forward slashes and then the domain or subdomain. These links work fine in browsers. I have read the following documents, but I am still not sure if above can be called valid URLs (basically should these be reported as broken URLs or not).
https://developer.mozilla.org/en-US/docs/Web/API/URL and
https://url.spec.whatwg.org/
Is there a standard specification that I can refer to?
They're both valid scheme-relative-URL strings, although they need to be in the context of a Base URL to be meaningful. When used within a web page, the web page will provide the Base URL context.
Although there are other, earlier standards for URLs, the whatwg document represents the most up-to-date, web compatible definition.

Are friendly URLs based on directories?

I've been reading many articles about SEO and investigating how to improve my site. I found an article that said that having friendly URLs help online indexers to find and positionate your site better than using URLs with lots of GET parameters so I decided to adapt my site to this kind of URL. I've also read that there's a way (editing .htaccess) but it's not the best way and it doesn't look really good.
For example, that's how Google's About URL looks like:
https://www.google.com/search/about/es/
When surfing into FTP do they see the directories search/about/es/index.html? If so, you must create many files and directories for each language instead of using &l=es, is it that worth?
You can never know (for sure) how resources are mapped to URLs.
For example, the URL https://www.google.com/search/about/es/ could
point to the HTML file /search/about/es/index.html
point to the HTML file /foo/bar/1.html
point to the PHP script /index.php
point to the PHP script /search.php?title=about&lang=es
point to the document available from the URL https://internal.google.com/1238
…
It’s always the server that, given the URL from the request, decides which resource to deliver. Unless you have access to the server, you can’t know how. (Even if a URL ends with .php, it’s not necessarily the case that PHP is involved at all.)
The server could look for a file that physically exists (if URL rewriting is involved: even in "other" places than what the URL path suggests), the server could run a script that generates a document on the fly (e.g., taking the content from your database), the server could output the file available from another URL, etc.
Related Wikipedia articles:
Rewrite engine
Web framework: URL mapping
Front controller

Are unnecessary slashes in a URL bad?

I noticed that https://stackoverflow.com//////////questions/4659504/ is a valid URL. However https://www.google.com//////////analytics/settings is not. Are there differences inherent in web server technologies that explain this? Should a url with unnecessary slashes be interpreted correctly or should it return an error?
First of all, adding a slash changes the semantics of a URL path like any other character does. So by definition /foo/bar and /foo//bar are not equivalent just as /foo/bar and /foo/bar/ are not equivalent.
But since the URL path is mostly used to be directly mapped onto the file system, web servers often remove empty path segments (Apache does that) so that /foo//bar and /foo/bar are handled equivalently. But this is not the expected behavior; it’s rather done for error correction.
They are both valid URLs.
However, Google's server can't handle the second one.
There is no specific reason to either handle or reject URLs with duplicate slashes; you should spend more time on more important things.
What do you consider "interpreted correctly"? HTTP only really specifices how the stuff in front of the slash after the server name gets interpreted. The rest is entirely up to the web server. It parses what you give it after that point (in whatever manner it likes) and presents you with whatever HTML it feels like providing for that text.
There is a difference in how every application processes requests. If you setup your app to replace succeeding slashes before routing the request you shouldn't have any problems.

Why can protocol be omitted from absolute paths on a webpage?

I recently ran across a website that had some interesting styling on a select element. I went to investigate and found this (names changed to protect the innocent):
<script type="text/javascript" src="//www.domain.tld/file.js"></script>
It works despite HTTP: being omitted. What is the purpose of leaving off the protocol?
It will use the protocol you're already using. Useful for sites with both https and http versions.
So if you're on https://www.domain.tld/file.js the script will be https://www.domain.tld/file.js.
If you're on http://www.domain.tld/ the script will be http://www.domain.tld/file.js.
i believe this is short hand for a relative path to the protocol. So it should use the same protocol as is being used for that session. e.g if you grabbed that page with http, then this url is relative to http protocol
The purpose is that the scheme (ie. http or https) can be determined relative to the containing page. This is useful if you have a common piece of code included in multiple pages that can be served via http or https.
The purpose is to "use the same protocol as in the current URL" -- presumably (?) useful if the page can be reached both as http: and https: (I have a hard time thinking of other protocols yet that it might be useful for, and even this one is not a clear-cut use case).

Should I put .htm at the end of my urls?

The tutorials I'm reading say to do that, but none of the websites I use do it. Why not?
none of the websites I use [put .htm into urls] Why not?
The simple answer would be:
Most sites offer dynamic content instead of static html pages.
Longer answer:
The file extension doesn't matter. It's all about the web server configuration.
Web server checks the extension of the file, then it knows how to handle it (send .html straight to client, run .php through mod_php and generate a html page etc.) This is configurable.
Then web server sends the content (static or generated) to the client, and the http protocol includes telling the client the type of the content in the headers before the web page is sent.
By the way, .htm is no longer needed. We don't use DOS with 8.3 filenames anymore.
To make it even more complicated: :-)
Web server can do url rewriting. For example it could redirect all urls of form : www.foo.com/photos/[imagename] to actual script located in www.foo.com/imgview.php?image=[imagename]
The .htm extension is an abomination left over from the days of 8.3 file name length limitations. If you're writing HTML, its more properly stored in a .html file. Bear in mind that a URL that you see in your browser doesn't necessarily correspond directly to some file on the server, which is why you rarely see .html or .htm in anything other than static sites.
I presume you're reading tutorials on creating static html web pages. Most sites are dynamically generated from programs that use the url to determine the content you see. The url is not tied to a file. If no such dynamic programs are present, then files are urls are synonomous.
If you can, leave off the .htm (or any file extension). It adds nothing to the use of the site, and exposes an irrelevant detail in the URL.
There's no need to put .htm in your URL's. Not only does it expose an unnecessary backend detail about your site, it also means that there is less room in your URLs for other characters.
It's true that URL's can be insanely long... but if you email a long link, it will often break. Not everyone uses TinyURL and the like, so it makes sense to keep your URL's short enough so that they don't get truncated in emails. Those four characters (.htm) might make the difference between your emailed url getting truncated or not!

Resources