Check a file of HTTP URLs for HTTP response codes - url

I need a way to check a file that contains links to see if any of them are broken. The file contains links to thousands of different URLs. I don't need to crawl or spider any further than the URLs that are in the file - we just need a HTTP request response for each URL.

Take a look at Xenu.
It does exactly what you need, assuming it's a web page, or a text file of links. You can control how deep it follows links.

Related

How do I make a dynamic URL for a 404 xhtml page?

I have defined a location for the page in the xml
<error-page>
<error-code>404</error-code>
<location>/faces/public/error-page-not-found.xhtml</location>
</error-page>
<error-page>
but I want the URL to be like below:
faces/{variable}/public/error-page-not-found.xhtml
where the value of the variable will change according to different situations
This question is a bit subjective though in general HTTP errors are handled by the server and most of the time by the scripting language on the server (and occasionally the HTTP server software directly).
In example the Apache HTTP web server software allows for rewrites. So you can request a page at example.com/123 though there is no "123" file there. In the code that would determine if you would have something that would be available for that request you would also determine if a resource exists for that request; if not then your server scripting code (PHP, ColdFusion, Perl, ASP.NET, etc) would need to return an HTTP 404. The server code would then have a small snippet that you would put in to the body of the code such as the code you have above.
You would not need to redirect to an error page, you would simply respond with the HTTP 404 response and any XML you'd use to notify the visitor that there is nothing there. HTTP server software such as Apache can't really produce code (it can only reference or rewrite some file to be used for certain requests).
Generally speaking if you have a website that uses a database you'd do the following...
Parse the URL requested so you can determine what the visitor requested.
Determine if a resource should be retrieved for that request (e.g. make a query to the database).
Once you know whether a resource is available or not you then either show the resource (e.g. a member's profile) or server the HTTP status (401: not signed in at all, 403:, signed in though not authorized where no increase in privileges will grant permission, 404: not found, etc) and display the corresponding content.
I would highly recommend that you read about Apache rewrites and PHP, especially it's $_SERVER array (e.g. <?php print_r($_SERVER);?>). You'd use Apache to rewrite all requests to a file so even if they request /1, /a, /about, /contact/, etc they all get processed by a single PHP file where you first determine what the requested URL is. There are tons of questions here and elsewhere on the web that will help you really get a good quick jump start on handling all that such as this: Redirect all traffic to index.php using mod_rewrite. If you do not know how to setup a local HTTP web server I highly recommend looking in to XAMPP, it's what I started out with years ago. Good luck!

Are friendly URLs based on directories?

I've been reading many articles about SEO and investigating how to improve my site. I found an article that said that having friendly URLs help online indexers to find and positionate your site better than using URLs with lots of GET parameters so I decided to adapt my site to this kind of URL. I've also read that there's a way (editing .htaccess) but it's not the best way and it doesn't look really good.
For example, that's how Google's About URL looks like:
https://www.google.com/search/about/es/
When surfing into FTP do they see the directories search/about/es/index.html? If so, you must create many files and directories for each language instead of using &l=es, is it that worth?
You can never know (for sure) how resources are mapped to URLs.
For example, the URL https://www.google.com/search/about/es/ could
point to the HTML file /search/about/es/index.html
point to the HTML file /foo/bar/1.html
point to the PHP script /index.php
point to the PHP script /search.php?title=about&lang=es
point to the document available from the URL https://internal.google.com/1238
…
It’s always the server that, given the URL from the request, decides which resource to deliver. Unless you have access to the server, you can’t know how. (Even if a URL ends with .php, it’s not necessarily the case that PHP is involved at all.)
The server could look for a file that physically exists (if URL rewriting is involved: even in "other" places than what the URL path suggests), the server could run a script that generates a document on the fly (e.g., taking the content from your database), the server could output the file available from another URL, etc.
Related Wikipedia articles:
Rewrite engine
Web framework: URL mapping
Front controller

Cloudfront/S3: Server different file depending on Request Header

I am hosting a static website generated with Middleman on CloudFront and S3. I want to add multiple language support and middleman allows me to localize the content and have the english version at /index.html and the translated content at /sp/index.html for example.
I would like to be able to detect the "Accept-Language" header in the request and based on that server either /index.html or /sp/index.html .
Based on my research I cannot see a way of doing this with S3 and Cloudfront, but maybe you guys have an idea?
If there is no "proper and good way" of doing this with CloudFront and S3, what would be the next best alternative? Currently I am thinking of detecting the language in JavaScript and then redirecting the user if the language is not english.
Greetings, Kim
As mentioned in the comments you will need some kind of arbitrator that can read request headers and either redirect or serve dynamic content. S3 is the problem there.
CloudFront can forward the Accept-Language header to your origin server, and ensure that content is only cached per-language. So that part isn't a problem.
If S3 is your origin, then you have a problem because your files are static and unable to process the incoming request with the language information. I don't recommend trying to detect language with JavaScript. It's problematic.
Although CloudFront can be configured with multiple origins (one per language, in your case) it cannot forward to these based on request header. Currently "behaviours" can only match the URL path. I suspect they could introduce header rules at some point, but until they do (or unless you can find another CDN that does) I'm afraid my answer is going to be a "you can't" answer.
As your site is all flat HTML, I suspect you're not interested in a convoluted solution that comprises various CloudFront behaviours and dynamic server scripts, etc..
I think your best option by far is a simple, low-tech one --
Offer the visitor a choice of language and allow them to switch language from any page. This also avoids surprises - If I google something in English, but I speak Spanish I should see the English page that I googled and then switch to Spanish if I feel like it.

Should I put .htm at the end of my urls?

The tutorials I'm reading say to do that, but none of the websites I use do it. Why not?
none of the websites I use [put .htm into urls] Why not?
The simple answer would be:
Most sites offer dynamic content instead of static html pages.
Longer answer:
The file extension doesn't matter. It's all about the web server configuration.
Web server checks the extension of the file, then it knows how to handle it (send .html straight to client, run .php through mod_php and generate a html page etc.) This is configurable.
Then web server sends the content (static or generated) to the client, and the http protocol includes telling the client the type of the content in the headers before the web page is sent.
By the way, .htm is no longer needed. We don't use DOS with 8.3 filenames anymore.
To make it even more complicated: :-)
Web server can do url rewriting. For example it could redirect all urls of form : www.foo.com/photos/[imagename] to actual script located in www.foo.com/imgview.php?image=[imagename]
The .htm extension is an abomination left over from the days of 8.3 file name length limitations. If you're writing HTML, its more properly stored in a .html file. Bear in mind that a URL that you see in your browser doesn't necessarily correspond directly to some file on the server, which is why you rarely see .html or .htm in anything other than static sites.
I presume you're reading tutorials on creating static html web pages. Most sites are dynamically generated from programs that use the url to determine the content you see. The url is not tied to a file. If no such dynamic programs are present, then files are urls are synonomous.
If you can, leave off the .htm (or any file extension). It adds nothing to the use of the site, and exposes an irrelevant detail in the URL.
There's no need to put .htm in your URL's. Not only does it expose an unnecessary backend detail about your site, it also means that there is less room in your URLs for other characters.
It's true that URL's can be insanely long... but if you email a long link, it will often break. Not everyone uses TinyURL and the like, so it makes sense to keep your URL's short enough so that they don't get truncated in emails. Those four characters (.htm) might make the difference between your emailed url getting truncated or not!

How do you see the client-side URL in ColdFusion?

Let's say, on a ColdFusion site, that the user has navigated to
http://www.example.com/sub1/
The server-side code typically used to tell you what URL the user is at, looks like:
http://#cgi.server_name##cgi.script_name#?#cgi.query_string#
however, "cgi.script_name" automatically includes the default cfm file for that folder- eg, that code, when parsed and expanded, is going to show us "http://www.example.com/sub1/index.cfm"
So, whether the user is visiting sub1/index.cfm or sub1/, the "cgi.script_name" var is going to include that "index.cfm".
The question is, how does one figure out which URL the user actually visited? This question is mostly for SEO-purposes- It's often preferable to 301 redirect "/index.cfm" to "/" to make sure there's only one URL for any piece of content- Since this is mostly for the benefit of spiders, javascript isn't an appropriate solution in this case. Also, assume one does not have access to isapi_rewrite or mod_rewrite- The question is how to achieve this within ColdFusion, specifically.
I suppose this won't be possible.
If the client requests "GET /", it will be translated by the web server to "GET /{whatever-default-file-exists-fist}" before ColdFusion even gets invoked. (This is necessary for the web server to know that ColdFusion has to be invoked in the first place!)
From ColdFusion's (or any application server's) perspective, the client requested "GET /index.cfm", and that's what you see in #CGI#.
As you've pointed out yourself, it would be possible to make a distinction by using a URL-rewriting tool. Since you specifically excluded that path, I can only say that you're out of luck here.
Not sure that it is possible using CF only, but you can make the trick using webserver's URL rewriting -- if you're using them, of course.
For Apache it can look this way. Say, we're using following mod_rewrite rule:
RewriteRule ^page/([0-9]+)/?$
index.cfm?page=$1&noindex=yes [L]
Now when we're trying to access URL http://website.com/page/10/ CGI shows:
QUERY_STRING page=10&noindex=yes
See the idea? Think same thing is possible when using IIS.
Hope this helps.
I do not think this is possible in CF. From my understanding, the webserver (Apache, IIS, etc) determines what default page to show, and requests it from CF. Therefore, CF does not know what the actual called page is.
Sergii is right that you could use URL rewrting to do this. If that is not available to you, you could use the fact that a specific page is given precedence in the list of default pages.
Let's assume that default.htm is the first page in the list of default pages. Write a generic default.htm that automatically forwards to index.cfm (or whatever). If you can adjust the list of defaults, you can have CF do a 301 redirect. If not, you can do a meta-refresh, or JS redirect, or somesuch in an HTML file.
I think this is possible.
Using GetHttpRequestData you will have access to all the HTTP headers.
Then the GET header in that should tell you what file the browser is requesting.
Try
<cfdump var="#GetHttpRequestData()#">
to see exactly what you have available to use.
Note - I don't have Coldfusion to hand to verify this.
Edit: Having done some more research it appears that GetHttpRequestData doesn't include the GET header. So this method probably won't work.
I am sure there is a way however - try dumping the CGI scope and see what you have.
If you are able to install ISAPI_rewrite (Assuming you're on IIS) - http://www.helicontech.com/isapi_rewrite/
It will insert a variable x-rewrite-url into the GetHttpRequestData() result structure which will either have / or /index.cfm depending on which URL was visited.
Martin

Resources