Can I prevent search engines from indexing an entire directory on my website? - staging

I have a staging site which I use to draft new features, changes and content to my actual website.
I don't want this to get indexed, but I'm hoping for a solution a little easier than having to add the below to every page on my site:
<meta name="robots" content="noindex, nofollow">
Can I do this in a way similar to how I added a password to the domain using a .htaccess file?

The robots.txt standard is meant for this. Example
User-agent: *
Disallow: /protected-directory/
Search engines will obey this, but of course the content will still be published (and probably more easily discoverable if you put the URL in the robots.txt), so password protection via .htaccess is an option, too.

What you want is a robots.txt file
The file should be in your server root and the content should be something like;
User-agent: *
Disallow: /mybetasite/
This will politely ask search indexing services not to index the pages under that directory, which all well behaved search engines will respect.

Indeed, robots.txt at the site root is the way to go.
To add multiple entries (as the OP suggests), do as follows:
User-agent: *
Disallow: /test_directory_aaa/
Disallow: /test_directory_bbb/
Disallow: /test_directory_ccc/
Or, to take the .htpasswd route:
In .htaccess, add:
AuthType Basic
AuthName "Marty's test directory"
AuthUserFile /test_directory_aaa/.htpasswd
AuthUserFile /test_directory_bbb/.htpasswd
AuthUserFile /test_directory_ccc/.htpasswd
require valid-user
In .htpasswd, add:
username1:s0M3md5H4sh1
username2:s0M3md5H4sh2
username3:s0M3md5H4sh3

Put following code in robot.txt which should be in root directory to refuse your entire site from indexing.
User-agent: *
Disallow: /

Create a file called Robots.txt in your public_html directory.
Put the following code in it:
User-agent: *
Disallow: /foldername/
foldername is the name of the directory you wish to block

Block Specific File for SEO:
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: *
Disallow: /*.xls$
Ref:
http://antezeta.com/news/avoid-search-engine-indexing
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&topic=1724262&ctx=topic

Related

Will HTML5 app.manifest work with MVC style URLS?

Given the following example manifest:
CACHE MANIFEST
# v1 2011-08-14
# This is another comment
index.html
cache.html
http://somedomain.com/abc/xyz/
/style/css
controller/view/1
# Use from network if available
NETWORK:
/api
# Fallback content
FALLBACK:
/ fallback.html
Will "/style/css" and "/controller/view/1" work, or does it require actual file names? I keep reading about putting "files" there, but on other sites I read "URI". I'm assuming URI is correct. So, are full AND relative URIs allowed? Any cross-browser/device issues to be aware of?
BTW 1: Yes, I'm aware that "file names" are just part of a URI, and names don't dictate content (image.png could download a text file, for example, if one wanted to). I only want to confirm that URIs work well in the CACHE MANIFEST section, thanks.
BTW 2: I'm aware that Mozilla states URI for the cache manifest, so as mentioned, I just want to confirm.
https://developer.mozilla.org/en-US/docs/Web/HTML/Using_the_application_cache
You can use a dynamic manifest file which is actually designed to work with MVC. I've not had chance to use it myself but it looks really good!
http://deanhume.com/home/blogpost/mvc-and-the-html5-application-cache/59

Changing my URL from "../page.html" to "../page/" with URL rewrites

How can I rewrite my url changing my default prefix .html or .php in / ?
For example look here: http://www.anderssonwise.com/studio/vision/
I've found this tut but nothing happen:
http://www.spencerdrager.com/2010/02/07/hide-php-extension-in-url-using-htaccess/
You have to use URL Rewrites, which is why Krister asked you if you have mod_rewrite enabled. It's not the simplest thing in the world to do but here's a pretty good tutorial:
http://www.addedbytes.com/articles/for-beginners/url-rewriting-for-beginners/
You create a folder, in this example, /page/ and within that folder create index.html and then you'll be able to access the page like: http://domain.com/page/

How do I forbid search engine to index subdirectory /CRM with robot.txt?

Or even forbid indexing the whole site?
Create a robots.txt file and add the appropriate Disallow directives, such as:
User-agent: *
Disallow: /CRM

Need to have either example.com/username or username.example.com, but how?

I'm almost finished developing my large project, however I would love it if I could make it so instead of having the users profile pages at: http://example.com/profile/username/USERNAME
(i'm currently using .htaccess to rewrite the GET data into forward slashes and profile(.php) being read as just 'profile' profile.php also parses the url correctly to retrieve the GET data)
But it would be some much better if I could do it so that it's like http://www.example.com/USERNAME (preferred) or http://www.USERNAME.example.com
Any ideas or resources?
Thanks,
Stefan
In your .htaccess in the root, add
RewriteRule ^/([^/]+)/? /profile/username/$1
This matches paths that don't include a slash (so no directories in the path) and suffixes the path to /profile/username/. The path can include an optional final slash.
(+1 for the comment about namespaces - it's a little dangrous having usernames in the root of your site. I've tried to limit the impact of this by only giving out the namespace comprising a single directory. Paths with multiple directories will be handled as normal.)

Converting Relative links to Absolute?

I am programming a small script using PHP and regular expressions
The aim of this script is to extract all links in a page and convert these links to absolute- when it is relative -
I figured how does relative links works but their are some questions also
lets say we have this page http://www.example.com/xxx1/xxx2/xxx3.html
if this page has the following links
index.html --- the absolute link will be http://www.example.com/xxx1/xxx2/index.html
./index.html --- the absolute link will be http://www.example.com/xxx1/xxx2/index.html
../index.html --- the absolute link will be http://www.example.com/xxx1/index.html
/index.html --- the absolute link will be http://www.example.com/index.html
so
index.html = will open in the current directory
./index.html = will also open in the current directory
../index.html = will open in the parent directory
/index.html = will open in the root directory
the problem is what if the URL is a search engine friendly?
say we have this URL
((case1)): http://www.example.com/xxx1/xxx2/xxx3/index/
or
((case2)): http://www.example.com/xxx1/xxx2/xxx3/index
is "index" in case1 a directory or a page?is it a directory in case2 or a page?
and how the following links will look like as absolute links in both cases 1 and 2
index.html --- ?
./index.html --- ?
../index.html --- ?
/index.html --- ?
I am not sure if it is an easy question for some of you but for me it is confusing?
Thanks :)
Direct answer to your example
In case 1, index is a "directory component" of the URL, while in case 2 index is a "file component" of the URL. This is independent of whether it actually is a regular file or directory on the web server -- see the explanation below. I'd call both a "page" if an HTML page is served by the server on those URLs.
Case 1: (Links from http://www.example.com/xxx1/xxx2/xxx3/index/)
index.html -> http://www.example.com/xxx1/xxx2/xxx3/index/index.html
./index.html -> http://www.example.com/xxx1/xxx2/xxx3/index/index.html
../index.html -> http://www.example.com/xxx1/xxx2/xxx3/index.html
/index.html -> http://www.example.com/index.html
Case 2: (Links from http://www.example.com/xxx1/xxx2/xxx3/index)
index.html -> http://www.example.com/xxx1/xxx2/xxx3/index.html
./index.html -> http://www.example.com/xxx1/xxx2/xxx3/index.html
../index.html -> http://www.example.com/xxx1/xxx2/index.html
/index.html -> http://www.example.com/index.html
So the only one that stays the same is the absolute links - 4.
Explanation
Links are relative to the URL the browser is at, which may not be the URL you originally entered (for example on an HTTP redirect). Most web browsers will update the URL bar with the current address once you follow a link or are redirected, so unless you just edited that, the address you see there is the one that counts.
URLs ending with a slash are considered to refer to directories (implied by RFC2396 for URI syntax, though it does not actually call them that way), else they are considered to refer to files within directories.
--Side note: This will not necessarily correspond to the filesystem path (if there is one) type used by the web server to serve the file. Most web servers, when asked requested a URL mapping to a directory on their filesystem, will either serve a file within the directory with some set name (often index.html, but the selection can usually be configured), or an HTML directory listing generated by the server (or an access error if that was disabled). The same will usually be served when a "file URL" for the similiar path without a trailing slash is requested, in which case the "file URL" actually maps to a directory filesystem path.--
This can lead to inconsistencies such as the above example, where the "file URL" http://www.example.com/xxx1/xxx2/xxx3/index is probably equivalent to the "directory URL" http://www.example.com/xxx1/xxx2/xxx3/index/, but relative links may refer to different paths from those two URLs, and one may work and the other may be broken.
For that reason, when a linking to a directory, it is recommended to always use the "directory URL" (with the terminating slash) and not the equivalent "file URL" - e.g. link to http://www.ietf.org/meetings/ and not http://www.ietf.org/meetings even if both would serve the same page. Many web servers are in fact configured to redirect clients requesting the latter to the former using a an HTTP 301 redirect response. You can see this if you enter the latter in your browser's URL bar - the URL bar will change to the former once it gets that response.

Resources