Hide web pages to the search engines robots - url

I need to hide all my sites pages to ALL the spider robots, except for the home page (www.site.com) that should be parsed from robots.
Does anyone knows how can i do that?

add to all pages you do not want to index tag <meta name="robots" content="noindex" />
or you can create robots.txt in your document root and put there something like:
User-agent: *
Allow: /$
Disallow: /*

Related

The canonical link points to the site root error

I have a website that uses country-specific pages. So for every page, there is a country-specific URL. For example: example.com/au/blog, example.com/us/blog, example.com/uk/blog. This is great as we can show content more relevant to each country.
This idea is the same for our home page: example.com/au, example.com/us, example.com/uk.
When a user goes to a non-country specific URL (ie example.com, example.com/blog) the server falls back to serving more generalised content. On the client, we then show a banner for the user to decide if they want to go to a country-specific page.
With this in mind, I have added the following meta tags and receiving the below error when testing using Lighthouse.
<link rel="canonical" href="https://www.example.com/">
<link rel="alternate" hreflang="x-default" href="https://www.example.com/">
<link rel="alternate" hreflang="en-GB" href="https://www.example.comt/uk/">
<link rel="alternate" hreflang="en-US" href="https://www.example.com/us/">
<link rel="alternate" hreflang="en-AU" href="https://www.example.com/au/">
//error
The document does not have a valid rel=canonical. Points to the domain's root URL (the homepage), instead of an equivalent page of content.
Is this the correct way to inform crawlers that:
The site root is the original document
The site root doesn't target any language or locale
The alternatives to this page are en-GB, en-US and en-AU
If so, why does Lighthouse complain about this error on the home page? It doesn't complain about this on any other page.
I am new to canonicalisation and providing alternative lang pages so I might be missing something obvious.
Since your home page has a generalized subset of content, it is not canonical. (The canonical context IRI and target IRI shouldn't be the same or the user would already be on the canonical IRI.) Technically, the per language IRIs are canonical and alternate documents, depending on the language. Since you're in the UK, you should specify the en-GB IRI to be the canonical IRI and the others to be alternates (not "alternatives") since they are simply different representations of the same content therein.
From the home page https://www.example.com/ (the generalized content):
<link rel="canonical" hreflang="en-GB" href="https://www.example.com/uk/">
From https://www.example.comt/uk/ (the canonical IRI)
<link rel="alternate" hreflang="en-US" href="https://www.example.com/us/">
<link rel="alternate" hreflang="en-AU" href="https://www.example.com/au/">
https://www.rfc-editor.org/rfc/rfc6596
https://www.rfc-editor.org/rfc/rfc8288
https://html.spec.whatwg.org/#the-link-element

how to block a search engine to search from my domain

I want to block a search engine to stop indexing my website. I've followed this reference Here and create a robot.txt on root. Content is this:
User-agent: http://search.pch.com
Disallow: /
But it doesn't work. Any help will be appreciated. I want to block search engine http://search.pch.com either through .htaccess or some other method.
UPDATE
I have also tried this one
<meta name="robots" content="noindex, nofollow">
<meta name="googlebot" content="noindex, nofollow">
no effect
You need to look into your log-files on your webserver to check if http://search.pch.com is the User-agent of the crawler.
Use a robot.txt (not reboot.txt) with
User-agent: *
Disallow: /
instead if you like any bots (that respect robot.txt) not to crawl you page.
First: file name should be robot.txt
Second: its web crawlers choice whether to honor this file. It clearly says "most of"
Third and most important: the user agent string for the PCHSearch might not be the same as its url. double check the user agent string.
or you can use this code for htaccess
# block visitors referred from indicated domains
RewriteEngine on
RewriteCond %{HTTP_REFERER} baddomain01\.com [NC,OR]
RewriteCond %{HTTP_REFERER} baddomain02\.com [NC]
RewriteRule .* - [F]
this worked for me
SetEnvIfNoCase Referer "http://search.pch\.com" bad_referer
Order Allow,Deny
Allow from ALL
Deny from env=bad_referer

Can I include canonical URLs in sitemap for SEO?

Can I include canonical URLs in sitemaps for SEO?
For example www.example.com/url.html is a duplicate page of www.example2.com/url.html.
So I used following tag in www.example.com/url.html page for SEO not to be penalized by search engines:
<link rel="canonical" href="www.example2.com/url.html">
Now my question is can I display www.example.com/url.html URL inside of www.example.com/sitemap.xml?
I already display www.example2.com/url.html URL inside of www.example2.com/sitemap.xml.
Please suggest me what I have to do.
You can include these two pages into your sitemap.xml and there won't be a problem for SEO because you're using the rel="canonical" tag. Indeed, when web crawlers will try to index the duplicate page, they will see the rel="canonical" tag and they will index the second page (the good one).
For better index you must leave one URL - canonical URL in XML site map

Remove Page from being indexed in Google, Yahoo, Bing [duplicate]

I don't want the search engines to index my imprint page. How could I do that?
Also you can add following meta tag in HEAD of that page
<meta name="robots" content="noindex,nofollow" />
You need a simple robots.txt file. Basically, it's a text file that tells search engines not to index particular pages.
You don't need to include it in the header of your page; as long as it's in the root directory of your website it will be picked up by crawlers.
Create it in the root folder of your website and put the following text in:
User-Agent: *
Disallow: /imprint-page.htm
Note that you'd replace imprint-page.html in the example with the actual name of the page (or the directory) that you wish to keep from being indexed.
That's it! If you want to get more advanced, you can check out here, here, or here for a lot more info. Also, you can find free tools online that will generate a robots.txt file for you (for example, here).
You can setup a robots.txt file to try and tell search engines to ignore certain directories.
See here for more info.
Basically:
User-agent: *
Disallow: /[directory or file here]
<meta name="robots" content="noindex, nofollow">
Just include this line in your <html> <head> tag. Why I'm telling you this because if you use robots.txt file to hide your URLs that might be login pages or other protected URLs that you won't show to someone else or search engines.
What I can do is just accessing the robots.txt file directly from your website and can see which URLs you have are secret. Then what is the logic behind this robots.txt file?
The good way is to include the meta tag from above and keep yourself safe from anyone.
Nowadays, the best method is to use a robots meta tag and set it to noindex,follow:
<meta name="robots" content="noindex, follow">
Create a robots.txt file and set the controls there.
Here are the docs for google:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
A robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
you can explicitly disallow :
User-agent: *
Disallow: /~joe/junk.html
please visit below link for details
robots.txt

404 static error pages in Rails, path context to public resources changes depending on request URI

I was working on overriding boiler plate 404 Rails page in RAIlS_ROOT/public. This is Rails 3.1.1 hosted on Pasenger. I noticed that paths in the html document loose context on routes inside a controller resource path in a production environment. This is probably something basic, but wanted to put it out there.
I have
/public /public/404.html
/public/error_stylesheet/styles.css
/public/error_images/image.jpg
404.html has references to the resources
<link href="error_stylesheets/styles.css" rel="stylesheet" type="text/css" />
<img src="error_images/errorpageheader.jpg">
For example,
If I request http://app/wrongurlname
My 404.html loads with resources err_stylesheets and err_images folders are seen and retrieved.
If I request
http://app/controller/wrong or
//app/wrong/wrong
The 404 page loads, but can't see the resources.
I was probably not interested in overriding behavior of ApplicationController or routing which seems like it would be necessary to serve erb pages. I'm not sure if serving
Maybe, you should try this kind of paths:
<link href="/error_stylesheets/styles.css" rel="stylesheet" type="text/css" />
<img src="/error_images/errorpageheader.jpg">
Without first slash you have relative paths, but with slash you've got an absolute path you need.

Resources