Drupal 8 - How to block search engines from reading or indexing a specific content type(Resource) - search-engine

I don't want to see resource content type nodes in the google search result.
I know, I can add the URL in the robots.txt if I have to restrict any particular content.
But, how can I restrict all the nodes of one particular content type in Drupal 8 from being crawled by the search engines like google?

The only way that I know of to block search engines is through robots.txt.
With Drupal, in order to block an entire type of content, you must make sure there are no ad-hoc aliases. Check /admin/config/search/path. Any abnormal aliases to instances of your content will allow the search engines to bypass general rules that you set.
Then, add a rule to robots.txt to disallow the pattern for that content.
Example:
Disallow: /node/
If you have some specific subset of content that you would like to block consider using Pathauto to create path patterns that will allow you to easily target those subsets with robots.txt rules.

If you're able to install the metatag module, you will need to add the field metatag to the content type that you're trying to exclude from google. To exclude from google make sure you checked in the metatag dropdown while editing the pages the following:
Prevents search engines from indexing this page.
If you have too many, you would need to write a hook_update in order to get the changes applied to all the existing pages.

Related

Search engine's viewpoint of a URL

Given below two links which point to the same page with same content. I'm just trying to give this page the right URL.
I can give it one of the following URLs.
http://www.mywebsite.com/help/topic/2001
or
http://www.mywebsite.com/help?topic=2001
Now, when a search engine sees this, what's the effect on the page's caching by the engine.
Do both link have the same effect or one of them can improvise the caching better.
That depends on the search engine. For example, Google will automatically try to guess if a URL parameter must be treated as a unique webpage, so in your scenario google will guess that the value of url parameter "topic" defines a page. Other search engines may fail on guessing this.
I think its better to use a url with no URL parameters, so its absolutely clear that a url is pointing to a unique page, instead of relying on the guessing ability of a search engine.
Specifically, google gives the ability to webmasters to manually set how a parameter will be treated. https://support.google.com/webmasters/answer/1235687?hl=en

Remove multiple indexed URLs (duplicates) with redirect

I am managing a website that has only about 20-50 pages (articles, links and etc.). Somehow, Google indexed over 1000 links (duplicates, same page with different string in the URL). I found that those links contain ?date= in url. I already blocked by writing Disallow: *date* in robots.txt, made an XML map (which I did not had before) placed it into root folder and imported to Google Webmaster Tools. But the problem still stays: links are (and probably will be) in search results. I would easily remove URLs in GWT, but they can only remove one link at the time, and removing >1000 one by one is not an option.
The question: Is it possible to make dynamic 301 redirects from every page that contains $date= in url to the original one, and how? I am thinking that Google will re-index those pages, redirect to original ones, and delete those numerous pages from search results.
Example:
bad page: www.website.com/article?date=1961-11-1 and n same pages with different "date"
good page: www.website.com/article
automatically redirect all bad pages to good ones.
I have spent whole work day trying to solve this problem, would be nice to get some support. Thank you!
P.S. As far as I think this coding question is the right one to ask in stackoverflow, but if I am wrong (forgive me) redirect me to right place where I can ask this one.
You're looking for the canonical link element, that's the way Google suggests to solve this problem (here's the Webmasters help page about it), and it's used by most if not all search engines. When you place an element like
<link rel='canonical' href='http://www.website.com/article'>
in the header of the page, the URI in the href attribute will be considered the 'canonical' version of the page, the one to be indexed and so on.
For the record: if the duplicate content is not a html page (say, it's a dynamically generated image), and supposing you're using Apache, you can use .htaccess to redirect to the canonical version. Unfortunately the Redirect and RedirectMatch directives don't handle query strings (they're strictly for URIs), but you could use mod_rewrite to strip parts of the query string. See, for example, this answer for a way to do it.

Url with pseudo anchors and duplicate content / SEO

I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.

Hidden file names in URLs

I usually like defining my pages to know exactly what page does what. However, on a number of sites, I see where the filename is hidden from view and I was just a little curious.
Is there any specific benefit of having URLs appear like this:
http://mydomain/my_directory/my_subdirectory/
As opposed to this:
http://mydomain/my_directory/my_subdirectory/index.php
Thanks.
Done correctly it can be better for:
the end user, it is easier to say and remember.
SEO, the page name may just detract from the URL in terms of search parsing.
Note: In your example (at least with IIS) all that may have happened is you've made index.php the default document of that sub directory. You could use both URLs to access the page which could again affect SEO page rank. A search engine would see both URLs as different, but the page content would be the same, resulting in duplicate content being flagged. The solution to this would be to:
301 redirect from one of the URLs to the other
Add a canonical tag to the page saying which URL you want page rank to be given to.
Some technologies simply don't match url with files. Java Servlet for examples.
These are not file names. These are URLs. Their goal is to describe the resource. Nobody cares whether you did it in PHP or ASP or typed your HTML in the Emacs. Nobody cares that you named your file index.php. We like to see clean URLs with clear structure and semantics.

robots.txt to restrict search engines indexing specified keywords for privacy

I have a large directory of individual names along with generic publicaly available and category specific information that I want indexed as much as possible in search engines. Listing these names on the site itself is not a concern to people but some don't want to be in search results when they "Google" themselves.
We want to continue listing these names within a page AND still index the page BUT not index specified names or keywords in search engines.
Can this be done page-by-page or would setting up two pages be a better work around:
Options available:
PHP can censor keywords if user-agent=robot/search engine
htaccess to restrict robots to non-censored content, but allowing to a second censored version
meta tags defining words not to index ?
JavaScript could hide keywords from robots but otherwise viewable
I will go through the options and tell you some problems I can see:
PHP: If you don't mind trusting user agent this will work well. I am unsure how some search engines will react to different content being displayed for their bots.
htaccess: You would probably need to redirect the bot to a different page. You could use the url parameters but this would be no different then using a pure PHP solution. The bot would index the page it is redirected to and not the page you wish to visit. You may be able to use the rewrite engine to over come this.
meta tags: Even if you could use meta tags to get the bot to ignore certain words, it wouldn't guarantee that search engines won't ignore it since there is no set "standard" for meta tags. But that doesn't matter since I don't no of any way to get a bot to ignore certain words or phrases using meta tags.
JavaScript: No bot I have ever heard of executes (or even reads) JavaScript when looking at a page, so I don't see this working. You could display the content you want hidden to the users using JavaScript and bots won't be able to see it but neither will users who have JavaScript disabled.
I would go the PHP route.
You can tell robots to skip indexing particular page by adding ROBOTS meta:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
UPDATE: The ways to restrict indexing of particular words I can think of are:
Use JS to add those to the page (see below).
Add module to the server that would strip those words from the rendered page.
JavaScript could be something like this:
<p>
<span id="secretWord">
<SCRIPT TYPE="text/javascript">
<!--
document.write('you can protect the word by concating strings/having HEX codes etc')
//-->
</script>
</span>
</p>
The server module is probably best option. In ASP.NET it should be fairly easy to do that. Not sure about PHP though.
What's not clear from your posting is whether you want to protect your names and keywords against Google, or against all search engines. Google is general well-behaved. You can use the ROBOTS meta tag to prevent that page from being indexed. But it won't prevent search engines that ignore the ROBOTS tags from indexing your site.
Other approaches you did not suggest:
Having the content of the page fetched with client-side JavaScript.
Force the user to execute a CAPTCHA before displaying the text. I recommend the reCAPTCHA package, which is easy to use.
Of all these, the reCAPTCHA approach is probably the best, as it will also protect against ilbehaved spiders. But it is the most onerous on your users.

Resources