I would like to know if this code disallow every search engine to scan my directory.
User-agent: *
Disallow: /
also does this code is updated with the new htlm 5 protocole ?
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Really useful or not needed anymore?
No, it'll only stop those that obey the robots.txt protocol (there exist search engines that intentionally disobey robots.txt to find "hidden" stuff).
It will however stop the vast majority of search engines that your average consumer would use, from seeing it.
Related
I have a website with two languages, English and Russian. My domain url for each language will be as such:
www.mydomain.com/en/
and
www.mydomain.com/ru/
How do I redirect visitors to www.mydomain.com/en/ when they type: www.mydomain.com
How will my url be shown in google search, will it be shown as:
www.mydomain.com
OR
www.mydomain.com/en/
I cannot give you the answer on how you have to redirect visitors to the right domain (from a technical point of view). That depends on the system (WordPress, Joomla, Magento or none at all) and the programming language you use (PHP, ASP.Net).
Edit: how to determine what language to show:
There are two ways to determine what language version to redirect your visitors to. You can base it on the browsers' language using the Accept-Language request header or you can get the location from the user's ip address and use the main language of that country. Maybe a combination of both would be the best solution.
I can however tell you how your URL's will be shown in Google. First you have to decide how you want Google to show them.
Some sites use their domain root (www.domain.com) as a simple page where users can define the language, and send them to the right language folder (www.domain.com/en/ or www.domain.com/ru/). Others use their domain without a language folder as the version for the main language of their country and use folders for other languages.
Two other solutions are to use subdomains (ru.domain.com or de.domain.com) or to use different domains (www.domain.com and www.domain.ru) for all (other) languages.
Which way you choose, make sure you never have two versions of one language! For SEO reasons, you cannot have the English version on www.domain.com AND www.domain.com/en/.
Once you have chosen your way of serving (other) languages, you have to tell search engines what language you webpage is in. You can also link to the same content in other languages. Put the following tags in the <head></head> section of your webpages:
<head>
<meta http-equiv="content-language" content="nl_NL">
<link hreflang="en" href="http://www.domain.com/ru/" rel="alternate">
</head>
Also adjust your <html> tag:
<html lang="nl"><head>...</head><body>...</body></html>
Edit: How to get your different versions in Google
When using different (sub)domains, you can notify Google about all of them Google Search Console (new name of Google Webmaster Tools). If you prefer to use folders, you can add your main domain to Google Search Console and let your <link hreflang="..." ...> tags do the work for you. You also might want to make seperate sitemaps for each language and notify Google about them in Search Console.
This question already has answers here:
Keywords meta tag: Useful or time waster?
(2 answers)
Closed 8 years ago.
The keywords meta tag seems like a staggering dinosaur from the early days of failing to trick Google. Search engines naturally prioritize actually readable words, as users don't want information they can't see, right?
So why do Tumblr and Youtube automatically insert meta tags?
A youtube watch page:
<meta name="keywords" content="sonic the hedgehog, final fantasy, mega man x, i swear i am not abusing tags, newgrounds, flash">
Tumblr's official staff blog:
<meta name="keywords" content="features,HQ Update,Tumblr Tuesday,Esther Day" />
In both cases, the keywords are taken from the explicitly user-entered tags. Youtube takes them from any tags that the uploader specified, and Tumblr takes the first 5 post tags on the page. (Tumblr even automatically inserts these tags on every blog page without the ability to opt-out.)
There must be some reason they're going through this trouble, right? Are they for older/smaller search engines? Internal analytics? I can't imagine it's an enormous strain on their servers, but the existence shows they prioritize something highly enough for the small additional loads.
Firstly, its not much trouble. The tags are already defined. Secondly, just because google won't use the meta data exclusively doesn't mean that google or other sites can't use it. It's provided in an easy to read place for programs that need it. Parsing html can be hard, especially when your site is constantly changing, so providing a constant place for tags with little to no effort is just something that they do.
I need to disallow the access without inform at robots.txt which folder is that.
In my case I have three folders /_private1, /_private2 and /_private3.
I need to know if I use like above, I'm going to protect my folder against Google and others. How should I do that?
Disallow: /_*
Disallow: /_
To disallow any directory or file whose name begins with an underscore:
User-agent: *
Disallow: /_
You should be aware that this does not completely guarantee that these directories will never show up on any search engines. It prevents robots.txt compliant robots (which includes all major search engines) from crawling them, but they could still theoretically show up in a search if someone decides to link to them. If you really need to keep this info private, best practice is to put it behind a password.
The following URLs point to the same resource:
http://www.domain.com/books/123
http://www.domain.com/books/123-Harry-Potter
http://www.domain.com/books/123-Harry-Potter-And-The-Deathly-Hallows
I would like to use (1) as the canonical URL for OpenGraph/Facebook, so if you "like" (2) and (3) then will count for (1).
But I would like to use (3) as the canonical URL for Google, because of SEO.
Is this recommended?
Hey Victor, that's a great question. As social media becomes more important it is going to be interesting to see how its technical requirements might start to clash with other thing like SEO.
From an SEO standpoint, it is pretty common for a website to have multiple URLs that point to a single page of unique content. Generally this is due to extra query parameters or some type of domain canonicalization issue (e.g. www vs. non-www). The best way to handle this whould be to use the rel=canonical tag at the top of your pages, and Google/Bing will treat that as a 301 redirect to the canonical URL. The search engines will index all 3 URLs, but through the 301 redirect, they will merge rank of the other 2 into your canonical page, and you will achieve your search objective.
<html>
<head>
<link rel="canonical" href="http://www.domain.com/books/123-Harry-Potter-And-The-Deathly-Hallows" />
</head>
</html>
Here's google's description of the Rel Canonical feature: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
From a user's perspective on a search engine, this seems like a good experience, so I don't think you'll experience any issues from the search engine (e.g. I don't completely agree with #Jan Hančič), but you might want to double check the quality standards for the Facebook like feature (http://developers.facebook.com/docs/opengraph). This seems to be a grey area from their best practices.
Good luck!
choose one main, create 301 the rest
I have a large directory of individual names along with generic publicaly available and category specific information that I want indexed as much as possible in search engines. Listing these names on the site itself is not a concern to people but some don't want to be in search results when they "Google" themselves.
We want to continue listing these names within a page AND still index the page BUT not index specified names or keywords in search engines.
Can this be done page-by-page or would setting up two pages be a better work around:
Options available:
PHP can censor keywords if user-agent=robot/search engine
htaccess to restrict robots to non-censored content, but allowing to a second censored version
meta tags defining words not to index ?
JavaScript could hide keywords from robots but otherwise viewable
I will go through the options and tell you some problems I can see:
PHP: If you don't mind trusting user agent this will work well. I am unsure how some search engines will react to different content being displayed for their bots.
htaccess: You would probably need to redirect the bot to a different page. You could use the url parameters but this would be no different then using a pure PHP solution. The bot would index the page it is redirected to and not the page you wish to visit. You may be able to use the rewrite engine to over come this.
meta tags: Even if you could use meta tags to get the bot to ignore certain words, it wouldn't guarantee that search engines won't ignore it since there is no set "standard" for meta tags. But that doesn't matter since I don't no of any way to get a bot to ignore certain words or phrases using meta tags.
JavaScript: No bot I have ever heard of executes (or even reads) JavaScript when looking at a page, so I don't see this working. You could display the content you want hidden to the users using JavaScript and bots won't be able to see it but neither will users who have JavaScript disabled.
I would go the PHP route.
You can tell robots to skip indexing particular page by adding ROBOTS meta:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
UPDATE: The ways to restrict indexing of particular words I can think of are:
Use JS to add those to the page (see below).
Add module to the server that would strip those words from the rendered page.
JavaScript could be something like this:
<p>
<span id="secretWord">
<SCRIPT TYPE="text/javascript">
<!--
document.write('you can protect the word by concating strings/having HEX codes etc')
//-->
</script>
</span>
</p>
The server module is probably best option. In ASP.NET it should be fairly easy to do that. Not sure about PHP though.
What's not clear from your posting is whether you want to protect your names and keywords against Google, or against all search engines. Google is general well-behaved. You can use the ROBOTS meta tag to prevent that page from being indexed. But it won't prevent search engines that ignore the ROBOTS tags from indexing your site.
Other approaches you did not suggest:
Having the content of the page fetched with client-side JavaScript.
Force the user to execute a CAPTCHA before displaying the text. I recommend the reCAPTCHA package, which is easy to use.
Of all these, the reCAPTCHA approach is probably the best, as it will also protect against ilbehaved spiders. But it is the most onerous on your users.