How to block specific urls by using robots.txt? - url

How can I block specific URLs by using a robots.txt? We do not want Google to crawl our site. How can I define a disallow tag for those URLs in a robots.txt file?

To exclude all robots from part of the server you can use this if you want to exclude, for example, these 3 folders:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

Related

Pulling URL with =VLOOKUP from the table

Trying to get url from this link https://www.atanet.org/onlinedirectories/tsd_view.php?id=3856
I use the following formula: =VLOOKUP("Website",ImportXML(A1, "(//table[#id='tableTSDContent']//tr)"),2,0)
But unfortunately, it does not pull out the url. I would really appreciate it if you could help me extract the url in question.
I tried using the APIPheny add on to import the data. After the <h2>Online Directories Listing</h2>, I saw a cell that said "Google bot blocked" or something to that effect.
I then went to the site's robots.txt file (https://www.atanet.org/robots.txt), which says:
User-agent: *
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
Disallow: /onlinedirectories/tsd_corp_listings/*
Disallow: /bin
Disallow: /division_calendar
User-agent: Googlebot
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: /*division_calendar*
Disallow: /*bin*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
User-agent: ITABot
Disallow: /onlinedirectories
I also think this means that the Google Sheets user agent is the same as the same as the Search Engine (Googlebot). If this is the case, then with Google Sheets, you're out of luck here because the tsd_view.php you want is disallowed. Likely, this was put there because they didn't want Google (or other search engines, for that matter) to index people's contact information. Of course, if you're a malicious webcrawler, you could ignore the robots.txt, but Googlebot is a nice bot.

IBM Watson NLU - Understanding is being blocked by robots.txt

I'm using the IBM Watson Natural Language API to scan specific webpage to determine keywords and categories.
But I've run into a issue with some sites which have their robots.txt set to block website scanner.
I'm working directly with these sites and they added the Watson agent string of "watson-url-fetcher" to their robots.txt file.
The result is that this works only some of the time.
This simplified robots.txt file works:
User-agent: *
Disallow: /
User-agent: watson-url-fetcher
Disallow: /manager/
But if the order is changed, the Watson no longer works:
reordered robots.txt fails :
User-agent: watson-url-fetcher
Disallow: /manager/
User-agent: *
Disallow: /
Watson then return the error code:
{
"error": "request to fetch blocked: fetch_failed",
"code": 400
}
Is this an error with Watson, or do I need to instruct the websites to always put the User-agent: * at the top of the robots.txt file?

Should I include mobile site URLs in robots.txt?

My boss is having me look at various ways to improve our site's SEO and I've been doing some research on it. I'm aware that search engines like mobile-friendly sites and I used Google's Webmaster Tools, finding that it considers our site to be mobile-friendly. However, we lack an adequate robots.txt file.
What we want to do is avoid getting the same page indexed twice (as desktop and mobile versions), and he recommended that I include our site's mobile URLs in the robots.txt file. However, will doing this damage our site's ranking? I get that files listed under robots.txt shouldn't be indexed, which raises concerns about whether or not people will be able to see results for our site when they search for it on their phones.
Although I would not recommend having two different files or URLs for mobile/regular sites,as the official google blog recommends:
Sites that use responsive web design, i.e. sites that serve all
devices on the same set of URLs, with each URL serving the same HTML
to all devices and using just CSS to change how the page is rendered
on the device. This is Google’s recommended configuration.
http://googlewebmastercentral.blogspot.ca/2012/06/recommendations-for-building-smartphone.html
Having said that, since you already have mobile versions and would like to block google bot from indexing multiple versions of the same URL:
Blocking Googlebot-Mobile from desktop site
Desktop site: http://www.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Allow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Disallow: /
Mobile site: http://m.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Disallow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Allow: /
http://searchengineland.com/5-tips-for-optimal-mobile-site-indexing-107088
Robots.txt disallows crawling, not indexing.
So if you would block your mobile URLs, bots would never be able to even see that you have a mobile site, which is probably not what you want.
Alternative
Tell bots what the links are about. Based on this declaration, bots can decide what they want to do with these URLs.
You can do this by providing the link types alternate and canonical:
alternate (defined in the HTML5 spec), to denote that it’s an "alternate representation of the current document".
canonical (defined in RFC 6596), to denote that the pages are the same, or that they only have trivial differences (e.g., different HTML structure, table sorted differently etc.), or that one is the superset of the other.
So if you want to use the URLs from the desktop site as canonical, you would use "alternate canonical" to link from mobile to desktop, and "alternate" to link from desktop to mobile. You can see an example in my answer to the Webmasters question Linking desktop and mobile pages.

How some websites doesn't show up page extension in the address bar?

Its just a question out of curiosity.
I have seen a lot of websites that doesn't show the page types/extensions in the address bar.For example, the stackoverflow's Ask Question page has the address stackoverflow.com/questions/ask instead of something like stackoverflow.com/questions/ask.php.
Do they use something to hide that page extension?Or why I do not see the page extension?
I think its a nice think for page security.
using .htaccess file, you can do that
something similar here Remove .php extension with .htaccess
All the .htaccess answers that you have seen apply to traditional PHP applications because they are all uploaded as normal files to the document root of a webserver. This means that each PHP file is "browsable" directly, assuming you haven't prevented this at your webserver configuration.
StackOverflow (which is a .NET application) and other modern applications use a URL mapping paradigm - not only does this help with "clean" URLs, but also because cool URIs don't change. It really doesn't have anything to do with security.
So it is most likely that each URL is mapped to a function, this function returns a response that is sent to the browser.
PHP frameworks offer the same - Laravel routing, symfony routing and zend framework routing are all examples of this mapping paradigm.
A .htaccess (hypertext access) file is a directory-level configuration file supported by several web servers, that allows for decentralized management of web server configuration. They are placed inside the web tree, and are able to override a subset of the server's global configuration for the directory that they are in, and all sub-directories.
htaccess file
Rewrite Guides
More htaccess tips and tricks
Rewrite Url
Servers often use .htaccess to rewrite long, overly comprehensive URLs to shorter and more memorable ones.
Authorization, authentication
A .htaccess file is often used to specify security restrictions for a directory, hence the filename "access". The .htaccess file is often accompanied by a .htpasswd file which stores valid usernames and their passwords
Given three links above these will explain you in better way.
this is done by using the .htaccess file to configure the details of a website
example:
RewriteEngine on
Rewrite Base /
RewriteRule ([a-z]+)/?$ index.php?menu=$1 [NC,L]
this example rewrites a URL which looks like this www.mydomain.com/home into www.mydomain.com?index.php&menu=home
for more details please search stackoverflow / google

How to block robots without robots.txt

As we know, robots.txt helps us avoid indexing of certain webpages/section by web crawlers/robots. But there are certain disadvantages by using this method: 1. the web crawlers might not listen to robots.txt file; 2. you are exposing the folders you want to protect to everybody;
There is another way of blocking the folders you want to protect from crawlers? Keep in mind that those folders might be wanted to be accessible from the browser (like /admin).
Check the User-Agent header on requests and issue a 403 if the header contains the name of a robot. This will block all of the honest robots but not the dishonest ones. But then again, if the robot was really honest, it would obey robots.txt.

Resources