In robots.txt, what will Disallow: /?s block?

In robots.txt, what will Disallow: /?s block? - search-engine

What will this line block when the search engine crawls the website?
Disallow: /?s

It will block robots from crawling any URLs (in the root (?)) that they stumble upon that contain s in their query strings.

Related

301 redirect rule for urls with dynamic paths

I'm on the learning curve for 301 redirects and have done lots of research, including looking at answers on this forum. I haven't found the answer to my specific query, which requires removing elements from the middle of the url request.
Namely, I am building a new site with dynamic links (WordPress, but the question applies to any CMS).
I need to redirect from links (also dynamic) structured as:
sitename.com/issue/february-2016/post/dynamic-post-name
(february-2016 is an example - could be 'march-2014' or any of a range of terms)
to:
sitename.com/post/dynamic-post-name
Another way to say this: Any request url with /article/ needs to grab that last string (which I think would be the wildcard?) and redirect it as: sitename.com/post/$
Is this possible?
Update: With more research, I found a possible answer that worked in a testing tool, although I've not tested it live on my site.
Does this look correct?
RewriteRule ^([^/]+)/([^/]+)/article/([^.]+)$ article/$3 [QSA,L]

RewriteRule ^article/.*/(.*)$ post/$1 [QSA,L,R=301]
Something like this should work.
The characters captured within the brackets (.*) will be the $1.
Feel free to change article and post to fit your need.
In this case, it will redirect
http://example.com/article/february-2016/post/dynamic-post-name
to
http://example.com/post/dynamic-post-name

EHow to Disallow few list of URL crawled by google crawler using robots.txt

I have couple of pages and URL which I do not want to be crawled by Google crawler.
I know it can be done via robots.txt. I search Google and found this way we need to arrange the whole things in robots.txt for disallow crawler but I am not sure does it right or not.
User-Agent: *
Disallow: /music?
Disallow: /widgets/radio?
Disallow: /affiliate/
Disallow: /affiliate_redirect.php
Disallow: /affiliate_sendto.php
Disallow: /affiliatelink.php
Disallow: /campaignlink.php
Disallow: /delivery.php
Disallow: /music/+noredirect/
Disallow: /user/*/library/music/
Disallow: /*/+news/*/visit
Disallow: /*/+wiki/diff
# AJAX content
Disallow: /search/autocomplete
Disallow: /template
Disallow: /ajax
Disallow: /user/*/tasteomatic
Can I give the URL like this way? I mean, can I specify full URL as disallow?
Disallow: http://www.bba-reman.com/admin/feedback.htm
EDIT
my current robots.txt entries looks like below
User-Agent: *
Disallow: /CheckLogin
Disallow: /DTC.pdf
Disallow: /catalogue/bmw.htm
Disallow: /auto-mine/bmw/index.htm
Disallow: /forums/parent.Jmp('i100')
Disallow: /forums/parent.Jmp('i040')
Disallow: /forums/CodeDescriptions.html
Disallow: /forums/parent.Jmp('i050')
Disallow: /forums/parent.Scl('000','24601')
Disallow: /forums/parent.Jmp('i030')
Disallow: /catalogue/peugeot.htm
is it ok.....just tell me. thanks

The value of the Disallow field is always the beginning of the URL path.
So if your robots.txt is accessible from http://example.com/robots.txt, and it contains this line
Disallow: http://example.com/admin/feedback.htm
then URLs like these would be disallowed:
http://example.com/http://example.com/admin/feedback.htm
http://example.com/http://example.com/admin/feedback.html
http://example.com/http://example.com/admin/feedback.htm_foo
http://example.com/http://example.com/admin/feedback.htm/bar
…
So if you want to disallow the URL http://example.com/admin/feedback.htm, you have to use
Disallow: /admin/feedback.htm
which would block URLs like these:
http://example.com/admin/feedback.htm
http://example.com/admin/feedback.html
http://example.com/admin/feedback.htm_foo
http://example.com/admin/feedback.htm/bar
…

Blocking URLs that contain numbers in robots.txt

My website allows search engines to index the same page in 2 formats like:
‪www.example.com/page-1271.html‬
www.example.com/page-1271-page-title.html
All my site pages are like that. So, How can I block the first format in robots.txt file? I mean is there such a code like:
Disallow: /page-(numbers).html

The original robots.txt specification has not defined any wildcards. (However, some parsers, like Google, have added wildcard support anyhow.)
If your concern is that search engines only index one of your two variants, there are alternatives to robots.txt:
You could redirect (with 301) from example.com/page-1271.html‬ to example.com/page-1271-page-title.html. This solution would be the best, as now everyone (users, bots) will work with the same URL.
Or you could use the canonical link relation. On example.com/page-1271.html‬ (or on both variants) you could add a link element to the head:
<link href="example.com/page-1271-page-title.html" rel="canonical" />
This tells search engine bots etc. to use the canonical URL instead of the current URL.

There is no such regexp option in robots.txt. You have a couple of options:
1) Place the robots disallow information into the head element in the html files.
2) Write a script that will add every blockable html file as a separate line into the robots.txt
3) Place content pages in a separate directory and disallow access to that directory.
Some search engines (such as Google), but not all of them, respects pattern matching:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&from=35237&rd=1
User-agent: *
Disallow: /page-*.html
Allow: /page-*-page-title.html
Here the Allow overrides the Disallow, this also is not supported by all search engines. Easiest would be to restructure your files (or make URL rewrites) or then place robots information into the html files themselves.

hide url directory

I have a website:
www.mydomain.com/subfolder/subfolder/index.php
How could always hide the directory names from the url. I mean always hide the 2 subfolders' names from the url as the page changes.
Example:
These urls:
www.mydomain.com/subfolder/subfolder/index.php
www.mydomain.com/subfolder/subfolder/about.php
www.mydomain.com/subfolder/subfolder/contact.php
...
Becomes:
www.mydomain.com/index
www.mydomain.com/about
www.mydomain.com/contact
...
I want to use the last mentioned urls for requesting these pages, too, without typing a horrible long url.

you can find some good hints when you will look for mod_rewrite or mod rewrite in htaccess.

If you have access to the Rewrite engine, you can use a simple rewriting pattern similar to this:
RewriteRule ^/(.*)$ /subfolder/subfolder/$1.php

Redirect 301 with hash part (anchor) #

One of our website has URL like this : example.oursite.com. We decided to move our site with an URL like this www.oursite.com/example. To do this, we wrote a rewrite rule in our Apache server that redirect to our new URL with a code 301.
Many websites link to us with URLs of the form example.oursite.com/#id=23. The problem is that the redirection erase the hash part of the URL with IE. As far as I know, the hash part is never sent to the server.
I wanted to implement the redirection with javascript to keep the hash part, but the Search Engine will not be aware that our URL changed. (no code 301 returned)
I want the Search Engine to be notified of our new URL(301) because we need to transfer the page rank to our new URL.
Is there a way to redirect with a 301 code and keep the hash part(#id=23) of in the URL ?

Search engines do in fact care about hash tags, they frequently use them to highlight specific content on a page.
To the question, however, anchor locations are unfortunately not sent to the server as part of the HTTP request. If you want to redirect a user, you will need to do this in Javascript on the client side.
Good article: http://web.archive.org/web/20090508005814/http://www.mikeduncan.com/named-anchors-are-not-sent/

Seeing as the server will never see the # (ruling out 301 Redirects) and Google has deprecated their AJAX Crawling scheme, it seems that a front-end solution is the only way!
How I did it:
(function() {
var redirects = [
['#!/about', '/about'],
['#!/contact', '/contact'],
['#!/page-x', '/pageX']
]
for (var i=0; i<redirects.length; i++) {
if (window.location.hash == redirects[i][0]) {
window.location.replace(redirects[i][1]);
}
}
})();
I'm assuming that because Google crawlers do indeed execute Javascript, the new pages will be indexed properly.
I've put it in a <script> tag directly underneath the <title> tag, so that it get executed before any other JS/CSS. Note that this script should only be required for your index file.

I am fairly certain that the hash/page anchor/bookmark part of a URL is not indexed by search engines, and therefore has no effect on your page ranking. Doing a google search for "inurl:#" returns zero documents, so that backs up my assumption. Links from external sites will be indexed without the hash.
You are right in that the hash part isn't sent to the server, so as far as I am aware, there isn't a good way to be able to create a redirection url with the hash in it.
Because of this, it's up to the browser to correctly manage the hash during a redirect. Firefox 3.5 appears to do this successfully. If you append a hash to a URL that has a known redirect, you will see the URL change in the address bar to the new location, but the hash stays on there successfully.
Edit: In response to the comment below, if there isn't a hash sign in the external URL for the part you need, then it is entirely possible to rewrite the URL. An Apache rewrite rule would take care of it:
RewriteCond %{HTTP_HOST} !^exemple\.oursite\.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/(.*) http://www.oursite.com/exemple/$1 [L,R]
If you're not using Apache, then you'll have to look into the server docs for something similar.

Google has a special syntax for AJAX applications that is based on hash URLs: http://code.google.com/web/ajaxcrawling/docs/getting-started.html

You could create a page on the old address that catches all requests and redirects to the new site with the correct address and code.
I did something like that, but it was in asp.net, which I guess it's not the language you use. Anyway there should be a way to do this in any language.

When returning status 301, your server is supposed to return a 'Location:' header which points to the new location. In practice, the way this is implemented varies; some servers provide the full URL (netloc and path), some just provide the new path and expect the browser to look for that path on the original netloc. It sounds like your rewrite rule is stripping the path.
An easy way to see what the returned Location header is, in the python shell:
>>> import httplib
>>> conn = httplib.HTTPConnection('exemple.oursite.com')
>>> conn.request('HEAD', '/')
>>> res = conn.getresponse()
>>> print res.getheader('location')
I'm afraid I don't know enough about mod_rewrite to tell you how to do the rewrite rule correctly, but this should give you an idea of what your server is actually telling clients to do.

The search bots don't care about hash tags. And if you are using them for some kind of flash or AJAX calls, you have more serious problems than your 301 redirects don't work. Because unless you have the content in an alternate form, the search engines are not indexing your site and you are definitely suffering as far as SEO goes.

I registered my account so I can't edit.
zombat : I'm sorry I made a mistake in my comment. The link to our video is exemple.oursite.com/#video_id=233. In this case, my rewrite rule in Apache doesn't work.
Nick Berardi: We changed the way our links work. We don't use # anymore, only for backward compatibility

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

In robots.txt, what will Disallow: /?s block? - search-engine

What will this line block when the search engine crawls the website? Disallow: /?s

It will block robots from crawling any URLs (in the root (?)) that they stumble upon that contain s in their query strings.

Related

301 redirect rule for urls with dynamic paths

EHow to Disallow few list of URL crawled by google crawler using robots.txt

Blocking URLs that contain numbers in robots.txt

hide url directory

Redirect 301 with hash part (anchor) #

Categories

Resources