Blocking URLs that contain numbers in robots.txt - url

My website allows search engines to index the same page in 2 formats like:
‪www.example.com/page-1271.html‬
www.example.com/page-1271-page-title.html
All my site pages are like that. So, How can I block the first format in robots.txt file? I mean is there such a code like:
Disallow: /page-(numbers).html

The original robots.txt specification has not defined any wildcards. (However, some parsers, like Google, have added wildcard support anyhow.)
If your concern is that search engines only index one of your two variants, there are alternatives to robots.txt:
You could redirect (with 301) from example.com/page-1271.html‬ to example.com/page-1271-page-title.html. This solution would be the best, as now everyone (users, bots) will work with the same URL.
Or you could use the canonical link relation. On example.com/page-1271.html‬ (or on both variants) you could add a link element to the head:
<link href="example.com/page-1271-page-title.html" rel="canonical" />
This tells search engine bots etc. to use the canonical URL instead of the current URL.

There is no such regexp option in robots.txt. You have a couple of options:
1) Place the robots disallow information into the head element in the html files.
2) Write a script that will add every blockable html file as a separate line into the robots.txt
3) Place content pages in a separate directory and disallow access to that directory.
Some search engines (such as Google), but not all of them, respects pattern matching:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&from=35237&rd=1
User-agent: *
Disallow: /page-*.html
Allow: /page-*-page-title.html
Here the Allow overrides the Disallow, this also is not supported by all search engines. Easiest would be to restructure your files (or make URL rewrites) or then place robots information into the html files themselves.

Related

Are URLs with & instead of & treated the same by search engines?

I'm validating one of my web pages and its throwing up errors as below :
& did not start a character reference. (& probably should have been escaped as &.)
This is because on my page I am linking to internal webpages which has &'s in the URL as below:
www.example.com/test.php?param1=1&param2=2
My question is that if I change the URLs in the a hrefs to include & as below:
www.example.com/test.php?param1=1&param2=2
Will Google and other search engines treat the 2 URLs above as separate pages or will they treat them both as the one below:
www.example.com/test.php?param1=1&param2=2
I dont want to loose my search engine rankings.
There is no reason to assume that search engines would knowingly ignore how HTML works.
Take, for example, this hyperlink:
…
The URL is not http://example.com/test.php?param1=1&param2=2!
It’s just the way how the URL http://example.com/test.php?param1=1&param2=2 is stored in attributes in an HTML document.
So when a conforming consumer comes across this hyperlink, it never visits http://example.com/test.php?param1=1&param2=2.

Is the format of URLs imported to IP4.1 from IP3.9 correct?

I have now successfully imported the text, pictures and pages from IP3.9 to IP4.1.
IP4.1 on localhost has truncated URLs.
For example URL in IP3.9
localhost/ip39/en/top/graphene/cvd-graphen/cvd-on-metals/multilayer-graphene-on-nickel-foil/
when imported to IP4.1 becomes
localhost/ip41/multilayer-graphene-on-nickel-foil
Is this normal ? If IP4 changes the format of the URLs then I think all the Google links will be lost.
Alan
Since v4.x we removed requirement to have language, zone or parent page in the path of pages. Which means that each page (despite its location in the menu tree) can have any URL starting from the root. You can put URL with slashes, too.
You have two options:
Manually change back all paths to the format you need (you can do that in the archive before import, too).
Create required redirects for search engines to understand what happened.

Does letter casing of directories and urls matter in .NET MVC?

Say I have a TitleCase directory name, but call an item within that directory using a lowercase url.
Does that have any effect or impact?
For example, does the server need to do a redirect from the incorrect lettercase to the correct lettercase?
Example
A file here: /PlugIns/CMSPages/Images/my-image.jpg
Called with: /plugins/cmspages/images/my-image.jpg
The routing engine isn't case sensitive.
One thing to be wary of, if you are referring to page urls - Google treats lowercase and uppercase urls as different pages, so you want to make use of rel="canonical" to ensure Google and other search engines know it is one page, no matter whether the url is upper or lowercase.

Can I create a clean URL using WebBroker and Delphi?

Can I create a clean URL for WebBroker webpages/applications ?
A typical WebBroker URL normally looks like:
hxxp://www.mywebsite.com/myapp.dll?name=fred
or
hxxp://www.mywebsite.com/myapp.dll/names/fred
What I would prefer is:
hxxp://www.mywebsite.com/names/fred
Any idea how I can achieve this with Delphi/WebBroker ? (ISAPI/Apache)
The typical way of doing this is to use apache's mod_rewrite to redirect requests to the url w/ parameters. Many, many applications do this to create 'human readable' and more search engine friendly urls.
For example, you might add this rule to make action=sales&year=2009 look like sales-2009.htm:
RewriteRule ^sales-2009.htm?$ index.php?action=sales&y=2009 [L]
When the user goes to 'sales-2009.htm', its actually redirected to the php page with the appropriate parameters. To the end user, though, it still displays sales-2009.htm in the browser's url bar.
You can, of course, use regular expressions w/ mod_rewrite, such that you can make the redirections much more flexible. You could, for example, make a single expression in the above example that would map any year to the correct parameter.

Why we don't use such URL formats?

I am reworking on the URL formats of my project. The basic format of our search URLs is this:-
www.projectname/module/search/<search keyword>/<exam filter>/<subject filter>/... other params ...
On searching with no search keyword and exam filter, the URL will be :-
www.projectname/module/search///<subject filter>/... other params ...
My question is why don't we see such URLs with back to back slashes (3 slashes after www.projectname/module/search)? Please note that I am not using .htaccess rewrite rules in my project anymore. This URL works perfect functionally. So, should I use this format?
For more details on why we chose this format, please check my other question:-
Suggest best URL style
Web servers will typically remove multiple slashes before the application gets to see the request,for a mix of compatibility and security reasons. When serving plain files, it is usual to allow any number of slashes between path segments to behave as one slash.
Blank URL path segments are not invalid in URLs but they are typically avoided because relative URLs with blank segments may parse unexpectedly. For example in /module/search, a link to //subject/param is not relative to the file, but a link to the server subject with path /param.
Whether you can see the multiple-slash sequences from the original URL depends on your server and application framework. In CGI, for example (and other gateway standards based on it), the PATH_INFO variable that is typically used to implement routing will usually omit multiple slashes. But on Apache there is a non-standard environment variable REQUEST_URI which gives the original form of the request without having elided slashes or done any %-unescaping like PATH_INFO does. So if you want to allow empty path segments, you can, but it'll cut down on your deployment options.
There are other strings than the empty string that don't make good path segments either. Using an encoded / (%2F), \ (%5C) or null byte (%00) is blocked by default by many servers. So you can't put any old string in a segment; it'll have to be processed to remove some characters (often ‘slug’-ified to remove all but letters and numbers). Whilst you are doing this you may as well replace the empty string with _.
Probably because it's not clearly defined whether or not the extra / should be ignored or not.
For instance: http://news.bbc.co.uk/sport and http://news.bbc.co.uk//////////sport both display the same page in Firefox and Chrome. The server is treating the two urls as the same thing, whereas your server obviously does not.
I'm not sure whether this behaviour is defined somewhere or not, but it does seem to make sense (at least for the BBC website - if I type an extra /, it does what I meant it to do.)

Resources