Can I include canonical URLs in sitemap for SEO? - url

Can I include canonical URLs in sitemaps for SEO?
For example www.example.com/url.html is a duplicate page of www.example2.com/url.html.
So I used following tag in www.example.com/url.html page for SEO not to be penalized by search engines:
<link rel="canonical" href="www.example2.com/url.html">
Now my question is can I display www.example.com/url.html URL inside of www.example.com/sitemap.xml?
I already display www.example2.com/url.html URL inside of www.example2.com/sitemap.xml.
Please suggest me what I have to do.

You can include these two pages into your sitemap.xml and there won't be a problem for SEO because you're using the rel="canonical" tag. Indeed, when web crawlers will try to index the duplicate page, they will see the rel="canonical" tag and they will index the second page (the good one).

For better index you must leave one URL - canonical URL in XML site map

Related

The canonical link points to the site root error

I have a website that uses country-specific pages. So for every page, there is a country-specific URL. For example: example.com/au/blog, example.com/us/blog, example.com/uk/blog. This is great as we can show content more relevant to each country.
This idea is the same for our home page: example.com/au, example.com/us, example.com/uk.
When a user goes to a non-country specific URL (ie example.com, example.com/blog) the server falls back to serving more generalised content. On the client, we then show a banner for the user to decide if they want to go to a country-specific page.
With this in mind, I have added the following meta tags and receiving the below error when testing using Lighthouse.
<link rel="canonical" href="https://www.example.com/">
<link rel="alternate" hreflang="x-default" href="https://www.example.com/">
<link rel="alternate" hreflang="en-GB" href="https://www.example.comt/uk/">
<link rel="alternate" hreflang="en-US" href="https://www.example.com/us/">
<link rel="alternate" hreflang="en-AU" href="https://www.example.com/au/">
//error
The document does not have a valid rel=canonical. Points to the domain's root URL (the homepage), instead of an equivalent page of content.
Is this the correct way to inform crawlers that:
The site root is the original document
The site root doesn't target any language or locale
The alternatives to this page are en-GB, en-US and en-AU
If so, why does Lighthouse complain about this error on the home page? It doesn't complain about this on any other page.
I am new to canonicalisation and providing alternative lang pages so I might be missing something obvious.
Since your home page has a generalized subset of content, it is not canonical. (The canonical context IRI and target IRI shouldn't be the same or the user would already be on the canonical IRI.) Technically, the per language IRIs are canonical and alternate documents, depending on the language. Since you're in the UK, you should specify the en-GB IRI to be the canonical IRI and the others to be alternates (not "alternatives") since they are simply different representations of the same content therein.
From the home page https://www.example.com/ (the generalized content):
<link rel="canonical" hreflang="en-GB" href="https://www.example.com/uk/">
From https://www.example.comt/uk/ (the canonical IRI)
<link rel="alternate" hreflang="en-US" href="https://www.example.com/us/">
<link rel="alternate" hreflang="en-AU" href="https://www.example.com/au/">
https://www.rfc-editor.org/rfc/rfc6596
https://www.rfc-editor.org/rfc/rfc8288
https://html.spec.whatwg.org/#the-link-element

Canonical URLs and trailing slash on root URL

When working with canonical URLs, does a trailing slash make any difference on a root URL?
I put the following canonical tag into my Rails site head:
<link rel="canonical" href="<%= url_for(:only_path => false) %>" />
...to ensure any parameterized URLs resolve to the base URL.
However, when I navigate to http://www.example.com, the canonical link shows up with a slash at the end:
<link rel="canonical" href="http://www.example.com/" />
I know trailing slashes DO matter when a path element is present in the URL, but thought they didn't matter on root URLs. However, I then ran into Matt Cutts presentation on canonical tags, where he clearly states that they are considered different URLs:
From http://www.mattcutts.com/blog/canonical-link-tag/ (See slide 3):
These URLs are all different:
www.example.com
example.com
www.example.com/
example.com/
Can anyone shed some light on what he means?
URLs which point to directory names (often with the expectation that a web server handler will return some sort of 'index'') without a trailing slash are actually invalid. Most web servers will automagically correct these requests with a redirect to the same URL with a trailing slash added.
So behind the scenes, your request for http://example.com results in a redirect from the web server to http://example.com/ which is why you're seeing the surprise trailing slash.
The short answer is that proper URI paths matter everywhere- root directory or not. For a deeper and more ranty answer, take a look at this page.

MVC / .NET Root URLs

In my layout page, I have:
<link href="~/Content/bootstrap.css" rel="stylesheet">
My understanding is that this should not be altered when it is sent to the client. However, when I set up the website as a virtual application under a "myapp" folder in IIS, the HTML is:
<link href="/myapp/Content/bootstrap.css" rel="stylesheet">
I'm a bit confused as I had thought I would need to change these URLs to:
<link href="#Url.Content("~/Content/bootstrap.css")" rel="stylesheet">
in order for this to work correctly.
So do I need to use URL.Content to get the correct root URL of the app/website, or can I just put tildes into the actual HTML src + href elements, and assume it will be outputted correctly by IIS?
As of ASP.NET MVC version 4 (or actually Razor version 2), the tilde links are essentially shortcuts to Url.Content(..).
You actually answered your own question. Yes, you should use Url.Content() for your relative paths. A simple tilde in front of relative paths are only parsed in the client's browser,which treats all URL's under the http://www.foo.com/ as a single domain, so will try to look for resources at http://www.foo.com/ and not http://www.foo.com/myapp/.

Remove Page from being indexed in Google, Yahoo, Bing [duplicate]

I don't want the search engines to index my imprint page. How could I do that?
Also you can add following meta tag in HEAD of that page
<meta name="robots" content="noindex,nofollow" />
You need a simple robots.txt file. Basically, it's a text file that tells search engines not to index particular pages.
You don't need to include it in the header of your page; as long as it's in the root directory of your website it will be picked up by crawlers.
Create it in the root folder of your website and put the following text in:
User-Agent: *
Disallow: /imprint-page.htm
Note that you'd replace imprint-page.html in the example with the actual name of the page (or the directory) that you wish to keep from being indexed.
That's it! If you want to get more advanced, you can check out here, here, or here for a lot more info. Also, you can find free tools online that will generate a robots.txt file for you (for example, here).
You can setup a robots.txt file to try and tell search engines to ignore certain directories.
See here for more info.
Basically:
User-agent: *
Disallow: /[directory or file here]
<meta name="robots" content="noindex, nofollow">
Just include this line in your <html> <head> tag. Why I'm telling you this because if you use robots.txt file to hide your URLs that might be login pages or other protected URLs that you won't show to someone else or search engines.
What I can do is just accessing the robots.txt file directly from your website and can see which URLs you have are secret. Then what is the logic behind this robots.txt file?
The good way is to include the meta tag from above and keep yourself safe from anyone.
Nowadays, the best method is to use a robots meta tag and set it to noindex,follow:
<meta name="robots" content="noindex, follow">
Create a robots.txt file and set the controls there.
Here are the docs for google:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
A robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
you can explicitly disallow :
User-agent: *
Disallow: /~joe/junk.html
please visit below link for details
robots.txt

URL with trailing slash breaks layout - how can I fix this without using base tag?

I am developing a website, and I have set up the following htaccess rules
RewriteEngine On
#RewriteRule ^([0-9]+)?$ index.php?page=$1
#RewriteRule ^([0-9]+)/([0-9]+)?$ index.php?page=$1&post=$2
RewriteRule ^([A-Za-z0-9-]+)?$ index.php?pagename=$1
RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)?$ index.php?pagename=$1&post=$2
This ensures that instead of http://www.mysite.com/index.php?page=2 showing the page
I get to use the friendlier method of http://www.mysite.com/about-us
* note I have not included a trailing slash.
In the page my css files are included as:
<link href="css/style.css" rel="stylesheet" type="text/css" />
and located at www.mysite.com/css/style.css
And this works well, however if I want to include a trailing slash (i.e. http://www.mysite.com/about-us/)
Then my css files do not load and I get an error where the Firefox source browser says:
The requested URL http://www.mysite.com/about-us/css/style.css was not found on this server.
This is because the page is determining about-us to be a directory instead of a page.
I am not keen to use the basehref tag like <base href="http://www.mysite.com/" />
Are there any other options?
Relative URLs are resolved from a base URL that is the URL of the document the relative URL is used in if not specified otherwise.
Now to fix this incorrect reference, you have two options:
change the base URL using the BASE element,
change the reference
by adjusting the relative URL path to the base URL path, or
by using just an absolute URL path, or
by using an absolute URL.
Since you don’t want to use the BASE element, you will probably need to adjust the URL you are using to reference the external resource.
The simplest would be to use the absolute URL path /css/style.css instead so that it is independent from the actual base URL path.

Resources