Should I include mobile site URLs in robots.txt? - url

My boss is having me look at various ways to improve our site's SEO and I've been doing some research on it. I'm aware that search engines like mobile-friendly sites and I used Google's Webmaster Tools, finding that it considers our site to be mobile-friendly. However, we lack an adequate robots.txt file.
What we want to do is avoid getting the same page indexed twice (as desktop and mobile versions), and he recommended that I include our site's mobile URLs in the robots.txt file. However, will doing this damage our site's ranking? I get that files listed under robots.txt shouldn't be indexed, which raises concerns about whether or not people will be able to see results for our site when they search for it on their phones.

Although I would not recommend having two different files or URLs for mobile/regular sites,as the official google blog recommends:
Sites that use responsive web design, i.e. sites that serve all
devices on the same set of URLs, with each URL serving the same HTML
to all devices and using just CSS to change how the page is rendered
on the device. This is Google’s recommended configuration.
http://googlewebmastercentral.blogspot.ca/2012/06/recommendations-for-building-smartphone.html
Having said that, since you already have mobile versions and would like to block google bot from indexing multiple versions of the same URL:
Blocking Googlebot-Mobile from desktop site
Desktop site: http://www.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Allow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Disallow: /
Mobile site: http://m.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Disallow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Allow: /
http://searchengineland.com/5-tips-for-optimal-mobile-site-indexing-107088

Robots.txt disallows crawling, not indexing.
So if you would block your mobile URLs, bots would never be able to even see that you have a mobile site, which is probably not what you want.
Alternative
Tell bots what the links are about. Based on this declaration, bots can decide what they want to do with these URLs.
You can do this by providing the link types alternate and canonical:
alternate (defined in the HTML5 spec), to denote that it’s an "alternate representation of the current document".
canonical (defined in RFC 6596), to denote that the pages are the same, or that they only have trivial differences (e.g., different HTML structure, table sorted differently etc.), or that one is the superset of the other.
So if you want to use the URLs from the desktop site as canonical, you would use "alternate canonical" to link from mobile to desktop, and "alternate" to link from desktop to mobile. You can see an example in my answer to the Webmasters question Linking desktop and mobile pages.

Related

How can I allow Mixed contents (http with https) using content-security-policy meta tag?

I'm forcing https to access my website, but some of the contents must be loaded over http (for example video contents can not be over https), but the browsers block the request because of mixed-contents policy.
After hours of searching I found that I can use Content-Security-Policy but I have no idea how to allow mixed contents with it.
<meta http-equiv="Content-Security-Policy" content="????">
You can't.
CSP is there to restrict content on your website, not to loosen browser restrictions.
Secure https sites given users certain guarantees and it's not really fair to then allow http content to be loaded over it (hence the mixed content warnings) and really not fair if you could hide these warnings without your users consent.
You can use CSP for a couple of things to aid a migration to https, for example:
You can use it to automatically upgrade http request to https (though browser support isn't universal). This helps in case you missed changing a http link to https equivalent. However this assumes the resource can be loaded over https and sounds like you cannot load them over https so that's not an option.
You can also use CSP to help you identify any http resources on you site you missed by reporting back a message to a service you can monitor to say a http resource was attempted to be loaded. This allows you identify and fix the http links to https so you don't have to depend on above automatic upgrade.
But neither is what you are really looking for.
You shouldn't... but you CAN, the feature is demonstrated here an HTTP PNG image converted on-the-fly to HTTPS.
<meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">
There's also a new permissions API, described here, that allows a Web server to check the user's permissions for features like geolocation, push, notification and Web MIDI.

PathLocationStrategy vs HashLocationStrategy in web apps

What are the pros and cons of using:
PathLocationStrategy - the default "HTML 5 pushState" style.
HashLocationStrategy - the "hash URL" style.
for instance, using HashLocationStrategy will prevent the feature of scrolling to an element by its #ID, but some 3rd party plugins require the HashLocationStrategy or the Hashbang #! in order to work in ajax websites.
I would like to know which one offers more for a webapp.
For me the main difference is that the PathLocationStrategy requires a configuration on the server side to all the paths configured in #RouteConfig to be redirected to the main HTML page of your Angular2 application. Otherwise you will have some 404 errors when trying to reload your application in the browser or try to access it using a particular URL.
Here is a question that could give you some hints about this:
When I refresh my website I get a 404. This is with Angular2 and firebase.
Hope it helps you,
Thierry
# can only be processed on the client, the servers just ignore them. This can cause problems with search engines (SEO), redirects can cause redundant page reloads.
This page https://github.com/browserstate/history.js/wiki/Intelligent-State-Handling has some detailed explanation, while some of the arguments don't apply for Angular applications (for example - doesn't work with JS disabled).
The "disadvantage" of HTML5 pushstate is that is requires server support like explained by Thierry.
According to official docs:
When the router navigates to a new component view, it updates the browser's location and history with a URL for that view. This is a strictly local URL. The browser shouldn't send this URL to the server and should not reload the page.
PathLocationStrategy
Modern HTML5 browsers support history.pushState, a technique that changes a browser's location and history without triggering a server page request. The router can compose a "natural" URL that is indistinguishable from one that would otherwise require a page load.
Here's the HTML5 pushState style URL that routes to the xyz component: localhost:4200/xyz/
HashLocationStrategy
Older browsers send page requests to the server when the location URL changes unless the change occurs after a # (called the hash). Routers can take advantage of this exception by composing in-application route URLs with hashes.
Here's a hash style URL that routes to the xyz component: localhost:4200/src/#/xyz/
I would like to know which one offers more for a webapp.
Almost all Angular projects should use the default HTML5 style as:
It produces URLs that are easier for users to understand.
It preserves the option to do server-side rendering later.
Rendering critical pages on the server is a technique that can greatly improve perceived responsiveness when the app first loads. An app that would otherwise take ten or more seconds to start could be rendered on the server and delivered to the user's device in less than a second.
This option is only available if application URLs look like normal web URLs without hashes (#) in the middle.
Stick with the default unless you have a compelling reason to resort to hash routes.

How to implement Schema.org on HTTPS pages?

Is it correct to statically set up Microdata’s itemtype attribute with HTTP value (http://schema.org/WebPage) on HTTPS pages or do I need to use HTTPS value (https://schema.org/WebPage) on all pages?
Since both HTTP and HTTPS versions of the site are available, can I set it up to //schema.org/WebPage or not?
tl;dr: Use http URIs.
In this answer on Webmasters SE I explained why you should favor http over https Schema.org URIs: The http URIs seem to be canonical, as the actual definition of the Schema.org vocabulary only defines http, not https. In addition: all examples (even on HTTPS) use the HTTP variant, the authors mentioned that they prefer to see the use of the HTTP variant, and RDFa’s Initial Context defines the HTTP variant only (so most of the RDF world will use HTTP).
In this answer on Webmasters SE I explained why you should not use protocol-relative URIs for vocabularies: Vocabulary URIs typically don’t get dereferenced, and there will never get something embedded from a vocabulary, so there is absolutely no need to use HTTPS for these just because you use HTTPS (it’s similar to simply linking to an external page, which might not even be accessible via HTTPS). On top of that, your Schema.org markup would no longer work if the document is accessed via a different protocol than HTTP/HTTPS, and it’s likely that some parsers won’t be able to recognize that you are using the Schema.org vocabulary because they might look for full URIs without applying URI resolution for the itemtype attribute.
There's been an update to that answer on Webmasters SE (dated November 2015), with a link to the schema.org FAQ about https:
Q: Should we write https://schema.org or http://schema.org in our markup?
The short of it is that schema.org will be moving to https, and you can use https URLs now, but there's no rush to switch.
Regarding protocol-relative URLs… please don't use them as they're a hack. Favor use of absolute or root-relative URLs whenever hyperlinking documents on the Web.
Is it correct to statically set up Microdata’s itemtype attribute with HTTP value [...]?
Either HTTP or HTTPS is fine in your itemtype according to the Schema.org FAQ. Your examples containing HTTP and HTTPS schemes are both correct for pages served with and without TLS.
If you've got a mix of absolute URLs pointing to different schemes it's more likely a person will notice it and wonder why things aren't consistent. So when you update refactor your existing itemtypes.

Cloudfront/S3: Server different file depending on Request Header

I am hosting a static website generated with Middleman on CloudFront and S3. I want to add multiple language support and middleman allows me to localize the content and have the english version at /index.html and the translated content at /sp/index.html for example.
I would like to be able to detect the "Accept-Language" header in the request and based on that server either /index.html or /sp/index.html .
Based on my research I cannot see a way of doing this with S3 and Cloudfront, but maybe you guys have an idea?
If there is no "proper and good way" of doing this with CloudFront and S3, what would be the next best alternative? Currently I am thinking of detecting the language in JavaScript and then redirecting the user if the language is not english.
Greetings, Kim
As mentioned in the comments you will need some kind of arbitrator that can read request headers and either redirect or serve dynamic content. S3 is the problem there.
CloudFront can forward the Accept-Language header to your origin server, and ensure that content is only cached per-language. So that part isn't a problem.
If S3 is your origin, then you have a problem because your files are static and unable to process the incoming request with the language information. I don't recommend trying to detect language with JavaScript. It's problematic.
Although CloudFront can be configured with multiple origins (one per language, in your case) it cannot forward to these based on request header. Currently "behaviours" can only match the URL path. I suspect they could introduce header rules at some point, but until they do (or unless you can find another CDN that does) I'm afraid my answer is going to be a "you can't" answer.
As your site is all flat HTML, I suspect you're not interested in a convoluted solution that comprises various CloudFront behaviours and dynamic server scripts, etc..
I think your best option by far is a simple, low-tech one --
Offer the visitor a choice of language and allow them to switch language from any page. This also avoids surprises - If I google something in English, but I speak Spanish I should see the English page that I googled and then switch to Spanish if I feel like it.

Google indexed my domain anyway?

I have a robots.txt like below but Google has still indexed my domain. Basically they've indexed mydomain.com but not mydomain.com/any_page
UserAgent: *
Disallow: /
I mean how can I go back further than / which I thought was the root of domain?
Note this domain is a work in progess, hence I don't want Google or any other search engines seeing it for a minute.
If you don't have one already, get a Google Webmaster Tools account. It includes a URL removal tool that may work for you.
This doesn't address the problem of search engines possibly ignoring or misinterpreting your robots.txt file, of course.
If you REALLY want your site to be off the air until it's launched, your best bet is to actually take it off the air. Make the site inaccessible except by password. If you put HTTP Basic authentication on your documentroot, then no search engine will be able to index anything, but you'll have full access with a password.

Resources