Sitemap for a site with a large number of dynamic subdomains - search-engine

I'm running a site which allows users to create subdomains. I'd like to submit these user subdomains to search engines via sitemaps. However, according to the sitemaps protocol (and Google Webmaster Tools), a single sitemap can include URLs from a single host only.
What is the best approach?
At the moment I've the following structure:
Sitemap index located at example.com/sitemap-index.xml that lists sitemaps for each subdomain (but located at the same host).
Each subdomain has its own sitemap located at example.com/sitemap-subdomain.xml (this way the sitemap index includes URLs from a single host only).
A sitemap for a subdomain contains URLs from the subdomain only, i.e., subdomain.example.com/*
Each subdomain has subdomain.example.com/robots.txt file:
--
User-agent: *
Allow: /
Sitemap: http://example.com/sitemap-subdomain.xml
--
I think this approach complies to the sitemaps protocol, however, Google Webmaster Tools give errors for subdomain sitemaps: "URL not allowed. This url is not allowed for a Sitemap at this location."
I've also checked how other sites do it. Eventbrite, for instance, produces sitemaps that contain URLs from multiple subdomains (e.g., see http://www.eventbrite.com/events01.xml.gz). This, however, does not comply with the sitemaps protocol.
What approach do you recommend for sitemaps?

I recently struggled through this and finally got it working. See this thread for more details:
http://www.google.com/support/forum/p/Webmasters/thread?tid=53c3e4b3ab8d9503&hl=en&fid=53c3e4b3ab8d9503000497bd04ba63cf
Summary:
Use DNS verification to verify your site and all it's subdomains in one fell swoop
make the robots.txt on all your subdomains point to the main sitemap on your www domain
You may need to wait several days for Google to update it's cached copies of robot.txt on all your subdomains. It will still show errors until then.

Yes, the subdomain restriction is in the sitemaps.org spec, but, Google has put some exceptions in place:
Verify all subdomains within your Google Webmaster tools account
http://www.google.com/support/webmasters/bin/answer.py?answer=75712
cross-submission of sitemaps XML via Google Webmaster tools - if submitted via the root of your domain - will not throw errors for Google
Within the robots.txt of a subdomain you can point to sitemaps XML on other domains. there will be no cross submission errors - for Google

If you have a website that allows users to create sub-domain within your site, it is better for you to simplify the process by creating and submitting sitemaps for each subdomains by creating a single sitemap. This includes sitemap URLs for all your subdomain sites and saving this sitemap to a single location. But, to do this, all sites must be verified in webmaster tools. You can define one sitemap as:
http://example.com/sitemap.xml
Define all your sub-domain sitemaps for all your sub-domain URLs under this document tree.
You can define multiple sitemap files upto 50,000 URLs and 10 megabytes file size per sitemap. Sitemaps can be compressed using gzip to reduce the bandwidth. So, you don’t have any problem by defining the sitemap in this way.

Related

Is it possible to include my one page Eloqua microsite in a subdirectory rather than as a subdomain

I have an Eloqua microsite that I would like to integrate into my main website as smoothly as possible. The site is a single page, and I am am hoping to include it as a subdirectory rather than a subdomain e.g. www.mydomain/microsite.
Is it possible to do this with an Eloqua microsite.
currently Eloqua only works with subdomains, not subfolders.

How to do multi-tenant support with custom domains and SSL on Heroku

My app allows users to create custom product landing pages.
I wish to setup this scenario:
Pages can exist on brand.myapp.com/offer-name, however I want to enable users to create landing pages using their own domain, for example brand.customerdomain.com/offer-name, which serves a page from my app.
I am unsure about the best way to do this. I know I can have users point a CNAME record to 'myapp.com', and then I add 'brand.customerdomain.com' as a Heroku custom domain. But is there a limit to the amount of custom domains I can add to Heroku? There will be thousands of these domains, so I don't know if this solution is feasible. I have had some success with this approach, however I get SSL browser messages when accessing the page from the user's domain.
In terms of SSL, I have a wildcard certificate installed on Heroku, for *.myapp.com.
Another way is to have a proxy server hosted elsewhere, and have users point a CNAME to something like 'proxy.myapp.com', which routes to my Heroku URL, however I haven't been able to get this to work on Nginx (on DigitalOcean), and haven't found any suitable guides (I don't have much Nginx knowledge).
The proxy approach I found here - https://mrvautin.com/enabling-custom-domain-for-saas-application-on-heroku/.
Cloudflare has a solution for this problem, however it's available to enterprise customers only, so I'd prefer to have my own solution - https://www.cloudflare.com/saas/.
What would be the ideal way to have multi-tenancy, with custom domains and SSL on Heroku?

How do I create 301 redirect from WordPress (hosted on wsynth) to a Rails app hosted on Heroku?

I have a website (example.com) that is a WordPress site hosted on WSYNTH.
I am redesigning the site, same domain (example.com) in ROR hosted on Heroku.
I have been told that for SEO purposes, once I point my domain to the ROR app on Heroku, all the old pages from the Wordpress site will go dark. (Makes sense.) But this would be very bad for SEO, since example.com will now have many URLs associated with it (created from the WP site) that are no longer valid.
I've heard that a 301 Redirect for those WordPress URLs will take care of this SEO issue. But how and where should I do this? Should I be installing a plugin in WordPress that will automate the redirects to the pages I want to send them to in the ROR/Heroku app?)
Also, is it possible to keep some of those old WordPress URLs live?
DNS
The 301 redirect is not the issue - you can use Wordpress itself to redirect to specific pages (using the simple 301 redirects plugin), or a better way will be to redirect your domain (with your DNS) to your Rails app, and then use the routes to handle any stray pages
The world of "SEO" is highly overrated - Google is just a system which
follows links. If it cannot find a page, it removes it from its
rankings; if it can find the page, it judges its on & off-site
optimization to determine its relevance.
This means the only thing you need to concern yourself with is ensuring you don't have any "holes" in your URLs. The redirections essentially mean you will tell Google to follow a link to the new page
--
Redirections
The first thing you need to do is ensure you have the new pages you wish to show on your site. Preferrably, you'll want to make as many of them as identical to your previous URLS as you can.
Secondly, you can introduce redirects in your Rails routing system to give Google real pages when it visits the links for your Wordpress site:
#config/routes.rb
get '/your-old-post-name', to: redirect('/your-new-post-name')
This will mean you will have to create redirections for every wordpress post in your new Rails app - but should give Google the knowledge that those pages have changed, to which it will update

URL/ Path in address bar

I have seen my websites only shows path as below when you navigate to different webpage or items in that website while when i created the website i have to create a webpage for everything & it do show the path with file name like .php while other websites only show a path even though it navigate to new page.
This is known as URL routing. The developer has configured the webserver (or web application) to map specific URL paths to individual webserver pages (.php, .aspx, .mvc, or whatever). There are different ways of achieving this, depending on the webserver platform technology, but it is generally achieved by configuring a url route map of some kind. There are several advantages to organising a website url's in this way but mainly it makes URL's more consistent and easier to understand for users, and hides the details of the website's underlying implementation.

Find all the web pages in a domain and its subdomains

I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu).
I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch's collected data. But, I dont think I need solr, since I am not performing any searches. All I need are the URLs that belong to a given domain.
Thanks
If you're familiar with ruby, consider using anemone. Wonderful crawling framework. Here is sample code that works out of the box.
require 'anemone'
urls = []
Anemone.crawl(site_url)
anemone.on_every_page do |page|
urls << page.url
end
end
https://github.com/chriskite/anemone
Disclaimer: You need to use a patch from the issues to crawl subdomains and you might want to consider adding a maximum page count.
The easiest way to find all subdomains of a given domain is to ask the DNS administrators of the site in question to provide you with a DNS Zone Transfer or their zone files; if there are any wildcard DNS entries in the zone, you'll have to also get the configurations (and potentially code) of the servers that respond to requests on the wildcard DNS entries. Don't forget that portions of the domain name space might be handled by other DNS servers -- you'll have to get data from them all.
This is especially complicated because HTTP servers might have different handling for requests to different names baked into their server configuration files, or the application code running the servers, or perhaps the application code running the servers will perform database lookups to determine what to do with the given name. FTP does not provide for name-based virtual hosting, and whatever other services you're interested in may or may not provide name-based virtual hosting protocols.

Resources