make sitemap page for many pages in rails - ruby-on-rails

I just noticed that Google didn't crawl quite a lot of my web pages. When I searched for that reason, I thought sitemap could be a reason for some part of that missing crawling because I had no sitemap or robots.txt kind of thing.
I have a very simple structured website with index, QnA, login(no user's page yet) and search page. But Search page can result in many restaurants' pages, which are upto almost 160,000 like example.com/restaurants/1000001 to example.com/restaurants/1160000.
I just learned some concepts about sitemap and I think dynamic sitemap examples are quite a lot in google. But I saw that over 50,000 pages require some more sitemaps google sitemap help page.
My website has very very simple structure but would it make some burden for my server?(is it necessary?) And I have no clear standard to split those 160,000 pages, then what can be a good standard to split them for googlebot?
Any tips would be a huge help for me. Thanks!

Related

Integrating GCS on a staging Jekyll website

We currently have a staging website, which has an IP address like xx.xx.xxx.xxx and we would like to have integrated and tested GCS on it before pushing it live. Is it possible?
Otherwise, is there any alternative to GCS to add a search bar in a Jekyll blog without using plugin?
Thanks!
PART I: Google
Google custom search cannot index an application that isn't available to the internet.
No, that's not entirely true. You can arrange something with Google (in theory, never done it) but it doesn't look easy. Or cheap.
You could set up a custom search for an unrelated site and embed those results in your local page, if you want to test out CSS prior to launch.
Remember, Google Custom Search also comes with ads, unless you pay. And the results tend to look like they came from Google.
PART II: Alternatives
I've looked into this extensively and I haven't come up with a good answer. Here are some not-so-good answers:
1) Tapir Search. This actually worked pretty well, but appears to have died. They do have recent twitter activity, however, so maybe worth checking back in a bit. twitter. It's basically a (free) front end for an elasticsearch server. I think. Neat service, obviously not super-dependable.
2) Go javascript. Lunr for example. There are many, many similar solutions available. Sadly, they are client-side and doing a full-text search on even a smallish blog type site can be very slow. Works okay if you limit the search to titles, but then...you're only searching titles.
3) Build a search engine server. Maybe some breed of Lucene. Upside: very robust search while keeping the snappy response of a flat HTML site. Downside: building and maintaining a search engine server is difficult, expensive and probably overkill.
4) Hosted search engine. Algolia for example. They're basically doing 3) for you. Relatively expensive (~$50/month) but well worth the cost, because, seriously, search engine servers are finicky and prone to explosions. I've never gone this route with Jekyll because I've never had a Jekyll project I was quite that serious about, but I did consider it.
If anyone has anything to add, I'd love to hear it. This question has been irritating me for a while.

SEO Friendly URLs in Ruby on Rails

I am completely new to this Framework. Hence, any help is really appreciated. I am developing a website using Ruby on Rails Framework (Currently it is in beta phase); however, there are 2 major issue I am facing
URLs - all the URLs are having #! because of which Search Engines are not crawling and indexing the same
Content on the website is not getting crawled or indexed
Please help.
take a look at the free railscast or a nicer one, but paid
Regarding your content not being indexed or crawled, did you setup proper robots.txt on your server?

How to fetch and display data from some other ecommerce sites to my website?

I want to get products related data from other ecommerce sites to my website.
its just process like giving specific product url from other ecommerce site to display
that product info in my website.
I am looking for this solution in Ruby On Rails.
Is there any solution with ror ? Please share your ideas If you knew about it.
Thanks in advance.
There are two ways of achieving what you need:
1) Those sites might actually have an API which you can use to get your job done.
2) Scraping those sites. Now, some websites for obvious reasons prohibit such thing so do read their terms. At any rate there are a couple of things you can use for web scraping like Nokogiri. A good screencast to get you started can be found on Railscasts
There are a plethora of options for web scraping, depending what you actually need but get started with Nokogiri and you can then find out more eg Mechanize, a library used for automating interactions with a website. Another screenshot that can be found on Railscasts

What is the best conceptual approach, in Rails, to managing content areas in what is otherwise a web application?

(A while back I read this great post: http://aaronlongwell.com/2009/06/the-ruby-on-rails-cms-dilemma.html, discussing the "Rails CMS Dilemma". It describes conceptual approaches to managing content in websites vs web apps. I'm still a beginner with Rails, but had a bit of a PHP background, and I still have trouble wrapping my brain around this.
A lot of what I run into is customers who want a website that is not 100% website, and not 100% web app... That is, perhaps there are several pages of business-to-public facing content, but then there are application elements, and the whole overall look is supposed to be cohesive. This was always fairly simple in PHP, as you just kind of dropped your app code into the PHP "script", etc (though I know there are plenty of cons to this platform and approach).
So I am wondering, what is the best approach in Rails for doing this?
Say you have an application with user authentication and some sort of CRUD stuff going on, where users collaborate on projects or something. Well, what is the optimal approach for managing the text/images of the "How This Site Works" and "Our Company" pages, which people may also want to view? Is it just simply having a pages controller and several text fields, with an admin panel on the back end that lets you edit those fields? Or is it perhaps a common approach to start off with something like Refinery, and then build on top of it for the non-content-driven areas of a site?
Sorry if this is a dumb question. It's just that I've read Hartl's book and others, and they never address this practical low-level stuff for a beginner... Sure, I can build a Twitter feed now, but what Twitter's "About" page (http://twitter.com/about)? I can't just throw text into a view and give that to a client... They want a super easy way to see the site tree, edit content areas, AND administrate/run their Twitter feed or whatever.
Thanks for your help.
I think you're looking for a CMS that runs as a plugin in your Rails application. If that's the case, I'd suggest that you try http://github.com/twg/comfortable-mexican-sofa

Where do search engines start crawling?

What do search engine bots use as a starting point? Is it DNS look-up or do they start with some fixed list of well-know sites? Any guesses or suggestions?
Your question can be interpreted in two ways:
Are you asking where search engines start their crawl from in general, or where they start to crawl a particular site?
I don't know how the big players work; but if you were to make your own search engine you'd probably seed it with popular portal sites. DMOZ.org seems to be a popular starting point. Since the big players have so much more data than we do they probably start their crawls from a variety of places.
If you're asking where a SE starts to crawl your particular site, it probably has a lot to do with which of your pages are the most popular. I imagine that if you have one super popular page that lots of other sites link to, then that would be the page that SEs starts will enter from because there are so many more entry points from other sites.
Note that I am not in SEO or anything; I just studied bot and SE traffic for a while for a project I was working on.
You can submit your site to search engines using their site submission forms - this will get you into their system. When you actually get crawled after that is impossible to say - from experience it's usually about a week or so for an initial crawl (homepage, couple of other pages 1-link deep from there). You can increase how many of your pages get crawled and indexed using clear semantic link structure and submitting a sitemap - these allow you to list all of your pages, and weight them relative to one another, which helps the search engines understand how important you view each part of site relative to the others.
If your site is linked from other crawled websites, then your site will also be crawled, starting with the page linked, and eventually spreading to the rest of your site. This can take a long time, and depends on the crawl frequency of the linking sites, so the url submission is the quickest way to let google know about you!
One tool I can't recommend highly enough is the Google Webmaster Tool. It allows you to see how often you've been crawled, any errors the googlebot has stumbled across (broken links, etc) and has a host of other useful tools in there.
In principle they start with nothing. Only when somebody explicitly tells them to include their website they can start crawling this site and use the links on that site to search more.
However, in practice the creator(s) of a search engine will put in some arbitrary sites they can think of. For example, their own blogs or the sites they have in their bookmarks.
In theory one could also just pick some random adresses and see if there is a website there. I doubt anyone does this though; the above method will work just fine and does not require extra coding just to bootstrap the search engine.

Resources