How to write a crawler in ruby? - ruby-on-rails

I am working on a ROR application where I need to implement a crawler that crawls other sites and stores data in my database. For example suppose I want to crawl all deals from http://www.snapdeal.com and store them into my database. How to implement this using crawler?

There are couple of options depending upon your usecase.
Nokogiri. Here is the RailsCast that will get you started.
Mechanize is built on top of Nokogiri. See the Mechanize RailsCast.
Screen Scraping with ScrAPI and the ScrAPI RailsCast.
Hpricot.
I have used combination of Nokogiri and Mechanize for few of my projects and I think they are good options.

You want to take a look at mechanize. Also from what you mention you probably don't need rails at all.

As Sergio commented, you retrieve pages, parse them, and follow their links. In your case, it sounds like you're more focused on "screen scraping" than crawling deep link networks, so a library like Scrubyt will be helpful (although progress on it has died out). You can also use a lower-level parsing-focused library like Nokogiri.

Related

Getting more Ruby On Rails Helpers

I am completely new on Ruby On Rails and I already watched a long tutorial to start developing a small web application. In such a tutorial I could see several helpers for textboxes, textareas, dates, times, checkboxes, radiobuttons, comboboxes, and so on.
Where can I find other helpers like accordions, WYSIWYG editors (like an HTML editor), and others that can be bound to data from model and used in views? Maybe a toolbox for example.
I will very much appreciate your feedback.
Best regards.
What you're mostly talking about are Form Helpers. There are a bunch of other Rails Guides so I'd recommend reading through them and getting a better idea of what Rails does and can provide.
If you're not finding what you need in that documentation, you may need to add a 3rd party gem to your app's Gemfile, and follow the gem's documentation for getting it working. The Ruby Toolbox is a good place to start searching if you want to see which gems are most common.
And, of course, in the end you might not be able to find something that someone else already wrote and that solves your problem, in which case you will need to write it yourself. For front-end stuff you'll want to get up to speed on how to use HTML, CSS, and Javascript.

Setting up search page and filter (similar to ecommerce)

I'm building my first rails app and I'm trying to build the search page on an ecommerce type site. The idea is the model pulls all the data from the database according to the filter that is checked or selected on the view such as (category, sub-category, price, date, etc.)
I've watched railscasts on elasticsearch, solar, etc. They seem like they'd each work in this scenario but are they overkill? I'm just not sure what is the quickest and most scalable way to set up this search. I've read a little about the has_scope gem which seems like it would be one way but I can't find a good tutorial or documentation on has_scope. Can someone point me in the right direction for creating this search page? Should I build it out with has_scope, solar, or elasticsearch?
Thanks
In my own experience solr is the best search technology I've come across. It provides a feature called faceting which is what you are describing. You can read about it on their wiki here: http://wiki.apache.org/solr/SolrFacetingOverview
The best solr gem I've come across is sunspot. It has a very easy to use DSL for interfacing with solr from a Rails app and hooks in to active record very easily. Take a look at their github project page. I think that will answer your question.

FAQ Plug-in or Gem for Rails 3?

FAQs seem to be a pretty commonly needed feature in a web application..
but it seems like there are no gems or plugins available for Rails.
Can you recommend a gem or plugin which provides FAQs to a Rails app?
Obviously you could make a simple FAQ very quickly with Rails, but there is much
more functionality that can be added: votes, search, categories, roles,
comments, markup, embedded links, tags, ... just to name a few.
Seems like people are re-inventing the wheel a lot for FAQs
I just published a gem for it.
https://github.com/railscash/how_to
Hope that helps, Its in development phase but we are using it actively. Your comments/feedback will be highly appreciated
I think either using BrowserCMS (http://browsercms.org) or RefineryCMS (http://refinerycms.com/) fits the bill for when you need more generic content pages. I prefer to just use a generic CMS instead of creating a gem/plugin for FAQS as you'll have other pages that could easily be thrown into a CMS engine as well. Saves developer time from having to update mostly static HTML pages.
Absolutely - Check out https://oraguide.com - Everything is streamlined / hosted in the cloud. It runs directly on the page as a floating div.

How do I get content from a website using Ruby / Rails?

I want to copy some specific content from a website using ruby/rails.
The content I need is inside a marquee html tag, divided by divs.
How can I get access to this content using ruby?
To be more precise - I want to use some kind of ruby gui (Preferably shoes).
How do I do it?
This isn't really a Rails question. It's something you'd do using Ruby, then possibly display using Rails, or Sinatra or Padrino - pick your poison.
There are several different HTTP clients you can use:
Open-URI comes with Ruby and is the easiest. Net::HTTP comes with Ruby and is the standard toolbox, but it's lower-level so you'd have to do more work. HTTPClient and Typhoeus+Hydra are capable of threading and have both high-level and low-level interfaces.
I recommend using Nokogiri to parse the returned HTML. It's very full-featured and robust.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
puts doc.to_html
If you need to navigate through login screens or fill in forms before you get to the page you need to parse, then I'd recommend looking at Mechanize. It relies on Nokogiri internally so you can ask it for a Nokogiri document and parse away once Mechanize retrieves the desired URL.
If you need to deal with Dynamic HTML, then look into the various WATIR tools. They drive various web browsers then let you access the content as seen by the browser.
Once you have the content or data you want, you can "repurpose" it into text inside a Rails page.
If I'm to understand correctly, you want a GUI interface to a website scraper. If that's so, you might have to build one yourself.
The easiest way to scrape a website is using nokogiri or mechanize gems. Basically, you will give those libraries the address of the website and then use their XPath capabilities to select the text out of the DOM.
https://github.com/sparklemotion/nokogiri
https://github.com/sparklemotion/mechanize (for the documentation)

Best way to add full web search to my site?

I need to add full web search to my site. I need something like Google Custom Search but with no ads and it has to be free. Any recommendation of a web service or open source project that can index my site and allow me to search it will be helpful.
My site is made in ruby on rails, if that helps.
I'll make this question community-wiki so you can edit my bad English. I think many people can benefit from this question.
Check out Lucene. It's an open source search engine that will certainly be a fun learning experience to implement on your own site. It was originally designed by the Excite folks, I do believe.
Ferret is the Ruby port of Lucene. Check out the acts_as_ferret plugin.
Depends what you mean by full web search really. If you want to search the whole web then the answers above wont help you much as they are really for indexing and searching the content of your site. I would suggest using the Google ajax search (just a 'powered by google' needed, no ads) or Boss from yahoo (might require ads not sure).
http://code.google.com/apis/ajaxsearch/
http://developer.yahoo.com/search/boss/
People are going to acts_as_solr and thinking sphinx in the blogs i read:
http://acts-as-solr.rubyforge.org/
http://ts.freelancing-gods.com/
I've aslo been looking at tsearch in postgres, it looks very capable:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
What do you mean by "full web search"?
The are good answers available for full-text search where a search engine indexes and queries the model objects stored in your database.
If you mean something that indexes and queries your rendered HTML, Nutch is a popular option with a web-crawler, parser, indexer, and query interface.
I recommend acts_as_xapian. It's very easy to implement, it's fast enough, and it's the got the features you'll normally need.

Resources