A single Nokogiri rake task to scrape all Groupon deals? - ruby-on-rails

I want to scrape Groupon deals using Nokogiri. I want to scrape all these deals at the following link:
http://www.groupon.com/getaways?d=travel_countmein
On top of that, I want to access each individual link and scrape the title and price. Conceptually, is there a way to code a single rake task to do this?
I understand that there needs to be a loop of some sort, but I don't know how to parse the url for each deal from the main getaway page.
I've already written a scraper for the title and price:
task :fetch_travel => :environment do
require 'nokogiri'
require 'open-uri'
url = "http://www.groupon.com/deals/ga-flamingo-conferences-resort-spa?c=all&p=0"
doc = Nokogiri::HTML(open(url))
title = doc.at_css("#content//h2/a").text
price = doc.at_css("#amount").text[/[0-9\.]+/]
link = doc.at_css("#content//h2/a")[:href]
desc = doc.at_css(".descriptor").text
Traveldeal.create(:title => title, :price => price, :url => link, :description => desc)
end

Figured out that this requires nested loop where the inner loop is the code above and the outer loop will parse each deal for the url to be used in the inner loop.

Related

How can I detect a new blog post using mechanize in ruby

I am try detect when a new blog post has been added to a blog. I am using mechanize for the scraping. Currently this is straight forward if you know the parent tags of a blog <article><header><h1>Blot Title here</h1></header></article> you can just do a diff of the titles you have now to the last time you checked. But I want to do this programatically. Is there a way to programmatically know what section or tags of a page that holds the titles of the blog posts without explicitly giving the heirachy of tags to the script?
Suppose there is a blog name blog.example.com. There are the post -
<article><header><h1>Blot Title here1</h1></header></article>
<article><header><h1>Blot Title here2</h1></header></article>
<article><header><h1>Blot Title here3</h1></header></article>
Using selector Gaget you will get an idea which css is responsible for the each article. To scrap the article you can use nokogiri or machanize gem.
Suppose macanize bot will visit blog.example.com and it will collect all the article and insert into your database.
require 'nokogiri'
require 'open-uri'
if 1==1
url = "http://www.eslemployment.com/country/esl-jobs-Vietnam.html"
doc = Nokogiri::HTML(open(url))
data = []
doc.css("#joblist td:nth-child(1) a").first(5).each do |titlecss|
country = "8"
jobtype = "1"
urlnext = titlecss.attr('href')
docnext = Nokogiri::HTML(open(urlnext))
docnext.css('#jobdescription div').remove
docnext.css('#detailjob , #job-summary').each do |detailscss|
docnext.css('#pagemsg h1').each do |titlenextcss|
data << JobPost.create(
:title => titlenextcss.text,
:jobslink => urlnext,
:description => detailscss.inner_html,
:country_id => country,
:job_type_id => jobtype
)
end
end
end
end
Here is a example of nokogiri gem. It collect the jobs from www.eslemployment.com . Now your question is how you can detected new article has added.
This code collect all the jobs from a page and added it into the database. I use here "distint" code into the model for this reason only new job will added to the database. no duplication job will added into the database. When new Jobs is added you can make a notification which job is added.
This is not effecient way. But it will work. Otherwise you can use the rss feed of that blog. This is the proper way to detect new post.

Automated website interaction - Mechanzie - Rails

I'm using the Mechanize gem to automate interaction with a website form.
The site i'm trying to interact with is http://www.tastekid.com/like/books
I'm trying to automatically submit a string to query in the form and return the suggested books in an array.
Following the guide, i've pretty printed the page layout to find the form name, but, I am just finding a form with no name, nill:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
pp page
How do I enter a string, submit the form and return the results in the form of an array?
These answers feel a little cluttered to me, so let me try to make it simpler:
page = agent.get 'http://www.tastekid.com/like/books'
there's only one form, so:
form = page.form
form['q'] = 'twilight'
submit the form
page = form.submit
print the text from the a's
puts page.search('.books a').map &:text
Following the guide, you can get the form:
form = page.form
I didn't see a name on the form, and I actually got two forms back: one on the page and one hidden.
I called
form.fields.first.methods.sort #not the hidden form
and saw that I could call value on the form, so I set it as such:
form.fields.first.value = "Blood Meridian"
then I submitted and pretty printed:
page = agent.submit(form)
This should work for you!
You could use the form_with method to locate the form you want. For example:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
the_form_you_want = page.form_with(:id => "searchFrm") # form_with
the_form_you_want.q = 'No Country for Old Men'
page = agent.submit(the_form_you_want)
pp page
It looks like the book titles all have the same class attribute. To extract the book titles, use the links_with method and pass in the class as a locator:
arr = []
page.links_with(:class => "rsrc").each do |link|
arr << link.text
end
But #aceofbassgreg is right. You'll need to read up on the mechanize and nokogiri documentation...

Using Nokogiri to loop through result pages

I have some code to extract offers on eBay, but there are several result pages and I get only the results of the first page. How can I loop through several result pages?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
You have to aggregate the results from each page by pulling the link from the "next" button (which, inspecting the page, is at the css .botpg-next a) and loading it.
Something like this:
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
while (url) do
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
link = doc.css('.botpg-next a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end
I'm just looping until no "next" button is found, but you could change that to limit the loop to a given number of results.

Parsing a document in a table

How do I parse a document in a table and send it across as a JSON file to another db.
Detailed Desc:
I have crawled and taken data into a table from websites using anemone. I need to now parse it and transfer it as a JSON file to another server. I think, I will have to first convert the document in the table into nokogiri document which can be parsed and converted to json file. Any idea how can I convert the doc into nokogiri document or if anyone has any other idea to parse it and send it as a json file ?
Nokogiri is your best bet for the HTML parsing, but as for converting it to JSON you're on your own from what I can tell.
Once you have it parsed via Nokogiri it shouldn't be terribly hard to extract the elements you need and generate JSON that represents them. What you're doing isn't a very common task, so you'll have to bridge the gap between Nokogiri and whichever gem you're using to generate the JSON.
Okay I found the answer long time back, I basically made use of REST to send message form one application to another, i sent it across as a hash. And the obvious one, I used nokogiri for parsing the table.
def post_me
#page_hash = page_to_hash
res = Net::HTTP.post_form(URI.parse('http://127.0.0.1:3007/element_data/save.json'),#page_hash)
end
For sending the hash from one application to another using net/http.
def page_to_hash
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'domainatrix'
#page = self.page.sub(/^<!DOCTYPE html(.*)$/, '<!DOCTYPE html>')
hash={}
doc = Nokogiri::HTML(self.page)
doc.search('*').each do |n|
puts n.name
end
Using Nokogiri for parsing the page table in my model. page table had the whole body of a webpage.
file_type = []
file_type_data=doc.xpath('//a/#href[contains(. , ".pdf") or contains(. , ".doc")
or contains(. , ".xls") or contains(. , ".cvs") or contains(. , ".txt")]')
file_type_data.each do |href|
if href[1] == "/"
href = "http://" + website_url + href
end
file_type << href
end
file_type_str = file_type.join(",")
hash ={:head => head,:title => title, :body => self.body,
:image => images_str, :file_type => file_type_str, :paragraph => para_str, :description => descr_str,:keyword => key_str,
:page_url=> self.url, :website_id=>self.parent_request_id, :website_url => website_url,
:depth => self.depth, :int_links => #int_links_arr, :ext_links => #ext_links_arr
}
A simple parsing example and how i formed my hash.

Puzzled by ror Mechanize

I'm trying to use mechanize to perform a simple search on my college's class schedule db. The following code returns nil, however it works logging into facebook and searching google (with diff url/params). What am I doing wrong?
I'm following the latest (great) railscast here. Mechanize documentation has been useful but I'm still puzzled. Thanks in advance for your suggestions!
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
Remove search from form.submit.search i.e. form.submit I'm guessing you're appending search to submit thinking that it has something to do with the value of the submit button i.e. search.
What you're code is doing IS successfully submitting the form. However you are calling the search method of the resulting page object with a nil argument. The search method expects a selector e.g. 'body div#nav_bar ul.links li' as an argument for it to return an array of elements that match that selector. Of course no elements will match a nil selector, hence the empty array.
Edit per your response:
Your code:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
What I tried and got to work:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit # <- No search method.
=> Insanely long array of HTML elements
The same code will not work with Google either:
require 'mechanize'
require 'nokogiri'
agent = WWW::Mechanize.new
agent.get("http://www.google.com")
form = agent.page.forms.last
form.q = "stackoverflow"
a = form.submit.search
b = form.submit
puts a
=> [] # <--- EMPTY!
puts b
#<WWW::Mechanize::Page
{url
#<URI::HTTP:0x1020ea878 URL:http://www.google.co.uk/search?hl=en&source=hp&ie=ISO-8859-1&q=stackoverflow&meta=>}
{meta}
{title "stackoverflow - Google Search"}
{iframes}
{frames}
{links
#<WWW::Mechanize::Page::Link
"Images"
"http://images.google.co.uk/images?hl=en&source=hp&q=stackoverflow&um=1&ie=UTF-8&sa=N&tab=wi">
#<WWW::Mechanize::Page::Link
"Videos"
…
The search method of a page object behaves like the search method of Nokogiri, in that it accepts a sequence of CSS selectors and/or XPath queries and returns an enumerable object of matching elements. e.g.
page.search('h3.r a.l', '//h3/a[#class="l"]')
The page returns a null result when it is queried through WWW::Mechanize.
I'm not sure if WWW::Mechanize can handle POSTING to this secure page.
"can't convert nil into String" means it can't show you in a text form what nothing is. It can't convert something from nothing.
It also might be a problem with the form and the script delay.
Try using curl for debugging, POSTing such as curl -d "occ_subject=chm" https://www.owens.edu/cgi-bin/class.pl, when I tried that it returned a page.
I think it's a problem with the secure page and the cgi script combined.

Resources