Using Nokogiri to loop through result pages - nokogiri

I have some code to extract offers on eBay, but there are several result pages and I get only the results of the first page. How can I loop through several result pages?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end

You have to aggregate the results from each page by pulling the link from the "next" button (which, inspecting the page, is at the css .botpg-next a) and loading it.
Something like this:
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
while (url) do
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
link = doc.css('.botpg-next a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end
I'm just looping until no "next" button is found, but you could change that to limit the loop to a given number of results.

Related

How to do SEO link tag for two/multiple pagination in single page?

I wrote an application in rails 4. In that app, I have two pagination in single page 'x (page)'. Params like groups and page in the url.
Url looks like:
https://example.com/x?page=2&group=4
Initial page:
https://example.com/x
If pagination page params, then
https://example.com/x?page=2
If paginating groups params, then
https://example.com/x?group=2
If paginating both,then
https://example.com/x?page=2&group=2
and so on.
I am using Kaminari gem to do pagination. In that gem I used rel_next_prev_link_tags helper to show link tag for prev/next.
How to show link tags for multiple pagination?
I created an custom helper to process the URL and based on params create the categorized link tags. ex: In view,
pagination_link_tags(#pages,'page') for pages pagination
pagination_link_tags(#groups,'group') for groups pagination
def pagination_link_tags(collection,pagination_params)
output = []
link = '<link rel="%s" href="%s"/>'
url = request.fullpath
uri = Addressable::URI.parse(url)
parameters = uri.query_values
# Update the params based on params name and create a link for SEO
if parameters.nil?
if collection.next_page
parameters = {}
parameters["#{pagination_params}"] = "#{collection.next_page}"
uri.query_values = parameters
output << link % ["next", uri.to_s]
end
else
if collection.previous_page
parameters["#{pagination_params}"] = "#{collection.previous_page}"
uri.query_values = parameters
output << link % ["prev", uri.to_s]
end
if collection.next_page
parameters["#{pagination_params}"] = "#{collection.next_page}"
uri.query_values = parameters
output << link % ["next", uri.to_s]
end
end
output.join("\n").html_safe
end
You can't show search engines two-dimensional pagination. In your case it looks more like grouping/categorizing + pagination.
Like:
Group 1 pages:
https://example.com/x
https://example.com/x?page=2
https://example.com/x?page=3
Group 2 pages:
https://example.com/x?group=2
https://example.com/x?page=2&group=2
https://example.com/x?page=3&group=2
Etc.

Automated website interaction - Mechanzie - Rails

I'm using the Mechanize gem to automate interaction with a website form.
The site i'm trying to interact with is http://www.tastekid.com/like/books
I'm trying to automatically submit a string to query in the form and return the suggested books in an array.
Following the guide, i've pretty printed the page layout to find the form name, but, I am just finding a form with no name, nill:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
pp page
How do I enter a string, submit the form and return the results in the form of an array?
These answers feel a little cluttered to me, so let me try to make it simpler:
page = agent.get 'http://www.tastekid.com/like/books'
there's only one form, so:
form = page.form
form['q'] = 'twilight'
submit the form
page = form.submit
print the text from the a's
puts page.search('.books a').map &:text
Following the guide, you can get the form:
form = page.form
I didn't see a name on the form, and I actually got two forms back: one on the page and one hidden.
I called
form.fields.first.methods.sort #not the hidden form
and saw that I could call value on the form, so I set it as such:
form.fields.first.value = "Blood Meridian"
then I submitted and pretty printed:
page = agent.submit(form)
This should work for you!
You could use the form_with method to locate the form you want. For example:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
the_form_you_want = page.form_with(:id => "searchFrm") # form_with
the_form_you_want.q = 'No Country for Old Men'
page = agent.submit(the_form_you_want)
pp page
It looks like the book titles all have the same class attribute. To extract the book titles, use the links_with method and pass in the class as a locator:
arr = []
page.links_with(:class => "rsrc").each do |link|
arr << link.text
end
But #aceofbassgreg is right. You'll need to read up on the mechanize and nokogiri documentation...

A single Nokogiri rake task to scrape all Groupon deals?

I want to scrape Groupon deals using Nokogiri. I want to scrape all these deals at the following link:
http://www.groupon.com/getaways?d=travel_countmein
On top of that, I want to access each individual link and scrape the title and price. Conceptually, is there a way to code a single rake task to do this?
I understand that there needs to be a loop of some sort, but I don't know how to parse the url for each deal from the main getaway page.
I've already written a scraper for the title and price:
task :fetch_travel => :environment do
require 'nokogiri'
require 'open-uri'
url = "http://www.groupon.com/deals/ga-flamingo-conferences-resort-spa?c=all&p=0"
doc = Nokogiri::HTML(open(url))
title = doc.at_css("#content//h2/a").text
price = doc.at_css("#amount").text[/[0-9\.]+/]
link = doc.at_css("#content//h2/a")[:href]
desc = doc.at_css(".descriptor").text
Traveldeal.create(:title => title, :price => price, :url => link, :description => desc)
end
Figured out that this requires nested loop where the inner loop is the code above and the outer loop will parse each deal for the url to be used in the inner loop.

Google custom search API with pagination

I have this method that puts the links of the 10 results from the Google custom search API into an array:
require 'json'
require 'open-uri'
def create
search = params[:search][:search]
base_url = "https://www.googleapis.com/customsearch/v1?"
stream = open("#{base_url}key=XXXXXXXXXXXXX&cx=XXXXXXXXXX&q=#{search}&start=#{i}&alt=json")
raise 'web service error' if (stream.status.first != '200')
result = JSON.parse(stream.read)
#new = []
result['items'].each do |r|
#new << r['link']
end
end
and my view:
<% #new.each do |link| %>
<p><%= link %></p>
<% end %>
I'm having trouble figuring out how to add pagination with this so that on the second page would return the next 10 results. I'm using the Kaminari gem for pagination.
I want for when a user clicks a link to another page, I fetch the next 10 results from Google's API. You can do this with the API's start parameter that specifies the first result to start with, which I pass as i. I was thinking of doing something like this:
i = (params[:page] - 1) * 10 + 1
where params[:page] is the current page number, but for some reason it is undefined. Also I'm unsure about how to setup pagination for an array that is not an AR object, and what would go in my view. I'd appreciate any help, and feel free to use any pagination gem you know.
How are you setting params[page]? It needs to be passed in with the other parameters in your request in some way.
Perhaps you need something like this in your controller:
#page = params[:page] || 1
i = (#page - 1) * PER_PAGE + 1
stream = open("#{base_url}key=XXXXXXXXXXXXX&cx=XXXXXXXXXX&q=#{search}&start=#{i}&alt=json")
raise 'web service error' if (stream.status.first != '200')
result = JSON.parse(stream.read)
#new = result['items'].map{|r| r['link']}
In your view you need to make sure that you are passing the page via the query parameter in the link to fetch the next set of results. Most likely that you would want to return #page + 1.
Handling pagination with non ActiveRecord objects depends on your pagination library. You might want to check out how will_paginate handles this.

Puzzled by ror Mechanize

I'm trying to use mechanize to perform a simple search on my college's class schedule db. The following code returns nil, however it works logging into facebook and searching google (with diff url/params). What am I doing wrong?
I'm following the latest (great) railscast here. Mechanize documentation has been useful but I'm still puzzled. Thanks in advance for your suggestions!
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
Remove search from form.submit.search i.e. form.submit I'm guessing you're appending search to submit thinking that it has something to do with the value of the submit button i.e. search.
What you're code is doing IS successfully submitting the form. However you are calling the search method of the resulting page object with a nil argument. The search method expects a selector e.g. 'body div#nav_bar ul.links li' as an argument for it to return an array of elements that match that selector. Of course no elements will match a nil selector, hence the empty array.
Edit per your response:
Your code:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
What I tried and got to work:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit # <- No search method.
=> Insanely long array of HTML elements
The same code will not work with Google either:
require 'mechanize'
require 'nokogiri'
agent = WWW::Mechanize.new
agent.get("http://www.google.com")
form = agent.page.forms.last
form.q = "stackoverflow"
a = form.submit.search
b = form.submit
puts a
=> [] # <--- EMPTY!
puts b
#<WWW::Mechanize::Page
{url
#<URI::HTTP:0x1020ea878 URL:http://www.google.co.uk/search?hl=en&source=hp&ie=ISO-8859-1&q=stackoverflow&meta=>}
{meta}
{title "stackoverflow - Google Search"}
{iframes}
{frames}
{links
#<WWW::Mechanize::Page::Link
"Images"
"http://images.google.co.uk/images?hl=en&source=hp&q=stackoverflow&um=1&ie=UTF-8&sa=N&tab=wi">
#<WWW::Mechanize::Page::Link
"Videos"
…
The search method of a page object behaves like the search method of Nokogiri, in that it accepts a sequence of CSS selectors and/or XPath queries and returns an enumerable object of matching elements. e.g.
page.search('h3.r a.l', '//h3/a[#class="l"]')
The page returns a null result when it is queried through WWW::Mechanize.
I'm not sure if WWW::Mechanize can handle POSTING to this secure page.
"can't convert nil into String" means it can't show you in a text form what nothing is. It can't convert something from nothing.
It also might be a problem with the form and the script delay.
Try using curl for debugging, POSTing such as curl -d "occ_subject=chm" https://www.owens.edu/cgi-bin/class.pl, when I tried that it returned a page.
I think it's a problem with the secure page and the cgi script combined.

Resources