Nokogiri and Mechanize help (clicking links found by Nokogiri via Mechanize) - ruby-on-rails

I search links via css form page = agent.get('http://www.print-index.ru/default.aspx?p=81&gr=198') and after that I have in page variable a lot of links but I don't know how use them, how click on them via Mechanize. I found on stackoverflow this method:
page = agent.get "http://google.com"
node = page.search ".//p[#class='posted']"
Mechanize::Page::Link.new(node, agent, page).click
but it works for only one link so how can I use this method for many links.
If I should post additional information, please say it.

If your goal is simply to make it to the next page and then scrape some info off of it, then all you really care about are:
Page content (For scraping your data)
The URL to the next page you need to visit
The way you get to the page content could be done by using Mechanize OR something else, like OpenURI (which is part of Ruby standard lib). As a side note, Mechanize uses Nokogiri behind the scenes; when you start to dig into elements on the parsed page you will see they come back as Nokogiri related objects.
Anyways, if this were my project I'd probably go the route of using OpenURI to get at the page's content and then Nokogiri to search it. I like the idea of using a Ruby standard library instead of requiring an additional dependency.
Here is an example using OpenURI:
require 'nokogiri'
require 'open-uri'
printing_page = Nokogiri::HTML(open("http://www.print-index.ru/default.aspx?p=81&gr=198"))
# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...
# Find the next page to visit. Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.css('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page
about_project_page = Nokogiri::HTML(open(about_project_link_in_navbar_menu_url)) # Get the About page's content
# ....
# Do something...
# ....
Here's an example using Mechanize to get the page content (they are very similar):
require 'mechanize'
agent = Mechanize.new
printing_page = agent.get("http://www.print-index.ru/default.aspx?p=81&gr=198")
# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...
# Find the next page to visit. Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.search('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page
about_project_page = agent.get(about_project_link_in_navbar_menu_url)
# ....
# Do something...
# ....
PS I used google to translate Russian to english.. if the variable names are incorrect, i'm sorry! :X

Related

Why can't I scrape this particular website with Ruby Mechanize?

I ideally want to access the API from this website, but since I am struggling to do that, I have decided to try and scrape the page instead. I am starting at this page:
https://fantasy.sixnationsrugby.com/#/welcome/login
Where I plan to log in and then scrape the data.
The code I have below seems to work for every other website I test with, apart from this one. And I can't seem to pull anything, no text, forms, etc literally nothing works? As an example I just want to scrape the main header title 'Let's Go! Log in to your account'
def scrape
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://fantasy.sixnationsrugby.com/#/welcome/login')
header_title = page.search('div.fs-box-header-title').text.strip
#output = header_title
end
Is it something to do with how the page is rendered? Thanks

Rails - gem for downloading files from another website

I am currently working on a Rails app.
I want to go to a website(http://alt19.com/) and select a set of options, then click a button which triggers the download of a CSV file. Then I want to take the file and parse it.
I have found a gem for parsing CSV files.
However, I don't know if there is a gem for navigating to another website, selecting a set of options, downloading several files and saving them somewhere where my app can process them.
Is there anything like this?
If not, are there any alternative solutions?
You can use mechanize gem to scrap the page. Mechanize uses nokogiri as one of the dependency which is responsible for scraping and mechanize has added feature of clicking elements from the page.
As you can see the CSV Generator from makes a post with some params.
Just do the same with 'net/https' and 'open_uri'
Example :
require "uri"
require "net/http"
params = {'box1' => 'Nothing is less important than which fork you use. Etiquette is the science of living. It embraces everything. It is ethics. It is honor. -Emily Post',
'button1' => 'Submit'
}
x = Net::HTTP.post_form(URI.parse('http://www.interlacken.com/webdbdev/ch05/formpost.asp'), params)
puts x.body
Example source: Submitting POST data from the controller in rails to another website

Detect redirect to specific IP with Mechanize Ruby

I am using the Ruby Mechanize Gem in to fetch and parse websites and I need to detect redirects to a certain IP. Here is my basic setup:
agent = Mechanize.new
page = agent.get('http://www.example.com')
Now, its obvious how to detect the redirect as such:
is_redirect? = page.code[/30[12]/].present?
but I want to take it a step further and check what domain/IP it redirects to; so something along the lines of (pseudo-code):
if page.resolves_to(55.55.55.55)...
Any thoughts on how this can be achieved?
The redirected url is in Page#uri:
require 'socket'
IPSocket::getaddress(page.uri.host)

How to scrape data from another website using Rails 3

I have a Rails 3.2.13 site that needs to scrape another website to get a product description. What is the best way to do this in Rails 3?
I've heard that nokogiri is fast. Should I use nokogiri? And if I use nokogiri is it possible that I don't have to save the scraped data anymore? I imagine it as just like getting json data from an API, is it like that?
I'd recommend a combination of Nokogiri and open-uri. Require both gems, and then just do something along the lines of doc = Nokogiri::HTML(open(YOUR_URL)). Then find the element you want to capture (using developer tools in chrome (or the equivalent) or something like Selector Gadget. Then you can use doc.at_css(SELECTOR) for a single element, or doc.search(SELECTOR) for multiple selectors. Calling the text method the response should get you the product description you're looking for. No need to save anything to the database (unless you want to) Hope that helps!
mechanize is a wonderful gem for scraping data from other websites as html. It is simple, robust and using nokogiri gem as result wrapper.
the following snippet will show you how you can fetch needed data being seen as Safari browser from url:
require 'htmlentities'
require "mechanize"
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
#resultHash = {}
a.get(url) do |page|
parsedPage = page.parser
#resultHash[:some_data_name] = parsedPage.at_xpath("//h1[#class='any-class']").text.split(/\s+/).join(" ")
end

How to open URLs in rails?

I'm trying to read in the html of a certain website.
Trying #something = open("http://www.google.com/") fails with the following error:
Errno::ENOENT in testController#show
No such file or directory - http://www.google.com/
Going to http://www.google.com/, I obviously see the site. What am I doing wrong?
Thanks!
You need to require 'open-uri' first to be able to open() remote paths.
See the docs for more info.
You should use a utility like Nokogiri to parse the returned content like so:
(From the Nokogiri site front page # http://nokogiri.org/)
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
# Do funky things with it using Nokogiri::XML::Node methods...
# Search for nodes by css
doc.css('h3.r a.l').each do |link|
puts link.content
end
will print to the screen:
Some Link

Resources