How to iterate pages of a site from Rails and Nokogiri - ruby-on-rails

I'm trying to build an informational site, which shows the visitor all deals from a specific merchant on that specific page. I managed to scrape the headlines from the first page and pack an URL iteration into an array.
My code should take each URL and paste it into the scraper, list the items of that page, iterate to the next page, scrape headlines and attach them to the recent done list, and so on.
My controller looks like this:
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
class Entry
def initialize(title)
#title = title
end
attr_reader :title
end
def scrape_mydealz
require 'open-uri'
urlarray = Array.new
# --------------------------------------------------------------- URL erstellen
pagination = '&page=1'
count = [1, 2]
count.each do |i|
base_url = "https://www.mydealz.de/search?q=media+markt"
pagination = "&page=#{i}"
combination = base_url + pagination
urlarray << combination
end
# --------------------------------------------------------------- / URL erstellen
urlarray.each do |test|
doc = Nokogiri::HTML(open("#{test}"))
entries = doc.css('article.thread')
#entriesArray = []
entries.each do |entry|
title = entry.css('a.vwo-thread-title').text
#entriesArray << Entry.new(title)
end
end
render template: 'scrape_mydealz'
end
end
With this code it iterates to page 2 and displays the scrape result from page 2 only.
The result could be found here:
https://mm-scraper-neevoo.c9users.io/

You reinitialize #entriesArray in each iteration. The easiest solution for you, to move the initialization outside the loop
#entriesArray = []
urlarray.each do |test|
doc = Nokogiri::HTML(open("#{test}"))
entries = doc.css('article.thread')
entries.each do |entry|
title = entry.css('a.vwo-thread-title').text
#entriesArray << Entry.new(title)
end
end

This is untested but it's the general idea I'd use to scan a site with two pages and accumulate the titles:
require 'open-uri'
BASE_URL = 'https://www.mydealz.de/search?q=media+markt&page=1'
def scrape_mydealz
urls = []
2.times do |i|
url = URI.parse(BASE_URL)
base_query = URI::decode_www_form(url.query).to_h
base_query['page'] = 1 + i
url.query = URI.encode_www_form(base_query)
urls << url
end
#entries_array = []
urls.each do |url|
doc = Nokogiri::HTML(open(url))
doc.css('article.thread').each do |entry|
#entries_array << Entry.new(entry.at('a.vwo-thread-title').text)
end
end
render template: 'scrape_mydealz'
end
Be careful using text with search, css or xpath:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
Notice that the first result has concatenated the contents of the <p> tags. Trying to take those apart afterwards is not possible usually.

Related

Parse API and Show Output in Rails View

So, I wrote a program that sends a get request to HappyFox (a support ticket web app) and I get a JSON file, Tickets.json.
I also wrote methods that parse the JSON and return a hash with information that I want, i.e tickets with and without a response.
How do I integrate this with my Rails app? I want my HappyFox View (in rails) to show the output of those methods, and give the user the ability to refresh the info whenever they want.
Ruby Code:
require 'httparty'
def happy_fox_call()
auth = { :username => 'REDACTED',
:password => 'REDACTED' }
#tickets = HTTParty.get("http://avatarfleet.happyfox.com/api/1.1/json/tickets/?size=50&page=1",
:basic_auth => auth)
tickets = File.new("Tickets.json", "w")
tickets.puts #tickets
tickets.close
end
puts "Calling API, please wait..."
happy_fox_call()
puts "Complete!"
require 'json'
$data = File.read('/home/joe/API/Tickets.json')
$tickets = JSON.parse($data)
$users = $tickets["data"][3]["name"]
Count each status in ONE method
def count_each_status(*statuses)
status_counters = Hash.new(0)
$tickets["data"].each do |tix|
if statuses.include?(tix["status"]["name"])
#puts status_counters # this is cool! Run this
status_counters[tix["status"]["name"]] += 1
end
end
return status_counters
end
Count tickets with and without a response
def count_unresponded(tickets)
true_counter = 0
false_counter = 0
$tickets["data"].each do |tix|
if tix["unresponded"] == false
false_counter += 1
else true_counter += 1
end
end
puts "There are #{true_counter} tickets without a response"
puts "There are #{false_counter} ticket with a response"
end
Make a function that creates a count of tickets by user
def user_count(users)
user_count = Hash.new(0)
$tickets["data"].each do |users|
user_count[users["user"]["name"]] += 1
end
return user_count
end
puts count_each_status("Closed", "On Hold", "Open", "Unanswered",
"New", "Customer Review")
puts count_unresponded($data)
puts user_count($tickets)
Thank you in advance!
You could create a new module in your lib directory that handles the API call/JSON parsing and include that file in whatever controller you want to interact with it. From there it should be pretty intuitive to assign variables and dynamically display them as you wish.
https://www.benfranklinlabs.com/where-to-put-rails-modules/

How to increment the loop if it does not match the id?

I'm working on a code which displays the images from the AWS server. But I'm facing trouble in looping the code.
It works fine for the 1st display but it is not going further (I've to display upto 6 images)
code for this -
def get_image_urls(user)
user_identifications = user.user_identifications.where(current_flag: true).order(:id_dl)
urls = []
keys = []
if !user_identifications.empty? && !user_identifications.nil?
user_identifications.each_with_index do |each_id, index|
obj = S3_BUCKET.object(each_id.aws_key)
urls << {each_id.id_dl=> obj.presigned_url(:get)}
keys << {each_id.id_dl=> each_id.aws_key}
end
end
return urls, keys
end
How to increment the loop based on checking the id and user.identifications value?
reject all empty values and then iterate users_identifications.
Like:
sanitized_identifications = users_identifications.reject(&:blank?)
sanitized_identifications.each_with_index do |identification, _index|
# Now if you want to skip an iteration based on some condition, try `next`, like:
# next if some_condition
# in you case
obj = S3_BUCKET.object(each_id.aws_key)
next if obj.blank?
urls << {each_id.id_dl=> obj.presigned_url(:get)}
keys << {each_id.id_dl=> each_id.aws_key}
end
UPDATE
# ...
user_identifications.each_with_index do |each_id, index|
begin
obj = S3_BUCKET.object(each_id.aws_key)
urls << {each_id.id_dl=> obj.presigned_url(:get)}
keys << {each_id.id_dl=> each_id.aws_key}
rescue => e
next
end
end
#...
Cheers!

Iterating in nokogiri with rails doesn't work

So I wrote some nokogiri code that works in a test .rb file but when I put it inside a rails app model it won't iterate and just returns the first value. Here is the code that iterates correctly:
require "rubygems"
require "open-uri"
require "nokogiri"
url = "http://www.ebay.com/sch/Cars-Trucks-/6001/i.html?_from=R40&_sac=1&_vxp=mtr&_nkw=car+projects&_ipg=200&rt=nc"
data = Nokogiri::HTML(open(url))
data.css(".li").each do |item|
item_link = item.at_css(".vip")[:href]
item_doc = Nokogiri::HTML(open(item_link))
puts item_doc.at_css("#itemTitle").text.sub! 'Details about', ''
end
Here is the same code in a rails app that only returns the first title it finds:
require "rubygems"
require "open-uri"
require "nokogiri"
class EbayScraper
attr_accessor :url, :data
def initialize(url)
#url = url
end
def data
#data ||= Nokogiri::HTML(open(#url))
end
def titles
data.css(".li").each do |item|
item_link = item.at_css(".vip")[:href]
item_data = Nokogiri::HTML(open(item_link))
return item_data.at_css("#itemTitle").text.sub! 'Details about', ''
end
end
ebay = EbayScraper.new("http://www.ebay.com/sch/Cars-Trucks-/6001/i.html?_from=R40&_sac=1&_vxp=mtr&_nkw=car+projects&_ipg=200&rt=nc")
titles = ebay.titles
puts titles
Why does the first code iterate through the whole thing and the second bunch of code just returns the first one?
Thanks for your time in advance!
Because you have a return statement in your loop that exits your titles function.

How do I filter my results when scraping a website using Nokogiri gem?

I am trying to scrape list of restaurants for my zip code from Deliveroo.co.uk
I need to add a way to figure out whether a restaurant is open or closed... from the website its very clear, but I just need to update my code to reflect this.
How do I go about doing this? I need to create something like a 'status' variable and then set each restaurant to 'open' or 'closed'.
Here is the website I'm trying to scrape from: https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE&time=1800&day=today
And my code is below.
thanks.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text
end
status = []
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i]]
end
end
end
We need to check li.restaurant--details have / have not class unavailable for close / open restaurant.
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
Btw, you should remove white space when get restaurant_name, etc ...
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
You can refer my code at here: https://gist.github.com/vinhnglx/4eaeb2e8511dd1454f42

How to get a full URL given a shortened one passed to Nokogiri?

I want to traverse some HTML documents with Nokogiri.
After getting the XML object, I want to have the last URL used by Nokogiri that fetched a document to be part of my JSON response.
def url = "http://ow.ly/hh8ri"
doc = Nokogiri::HTML(open(url)
...
Nokogiri internally redirects it to http://www.mp.rs.gov.br/imprensa/noticias/id30979.html, but I want to have access to it.
I want to know if the "doc" object has access to some URL as attribute or something.
Does someone know a workaround?
By the way, I want the full URL because I'm traversing the HTML to find <img> tags and some have relative ones like: "/media/image/image.png", and then I adjust some using:
URI.join(url, relative_link_url).to_s
The image URL should be:
http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg
Instead of:
http://ow.ly/hh8ri/media/imprensa/2013/01/30979_260_260__trytr.jpg
EDIT: IDEA
class Scraper < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
self.parse(html).tap do |d|
url = URI.parse(url)
response = Net::HTTP.new(url.host, url.port)
head = response.start do |r|
r.head url.path
end
d.url = head['location']
end
end
end
end
Use Mechanize. The URLs will always be converted to absolute:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://ow.ly/hh8ri'
page.images.map{|i| i.url.to_s}
#=> ["http://www.mp.rs.gov.br/images/imprensa/barra_area.gif", "http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg"]
Because your example is using OpenURI, that's the code to ask, not Nokogiri. Nokogiri has NO idea where the content came from.
OpenURI can tell you easily:
require 'open-uri'
starting_url = 'http://www.example.com'
final_uri = nil
puts "Starting URL: #{ starting_url }"
io = open(starting_url) { |io| final_uri = io.base_uri }
doc = io.read
puts "Final URL: #{ final_uri.to_s }"
Which outputs:
Starting URL: http://www.example.com
Final URL: http://www.iana.org/domains/example
base_uri is documented in the OpenURI::Meta module.
I had the exact same issue recently. What I did was to create a class that inherits from Nokogiri::HTML::Document, and then just override thenew class method to parse the document, then save the url in an instance variable with an accessor:
require 'nokogiri'
require 'open-uri'
class Webpage < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url)
self.parse(html).tap do |d|
d.url = url
end
end
end
end
Then you can just create a new Webpage, and it will have access to all the normal methods you would have with a Nokogiri::HTML::Document:
w = Webpage.new("http://www.google.com")
w.url
#=> "http://www.google.com"
w.at_css('title')
#=> [#<Nokogiri::XML::Element:0x4952f78 name="title" children=[#<Nokogiri::XML::Text:0x4952cb2 "Google">]>]
If you have some relative url that you got from an image tag, you can then make it absolute by passing the return value of the url accessor to URI.join:
relative_link_url = "/media/image/image.png"
=> "/media/image/image.png"
URI.join(w.url, relative_link_url).to_s
=> "http://www.google.com/media/image/image.png"
Hope that helps.
p.s. the title of this question is quite misleading. Something more along the lines of "Accessing URL of Nokogiri HTML document" would be clearer.

Resources