How do I filter my results when scraping a website using Nokogiri gem?

How do I filter my results when scraping a website using Nokogiri gem? - ruby-on-rails

I am trying to scrape list of restaurants for my zip code from Deliveroo.co.uk
I need to add a way to figure out whether a restaurant is open or closed... from the website its very clear, but I just need to update my code to reflect this.
How do I go about doing this? I need to create something like a 'status' variable and then set each restaurant to 'open' or 'closed'.
Here is the website I'm trying to scrape from: https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE&time=1800&day=today
And my code is below.
thanks.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text
end
status = []
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i]]
end
end
end

We need to check li.restaurant--details have / have not class unavailable for close / open restaurant.
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
Btw, you should remove white space when get restaurant_name, etc ...
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
You can refer my code at here: https://gist.github.com/vinhnglx/4eaeb2e8511dd1454f42

Related

How to iterate pages of a site from Rails and Nokogiri

I'm trying to build an informational site, which shows the visitor all deals from a specific merchant on that specific page. I managed to scrape the headlines from the first page and pack an URL iteration into an array.
My code should take each URL and paste it into the scraper, list the items of that page, iterate to the next page, scrape headlines and attach them to the recent done list, and so on.
My controller looks like this:
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
class Entry
def initialize(title)
#title = title
end
attr_reader :title
end
def scrape_mydealz
require 'open-uri'
urlarray = Array.new
# --------------------------------------------------------------- URL erstellen
pagination = '&page=1'
count = [1, 2]
count.each do |i|
base_url = "https://www.mydealz.de/search?q=media+markt"
pagination = "&page=#{i}"
combination = base_url + pagination
urlarray << combination
end
# --------------------------------------------------------------- / URL erstellen
urlarray.each do |test|
doc = Nokogiri::HTML(open("#{test}"))
entries = doc.css('article.thread')
#entriesArray = []
entries.each do |entry|
title = entry.css('a.vwo-thread-title').text
#entriesArray << Entry.new(title)
end
end
render template: 'scrape_mydealz'
end
end
With this code it iterates to page 2 and displays the scrape result from page 2 only.
The result could be found here:
https://mm-scraper-neevoo.c9users.io/

You reinitialize #entriesArray in each iteration. The easiest solution for you, to move the initialization outside the loop
#entriesArray = []
urlarray.each do |test|
doc = Nokogiri::HTML(open("#{test}"))
entries = doc.css('article.thread')
entries.each do |entry|
title = entry.css('a.vwo-thread-title').text
#entriesArray << Entry.new(title)
end
end

This is untested but it's the general idea I'd use to scan a site with two pages and accumulate the titles:
require 'open-uri'
BASE_URL = 'https://www.mydealz.de/search?q=media+markt&page=1'
def scrape_mydealz
urls = []
2.times do |i|
url = URI.parse(BASE_URL)
base_query = URI::decode_www_form(url.query).to_h
base_query['page'] = 1 + i
url.query = URI.encode_www_form(base_query)
urls << url
end
#entries_array = []
urls.each do |url|
doc = Nokogiri::HTML(open(url))
doc.css('article.thread').each do |entry|
#entries_array << Entry.new(entry.at('a.vwo-thread-title').text)
end
end
render template: 'scrape_mydealz'
end
Be careful using text with search, css or xpath:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
Notice that the first result has concatenated the contents of the <p> tags. Trying to take those apart afterwards is not possible usually.

Data is overwriting instead of appending to CSV

I am using a rake task and the csv module to loop through one csv, extract and alter the data I need and then append each new row of data to a second csv. However each row seems to be overwriting/replacing the previous row in the new csv instead of appending it as a new row after it. I've looked at the documentation and googled but can't find any examples of appending rows to the csv differently.
require 'csv'
namespace :replace do
desc "replace variant id with variant sku"
task :sku => :environment do
file="db/master-list-3-28.csv"
CSV.foreach(file) do |row|
msku, namespace, key, valueType, value = row
valueArray = value.split('|')
newValueString = ""
valueArray.each_with_index do |v, index|
recArray = v.split('*')
handle = recArray[0]
vid = recArray[1]
newValueString << handle
newValueString << "*"
variant = ShopifyAPI::Variant.find(vid)
newValueString << variant.sku
end
#end of value save the newvaluestring to new csv
newFile = Rails.root.join('lib/assets', 'newFile.csv')
CSV.open(newFile, "wb") do |csv|
csv << [newValueString]
end
end
end
end

Your mode when opneing the file is wrong and should be a+. See details in the docs: http://ruby-doc.org/core-2.2.4/IO.html#method-c-new
Also, you might want to open that file just once and not with every line.

Rake task handle 404

I am using a rake task to take data from one csv, call the shopify api using that data, and save the response to another CSV. The problem is I have no error handler in place so that if the shopify api cannot find the resource I provided, the whole task gets aborted. What is the best way to handle the error so that if the resource is not found in Shopify, simply skip it and proceed to the next row?
The line calling the shopify API in the code below is:
variant = ShopifyAPI::Variant.find(vid)
namespace :replace do
desc "replace variant id with variant sku"
task :sku => :environment do
file="db/master-list-3-28.csv"
newFile = Rails.root.join('lib/assets', 'newFile.csv')
CSV.open(newFile, "a+") do |csv|
CSV.foreach(file) do |row|
msku, namespace, key, valueType, value = row
valueArray = value.split('|')
newValueString = ""
valueArray.each_with_index do |v, index|
recArray = v.split('*')
handle = recArray[0]
vid = recArray[1]
newValueString << handle
newValueString << "*"
# use api call to retrieve variant sku using handle and vid
#replace vid with sku and save to csv
variant = ShopifyAPI::Variant.find(vid)
sleep 1
# puts variant.sku
newValueString << variant.sku
if index < 2
newValueString << "|"
end
end
#end of value save the newvaluestring to new csv
csv << [newValueString]
end
end
end
end

Here's a simple way to get it done:
begin
variant = ShopifyAPI::Variant.find(vid)
rescue
next
end
If an exception is raised the stuff in rescue happens.

How do I automate running a Ruby script?

I have written a ruby script (code below) to scrape from Deliveroo.co.uk.
Right now I run it manually by going to terminal and typing in 'ruby ....rb'.
How do I automate things so that this script runs automatically every hour?
Also, how do I save the output from each run without overwriting the previous output?
Code is below.. thank you.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text.strip
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text.strip
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text.strip
end
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i], status[i]]
end
end

There's two questions, I'll try to answer them below.
How to run periodically:
What you are looking for is a cronjob, there are many resources out there for creating one.
Look into cron or gems like whenever / clockwork.
Save output between multiple runs: In order to save the output you could just write to a file directly in ruby, very similar to what you are doing right now.
The way you're saving it right now is:
CSV.open("deliveroo.csv", "w") do |file|
The "w" opens the file and overwrites any content present in it, try "a" (append) instead.
CSV.open("deliveroo.csv", "a") do |file|
Read more here about opening files in different modes: File opening mode in Ruby

web scraping/export to CSV with Ruby

ruby n00b here in hope of some guidance. I am looking to scrape a website (600-odd names and links on one page) and output to CSV. The scraping itself works fine (the output correctly fills the terminal as the script runs), but I can't get the CSV to populate. The code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://www.example.com/page/"
page = Nokogiri::HTML(open(url))
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
foo = puts "#{name}"
bar = "#{link}"
CSV.open("file.csv", "wb") do |csv|
csv << [foo, bar]
end
end
puts "upload complete!"
...replacing the csv << [foo, bar] with csv << [name, link] just puts the final iteration into the CSV. I feel there's something basic I am missing here. Thanks for reading.

The problem is that you're doing CSV.open for every single item. So it's overwriting the file with the newer item. And hence at the end, you're left with the last item in the csv file.
Move the CSV.open call before page.css('.item').each and it should work.
CSV.open("file.csv", "wb") do |csv|
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
csv << [name, link]
end
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How do I filter my results when scraping a website using Nokogiri gem? - ruby-on-rails

Related

How to iterate pages of a site from Rails and Nokogiri

Data is overwriting instead of appending to CSV

Rake task handle 404

How do I automate running a Ruby script?

web scraping/export to CSV with Ruby

Categories

Resources