web scraping/export to CSV with Ruby - ruby-on-rails

ruby n00b here in hope of some guidance. I am looking to scrape a website (600-odd names and links on one page) and output to CSV. The scraping itself works fine (the output correctly fills the terminal as the script runs), but I can't get the CSV to populate. The code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://www.example.com/page/"
page = Nokogiri::HTML(open(url))
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
foo = puts "#{name}"
bar = "#{link}"
CSV.open("file.csv", "wb") do |csv|
csv << [foo, bar]
end
end
puts "upload complete!"
...replacing the csv << [foo, bar] with csv << [name, link] just puts the final iteration into the CSV. I feel there's something basic I am missing here. Thanks for reading.

The problem is that you're doing CSV.open for every single item. So it's overwriting the file with the newer item. And hence at the end, you're left with the last item in the csv file.
Move the CSV.open call before page.css('.item').each and it should work.
CSV.open("file.csv", "wb") do |csv|
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
csv << [name, link]
end
end

Related

Create timestamped labels on csv files (ruby code)

I am running a transaction download script through Ruby. I was wondering if it is possible to label each .csv it creates with the current date/time the script was run. Below is the end of the script.
CSV.open("transaction_report.csv", "w") do |csv|
csv << header_row
search_results.each do |transaction|
transaction_details_row = header_row.map{ |attribute| transaction.send(attribute) }
csv << transaction_details_row
end
end
Like this?
CSV.open("transaction_report-#{Time.now}.csv", "w") do |csv|
csv << header_row
search_results.each do |transaction|
transaction_details_row = header_row.map{ |attribute| transaction.send(attribute) }
csv << transaction_details_row
end
end
This just appends the time of generation to the file name. For example:
"transaction_report-#{Time.now}.csv"
# => "transaction_report-2019-10-10 16:09:07 +0100.csv"
If you want to avoid spaces in the file name, you can sub these out like so:
"transaction_report-#{Time.now.to_s.gsub(/\s/, '-')}.csv"
# => "transaction_report-2019-10-10-16:09:40-+0100.csv"
Is that what you're after? It sounds right based on the question, though happy to update if you're able to correct me :)

Data is overwriting instead of appending to CSV

I am using a rake task and the csv module to loop through one csv, extract and alter the data I need and then append each new row of data to a second csv. However each row seems to be overwriting/replacing the previous row in the new csv instead of appending it as a new row after it. I've looked at the documentation and googled but can't find any examples of appending rows to the csv differently.
require 'csv'
namespace :replace do
desc "replace variant id with variant sku"
task :sku => :environment do
file="db/master-list-3-28.csv"
CSV.foreach(file) do |row|
msku, namespace, key, valueType, value = row
valueArray = value.split('|')
newValueString = ""
valueArray.each_with_index do |v, index|
recArray = v.split('*')
handle = recArray[0]
vid = recArray[1]
newValueString << handle
newValueString << "*"
variant = ShopifyAPI::Variant.find(vid)
newValueString << variant.sku
end
#end of value save the newvaluestring to new csv
newFile = Rails.root.join('lib/assets', 'newFile.csv')
CSV.open(newFile, "wb") do |csv|
csv << [newValueString]
end
end
end
end
Your mode when opneing the file is wrong and should be a+. See details in the docs: http://ruby-doc.org/core-2.2.4/IO.html#method-c-new
Also, you might want to open that file just once and not with every line.

How do I automate running a Ruby script?

I have written a ruby script (code below) to scrape from Deliveroo.co.uk.
Right now I run it manually by going to terminal and typing in 'ruby ....rb'.
How do I automate things so that this script runs automatically every hour?
Also, how do I save the output from each run without overwriting the previous output?
Code is below.. thank you.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text.strip
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text.strip
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text.strip
end
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i], status[i]]
end
end
There's two questions, I'll try to answer them below.
How to run periodically:
What you are looking for is a cronjob, there are many resources out there for creating one.
Look into cron or gems like whenever / clockwork.
Save output between multiple runs: In order to save the output you could just write to a file directly in ruby, very similar to what you are doing right now.
The way you're saving it right now is:
CSV.open("deliveroo.csv", "w") do |file|
The "w" opens the file and overwrites any content present in it, try "a" (append) instead.
CSV.open("deliveroo.csv", "a") do |file|
Read more here about opening files in different modes: File opening mode in Ruby

How do I filter my results when scraping a website using Nokogiri gem?

I am trying to scrape list of restaurants for my zip code from Deliveroo.co.uk
I need to add a way to figure out whether a restaurant is open or closed... from the website its very clear, but I just need to update my code to reflect this.
How do I go about doing this? I need to create something like a 'status' variable and then set each restaurant to 'open' or 'closed'.
Here is the website I'm trying to scrape from: https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE&time=1800&day=today
And my code is below.
thanks.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text
end
status = []
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i]]
end
end
end
We need to check li.restaurant--details have / have not class unavailable for close / open restaurant.
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
Btw, you should remove white space when get restaurant_name, etc ...
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
You can refer my code at here: https://gist.github.com/vinhnglx/4eaeb2e8511dd1454f42

Ruby csv - delete row if column is empty

Trying to delete rows from the csv file here with Ruby without success.
How can I tell that all rows, where column "newprice" is empty, should be deleted?
require 'csv'
guests = CSV.table('new.csv', headers:true)
guests.each do |guest_row|
p guests.to_s
end
price = CSV.foreach('new.csv', headers:true) do |row|
puts row['newprice']
end
guests.delete_if('newprice' = '')
File.open('new_output.csv', 'w') do |f|
f.write(guests.to_csv)
end
Thanks!
Almost there. The table method changes the headers to symbols, and delete_if takes a block, the same way as each and open.
require 'csv'
guests = CSV.table('test.csv', headers:true)
guests.each do |guest_row|
p guest_row.to_s
end
guests.delete_if do |row|
row[:newprice].nil?
end
File.open('test1.csv', 'w') do |f|
f.write(guests.to_csv)
end

Resources