Web Scraping using Ruby - If statment - ruby-on-rails

I have built a web scraper. I need it to scrape the prices and bedrooms of a given neighborhood. Sometimes the span.first_detail_cell will return Furnished and the rest of the time it will return the price. I need to write something that can overlook the span.first_detail_cell if it is furnished and look in the next cell for the price. I think I need to write an if statement, but not sure of the parameters. Any help would be great!
require 'open-uri'
require 'nokogiri'
require 'csv'
url = "https://streeteasy.com/for-rent/bushwick"
page = Nokogiri::HTML(open(url))
page_numbers = []
page.css("nav.pagination span.page a").each do |line|
page_numbers << line.text
end
max_page = page_numbers.max
beds = []
price = []
max_page.to_i.times do |i|
url = "https://streeteasy.com/for-rent/bushwick?page=#{i+1}"
page = Nokogiri::HTML(open(url))
page.css('span.first_detail_cell').each do |line|
beds << line.text
end
page.css('span.price').each do |line|
price << line.text
end
end
CSV.open("bushwick_rentals.csv", "w") do |file|
file << ["Beds", "Price"]
beds.length.times do |i|
file << [beds[i], price[i]]
end
end

page.css('span.first_detail_cell').each do |line|
if line.text.include?("Furnished")
# do something hre
else
beds << line.text
end
end

Related

CSV in RUBY custom string

I have 1 field delivery_time It is in an array
include :
DELIVERY_TIME = [
I18n.t("activerecord.attributes.order.none_delivery_time"),
"09:00~12:00",
"12:00~14:00",
"14:00~16:00",
"16:00~18:00",
"18:00~20:00",
"19:00~21:00",
"20:00~21:00",
].freeze
when I downloaded the csv directory it was in the form
"09:00~12:00"
but i want now when I download it will take the form :
"0912"
how to customize it?
my code:
def perform
CSV.generate(headers: true) do |csv|
csv << attributes
orders.each do |order|
csv << create_row(order)
end
end
end
def create_row(order)
row << order.delivery_time
end
AFAIU, you need to modify DELIVERY_TIME to fit your format. CSV is absolutely out of scope here. So to transform values, one should split by ~ and take the hour from the result.
DELIVERY_TIME = [
"09:00~12:00",
"12:00~14:00",
"14:00~16:00",
"16:00~18:00",
"18:00~20:00",
"19:00~21:00",
"20:00~21:00",
].freeze
DELIVERY_TIME.map { |s| s.split('~').map { |s| s[0...2] }.join }
#⇒ ["0912", "1214", "1416", "1618", "1820", "1921", "2021"]
A safer method would be to use DateTime#parse for this
require 'time'
DELIVERY_TIME.map do |s|
s.split('~').map { |s| DateTime.parse(s).strftime("%H") }.join
end
#⇒ ["0912", "1214", "1416", "1618", "1820", "1921", "2021"]
It's not real clear what you're asking, but I'd probably start with something like this:
"09:00~12:00".scan(/\d{2}/).values_at(0, 2).join # => "0912"
Using that in some code:
"09:00~12:00".scan(/\d{2}/).values_at(0, 2).join # => "0912"
DELIVERY_TIME = [
'blah',
"09:00~12:00",
"12:00~14:00",
"14:00~16:00",
"16:00~18:00",
"18:00~20:00",
"19:00~21:00",
"20:00~21:00",
].freeze
ary = [] << DELIVERY_TIME.first
ary += DELIVERY_TIME[1..-1].map { |i|
i.scan(/\d{2}/).values_at(0, 2).join
}
# => ["blah", "0912", "1214", "1416", "1618", "1820", "1921", "2021"]

How to scrape multiple pages using nokogiri and how to scrape fast with rails

I am trying to scrape certain elements from 99 pages of a web page. I cannot for the life of me figure out how to do it.
Here is my code:
require 'open-uri'
require 'nokogiri'
#title = []
html_content = open("https://www.imdb.com/list/ls057823854/?
sort=list_order,asc&st_dt=&mode=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.search(".lister-item-header/a").each do |title|
#title << title.text.strip
If u want to collect all titles here is the scraper code.
require 'open-uri'
require 'nokogiri'
require 'json'
#title = []
url = "https://www.imdb.com/list/ls057823854/?sort=list_order,asc&st_dt=&mode=detail&page="
html_content = open(url+"1").read
doc = Nokogiri::HTML(html_content)
max = doc.search(".pagination-range").first.text.split("of")[1].gsub(",","").strip.to_i
max = (max / 100).floor + 1
doc.search(".lister-item-header/a").each do |title|
#title << title.text.strip
end
for i in 2..max
html_content = open(url+i.to_s).read
doc = Nokogiri::HTML(html_content)
doc.search(".lister-item-header/a").each do |title|
#title << title.text.strip
end
sleep(1)
end
File.open("imdb-titles.json","w") do |f|
f.write(JSON.pretty_generate(#title))
end

ruby nokorigi export csv columns

i want to export csv in 3 columns with the type of it but the result that i get is not the same what i want. it's just only one column to show all my data, please help me what should i do
require 'nokogiri'
require 'csv'
page = Nokogiri::HTML(open("index.html"))
fullName = page.css('li._5i_q').css("a[data-gt]").children.map {|name| name.text }
shortURL = page.css('li._5i_q').css("._5j0e a[data-hovercard]")
myID = shortURL.map {|element|
element["data-hovercard"][/id=([^&]*)/].gsub('id=', '')
}
messenger = shortURL.map {|element|
element["data-hovercard"][/id=([^&]*)/].gsub('id=', '') + "#gmail.com"
}
attributes = %w{ID FullName Messenger}
CSV.open('chatId.csv', 'w') do |csv|
csv << attributes
myID.each do |x|
csv << [x]
end
fullName.each do |y|
csv << [y]
end
messenger.each do |z|
csv << [z]
end
end
It's all my code
You will have to write row by row when exporting data to csv. Therefore, try to create an array of [x, y, z] and export them using to_csv method. For example:
data = myID.zip(fullName, shortUrl)
CSV.open('chatId.csv', 'w') do |csv|
csv << attributes
csv << "\n"
data.each do |d|
csv << d.to_csv
end
end

How to save Json Array into HTML format in Rails?

Hello Ruby users i have Json Array format
[
"Can also work with any bluetooth-enabled smartphones and\ntablets",
"For calls and music, Hands-free",
"Very stylish design and lightweight",
"Function:Bluetooth,Noise Cancelling,Microphone",
"Compatible:For Any Device With Bluetooth Function",
"Chipset: CSR4.0", "Bluetooth Version:Bluetooth 4.0",
"Transmission Distance:10 Meters"
]
I want to save this array into html formal using the html format which is below.
<ul>
<li>Can also work with any bluetooth-enabled smartphones and\ntablets</li>
<li>For calls and music, Hands-free</li>
<li>Very stylish design and lightweight</li>
<li>Function:Bluetooth,Noise Cancelling,Microphone</li>
<li>Compatible:For Any Device With Bluetooth Function</li>
<li>Chipset: CSR4.0</li>
<li>Bluetooth Version:Bluetooth 4.0</li>
<li>Transmission Distance:10 Meters</li>
</ul>
This is my current code which is working correctly if i have to save it just the array however i need this to be html format so user can easily read it
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
#item.item_details = result["description"]
#item.save
end
Under my Previous Attempt to solve like this
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
item_list = result["description"]
item_list.each do |list|
#item.item_details = "<li>"+list+"</li>"
end
#item.save
end
This one only save one of the array and no <ul> head
heres the original code
namespace :scraper do
desc "Scrape Website"
task somesite: :environment do
require 'open-uri'
require 'nokogiri'
require 'json'
url = "url here!"
page = Nokogiri::HTML(open(url))
script = page.search('head script')[2]
jsonparse = script.content[/\{\"[a-zA-Z0-9\"\:\-\,\ \=\(\)\.\_\D\/\[\]\}]+/i]
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
item_details = result["description"].each {|list| "<li>#{list}</li>" }
puts item_details
#item.item_old_price = result["originalPrice"]
#item.item_final_price = result["price"]
#item.save
end
end
end
The Idea is to save the Array into database with the html format.
<ul>
<li>content 1</li>
<li>content 2</li>
<li>content and soon</li>
</ul>
Thanks
I'm not sure what you tried to do.
Please check below codes and give me a feedback.
Happy coding :)
<!-- language: ruby -->
require 'json'
class MyJsonParser
def initialize
#items = []
end
def parse(json)
result = JSON.parse(json)
generate_items(result)
items
end
private
attr_reader :items
def generate_items(result)
result['mods']['listItems'].each {|item_detail| items << Item.new(item_detail)}
end
end
class Item
attr_reader :details
def initialize(detail='')
#details = ''
before_initialize
details << detail
after_initialize
end
private
def before_initialize
details << '<li>'
end
def after_initialize
details << '</li>'
end
end
json_str = '{
"mods": {
"listItems":
[
"Can also work with any bluetooth-enabled smartphones and\ntablets",
"For calls and music, Hands-free",
"Very stylish design and lightweight",
"Function:Bluetooth,Noise Cancelling,Microphone",
"Compatible:For Any Device With Bluetooth Function",
"Chipset: CSR4.0",
"Bluetooth Version:Bluetooth 4.0",
"Transmission Distance:10 Meters"
]
}
}'
result = MyJsonParser.new.parse(json_str)
result.each do |i|
p i.details
end

axslx - how can I check if an array element exists and if so alter its output?

I have a Xpath query which accepts array elements for output using Axslx, I need to tidy up my ouput for certain conditions one of which is the 'Software included'
My xpath scrapes the following URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1
A sample of my code is below:
clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'
selector = "//td[text()='%s']/following-sibling::td"
data = clues.map do |clue|
xpath = selector % clue
[clue, doc.at(xpath).text.strip]
end
Axlsx::Package.new do |p|
p.workbook.add_worksheet do |sheet|
data.each { |datum| sheet.add_row datum }
end
p.serialize 'output.xlsx'
end
My Current output formatting
My Desired output formatting
If you can rely on the data always using ';' for separators, have a go at this:
data = []
clues.each do |clue|
xpath = selector % clue
details = doc.at(xpath).text.strip.split(';')
data << [clue, details.pop]
details.each { |detail| data << ['', detail] }
end
to generate the data before the Axlsx::Package.new block
In answer to you comment/question: You do it with something like this ;)
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'
class Scraper
def initialize(url, selector)
#url = url
#selector = selector
end
def hooks
#hooks ||= {}
end
def add_hook(clue, p_roc)
hooks[clue] = p_roc
end
def export(file_name)
Scraper.clues.each do |clue|
if detail = parse_clue(clue)
output << [clue, detail.pop]
detail.each { |datum| output << ['', datum] }
end
end
serialize(file_name)
end
private
def self.clues
#clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
'Warranty', 'Software included', 'Product color']
end
def doc
#doc ||= begin
Nokogiri::HTML(open(#url))
rescue
raise ArgumentError, 'Invalid URL - Nothing to parse'
end
end
def output
#output ||= []
end
def selector_for_clue(clue)
#selector % clue
end
def parse_clue(clue)
if element = doc.at(selector_for_clue(clue))
call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
end
end
def call_hook(clue, element)
if hooks[clue].is_a? Proc
value = hooks[clue].call(element)
value.is_a?(Array) ? value : [value]
end
end
def package
#package ||= Axlsx::Package.new
end
def serialize(file_name)
package.workbook.add_worksheet do |sheet|
output.each { |datum| sheet.add_row datum }
end
package.serialize(file_name)
end
end
scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")
# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end
scraper.add_hook('Operating system', os_parse)
scraper.export('foo.xlsx')
And the FINAL answer is... a gem.
http://rubydoc.info/gems/ninja2k/0.0.2/frames

Resources