Web Scraping using Ruby - If statment

Web Scraping using Ruby - If statment - ruby-on-rails

I have built a web scraper. I need it to scrape the prices and bedrooms of a given neighborhood. Sometimes the span.first_detail_cell will return Furnished and the rest of the time it will return the price. I need to write something that can overlook the span.first_detail_cell if it is furnished and look in the next cell for the price. I think I need to write an if statement, but not sure of the parameters. Any help would be great!
require 'open-uri'
require 'nokogiri'
require 'csv'
url = "https://streeteasy.com/for-rent/bushwick"
page = Nokogiri::HTML(open(url))
page_numbers = []
page.css("nav.pagination span.page a").each do |line|
page_numbers << line.text
end
max_page = page_numbers.max
beds = []
price = []
max_page.to_i.times do |i|
url = "https://streeteasy.com/for-rent/bushwick?page=#{i+1}"
page = Nokogiri::HTML(open(url))
page.css('span.first_detail_cell').each do |line|
beds << line.text
end
page.css('span.price').each do |line|
price << line.text
end
end
CSV.open("bushwick_rentals.csv", "w") do |file|
file << ["Beds", "Price"]
beds.length.times do |i|
file << [beds[i], price[i]]
end
end

page.css('span.first_detail_cell').each do |line|
if line.text.include?("Furnished")
# do something hre
else
beds << line.text
end
end

Related

CSV in RUBY custom string

I have 1 field delivery_time It is in an array
include :
DELIVERY_TIME = [
I18n.t("activerecord.attributes.order.none_delivery_time"),
"09:00～12:00",
"12:00～14:00",
"14:00～16:00",
"16:00～18:00",
"18:00～20:00",
"19:00～21:00",
"20:00～21:00",
].freeze
when I downloaded the csv directory it was in the form
"09:00～12:00"
but i want now when I download it will take the form :
"0912"
how to customize it?
my code:
def perform
CSV.generate(headers: true) do |csv|
csv << attributes
orders.each do |order|
csv << create_row(order)
end
end
end
def create_row(order)
row << order.delivery_time
end

AFAIU, you need to modify DELIVERY_TIME to fit your format. CSV is absolutely out of scope here. So to transform values, one should split by ～ and take the hour from the result.
DELIVERY_TIME = [
"09:00～12:00",
"12:00～14:00",
"14:00～16:00",
"16:00～18:00",
"18:00～20:00",
"19:00～21:00",
"20:00～21:00",
].freeze
DELIVERY_TIME.map { |s| s.split('～').map { |s| s[0...2] }.join }
#⇒ ["0912", "1214", "1416", "1618", "1820", "1921", "2021"]
A safer method would be to use DateTime#parse for this
require 'time'
DELIVERY_TIME.map do |s|
s.split('～').map { |s| DateTime.parse(s).strftime("%H") }.join
end
#⇒ ["0912", "1214", "1416", "1618", "1820", "1921", "2021"]

It's not real clear what you're asking, but I'd probably start with something like this:
"09:00～12:00".scan(/\d{2}/).values_at(0, 2).join # => "0912"
Using that in some code:
"09:00～12:00".scan(/\d{2}/).values_at(0, 2).join # => "0912"
DELIVERY_TIME = [
'blah',
"09:00～12:00",
"12:00～14:00",
"14:00～16:00",
"16:00～18:00",
"18:00～20:00",
"19:00～21:00",
"20:00～21:00",
].freeze
ary = [] << DELIVERY_TIME.first
ary += DELIVERY_TIME[1..-1].map { |i|
i.scan(/\d{2}/).values_at(0, 2).join
}
# => ["blah", "0912", "1214", "1416", "1618", "1820", "1921", "2021"]

How to scrape multiple pages using nokogiri and how to scrape fast with rails

I am trying to scrape certain elements from 99 pages of a web page. I cannot for the life of me figure out how to do it.
Here is my code:
require 'open-uri'
require 'nokogiri'
#title = []
html_content = open("https://www.imdb.com/list/ls057823854/?
sort=list_order,asc&st_dt=&mode=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.search(".lister-item-header/a").each do |title|
#title << title.text.strip

If u want to collect all titles here is the scraper code.
require 'open-uri'
require 'nokogiri'
require 'json'
#title = []
url = "https://www.imdb.com/list/ls057823854/?sort=list_order,asc&st_dt=&mode=detail&page="
html_content = open(url+"1").read
doc = Nokogiri::HTML(html_content)
max = doc.search(".pagination-range").first.text.split("of")[1].gsub(",","").strip.to_i
max = (max / 100).floor + 1
doc.search(".lister-item-header/a").each do |title|
#title << title.text.strip
end
for i in 2..max
html_content = open(url+i.to_s).read
doc = Nokogiri::HTML(html_content)
doc.search(".lister-item-header/a").each do |title|
#title << title.text.strip
end
sleep(1)
end
File.open("imdb-titles.json","w") do |f|
f.write(JSON.pretty_generate(#title))
end

ruby nokorigi export csv columns

i want to export csv in 3 columns with the type of it but the result that i get is not the same what i want. it's just only one column to show all my data, please help me what should i do
require 'nokogiri'
require 'csv'
page = Nokogiri::HTML(open("index.html"))
fullName = page.css('li._5i_q').css("a[data-gt]").children.map {|name| name.text }
shortURL = page.css('li._5i_q').css("._5j0e a[data-hovercard]")
myID = shortURL.map {|element|
element["data-hovercard"][/id=([^&]*)/].gsub('id=', '')
}
messenger = shortURL.map {|element|
element["data-hovercard"][/id=([^&]*)/].gsub('id=', '') + "#gmail.com"
}
attributes = %w{ID FullName Messenger}
CSV.open('chatId.csv', 'w') do |csv|
csv << attributes
myID.each do |x|
csv << [x]
end
fullName.each do |y|
csv << [y]
end
messenger.each do |z|
csv << [z]
end
end
It's all my code

You will have to write row by row when exporting data to csv. Therefore, try to create an array of [x, y, z] and export them using to_csv method. For example:
data = myID.zip(fullName, shortUrl)
CSV.open('chatId.csv', 'w') do |csv|
csv << attributes
csv << "\n"
data.each do |d|
csv << d.to_csv
end
end

How to save Json Array into HTML format in Rails?

Hello Ruby users i have Json Array format
[
"Can also work with any bluetooth-enabled smartphones and\ntablets",
"For calls and music, Hands-free",
"Very stylish design and lightweight",
"Function:Bluetooth,Noise Cancelling,Microphone",
"Compatible:For Any Device With Bluetooth Function",
"Chipset: CSR4.0", "Bluetooth Version:Bluetooth 4.0",
"Transmission Distance:10 Meters"
]
I want to save this array into html formal using the html format which is below.
<ul>
<li>Can also work with any bluetooth-enabled smartphones and\ntablets</li>
<li>For calls and music, Hands-free</li>
<li>Very stylish design and lightweight</li>
<li>Function:Bluetooth,Noise Cancelling,Microphone</li>
<li>Compatible:For Any Device With Bluetooth Function</li>
<li>Chipset: CSR4.0</li>
<li>Bluetooth Version:Bluetooth 4.0</li>
<li>Transmission Distance:10 Meters</li>
</ul>
This is my current code which is working correctly if i have to save it just the array however i need this to be html format so user can easily read it
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
#item.item_details = result["description"]
#item.save
end
Under my Previous Attempt to solve like this
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
item_list = result["description"]
item_list.each do |list|
#item.item_details = "<li>"+list+"</li>"
end
#item.save
end
This one only save one of the array and no <ul> head
heres the original code
namespace :scraper do
desc "Scrape Website"
task somesite: :environment do
require 'open-uri'
require 'nokogiri'
require 'json'
url = "url here!"
page = Nokogiri::HTML(open(url))
script = page.search('head script')[2]
jsonparse = script.content[/\{\"[a-zA-Z0-9\"\:\-\,\ \=\(\)\.\_\D\/\[\]\}]+/i]
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
item_details = result["description"].each {|list| "<li>#{list}</li>" }
puts item_details
#item.item_old_price = result["originalPrice"]
#item.item_final_price = result["price"]
#item.save
end
end
end
The Idea is to save the Array into database with the html format.
<ul>
<li>content 1</li>
<li>content 2</li>
<li>content and soon</li>
</ul>
Thanks

I'm not sure what you tried to do.
Please check below codes and give me a feedback.
Happy coding :)
<!-- language: ruby -->
require 'json'
class MyJsonParser
def initialize
#items = []
end
def parse(json)
result = JSON.parse(json)
generate_items(result)
items
end
private
attr_reader :items
def generate_items(result)
result['mods']['listItems'].each {|item_detail| items << Item.new(item_detail)}
end
end
class Item
attr_reader :details
def initialize(detail='')
#details = ''
before_initialize
details << detail
after_initialize
end
private
def before_initialize
details << '<li>'
end
def after_initialize
details << '</li>'
end
end
json_str = '{
"mods": {
"listItems":
[
"Can also work with any bluetooth-enabled smartphones and\ntablets",
"For calls and music, Hands-free",
"Very stylish design and lightweight",
"Function:Bluetooth,Noise Cancelling,Microphone",
"Compatible:For Any Device With Bluetooth Function",
"Chipset: CSR4.0",
"Bluetooth Version:Bluetooth 4.0",
"Transmission Distance:10 Meters"
]
}
}'
result = MyJsonParser.new.parse(json_str)
result.each do |i|
p i.details
end

axslx - how can I check if an array element exists and if so alter its output?

I have a Xpath query which accepts array elements for output using Axslx, I need to tidy up my ouput for certain conditions one of which is the 'Software included'
My xpath scrapes the following URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1
A sample of my code is below:
clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'
selector = "//td[text()='%s']/following-sibling::td"
data = clues.map do |clue|
xpath = selector % clue
[clue, doc.at(xpath).text.strip]
end
Axlsx::Package.new do |p|
p.workbook.add_worksheet do |sheet|
data.each { |datum| sheet.add_row datum }
end
p.serialize 'output.xlsx'
end
My Current output formatting
My Desired output formatting

If you can rely on the data always using ';' for separators, have a go at this:
data = []
clues.each do |clue|
xpath = selector % clue
details = doc.at(xpath).text.strip.split(';')
data << [clue, details.pop]
details.each { |detail| data << ['', detail] }
end
to generate the data before the Axlsx::Package.new block
In answer to you comment/question: You do it with something like this ;)
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'
class Scraper
def initialize(url, selector)
#url = url
#selector = selector
end
def hooks
#hooks ||= {}
end
def add_hook(clue, p_roc)
hooks[clue] = p_roc
end
def export(file_name)
Scraper.clues.each do |clue|
if detail = parse_clue(clue)
output << [clue, detail.pop]
detail.each { |datum| output << ['', datum] }
end
end
serialize(file_name)
end
private
def self.clues
#clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
'Warranty', 'Software included', 'Product color']
end
def doc
#doc ||= begin
Nokogiri::HTML(open(#url))
rescue
raise ArgumentError, 'Invalid URL - Nothing to parse'
end
end
def output
#output ||= []
end
def selector_for_clue(clue)
#selector % clue
end
def parse_clue(clue)
if element = doc.at(selector_for_clue(clue))
call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
end
end
def call_hook(clue, element)
if hooks[clue].is_a? Proc
value = hooks[clue].call(element)
value.is_a?(Array) ? value : [value]
end
end
def package
#package ||= Axlsx::Package.new
end
def serialize(file_name)
package.workbook.add_worksheet do |sheet|
output.each { |datum| sheet.add_row datum }
end
package.serialize(file_name)
end
end
scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")
# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end
scraper.add_hook('Operating system', os_parse)
scraper.export('foo.xlsx')
And the FINAL answer is... a gem.
http://rubydoc.info/gems/ninja2k/0.0.2/frames

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Web Scraping using Ruby - If statment - ruby-on-rails

page.css('span.first_detail_cell').each do |line| if line.text.include?("Furnished") # do something hre else beds << line.text end end

Related

CSV in RUBY custom string

How to scrape multiple pages using nokogiri and how to scrape fast with rails

ruby nokorigi export csv columns

How to save Json Array into HTML format in Rails?

axslx - how can I check if an array element exists and if so alter its output?

Categories

Resources