How to save Json Array into HTML format in Rails? - ruby-on-rails

Hello Ruby users i have Json Array format
[
"Can also work with any bluetooth-enabled smartphones and\ntablets",
"For calls and music, Hands-free",
"Very stylish design and lightweight",
"Function:Bluetooth,Noise Cancelling,Microphone",
"Compatible:For Any Device With Bluetooth Function",
"Chipset: CSR4.0", "Bluetooth Version:Bluetooth 4.0",
"Transmission Distance:10 Meters"
]
I want to save this array into html formal using the html format which is below.
<ul>
<li>Can also work with any bluetooth-enabled smartphones and\ntablets</li>
<li>For calls and music, Hands-free</li>
<li>Very stylish design and lightweight</li>
<li>Function:Bluetooth,Noise Cancelling,Microphone</li>
<li>Compatible:For Any Device With Bluetooth Function</li>
<li>Chipset: CSR4.0</li>
<li>Bluetooth Version:Bluetooth 4.0</li>
<li>Transmission Distance:10 Meters</li>
</ul>
This is my current code which is working correctly if i have to save it just the array however i need this to be html format so user can easily read it
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
#item.item_details = result["description"]
#item.save
end
Under my Previous Attempt to solve like this
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
item_list = result["description"]
item_list.each do |list|
#item.item_details = "<li>"+list+"</li>"
end
#item.save
end
This one only save one of the array and no <ul> head
heres the original code
namespace :scraper do
desc "Scrape Website"
task somesite: :environment do
require 'open-uri'
require 'nokogiri'
require 'json'
url = "url here!"
page = Nokogiri::HTML(open(url))
script = page.search('head script')[2]
jsonparse = script.content[/\{\"[a-zA-Z0-9\"\:\-\,\ \=\(\)\.\_\D\/\[\]\}]+/i]
result = JSON.parse(jsonparse)
result["mods"]["listItems"].each do |result|
#item = Item.new
item_details = result["description"].each {|list| "<li>#{list}</li>" }
puts item_details
#item.item_old_price = result["originalPrice"]
#item.item_final_price = result["price"]
#item.save
end
end
end
The Idea is to save the Array into database with the html format.
<ul>
<li>content 1</li>
<li>content 2</li>
<li>content and soon</li>
</ul>
Thanks

I'm not sure what you tried to do.
Please check below codes and give me a feedback.
Happy coding :)
<!-- language: ruby -->
require 'json'
class MyJsonParser
def initialize
#items = []
end
def parse(json)
result = JSON.parse(json)
generate_items(result)
items
end
private
attr_reader :items
def generate_items(result)
result['mods']['listItems'].each {|item_detail| items << Item.new(item_detail)}
end
end
class Item
attr_reader :details
def initialize(detail='')
#details = ''
before_initialize
details << detail
after_initialize
end
private
def before_initialize
details << '<li>'
end
def after_initialize
details << '</li>'
end
end
json_str = '{
"mods": {
"listItems":
[
"Can also work with any bluetooth-enabled smartphones and\ntablets",
"For calls and music, Hands-free",
"Very stylish design and lightweight",
"Function:Bluetooth,Noise Cancelling,Microphone",
"Compatible:For Any Device With Bluetooth Function",
"Chipset: CSR4.0",
"Bluetooth Version:Bluetooth 4.0",
"Transmission Distance:10 Meters"
]
}
}'
result = MyJsonParser.new.parse(json_str)
result.each do |i|
p i.details
end

Related

Nokogiri displaying data in view

Trying to figure out how display the text and images I have scraped in my application/html.
Here is my app/scrape2.rb file
require 'nokogiri'
require 'open-uri'
url = "https://marketplace.asos.com/boutiques/independent-label"
doc = Nokogiri::HTML(open(url))
label = doc.css('#boutiqueList')
#label = label.css('#boutiqueList img').map { |l| p l.attr('src') }
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
Here is the controller:
class PagesController < ApplicationController
def about
#used to change the routing to /about
end
def index
#label = label.css('#boutiqueList img').map { |l| p l.attr('src') }
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
end
end
and finally the label.html.erb page:
<% #label.each do |image| %>
<%= image_tag image %>
<% end %>
do I need some other method, not storing the arrays properly?
Your controller needs to load the data itself, or somehow pull the data from scrape2.rb. Controllers do not have access to other files unless specified (include, extend, etc).
require 'nokogiri'
require 'open-uri'
class PagesController < ApplicationController
def index
# Call these in your controller:
url = "https://marketplace.asos.com/boutiques/independent-label"
doc = Nokogiri::HTML(open(url))
label = doc.css('#boutiqueList')
#label = label.css('#boutiqueList img').map { |l| p l.attr('src') }
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
end
end
You're not parsing the data correctly.
label = doc.css('#boutiqueList')
should be:
label = doc.at('#boutiqueList')
#boutiqueList is an ID, of which only one can exist in a document at a time. css returns a NodeSet, which is like an Array, but you really want to point to the Node itself, which is what at would do. at is equivalent to search('...').first.
Then you use:
label.css('#boutiqueList img')
which is also wrong. label is supposed to already point to the node containing #boutiqueList, but then you want Nokogiri to look inside that node and find additional nodes with id="boutiqueList" and that contain <img> tags. But, again, because #boutiqueList is an ID and it can't occur more than once in the document, Nokogiri can't find any nodes:
label.css('#boutiqueList img').size # => 0
whereas using label.css correctly finds <img> nodes:
label.css('img').size # => 48
Then you use map to print out values, but map is used to modify the contents of an Array as it iterates over it. p will return the value it outputs, but it's bad form to rely on the returned value of p in a map. Instead you should map to convert the values, then puts the result if you need to see it:
#label = label.css('#boutiqueList img').map { |l| l.attr('src') }
puts #label
Instead of using attr('src'), I'd write the first line as:
#label = label.css('img').map { |l| l['src'] }
The same is true of:
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }

Web Scraping using Ruby - If statment

I have built a web scraper. I need it to scrape the prices and bedrooms of a given neighborhood. Sometimes the span.first_detail_cell will return Furnished and the rest of the time it will return the price. I need to write something that can overlook the span.first_detail_cell if it is furnished and look in the next cell for the price. I think I need to write an if statement, but not sure of the parameters. Any help would be great!
require 'open-uri'
require 'nokogiri'
require 'csv'
url = "https://streeteasy.com/for-rent/bushwick"
page = Nokogiri::HTML(open(url))
page_numbers = []
page.css("nav.pagination span.page a").each do |line|
page_numbers << line.text
end
max_page = page_numbers.max
beds = []
price = []
max_page.to_i.times do |i|
url = "https://streeteasy.com/for-rent/bushwick?page=#{i+1}"
page = Nokogiri::HTML(open(url))
page.css('span.first_detail_cell').each do |line|
beds << line.text
end
page.css('span.price').each do |line|
price << line.text
end
end
CSV.open("bushwick_rentals.csv", "w") do |file|
file << ["Beds", "Price"]
beds.length.times do |i|
file << [beds[i], price[i]]
end
end
page.css('span.first_detail_cell').each do |line|
if line.text.include?("Furnished")
# do something hre
else
beds << line.text
end
end

Nokogiri Scraping In Rails

So I have this code in my index action, would love to move it to a model, just a little confused on how to do it.
Original Code
def index
urls = %w[http://cltampa.com/blogs/potlikker http://cltampa.com/blogs/artbreaker http://cltampa.com/blogs/politicalanimals http://cltampa.com/blogs/earbuds http://cltampa.com/blogs/dailyloaf http://cltampa.com/blogs/bedpost]
#final_images = []
#final_urls = []
urls.each do |url|
blog = Nokogiri::HTML(open(url))
images = blog.xpath('//*[#class="postBody"]/div[1]//img/#src')
images.each do |image|
#final_images << image
end
story_path = blog.xpath('//*[#class="postTitle"]/a/#href')
story_path.each do |path|
#final_urls << path
end
end
end
I tested this code in my model and it works perfectly for one url, just not sure how to integrate all of the urls like the original code.
New Code
Model
class Photocloud < ActiveRecord::Base
attr_reader :url, :data
def initialize(url)
#url = url
end
def data
#data ||= Nokogiri::HTML(open(url))
end
def get_elements(path)
data.xpath(path)
end
end
Controller
def index
#scraper = Photocloud.new('http://cltampa.com/blogs/artbreaker')
#photos = #scraper.get_elements('//*[#class="postBody"]/div[1]//img/#src')
#story_urls = #scraper.get_elements('//*[#class="postBody"]/div[1]//img/#src')
end
My main questions are how would I initialize multiple urls and loop through them like my original code. I have tried different things but feel like I have hit a wall. I need to save them to the database, but would like to get this working first. Any help is greatly appreciated.
Updated Controller - WIP
def index
start_urls = %w[http://cltampa.com/blogs/potlikker
http://cltampa.com/blogs/artbreaker
http://cltampa.com/blogs/politicalanimals
http://cltampa.com/blogs/earbuds
http://cltampa.com/blogs/dailyloaf
http://cltampa.com/blogs/bedpost]
#scraper = Photocloud.new(start_urls)
#images =
#paths =
end
Need some help with this part...
It seems that you don't persist scraped images and paths to the database so Photocloud doesn't need to inherit from ActiveRecord::Base - it can be just a plain old ruby object (PORO):
class Photocloud
attr_reader :start_urls
attr_accessor :images, :paths
def initialize(start_urls)
#start_urls = start_urls
#images = []
#paths = []
end
def scrape
start_urls.each do |start_url|
blog = Nokogiri::HTML(open(url))
scrape_images(blog)
scrape_paths(blog)
end
end
private
def scrape_images(blog)
images = blog.xpath('//*[#class="postBody"]/div[1]//img/#src')
images.each do |image|
images << image
end
end
def scrape_paths(blog)
story_path = blog.xpath('//*[#class="postTitle"]/a/#href')
story_path.each do |path|
paths << path
end
end
end
In controller:
scraper = Photocloud.new(start_urls)
scraper.scrape
#images = scraper.images
#paths = scraper.paths
This is only one of the possibilities how you could structure code, of course.

axslx - how can I check if an array element exists and if so alter its output?

I have a Xpath query which accepts array elements for output using Axslx, I need to tidy up my ouput for certain conditions one of which is the 'Software included'
My xpath scrapes the following URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1
A sample of my code is below:
clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'
selector = "//td[text()='%s']/following-sibling::td"
data = clues.map do |clue|
xpath = selector % clue
[clue, doc.at(xpath).text.strip]
end
Axlsx::Package.new do |p|
p.workbook.add_worksheet do |sheet|
data.each { |datum| sheet.add_row datum }
end
p.serialize 'output.xlsx'
end
My Current output formatting
My Desired output formatting
If you can rely on the data always using ';' for separators, have a go at this:
data = []
clues.each do |clue|
xpath = selector % clue
details = doc.at(xpath).text.strip.split(';')
data << [clue, details.pop]
details.each { |detail| data << ['', detail] }
end
to generate the data before the Axlsx::Package.new block
In answer to you comment/question: You do it with something like this ;)
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'
class Scraper
def initialize(url, selector)
#url = url
#selector = selector
end
def hooks
#hooks ||= {}
end
def add_hook(clue, p_roc)
hooks[clue] = p_roc
end
def export(file_name)
Scraper.clues.each do |clue|
if detail = parse_clue(clue)
output << [clue, detail.pop]
detail.each { |datum| output << ['', datum] }
end
end
serialize(file_name)
end
private
def self.clues
#clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
'Warranty', 'Software included', 'Product color']
end
def doc
#doc ||= begin
Nokogiri::HTML(open(#url))
rescue
raise ArgumentError, 'Invalid URL - Nothing to parse'
end
end
def output
#output ||= []
end
def selector_for_clue(clue)
#selector % clue
end
def parse_clue(clue)
if element = doc.at(selector_for_clue(clue))
call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
end
end
def call_hook(clue, element)
if hooks[clue].is_a? Proc
value = hooks[clue].call(element)
value.is_a?(Array) ? value : [value]
end
end
def package
#package ||= Axlsx::Package.new
end
def serialize(file_name)
package.workbook.add_worksheet do |sheet|
output.each { |datum| sheet.add_row datum }
end
package.serialize(file_name)
end
end
scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")
# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end
scraper.add_hook('Operating system', os_parse)
scraper.export('foo.xlsx')
And the FINAL answer is... a gem.
http://rubydoc.info/gems/ninja2k/0.0.2/frames

Gem Resque Error - Undefined "method perform" after Overriding it form the super class

First of all Thanks for you all for helping programmers like me with your valuable inputs in solving day to day issues.
This is my first question in stack overflow as I am experiencing this problems from almost one week.
WE are building a crawler which crawls the specific websites and extract the contents from it, we are using mechanize to acheive this , as it was taking loads of time we decided to run the crawling process as a background task using resque with redis gem , but while sending the process to background I am experiencing the error as the title saying,
my code in lib/parsers/home.rb
require 'resque'
require File.dirname(__FILE__)+"/../index"
class Home < Index
Resque.enqueue(Index , :page )
def self.perform(page)
super (page)
search_form = page.form_with :name=>"frmAgent"
resuts_page = search_form.submit
total_entries = resuts_page.parser.xpath('//*[#id="PagingTable"]/tr[2]/td[2]').text
if total_entries =~ /(\d+)\s*$/
total_entries = $1
else
total_entries = "unknown"
end
start_res_idx = 1
while true
puts "Found #{total_entries} entries"
detail_links = resuts_page.parser.xpath('//*[#id="MainTable"]/tr/td/a')
detail_links.each do |d_link|
if d_link.attribute("class")
next
else
data_page = #agent.get d_link.attribute("href")
fields = get_fields_from_page data_page
save_result_page page.uri.to_s, fields
#break
end
end
site_done
rescue Exception => e
puts "error: #{e}"
end
end
and the superclass in lib/index.rb is
require 'resque'
require 'mechanize'
require 'mechanize/form'
class Index
#queue = :Index_queue
def initialize(site)
#site = site
#agent = Mechanize.new
#agent.user_agent = Mechanize::AGENT_ALIASES['Windows Mozilla']
#agent.follow_meta_refresh = true
#rows_parsed = 0
#rows_total = 0
rescue Exception => e
log "Unable to login: #{e.message}"
end
def run
log "Parsing..."
url = "unknown"
if #site.url
url = #site.url
log "Opening #{url} as a data page"
#page = #agent.get(url)
#perform method should be override in subclasses
#data = self.perform(#page)
else
#some sites do not have "datapage" URL
#for example after login you're already on your very own datapage
#this is to be addressed in 'perform' method of subclass
#data = self.perform(nil)
end
rescue Exception=>e
puts "Failed to parse URL '#{url}', exception=>"+e.message
set_site_status("error "+e.message)
end
#overriding method
def self.perform(page)
end
def save_result_page(url, result_params)
result = Result.find_by_sql(["select * from results where site_id = ? AND ref_code = ?", #site.id, utf8(result_params[:ref_code])]).first
if result.nil?
result_params[:site_id] = #site.id
result_params[:time_crawled] = DateTime.now().strftime "%Y-%m-%d %H:%M:%S"
result_params[:link] = url
result = Result.create result_params
else
result.result_fields.each do |f|
f.delete
end
result.link = url
result.time_crawled = DateTime.now().strftime "%Y-%m-%d %H:%M:%S"
result.html = result_params[:html]
fields = []
result_params[:result_fields_attributes].each do |f|
fields.push ResultField.new(f)
end
result.result_fields = fields
result.save
end
#rows_parsed +=1
msg = "Saved #{#rows_parsed}"
msg +=" of #{#rows_total}" if #rows_total.to_i > 0
log msg
return result
end
end
What's Wrong with this code?
Thanks

Resources