Trying to figure out how display the text and images I have scraped in my application/html.
Here is my app/scrape2.rb file
require 'nokogiri'
require 'open-uri'
url = "https://marketplace.asos.com/boutiques/independent-label"
doc = Nokogiri::HTML(open(url))
label = doc.css('#boutiqueList')
#label = label.css('#boutiqueList img').map { |l| p l.attr('src') }
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
Here is the controller:
class PagesController < ApplicationController
def about
#used to change the routing to /about
end
def index
#label = label.css('#boutiqueList img').map { |l| p l.attr('src') }
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
end
end
and finally the label.html.erb page:
<% #label.each do |image| %>
<%= image_tag image %>
<% end %>
do I need some other method, not storing the arrays properly?
Your controller needs to load the data itself, or somehow pull the data from scrape2.rb. Controllers do not have access to other files unless specified (include, extend, etc).
require 'nokogiri'
require 'open-uri'
class PagesController < ApplicationController
def index
# Call these in your controller:
url = "https://marketplace.asos.com/boutiques/independent-label"
doc = Nokogiri::HTML(open(url))
label = doc.css('#boutiqueList')
#label = label.css('#boutiqueList img').map { |l| p l.attr('src') }
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
end
end
You're not parsing the data correctly.
label = doc.css('#boutiqueList')
should be:
label = doc.at('#boutiqueList')
#boutiqueList is an ID, of which only one can exist in a document at a time. css returns a NodeSet, which is like an Array, but you really want to point to the Node itself, which is what at would do. at is equivalent to search('...').first.
Then you use:
label.css('#boutiqueList img')
which is also wrong. label is supposed to already point to the node containing #boutiqueList, but then you want Nokogiri to look inside that node and find additional nodes with id="boutiqueList" and that contain <img> tags. But, again, because #boutiqueList is an ID and it can't occur more than once in the document, Nokogiri can't find any nodes:
label.css('#boutiqueList img').size # => 0
whereas using label.css correctly finds <img> nodes:
label.css('img').size # => 48
Then you use map to print out values, but map is used to modify the contents of an Array as it iterates over it. p will return the value it outputs, but it's bad form to rely on the returned value of p in a map. Instead you should map to convert the values, then puts the result if you need to see it:
#label = label.css('#boutiqueList img').map { |l| l.attr('src') }
puts #label
Instead of using attr('src'), I'd write the first line as:
#label = label.css('img').map { |l| l['src'] }
The same is true of:
#title = label.css("#boutiqueList .notranslate").map { |o| p o.text }
Related
I've got a class that looks like this:
class VariableStack
def initialize(document)
#document = document
end
def to_array
#document.template.stacks.each { |stack| stack_hash stack }
end
private
def stack_hash(stack)
stack_hash = {}
stack_hash['stack_name'] = stack.name
stack_hash['boxes'] = [stack.boxes.each { |box| box_hash box }]
stack_hash
end
def box_hash(box)
box_hash = {}
content = []
box.template_variables.indexed.each { |var| content << content_array(var) }
content.delete_if(&:blank?)
box_hash.store('content', content.join("\n"))
return if box_hash['content'].empty?
box_hash
end
def content_array(var)
v = #document.template_variables.where(master_id: var.id).first
return unless v
if v.text.present?
v.format_text
elsif v.photo_id.present?
v.image.uploaded_image.url
end
end
end
The document I'm testing with has two template_variables so the desired result should be a nested hash like so:
Instead I'm getting this result:
=> [#<Stack id: 1, name: "User information">]
i.e., I'm not getting the boxes key nor it's nested content. Why isn't my method looping through the box_hash and content fields?
That's because the to_array method uses each method, which returns the object it's been called on (in this case #document.template.stacks)
Change it to the map and you may get the desired result:
def to_array
#document.template.stacks.map { |stack| stack_hash stack }
end
So I have this code in my index action, would love to move it to a model, just a little confused on how to do it.
Original Code
def index
urls = %w[http://cltampa.com/blogs/potlikker http://cltampa.com/blogs/artbreaker http://cltampa.com/blogs/politicalanimals http://cltampa.com/blogs/earbuds http://cltampa.com/blogs/dailyloaf http://cltampa.com/blogs/bedpost]
#final_images = []
#final_urls = []
urls.each do |url|
blog = Nokogiri::HTML(open(url))
images = blog.xpath('//*[#class="postBody"]/div[1]//img/#src')
images.each do |image|
#final_images << image
end
story_path = blog.xpath('//*[#class="postTitle"]/a/#href')
story_path.each do |path|
#final_urls << path
end
end
end
I tested this code in my model and it works perfectly for one url, just not sure how to integrate all of the urls like the original code.
New Code
Model
class Photocloud < ActiveRecord::Base
attr_reader :url, :data
def initialize(url)
#url = url
end
def data
#data ||= Nokogiri::HTML(open(url))
end
def get_elements(path)
data.xpath(path)
end
end
Controller
def index
#scraper = Photocloud.new('http://cltampa.com/blogs/artbreaker')
#photos = #scraper.get_elements('//*[#class="postBody"]/div[1]//img/#src')
#story_urls = #scraper.get_elements('//*[#class="postBody"]/div[1]//img/#src')
end
My main questions are how would I initialize multiple urls and loop through them like my original code. I have tried different things but feel like I have hit a wall. I need to save them to the database, but would like to get this working first. Any help is greatly appreciated.
Updated Controller - WIP
def index
start_urls = %w[http://cltampa.com/blogs/potlikker
http://cltampa.com/blogs/artbreaker
http://cltampa.com/blogs/politicalanimals
http://cltampa.com/blogs/earbuds
http://cltampa.com/blogs/dailyloaf
http://cltampa.com/blogs/bedpost]
#scraper = Photocloud.new(start_urls)
#images =
#paths =
end
Need some help with this part...
It seems that you don't persist scraped images and paths to the database so Photocloud doesn't need to inherit from ActiveRecord::Base - it can be just a plain old ruby object (PORO):
class Photocloud
attr_reader :start_urls
attr_accessor :images, :paths
def initialize(start_urls)
#start_urls = start_urls
#images = []
#paths = []
end
def scrape
start_urls.each do |start_url|
blog = Nokogiri::HTML(open(url))
scrape_images(blog)
scrape_paths(blog)
end
end
private
def scrape_images(blog)
images = blog.xpath('//*[#class="postBody"]/div[1]//img/#src')
images.each do |image|
images << image
end
end
def scrape_paths(blog)
story_path = blog.xpath('//*[#class="postTitle"]/a/#href')
story_path.each do |path|
paths << path
end
end
end
In controller:
scraper = Photocloud.new(start_urls)
scraper.scrape
#images = scraper.images
#paths = scraper.paths
This is only one of the possibilities how you could structure code, of course.
I am reviewing a piece of code from a Rails project and I came across the tap method. What does it do?
Also, it would be great if someone could help me understand what the rest of the code does:
def self.properties_container_to_object properties_container
{}.tap do |obj|
obj['vid'] = properties_container['vid'] if properties_container['vid']
obj['canonical-vid'] = properties_container['canonical-vid'] if properties_container['canonical-vid']
properties_container['properties'].each_pair do |name, property_hash|
obj[name] = property_hash['value']
end
end
end
Thanks!
.tap is here to "perform operations on intermediate results within a chain of methods" (quoting ruby-doc).
In other words, object.tap allows you to manipulate object and to return it after the block:
{}.tap{ |hash| hash[:video] = 'Batmaaaaan' }
# => return the hash itself with the key/value video equal to 'Batmaaaaan'
So you can do stuff like this with .tap:
{}.tap{ |h| h[:video] = 'Batmaaan' }[:video]
# => returns "Batmaaan"
Which is equivalent to:
h = {}
h[:video] = 'Batmaaan'
return h[:video]
An even better example:
user = User.new.tap{ |u| u.generate_dependent_stuff }
# user is equal to the User's instance, not equal to the result of `u.generate_dependent_stuff`
Your code:
def self.properties_container_to_object(properties_container)
{}.tap do |obj|
obj['vid'] = properties_container['vid'] if properties_container['vid']
obj['canonical-vid'] = properties_container['canonical-vid'] if properties_container['canonical-vid']
properties_container['properties'].each_pair do |name, property_hash|
obj[name] = property_hash['value']
end
end
end
Is returning a Hash beeing filled in the .tap block
The long-version of your code would be:
def self.properties_container_to_object(properties_container)
hash = {}
hash['vid'] = properties_container['vid'] if properties_container['vid']
hash['canonical-vid'] = properties_container['canonical-vid'] if properties_container['canonical-vid']
properties_container['properties'].each_pair do |name, property_hash|
hash[name] = property_hash['value']
end
hash
end
Tap is a Ruby method from the Object class.
This method yields x to the block and then returns x. This method is used to "tap into" a method chain, to perform operations on intermediate results within the chain.
I have a Xpath query which accepts array elements for output using Axslx, I need to tidy up my ouput for certain conditions one of which is the 'Software included'
My xpath scrapes the following URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1
A sample of my code is below:
clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'
selector = "//td[text()='%s']/following-sibling::td"
data = clues.map do |clue|
xpath = selector % clue
[clue, doc.at(xpath).text.strip]
end
Axlsx::Package.new do |p|
p.workbook.add_worksheet do |sheet|
data.each { |datum| sheet.add_row datum }
end
p.serialize 'output.xlsx'
end
My Current output formatting
My Desired output formatting
If you can rely on the data always using ';' for separators, have a go at this:
data = []
clues.each do |clue|
xpath = selector % clue
details = doc.at(xpath).text.strip.split(';')
data << [clue, details.pop]
details.each { |detail| data << ['', detail] }
end
to generate the data before the Axlsx::Package.new block
In answer to you comment/question: You do it with something like this ;)
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'
class Scraper
def initialize(url, selector)
#url = url
#selector = selector
end
def hooks
#hooks ||= {}
end
def add_hook(clue, p_roc)
hooks[clue] = p_roc
end
def export(file_name)
Scraper.clues.each do |clue|
if detail = parse_clue(clue)
output << [clue, detail.pop]
detail.each { |datum| output << ['', datum] }
end
end
serialize(file_name)
end
private
def self.clues
#clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
'Warranty', 'Software included', 'Product color']
end
def doc
#doc ||= begin
Nokogiri::HTML(open(#url))
rescue
raise ArgumentError, 'Invalid URL - Nothing to parse'
end
end
def output
#output ||= []
end
def selector_for_clue(clue)
#selector % clue
end
def parse_clue(clue)
if element = doc.at(selector_for_clue(clue))
call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
end
end
def call_hook(clue, element)
if hooks[clue].is_a? Proc
value = hooks[clue].call(element)
value.is_a?(Array) ? value : [value]
end
end
def package
#package ||= Axlsx::Package.new
end
def serialize(file_name)
package.workbook.add_worksheet do |sheet|
output.each { |datum| sheet.add_row datum }
end
package.serialize(file_name)
end
end
scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")
# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end
scraper.add_hook('Operating system', os_parse)
scraper.export('foo.xlsx')
And the FINAL answer is... a gem.
http://rubydoc.info/gems/ninja2k/0.0.2/frames
Controller:
#events = Event.all
#events.each { |e| e.user_subscribed = "someuser" }
#events.each { |e| puts "error" + e.user_subscribed }
I have attr_accessor :user_subscribed. but the error is can't convert nil into String as e.user_subscribed evaluates to nil.
I'm using mongoid on the backend.
edit: this works, but it just copies the whole array.
#events = #events.map do |e|
e.user_subscribed = "faaa"
e
end
If you're not saving the #events to the database, user_subscribed won't persist, unless you keep it in memory:
#events_with_subscription = #events.map { |e| e.user_subscribed = "someuser"; return e }
edited based on OP comments.
sounds like it might be better to just output Event.user_subscribed(current_user) directly in the view...but if you wanted to load up all that data before hand you could do:
#array_of_subscription_results = #Events.map{|e| e.user_subscribed(current_user,some,other,var,required) }
As long as user_subscribed returns the values you are interested in, thats what map will load into #array_of_subscription_results