Iterating in nokogiri with rails doesn't work - ruby-on-rails

So I wrote some nokogiri code that works in a test .rb file but when I put it inside a rails app model it won't iterate and just returns the first value. Here is the code that iterates correctly:
require "rubygems"
require "open-uri"
require "nokogiri"
url = "http://www.ebay.com/sch/Cars-Trucks-/6001/i.html?_from=R40&_sac=1&_vxp=mtr&_nkw=car+projects&_ipg=200&rt=nc"
data = Nokogiri::HTML(open(url))
data.css(".li").each do |item|
item_link = item.at_css(".vip")[:href]
item_doc = Nokogiri::HTML(open(item_link))
puts item_doc.at_css("#itemTitle").text.sub! 'Details about', ''
end
Here is the same code in a rails app that only returns the first title it finds:
require "rubygems"
require "open-uri"
require "nokogiri"
class EbayScraper
attr_accessor :url, :data
def initialize(url)
#url = url
end
def data
#data ||= Nokogiri::HTML(open(#url))
end
def titles
data.css(".li").each do |item|
item_link = item.at_css(".vip")[:href]
item_data = Nokogiri::HTML(open(item_link))
return item_data.at_css("#itemTitle").text.sub! 'Details about', ''
end
end
ebay = EbayScraper.new("http://www.ebay.com/sch/Cars-Trucks-/6001/i.html?_from=R40&_sac=1&_vxp=mtr&_nkw=car+projects&_ipg=200&rt=nc")
titles = ebay.titles
puts titles
Why does the first code iterate through the whole thing and the second bunch of code just returns the first one?
Thanks for your time in advance!

Because you have a return statement in your loop that exits your titles function.

Related

Rails + Sidekiq not recognizing class

I have a CsvImport service object in my app/services and I'm trying to call one of the class methods from within a Worker.
class InventoryUploadWorker
include Sidekiq::Worker
def perform(file_path, company_id)
CsvImport.csv_import(file_path, Company.find(company_id))
end
end
But it seems that the worker doesn't know what the class is, I've attempted require 'csv_import' to no avail.
Heres where it breaks:
WARN: ArgumentError: undefined class/module CsvImport
The method being called in
csv_import.rb
class CsvImport
require "benchmark"
require 'csv'
def self.csv_import(filename, company)
time = Benchmark.measure do
File.open(filename) do |file|
headers = file.first
file.lazy.each_slice(150) do |lines|
Part.transaction do
inventory = []
insert_to_parts_db = []
rows = CSV.parse(lines.join, write_headers: true, headers: headers)
rows.map do |row|
part_match = Part.find_by(part_num: row['part_num'])
new_part = build_new_part(row['part_num'], row['description']) unless part_match
quantity = row['quantity'].to_i
row.delete('quantity')
row["condition"] = match_condition(row)
quantity.times do
part = InventoryPart.new(
part_num: row["part_num"],
description: row["description"],
condition: row["condition"],
serial_num: row["serial_num"],
company_id: company.id,
part_id: part_match ? part_match.id : new_part.id
)
inventory << part
end
end
InventoryPart.import inventory
end
end
end
end
puts time
end
your requires are inside the class. Put them outside the class so they're required right away when the file is loaded, not when the class is loaded.
Instead of
class CsvImport
require "benchmark"
require 'csv'
...
Do this
require "benchmark"
require 'csv'
class CsvImport
...
Try to add to application.rb
config.autoload_paths += Dir["#{config.root}/app/services"]
More details here: autoload-paths

Mechanize in Module, Nameerror ' agent'

Looking for advice on how to fix this error and refactor this code to improve it.
require 'mechanize'
require 'pry'
require 'pp'
module Mymodule
class WebBot
agent = Mechanize.new { |agent|
agent.user_agent_alias = 'Windows Chrome'
}
def form(response)
require "addressable/uri"
require "addressable/template"
template = Addressable::Template.new("http://www.domain.com/{?query*}")
url = template.expand({"query" => response}).to_s
page = agent.get(url)
end
def get_products
products = []
page.search("datatable").search('tr').each do |row|
begin
product = row.search('td')[1].text
rescue => e
p e.message
end
products << product
end
products
end
end
end
Calling the module:
response = {size: "SM", color: "BLUE"}
t = Mymodule::WebBot.new
t.form(response)
t.get_products
Error:
NameError: undefined local variable or method `agent'
Ruby has a naming convention. agent is a local variable in the class scope. To make it visible to other methods you should make it a class variable by naming it ##agent, and it'll be shared among all the objects of WebBot. The preferred way though is to make it an instance variable by naming it #agent. Every object of WebBot will have its own #agent. But you should put it in initialize, initialize will be invoked when you create a new object with new
class WebBot
def initialize
#agent = Mechanize.new do |a|
a.user_agent_alias = 'Windows Chrome'
end
end
.....
And the same error will occur to page. You defined it in form as a local variable. When form finishes execution, it'll be deleted. You should make it an instance variable. Fortunately, you don't have to put it in initialize. You can define it here in form. And the object will have its own #page after invoking form. Do this in form:
def form(response)
require "addressable/uri"
require "addressable/template"
template = Addressable::Template.new("http://www.domain.com/{?query*}")
url = template.expand({"query" => response}).to_s
#page = agent.get(url)
end
And remember to change every occurrence of page and agent to #page and #agent. In your get_products for example:
def get_products
products = []
#page.search("datatable").search('tr').each do |row|
.....
These changes will resolve the name errors. Refactoring is another story btw.

Adding some Regex code to my Ruby webscraper

I'm building a webscraper and using Nokogiri. Here is the code that I currently have:
require 'nokogiri'
require 'open-uri'
require 'pry'
class Scraper
def get_page
doc = Nokogiri::HTML(open("http://www.theskimm.com/recent"))
h = {}
doc.xpath('//a[#href]').each do |link|
h[link.text.strip] = link['href']
end
puts h
end
binding.pry
end
Scraper.new.get_page
This returns me a hash of all URLs on the page (I only pasted the first few lines):
{"Back to Sign Up"=>"/", "SHARE THIS"=>"https://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fwww.theskimm.com%2F2015%2F12%2F07%2Fskimm-for-december-8th-2&display=popup", "theSkimm\nSkimm for December 8th"=>"/", "Trump campaign press release"=>"http://skimmth.is/1SKR0bP", "assault weapons ban"=>"http://skimmth.is/1QbnCO8"}
However, I'd like to only grab the URLs that contain "http://skimmth.is/" as part of the value. What code/ Regular Expression would I need to add to my original Scraper class to ONLY selects URLs with that address?
You can use contains() function of xpath.
doc.xpath('//a[contains(#href, "http://skimmth.is/")]').map{|e| e.attr(:href)}
=> ["http://skimmth.is/1SKR0bP",
"http://skimmth.is/1QbnCO8",
"http://skimmth.is/1SHBSff",
"http://skimmth.is/1N8dORo",
"http://skimmth.is/1HRwGoO",
"http://skimmth.is/1HRmEUG",
"http://skimmth.is/1NePsmI",
"http://skimmth.is/1IQoJLn",
"http://skimmth.is/1ToQ6T1",
"http://skimmth.is/1IAZ6mW",
"http://skimmth.is/1N7Foy1",
"http://skimmth.is/1m7B6Op",
"http://skimmth.is/1SKBhJW",
"http://skimmth.is/1ToQ6T1",
"http://skimmth.is/1XfpwkX%20",
"http://skimmth.is/1P9rq20"]
You can use if as a statement modifier to check that the value is appropriate before adding it to the hash. For example, update this line:
h[link.text.strip] = link['href']
to
h[link.text.strip] = link['href'] if link['href'] =~ /http:\/\/skimmth.is\//
FWIW: =~ is the match method for the Regexp class.

web scraping/export to CSV with Ruby

ruby n00b here in hope of some guidance. I am looking to scrape a website (600-odd names and links on one page) and output to CSV. The scraping itself works fine (the output correctly fills the terminal as the script runs), but I can't get the CSV to populate. The code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://www.example.com/page/"
page = Nokogiri::HTML(open(url))
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
foo = puts "#{name}"
bar = "#{link}"
CSV.open("file.csv", "wb") do |csv|
csv << [foo, bar]
end
end
puts "upload complete!"
...replacing the csv << [foo, bar] with csv << [name, link] just puts the final iteration into the CSV. I feel there's something basic I am missing here. Thanks for reading.
The problem is that you're doing CSV.open for every single item. So it's overwriting the file with the newer item. And hence at the end, you're left with the last item in the csv file.
Move the CSV.open call before page.css('.item').each and it should work.
CSV.open("file.csv", "wb") do |csv|
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
csv << [name, link]
end
end

How to use a method in a lib?

I'm working to integrate UserVoice Single Single On with my rails app. They provide the following class for ruby:
require 'rubygems'
require 'ezcrypto'
require 'json'
require 'cgi'
require 'base64'
module Uservoice
class Token
attr_accessor :data
USERVOICE_SUBDOMAIN = "FILL IN"
USERVOICE_SSO_KEY = "FILL IN"
def initialize(options = {})
options.merge!({:expires => (Time.zone.now.utc + 5 * 60).to_s})
key = EzCrypto::Key.with_password USERVOICE_SUBDOMAIN, USERVOICE_SSO_KEY
encrypted = key.encrypt(options.to_json)
#data = Base64.encode64(encrypted).gsub(/\n/,'') # Remove line returns where are annoyingly placed every 60 characters
end
def to_s
#data
end
end
end
What I can't figure out is how to use this. I added this file to my lib directory and am using Rails Console to run. I tried:
1.9.3-p125 :013 > Uservoice::Token
=> Uservoice::Token
But can't get it to actually return for the options:
Uservoice::Token.new(:guid => 1, :display_name => "jeff goldmen", :email => "jeff#google.com")
Any ideas how to actually use this? Thanks
Looking at the code, it doesn't appear that the initializer (what gets run when you call new) will take just a hash. The method definition looks like this:
def initialize(key, api_key, data)
And it seems to treat the data variable as a hash. You might just need to add the key and api_key values when you instantiate a Token. So a call would look like this:
Uservoice::Token.new(KEY, API_KEY, {guid:1, display_name:'foo', email:'f#b.com'})

Resources