Inputting scraped data into database - ruby-on-rails

Heyo,
So I built a working scraper and added the file to my app. I am now trying to take the information in the scraper and place it in my database. I am attempting to use the find_or_create method but I keep getting the following error.
breads_scraper.rb:49:in `block in summary': uninitialized constant Scraper::Bread (NameError)
from /Users/Cameron/.rvm/gems/ruby-1.9.3-p392/gems/nokogiri- 1.5.9/lib/nokogiri/xml/node_set.rb:239:in `block in each'
from /Users/Cameron/.rvm/gems/ruby-1.9.3-p392/gems/nokogiri-1.5.9/lib/nokogiri/xml/node_set.rb:238:in `upto'
from /Users/Cameron/.rvm/gems/ruby-1.9.3-p392/gems/nokogiri-1.5.9/lib/nokogiri/xml/node_set.rb:238:in `each'
from breads_scraper.rb:24:in `map'
from breads_scraper.rb:24:in `summary'
from breads_scraper.rb:57:in `<class:Scraper>'
from breads_scraper.rb:9:in `<main>'
My code looks like the following. My theory is that I am using find_or_create incorrectly, or the file doesn't know how to reach the bread method and controller.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'json'
url = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_breads"))
class Scraper
def initialize
#url = "http://en.wikipedia.org/wiki/List_of_breads"
#nodes = Nokogiri::HTML(open(#url))
end
def summary
bread_data = #nodes
breads = bread_data.css('div.mw-content-ltr table.wikitable tr')
bread_data.search('sup').remove
bread_hashes = breads.map {|x|
if content = x.css('td')[0]
name = content.text
end
if content = x.css('td a.image').map {|link| link ['href']}
image =content[0]
end
if content = x.css('td')[2]
type = content.text
end
if content = x.css('td')[3]
country = content.text
end
if content = x.css('td')[4]
description =content.text
end
{
:name => name,
:image => image,
:type => type,
:country => country,
:description => description,
}
Bread.find_or_create(:title => name, :description => description, :image_url => image, :country_origin => country, :type => type)
}
end
bready = Scraper.new
bready.summary
puts "atta boy"
end
Thanks!

Invoke the the scraper from a rake task.
lib/tasks/scraper.rake
namespace :app do
desc "Scrape breads"
task :scrape_breads => :environment do
Scraper.new.summary
end
end
Now, you can run the rake task as follows:
rake app:scrape_breads

It looks like the Bread class is not loaded.

Related

How to speed up sitemap_generator with parallel gem

I am trying to speed up sitemap_generator by adding parallelization via the parallel gem. I have the following code but my groups aren't getting written to the public/sitemaps directory. I am thinking it's due to lambdas getting executed in a different space in parallel. Any feedback would be helpful. Thanks!
#!/usr/bin/env ruby
require 'rubygems'
require 'sitemap_generator'
require 'benchmark'
require 'parallel'
require 'random-word'
SitemapGenerator::Sitemap.default_host = "http://localhost"
a = lambda {
SitemapGenerator::Sitemap.group(:filename => :biz, :sitemaps_path => 'sitemaps/biz/') do
(1..1000).each do |index|
url = "/#{RandomWord.adjs.next}/#{RandomWord.nouns.next}"
add url, :priority => 0.8
end
end
}
b = lambda {
SitemapGenerator::Sitemap.group(:filename => :wedding_ugc, :sitemaps_path => 'sitemaps/ugc') do
(1..1000).each do |index|
url = "/#{RandomWord.adjs.next}/#{RandomWord.nouns.next}"
add url, :priority => 0.8
end
end
}
#working example
# SitemapGenerator::Sitemap.default_host = "http://localhost"
# SitemapGenerator::Sitemap.create(:compress => false) do
# group(:filename => :biz, :sitemaps_path => 'sitemaps/biz/') do
# (1..1000).each do |index|
# url = "/#{RandomWord.adjs.next}/#{RandomWord.nouns.next}"
# add url, :priority => 0.8
# end
# end
# end
puts Time.now
Parallel.each([a,b]){|job| job.call()}
puts Time.now
I got this working and posted the solution on github here
Here is the code incase the url gets broken.
SitemapGenerator::Sitemap.create(:compress => false, :create_index => false) do
group1 = lambda {
group = sitemap.group(:filename => :group1, :sitemaps_path => 'sitemaps/group1') do
Record.find_each do |record|
add '/record/path'
end
end
group.sitemap.write unless group.sitemap.written? #write if not full
}
# group2 like above...
Parallel.each([group1, group2], :in_processes => 8) do |group|
group.call
end
end
#regenerate the index sitemap xml file because I couldn't figure out how to track it with multiple processes
SitemapGenerator::Sitemap.create(:compress => false) do
Dir.chdir(sitemap.public_path.to_s)
xml_files = File.join("**", "sitemaps", "**", "*.xml")
xml_file_paths = Dir.glob(xml_files)
xml_file_paths.each do |file|
add file
end
end

Handle failure in rake task without aborting

I have this rake task:
desc "get product image urls from ItemMaster"
task :get_product_image_urls => :environment do
require 'item_master'
ItemMaster.get_image
end
It calls this API method that iterates over several thousand database items like so:
class ItemMaster
include HTTParty
format :xml
base_uri 'https://api.myapi.com/v2'
def self.get_image
#items = Item.all
#items.each do |item|
response = get("/item?upc=#{item.upc}&epl=100&ef=png", :headers => {"username" => "myname", "password" => "mypass"})
image_link = response["items"]["item"]["media"]["medium"]["url"]
item_image = ItemImage.where(:upc => item.upc).first_or_create
item_image.update_attributes(:url => "#{image_link}")
end
end
end
The rake task starts up when I call it, until it hits this error about 22 items in:
undefined method `[]' for nil:NilClass
/Users/name/Rails/SG/lib/item_master.rb:12:in `block in get_image'
/Users/name/Rails/SG/lib/item_master.rb:10:in `each'
/Users/name/Rails/SG/lib/item_master.rb:10:in `get_image'
/Users/name/Rails/SG/lib/tasks/get_product_image_urls.rake:4:in `block in <top (required)>'
/Users/name/.rvm/gems/ruby-1.9.3-p429/bin/ruby_noexec_wrapper:14:in `eval'
/Users/name/.rvm/gems/ruby-1.9.3-p429/bin/ruby_noexec_wrapper:14:in `<main>'
Line 12 is this guy: image_link = response["items"]["item"]["media"]["medium"]["url"] so I'm thinking that probably a url is missing in the api and it's causing the rake task to fail. Is there a way to move past an error like this and continue on with the rest of the rake task? Thanks in advance!
The ItemMaster.get_image method will need to be edited to either not create the exception condition in the first place, or to rescue on a proper exception and move on. For more on exception handling: http://www.tutorialspoint.com/ruby/ruby_exceptions.htm
An example would be:
def self.get_image
#items = Item.all
#items.each do |item|
response = get("/item?upc=#{item.upc}&epl=100&ef=png", :headers => {"username" => "myname", "password" => "mypass"})
begin
image_link = response["items"]["item"]["media"]["medium"]["url"]
item_image = ItemImage.where(:upc => item.upc).first_or_create
item_image.update_attributes(:url => "#{image_link}")
rescue NoMethodError => ex
logger.error "Failed to locate image link ..." # Customize this to your liking
end
end
end
end
For extra goodness, consider handling the code for each item within a separate method so you can isolate responsibility for handling item-related code within the Item model itself!

Relationships created by a rake task are not persisted though the rails server

I'm working my first project using Neo4j. I'm parsing wikipedia's page and pagelinks dumps to create a graph where the nodes are pages and the edges are links.
I've defined some rake tasks that download the dumps, parse the data, and save it in a Neo4j database. At the end of the rake task I print the number of pages and links created, and some of the pages with the most links. Here is the output of the raks task for the zawiki.
$ rake wiki[zawiki]
[ omitted ]
...
:: Done parsing zawiki
:: 1984 pages
:: 2144 links
:: The pages with the most links are:
9625.0 - Emijrp/List_of_Wikipedians_by_number_of_edits_(bots_included): 40
1363.0 - Gvangjsih_Bouxcuengh_Swcigih: 30
9112.0 - Fuzsuih: 27
1367.0 - Cungzcoj: 26
9279.0 - Vangz_Yenfanh: 19
It looks like pages and links are being created, but when I start a rails console, or the server the links aren't found.
$ rails c
jruby-1.7.5 :013 > Pages.all.count
=> 1984
jruby-1.7.5 :003 > Pages.all.reduce(0) { |count, page| count + page.links.count}
=> 0
jruby-1.7.5 :012 > Pages.all.sort_by { |p| p.links.count }.reverse[0...5].map { |p| p.links.count }
=> [0, 0, 0, 0, 0]
Here is the rake task, and this is the projects github page. Can anyone tell me why the links aren't saved?
DUMP_DIR = Rails.root.join('lib','assets')
desc "Download wiki dumps and parse them"
task :wiki, [:wiki] => 'wiki:all'
namespace :wiki do
task :all, [:wiki] => [:get, :parse] do |t, args|
# Print info about the newly created pages and links.
link_count = 0
Pages.all.each do |page|
link_count += page.links.count
end
indent "Done parsing #{args[:wiki]}"
indent "#{Pages.count} pages"
indent "#{link_count} links"
indent "The pages with the most links are:"
Pages.all.sort_by { |a| a.links.count }.reverse[0...5].each do |page|
puts "#{page.page_id} - #{page.title}: #{page.links.count}"
end
end
desc "Download wiki page and page links database dumps to /lib/assets"
task :get, :wiki do |t, args|
indent "Downloading dumps"
sh "#{Rails.root.join('lib', "get_wiki").to_s} #{args[:wiki]}"
indent "Done"
end
desc "Parse all dumps"
task :parse, [:wiki] => 'parse:all'
namespace :parse do
task :all, [:wiki] => [:pages, :pagelinks]
desc "Read wiki page dumps from lib/assests into the database"
task :pages, [:wiki] => :environment do |t, args|
parse_dumps('page', args[:wiki]) do |obj|
page = Pages.create_from_dump(obj)
end
indent = "Created #{Pages.count} pages"
end
desc "Read wiki pagelink dumps from lib/assests into the database"
task :pagelinks, [:wiki] => :environment do |t, args|
errors = 0
parse_dumps('pagelinks', args[:wiki]) do |from_id, namespace, to_title|
from = Pages.find(:page_id => from_id)
to = Pages.find(:title => to_title)
if to.nil? || from.nil?
errors = errors.succ
else
from.links << to
from.save
end
end
end
end
end
def indent *args
print ":: "
puts args
end
def parse_dumps(dump, wiki_match, &block)
wiki_match ||= /\w+/
DUMP_DIR.entries.each do |file|
file, wiki = *(file.to_s.match(Regexp.new "(#{wiki_match})-#{dump}.sql"))
if file
indent "Parsing #{wiki} #{dump.pluralize} from #{file}"
each_value(DUMP_DIR.join(file), &block)
end
end
end
def each_value(filename)
f = File.open(filename)
num_read = 0
begin # read file until line starting with INSERT INTO
line = f.gets
end until line.match /^INSERT INTO/
begin
line = line.match(/\(.*\)[,;]/)[0] # ignore begining of line until (...) object
begin
yield line[1..-3].split(',').map { |e| e.match(/^['"].*['"]$/) ? e[1..-2] : e.to_f }
num_read = num_read.succ
line = f.gets.chomp
end while(line[0] == '(') # until next insert block, or end of file
end while line.match /^INSERT INTO/ # Until line doesn't start with (...
f.close
end
app/models/pages.rb
class Pages < Neo4j::Rails::Model
include Neo4j::NodeMixin
has_n(:links).to(Pages)
property :page_id
property :namespace, :type => Fixnum
property :title, :type => String
property :restrictions, :type => String
property :counter, :type => Fixnum
property :is_redirect, :type => Fixnum
property :is_new, :type => Fixnum
property :random, :type => Float
property :touched, :type => String
property :latest, :type => Fixnum
property :length, :type => Fixnum
property :no_title_convert, :type => Fixnum
def self.create_from_dump(obj)
# TODO: I wonder if there is a way to compine these calls
page = {}
# order of this array is important, it corresponds to the data in obj
attrs = [:page_id, :namespace, :title, :restrictions, :counter, :is_redirect,
:is_new, :random, :touched, :latest, :length, :no_title_convert]
attrs.each_index { |i| page[attrs[i]] = obj[i] }
page = Pages.create(page)
return page
end
end
I must admit that I have no idea of how Neo4j works.
Transferring from other databases though, I too assume that either some validation is wrong, or maybe even something is misconfigured in your use of the database. The latter I can't give any advice on where to look, but if it's about validation, you can look at Page#errors or try calling Page#save! and see what it raises.
One crazy idea that just came to mind looking at this example is that maybe for that relation to be configured properly, you need a back reference, too.
Maybe has_n(:links).to(Page, :links) will help you. Or, if that doesn't work:
has_n(:links_left).to(Page, :links_right)
has_n(:links_right).from(Page, :links_left)
The more I look at this, the more I think the back reference to the same table is not configured properly and thus won't validate.

Storing image using open URI and paperclip having size less than 10kb

I want to import some icons from my old site. The size of those icons is less than 10kb. So when I am trying to import the icons its returning stringio.txt file.
require "open-uri"
class Category < ActiveRecord::Base
has_attached_file :icon, :path => ":rails_root/public/:attachment/:id/:style/:basename.:extension"
def icon_from_url(url)
self.icon = open(url)
end
end
In rake task.
category = Category.new
category.icon_from_url "https://xyz.com/images/dog.png"
category.save
Try:
def icon_from_url(url)
extname = File.extname(url)
basename = File.basename(url, extname)
file = Tempfile.new([basename, extname])
file.binmode
open(URI.parse(url)) do |data|
file.write data.read
end
file.rewind
self.icon = file
end
To override the default filename of a "fake file upload" in Paperclip (stringio.txt on small files or an almost random temporary name on larger files) you have 2 main possibilities:
Define an original_filename on the IO:
def icon_from_url(url)
io = open(url)
io.original_filename = "foo.png"
self.icon = io
end
You can also get the filename from the URI:
io.original_filename = File.basename(URI.parse(url).path)
Or replace :basename in your :path:
has_attached_file :icon, :path => ":rails_root/public/:attachment/:id/:style/foo.png", :url => "/:attachment/:id/:style/foo.png"
Remember to alway change the :url when you change the :path, otherwise the icon.url method will be wrong.
You can also define you own custom interpolations (e.g. :rails_root/public/:whatever).
You are almost there I think, try opening parsed uri, not the string.
require "open-uri"
class Category < ActiveRecord::Base
has_attached_file :icon, :path =>:rails_root/public/:attachment/:id/:style/:basename.:extension"
def icon_from_url(url)
self.icon = open(URI.parse(url))
end
end
Of course this doesn't handle errors
You can also disable OpenURI from ever creating a StringIO object, and force it to create a temp file instead. See this SO answer:
Why does Ruby open-uri's open return a StringIO in my unit test, but a FileIO in my controller?
In the past, I found the most reliable way to retrieve remote files was by using the command line tool "wget". The following code is mostly copied straight from an existing production (Rails 2.x) app with a few tweaks to fit with your code examples:
class CategoryIconImporter
def self.download_to_tempfile (url)
system(wget_download_command_for(url))
##tempfile.path
end
def self.clear_tempfile
##tempfile.delete if ##tempfile && ##tempfile.path && File.exist?(##tempfile.path)
##tempfile = nil
end
def self.set_wget
# used for retrieval in NrlImage (and in future from other sies?)
if !##wget
stdin, stdout, stderr = Open3.popen3('which wget')
##wget = stdout.gets
##wget ||= '/usr/local/bin/wget'
##wget.strip!
end
end
def self.wget_download_command_for (url)
set_wget
##tempfile = Tempfile.new url.sub(/\?.+$/, '').split(/[\/\\]/).last
command = [ ##wget ]
command << '-q'
if url =~ /^https/
command << '--secure-protocol=auto'
command << '--no-check-certificate'
end
command << '-O'
command << ##tempfile.path
command << url
command.join(' ')
end
def self.import_from_url (category_params, url)
clear_tempfile
filename = url.sub(/\?.+$/, '').split(/[\/\\]/).last
found = MIME::Types.type_for(filename)
content_type = !found.empty? ? found.first.content_type : nil
download_to_tempfile url
nicer_path = RAILS_ROOT + '/tmp/' + filename
File.copy ##tempfile.path, nicer_path
Category.create(category_params.merge({:icon => ActionController::TestUploadedFile.new(nicer_path, content_type, true)}))
end
end
The rake task logic might look like:
[
['Cat', 'cat'],
['Dog', 'dog'],
].each do |name, icon|
CategoryIconImporter.import_from_url {:name => name}, "https://xyz.com/images/#{icon}.png"
end
This uses the mime-types gem for content type discovery:
gem 'mime-types', :require => 'mime/types'

Feedzirra in Rails 3

I am trying to get feedzirra running on rails 3, I tried by some methods I have found on the internet.
This is in my gemfile:
source 'http://gems.github.com'
gem 'loofah', '1.0.0.beta.1'
group :after_initialize do
gem 'pauldix-feedzirra'
end
And i've out this after bundle.setup in root.rb
Bundler.require :after_initialize
And this is the code in my model (movie.rb)
class Movie < ActiveRecord::Base
def self.import_from_feed
feed = Feedzirra::Feed.fetch_and_parse("url-to.xml")
add_entries(feed.entries)
end
private
def self.add_entries(entries)
entries.each do |entry|
unless exists? :guid => entry.id
create!(
:title => entry.title,
:synopsis => entry.synopsis,
:cover => entry.cover,
:duration => entry.duration,
:channel => entry.channel,
:imdb_rating => entry.imdb_rating,
:imdb_votes => entry.imdb_votes,
:imdb_id => entry.imdb_votes
)
end
end
end
end
I try to run the import_from_feed function from the console and I keep getting this error:
>> Movie.import_from_feed
NameError: uninitialized constant Movie::Feedzirra
from /Users/myname/Ruby/appname/app/models/movie.rb:3:in `import_from_feed'
from (irb):1
Can someone help me out? Been trying for ages now!
Two things:
Just add the gem, not under :after_initialize
Use the feedzirra gem, not the old pauldix-feedzirra one.

Resources