Find within the first 10? - ruby-on-rails

I'm using Nokogiri to screen-scrape contents of a website.
I set fetch_number to specify the number of <divs> that I want to retrieve. For example, I may want the first(10) tweets from the target page.
The code looks like this:
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title']
end
However, when there is less than 10 matching div tags returned, it will report
NoMethodError: undefined method 'css' for nil:NilClass
This is because, when no matching HTML is found, it will return nil.
How can I make it return all the available data within 10? I don't need the nils.
UPDATE:
task :test_fetch => :environment do
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
end
Return resultes(Near the end):
24
http://item.taobao.com/item.htm?id=41249522884
http://item.taobao.com/item.htm?id=40369253621
http://item.taobao.com/item.htm?id=40384876796
http://item.taobao.com/item.htm?id=40352486259
http://item.taobao.com/item.htm?id=40384968205
.....
http://item.taobao.com/item.htm?id=38843789106
http://item.taobao.com/item.htm?id=38843517455
http://item.taobao.com/item.htm?id=38854788276
http://item.taobao.com/item.htm?id=38825442050
http://item.taobao.com/item.htm?id=38630599372
http://item.taobao.com/item.htm?id=38346270714
http://item.taobao.com/item.htm?id=38357729988
http://item.taobao.com/item.htm?id=38345374874
this is empty
this is empty
this is empty
this is empty
this is empty
this is empty
count reports only 24 elements, but it retuns a 30 array.
And it actually is not an array, but Nokogiri::XML::NodeSet? I'm not sure.

title = item.css("a")[0]['title']
is a bad practice.
Instead, consider writing using at or at_css instead of search or css:
title = item.at('a')['title']
Next, if the <a> tag returned doesn't have a title parameter, Nokogiri and/or Ruby will be upset because the title variable will be nil. Instead, improve your CSS selector to only allow matches like <a title="foo">:
require 'nokogiri'
doc = Nokogiri::HTML('<body>foobar</body>')
doc.at('a').to_html # => "foo"
doc.at('a[title]').to_html # => "bar"
Notice how the first, which is not constrained to look for tags with a title parameter returns the first <a> tag. Using a[title] will only return ones with a title parameter.
That means your loop over the values will never return nil, and you won't have a problem needing to compact them out of the returned array.
As a general programming tip, if you're getting nils like that, look at the code generating the array, because odds are good it's not doing it right. You should ALWAYS know what sort of results your code will generate. Using compact to clean up the array is a knee-jerk reaction to not having written the code correctly most of the time.
Here's your updated code:
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
And here's what's wrong:
doc.css(".main-wrap .item").first(30)
Here's a simple example demonstrating why that doesn't work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
In Nokogiri, search',cssandxpath` are equivalent, except that the first is generic and can take either CSS or XPath, while the last two are specific to that language.
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fcf360ef750 name="p" children=[#<Nokogiri::XML::Text:0x3fcf360ef4f8 "foo">]>]
doc.search('p').size # => 1
doc.search('p').map(&:to_html) # => ["<p>foo</p>"]
That shows that the NodeSet returned by doing a simple search returns only one node, and what the node looks like.
doc.search('p').first(2) # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>, nil]
doc.search('p').first(2).size # => 2
Searching using first(n) returns "n" elements. If that many aren't found Nokogiri fills them in using nil values.
This is counter what we'd assume first(n) to do, since Enumerable#first returns up-to-n and won't pad with nils. This isn't a bug, but it is unexpected behavior since Enumerable's first sets the expected behavior for methods with that name, but, this is NodeSet#first, not Enumerable#first, so it does what it does until the Nokogiri authors change it. (You can see why it happens if you look at the source for that particular method.)
Instead, slicing the NodeSet does show the expected behavior:
doc.search('p')[0..1] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0..1].size # => 1
doc.search('p')[0, 2] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0, 2].size # => 1
So, don't use NodeSet#first(n), use the slice form NodeSet#[].
Applying that, I'd write the code something like:
require 'nokogiri'
require 'open-uri'
URL = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(URL))
hrefs = doc.css(".main-wrap .item .detail a[href]")[0..29].map { |anchors|
anchors['href']
}
puts hrefs.size
puts hrefs
# >> 24
# >> http://item.taobao.com/item.htm?id=41249522884
# >> http://item.taobao.com/item.htm?id=40369253621
# >> http://item.taobao.com/item.htm?id=40384876796
# >> http://item.taobao.com/item.htm?id=40352486259
# >> http://item.taobao.com/item.htm?id=40384968205
# >> http://item.taobao.com/item.htm?id=40384816312
# >> http://item.taobao.com/item.htm?id=40384600507
# >> http://item.taobao.com/item.htm?id=39973451949
# >> http://item.taobao.com/item.htm?id=39861209551
# >> http://item.taobao.com/item.htm?id=39545678869
# >> http://item.taobao.com/item.htm?id=39535371171
# >> http://item.taobao.com/item.htm?id=39509186150
# >> http://item.taobao.com/item.htm?id=38973412667
# >> http://item.taobao.com/item.htm?id=38910499863
# >> http://item.taobao.com/item.htm?id=38942960787
# >> http://item.taobao.com/item.htm?id=38910403350
# >> http://item.taobao.com/item.htm?id=38843789106
# >> http://item.taobao.com/item.htm?id=38843517455
# >> http://item.taobao.com/item.htm?id=38854788276
# >> http://item.taobao.com/item.htm?id=38825442050
# >> http://item.taobao.com/item.htm?id=38630599372
# >> http://item.taobao.com/item.htm?id=38346270714
# >> http://item.taobao.com/item.htm?id=38357729988
# >> http://item.taobao.com/item.htm?id=38345374874

Try this
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title'] rescue nil
end
And let me know it works or not? It will not show error

Try compact.
[1, nil, 2, nil, 3] # => [1, 2, 3]
http://www.ruby-doc.org/core-2.1.3/Array.html#method-i-compact
(ie: first(fetch_number).compact.each do |item|)

Related

Nokogiri: Clean up data output

I am trying to scrape player information from MLS sites to create a map of where the players come from, as well as other information. I am as new to this as it gets.
So far I have used this code:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://www.atlutd.com/players')
parse_page = Nokogiri::HTML(page)
players_array = []
parse_page.css('.player_list.list-reset').css('.row').css('.player_info').map do |a|
player_info = a.text
players_array.push(player_info)
end
#CSV.open('atlantaplayers.csv', 'w') do |csv|
# csv << players_array
#end
pry.start(binding)
The output of the pry function is:
"Miguel Almirón10\nMidfielder\n-\nAsunción, ParaguayAge:\n23\nHT:\n5' 9\"\nWT:\n140\n"
Which when put into the csv creates this in a single cell:
"Miguel Almirón10
Midfielder
-
Asunción, ParaguayAge:
23
HT:
5' 9""
WT:
140
"
I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.
My desired outcome here is to figure out how to get the pry output into the array as follows:
Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140
Bonus points if you can help with the accent marks on names. Also if there is going to be an issue with height, is there a way to convert it to metric?
Thank you in advance!
I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.
Yes that's why it's showing in this odd format, you can strip the rendered text to remove extra spaces/lines then your text will show without the \ns
player_info = a.text.strip
[1] pry(main)> "Miguel Almirón10\n".strip
=> "Miguel Almirón10"
This will only remove the \n if you wish to store them in a CSV in this order
Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140
then you might want to split by spaces and then create an array for each row so when pushing the line to the CSV file it will look like this:
csv << ["Miguel", "Almiron", 10, "Midfielder", "Asuncion", "Paraguay", 23, "5'9\"", 140]
with the accent marks on names
you can use the transliterate method which will remove accents
[8] pry(main)> ActiveSupport::Inflector.transliterate("Miguel Almirón10")
=> "Miguel Almiron10"
See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate and you might want to require 'rails' for this
Here's what I would use, i18n and people gems:
require 'people'
require "i18n"
I18n.available_locales = [:en]
#np = People::NameParser.new
players_array = []
parse_page.css('.player_info').each do |div|
name = #np.parse I18n.transliterate(div.at('.name a').text)
players_array << [
name[:first],
name[:last],
div.at('.jersey').text,
div.at('.position').text,
]
end
# => [["Miguel", "Almiron", "10", "Midfielder"],
# ["Mikey", "Ambrose", "22", "Defender"],
# ["Yamil", "Asad", "11", "Forward"],
# ...
That should get you started.

Ruby - checking if file is a CSV

I have just wrote a code where I get a csv file passed in argument and treat it line by line ; so far, everything is okay. Now, I would like to secure my code by making sure that what we receive in argument is a .csv file.
I saw in the Ruby doc that it exist a == "--file" option but using it generate an error : the way I understood it, it seems this option only work for the txt files.
Is there a method specific that allowed to check if my file is a csv ? Here some of my code :
if ARGV.empty?
puts "j'ai rien reçu"
# option to check, don't work
elsif ARGV[0].shift == "--file"
# my code so far, whithout checking
else CSV.foreach(ARGV.shift) do |row|
etc, etc...
I think it is unpossible to make a real safe test without additional information.
Just some notes what you can do:
You get a filename in a variable filename.
First, check if it is a file:
File.exist?
Then you could check, if the encoding is correct:
raise "Wrong encoding" unless content.valid_encoding?
Has your csv always the same number of columns? And do you have only one liner?
This can be a possibility to make the next check:
content.each_line{|line|
return false if line.count(sep) < columns - 1
}
This check can be modified for your case, e.g. if you have always an exact number of rows.
In total you can define something like:
require 'csv'
#columns defines the expected numer of columns per line
def csv?(filename, sep: ';', columns: 3)
return false unless File.exist?(filename) #"No file"
content = File.read(filename, :encoding => 'utf-8')
return false unless content.valid_encoding? #"Wrong encoding"
content.each_line{|line|
return false if line.count(sep) < columns - 1
}
CSV.parse(content, :col_sep => sep)
end
if csv = csv?('test.csv')
csv.each do |row|
p row
end
end
You can use ruby-filemagic gem
gem install ruby-filemagic
Usage:
$ irb
irb(main):001:0> require 'filemagic'
=> true
irb(main):002:0> fm = FileMagic.new
=> #<FileMagic:0x7fd4afb0>
irb(main):003:0> fm.file('foo.zip')
=> "Zip archive data, at least v2.0 to extract"
irb(main):004:0>
https://github.com/ricardochimal/ruby-filemagic
Use File.extname() to check the origin file
File.extname("test.rb") #=> ".rb"

ruby net_dav sample puts

I'm trying to put a file on a site with WEB_DAV. (a ruby gem)
When I follow the example, I get a nil exception
#### GEMS
require 'rubygems'
begin
gem "net_dav"
rescue LoadError
system("gem install net_dav")
Gem.clear_paths
end
require 'net/dav'
uri = URI('https://staging.web.mysite');
user = "dave"
pasw = "correcthorsebatterystaple"
dav = Net::DAV.new(uri, :curl => false)
dav.verify_server = false
dav.credentials(user, pasw)
cargo = ("testing.txt")
File.open(cargo, "rb") { |stream|
dav.put(urI.path +'/'+ cargo, stream, File.size(cargo))
}
when I run this I get
`digest_auth': can't convert nil into String (TypeError)
this relates to line 197 in my nav.rb file.
request_digest << ':' << params['nonce']
So what I'm wondering is what step did I not add?
Is there a reasonable example of the correct use of this gem? Something that does something that works would be sweet :)
SIDE QUESTION: Is this the correct gem to use to do web_DAV? It seems an old unmaintained gem, perhaps there's something used by more to accomplish the task?
Try referencing the hash with a symbol rather than a string, i.e.
request_digest << ':' << params[:nonce]
In a simple test
baz = "baz"
params = {:foo => "bar"}
baz << ':' << params['foo']
results in the same error as you're getting.

Rails 2.3.8 Render a html as pdf

I am using Prawn pdf library however i am doing a complex design, So I need a quick solution as to convert a html to pdf file.
Thanks in advance
I would use wkhtmltopdf shell tool
together with the wicked_pdf ruby gem, its free, and uses qtwebkit to render your html to pdf. Also executes javascript as well, for charts for example. You can find more info about installation: https://github.com/mileszs/wicked_pdf
I have one Rails app that's been using PrinceXML in production for a few years. It is pricey - around $4K for a server license - but does a very good job of rendering HTML+CSS in PDF files. I haven't looked at newer solutions since this one is paid for and working quite well.
For what it's worth, here's some code I adapted from Subimage Interactive to make the conversion simple:
lib/prince.rb
# Prince XML Ruby interface.
# http://www.princexml.com
#
# Library by Subimage Interactive - http://www.subimage.com
#
#
# USAGE
# -----------------------------------------------------------------------------
# prince = Prince.new()
# html_string = render_to_string(:template => 'some_document')
# send_data(
# prince.pdf_from_string(html_string),
# :filename => 'some_document.pdf'
# :type => 'application/pdf'
# )
#
class Prince
attr_accessor :exe_path, :style_sheets, :log_file
# Initialize method
#
def initialize()
# Finds where the application lives, so we can call it.
#exe_path = '/usr/local/bin/prince'
case Rails.env
when 'production', 'staging'
# use default hard-coded path
else
if File.exist?(#exe_path)
# use default hard-coded path
else
#exe_path = `which prince`.chomp
end
end
#style_sheets = ''
#log_file = "#{::Rails.root}/log/prince.log"
end
# Sets stylesheets...
# Can pass in multiple paths for css files.
#
def add_style_sheets(*sheets)
for sheet in sheets do
#style_sheets << " -s #{sheet} "
end
end
# Returns fully formed executable path with any command line switches
# we've set based on our variables.
#
def exe_path
# Add any standard cmd line arguments we need to pass
#exe_path << " --input=html --server --log=#{#log_file} "
#exe_path << #style_sheets
return #exe_path
end
# Makes a pdf from a passed in string.
#
# Returns PDF as a stream, so we can use send_data to shoot
# it down the pipe using Rails.
#
def pdf_from_string(string)
path = self.exe_path()
# Don't spew errors to the standard out...and set up to take IO
# as input and output
path << ' --silent - -o -'
# Show the command used...
#logger.info "\n\nPRINCE XML PDF COMMAND"
#logger.info path
#logger.info ''
# Actually call the prince command, and pass the entire data stream back.
pdf = IO.popen(path, "w+")
pdf.puts(string)
pdf.close_write
output = pdf.gets(nil)
pdf.close_read
return output
end
end
lib/pdf_helper.rb
module PdfHelper
require 'prince'
private
def make_pdf(template_path, pdf_name, stylesheets = [], skip_base_pdf_stylesheet = false)
# application notices should never be included in PDFs, pull them from session here
notices = nil
if !flash.now[:notice].nil?
notices = flash.now[:notice]
flash.now[:notice] = nil
end
if !flash[:notice].nil?
notices = '' if notices.nil?
notices << flash[:notice]
flash[:notice] = nil
end
prince = Prince.new()
# Sets style sheets on PDF renderer.
stylesheet_base = "#{::Rails.root}/public/stylesheets"
prince.add_style_sheets(
"#{stylesheet_base}/application.css",
"#{stylesheet_base}/print.css"
)
prince.add_style_sheets("#{stylesheet_base}/pdf.css") unless skip_base_pdf_stylesheet
if 0 < stylesheets.size
stylesheets.each { |s| prince.add_style_sheets("#{stylesheet_base}/#{s}.css") }
end
# Set RAILS_ASSET_ID to blank string or rails appends some time after
# to prevent file caching, messing up local - disk requests.
ENV['RAILS_ASSET_ID'] = ''
html_string = render_to_string(:template => template_path, :layout => 'application')
# Make all paths relative, on disk paths...
html_string.gsub!("src=\"", "src=\"#{::Rails.root}/public")
html_string.gsub!("src=\"#{::Rails.root}/public#{::Rails.root}", "src=\"#{::Rails.root}")
# re-insert any application notices into the session
if !notices.nil?
flash[:notice] = notices
end
return prince.pdf_from_string(html_string)
end
def make_and_send_pdf(template_path, pdf_name, stylesheets = [], skip_base_pdf_stylesheet = false)
send_data(
make_pdf(template_path, pdf_name, stylesheets, skip_base_pdf_stylesheet),
:filename => pdf_name,
:type => 'application/pdf'
)
end
end
sample controller action
include PdfHelper
def pdf
index
make_and_send_pdf '/ads/index', "#{filename}.pdf"
end
You can directly convert your existing HTML to PDF using acts_as_flying_saucer library.For header and footer you can refer
https://github.com/amardaxini/acts_as_flying_saucer/wiki/PDF-Header-Footer

Map string to another string in Ruby Rails

Hey guys,
i have 5 model attributes, for example, 'str' and 'dex'. A user has strength, dexterity attribute.
When i call user.increase_attr('dex') i want to do it through 'dex' and not having to pass 'dexterity' string all the way.
Of course, i can just check if ability == 'dex' and convert it to 'dexterity' when i will need to do user.dexterity += 1 and then save it.
But what is a good ruby way to do that ?
Look at Ruby's Abbrev module that's part of the standard library. This should give you some ideas.
require 'abbrev'
require 'pp'
class User
def increase_attr(s)
"increasing using '#{s}'"
end
end
abbreviations = Hash[*Abbrev::abbrev(%w[dexterity strength speed height weight]).flatten]
user = User.new
user.increase_attr(abbreviations['dex']) # => "increasing using 'dexterity'"
user.increase_attr(abbreviations['s']) # => "increasing using ''"
user.increase_attr(abbreviations['st']) # => "increasing using 'strength'"
user.increase_attr(abbreviations['sp']) # => "increasing using 'speed'"
If an ambiguous value is passed in, (the "s"), nothing will match. If a unique value is found in the hash, the returned value is the full string, making it easy to map short strings to the full string.
Because having varying lengths of the trigger strings would be confusing to the user you could strip all elements of the hash that have keys shorter than the shortest unambiguous key. In other words, remove anything shorter than two characters because of the collision of "speed" ("sp") and "strength" ("st"), meaning "h", "d" and "w" need to go. It's a "be kind to the poor human users" thing.
Here's what is created when Abbrev::abbrev does its magic and it's coerced into a Hash.
pp abbreviations
# >> {"dexterit"=>"dexterity",
# >> "dexteri"=>"dexterity",
# >> "dexter"=>"dexterity",
# >> "dexte"=>"dexterity",
# >> "dext"=>"dexterity",
# >> "dex"=>"dexterity",
# >> "de"=>"dexterity",
# >> "d"=>"dexterity",
# >> "strengt"=>"strength",
# >> "streng"=>"strength",
# >> "stren"=>"strength",
# >> "stre"=>"strength",
# >> "str"=>"strength",
# >> "st"=>"strength",
# >> "spee"=>"speed",
# >> "spe"=>"speed",
# >> "sp"=>"speed",
# >> "heigh"=>"height",
# >> "heig"=>"height",
# >> "hei"=>"height",
# >> "he"=>"height",
# >> "h"=>"height",
# >> "weigh"=>"weight",
# >> "weig"=>"weight",
# >> "wei"=>"weight",
# >> "we"=>"weight",
# >> "w"=>"weight",
# >> "dexterity"=>"dexterity",
# >> "strength"=>"strength",
# >> "speed"=>"speed",
# >> "height"=>"height",
# >> "weight"=>"weight"}
def increase_attr(attr)
attr_map = {'dex' => :dexterity, 'str' => :strength}
increment!(attr_map[attr]) if attr_map.include?(attr)
end
Basically create a Hash that has the key of 'dex', 'str' etc and points to the expanded version of that word(in symbol format).

Resources