Problem when scraping data from webpage. Ruby on rails 5 - ruby-on-rails

I'm developing a web-scraper. So, I wrote some code and don't understand why the loop doesn't work? How can help me with that?
scraper_service.rb:
browser = Watir::Browser.new
browser.goto('some_link_here')
browser.is(class: /event--head-block/).each do |event|
event.is(class: /event--more/).button.click
puts "Hello world"
binding.pry
end
So, when I executed the code, I didn't see 'hello world' in the console. In addition, when tried to understand are the class 'event--head-block' present on the web-page, I run browser.element(class: /event--head-block/).exists? and that returns true.
Update
I forget to say that there are 8-10 same classes with name 'event--head-block'. Probably it's the reason?

Related

Ruby Rails Screen Scrape different results in Rails Console

I'm confused about a difference I'm seeing in Nokogiri commands run from Rails Console and what I get from the same commands run in a Rails Helper.
In Rails Console, I am able to capture the data I want with these commands:
endpoint = "https://basketball-reference.com/leagues/BAA_1947_totals.html"
browser = Watir::Browser.new(:chrome)
browser.goto(endpoint)
#doc_season = Nokogiri::HTML.parse(URI.open("https://basketball-reference.com/leagues/BAA_1947_totals.html"))
player_season_table = #doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove) #THIS WORKED
rows[0].at_css("td").try(:text) # Gets single player name
rows[0].at_css("a").attributes["href"].try(:value) # Gets that player page URL
However, my rails helper that is meant to take those commands and fold them into methods:
module ScraperHelper
def target_scrape(url)
browser = Watir::Browser.new(:chrome)
browser.goto(url)
doc = Nokogiri::HTML.parse(browser.html)
end
def league_year_prefix(year, league = 'NBA')
# aba_seasons = 1968..1976
baa_seasons = 1947..1949
baa_seasons.include?(year) ? league_year = "BAA_#{year}" : league_year = "#{league}_#{year}"
end
def players_total_of_season(year, league = 'NBA')
# always the latter year of the season, first year is 1947 no quotes
# ABA is 1968 to 1976
league_year = league_year_prefix(year, league)
#doc_season = target_scrape("http://basketball-reference.com/leagues/#{league_year}_totals.html")
end
def gather_players_from_season
player_season_table = #doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove)
puts rows[0].at_css("td").try(:text)
puts rows[0].at_css("a").attributes["href"].try(:value)
end
end
On that module, I try to emulate the rails console commands and break them into modules. And to test it out (since I don't have any other functionality or views built yet), I run Rails console, include this helper and run the methods.
But I get wildly different results.
in the gather_players_from_season method, I can see that
player_season_table = #doc_season.css("tbody")
Is no longer grabbing the same data it grabbed when run as a command line by line. It also doesn't like the attributes method here:
puts rows[0].at_css("a").attributes["href"].try(:value)
So my first thought is a difference in gems maybe? Watir is launching the headless browser. Nokogiri isn't causing errors as near as I can tell.
Your first thought of comparing the Gem versions is a great idea, but I am noticing a difference between the two code solutions:
In the Rails Console
the code parses the HTML with URI.open: Nokogiri::HTML.parse(URI.open("some html"))
In the ScraperHelper code
the code does not call URI.open, Nokogiri::HTML.parse("some html")
Perhaps that difference will return different values and make the rest of the ScraperHelper return unexpected results.

How to "reload" a cloudflare 520 request with ruby?

I wrote a ruby script to download an image URL:
require 'open-uri'
imageAddress = ARGV[0]
targetPath = ARGV[1]
fullFileNamePath = "#{targetPath}test.jpg"
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
end
Example Usage:
ruby download_image.rb "https://images.genius.com/b015b15e476c92d10a834d523575d3c9.1000x1000x1.jpg" "/Users/Me/Downloads/"
The problem is, sometimes I run across this output error:
520 Origin Error
Then, when I try the same URL in my browser, I get something like this:
If I reload the page or click the 'Retry for a live version' button in the above image, the page loads.
Then if I run the script again it downloads the image just fine.
So how can I replicate this page reload / 'Retry for a live version' behavior using ruby and without switching to my browser? Running the script again doesn't do the job.
It sounds like you are looking for a delay command. If the script fails (or encounters '520 Origin Error') wait and re-try.
This is a quick built recursive function, you may want to add other checks for how many times you have looped, breaking after so many. (Also not tested, may contain errors, meant as an example)
def getFile(params_you_need)
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
if ex == '520 Origin Error'
sleep(30) #generally a good time to pause
getFile(params_you_need)
end
end
end

Can I display the log of system call in Ruby?

I need to call a command(in a sinatra or rails app) like this:
`command sub`
Some log will be outputed when the command is executing.
I want to see the log displaying continuously in the process.
But I just can get the log string after it's done with:
result = `command sub`
So, is there a way to implement this?
On windows i have the best experience with IO.popen
Here is a sample
require 'logger'
$log = Logger.new( "#{__FILE__}.log", 'monthly' )
#here comes the full command line, here it is a java program
command = %Q{java -jar getscreen.jar #{$userid} #{$password}}
$log.debug command
STDOUT.sync = true
begin
# Note the somewhat strange 2> syntax. This denotes the file descriptor to pipe to a file. By convention, 0 is stdin, 1 is stdout, 2 is stderr.
IO.popen(command+" 2>&1") do |pipe|
pipe.sync = true
while str = pipe.gets #for every line the external program returns
#do somerthing with the capturted line
end
end
rescue => e
$log.error "#{__LINE__}:#{e}"
$log.error e.backtrace
end
There's six ways to do it, but the way you're using isn't the correct one because it waits for the process the return.
Pick one from here:
http://tech.natemurray.com/2007/03/ruby-shell-commands.html
I would use IO#popen3 if I was you.

Ruby timeout does not work in Rails?

I'm having an issue trying to get a timeout when connecting via TCPSocket to a remote resource that isn't available. It just hangs indefinitely without timing out. Ideally I'd want it to try reconnect every 2 minutes or so, but the TCPSocket.new call seems to block. I've tried using timeout() but that doesn't do anything either. Trying the same call in an IRB instance works perfectly fine, but when it's in Rails, it fails. Anyone have a work around for this?
My code looks something as follows:
def self.connect!
##connection = TCPSocket.new IP, 4449
end
def self.send(cmd)
puts "send "
unless ##connection
self.connect!
end
loop do
begin
##connection.puts(cmd)
return
rescue IOError
sleep(self.get_reconnect_delay)
self.connect!
end
end
end
Unfortunately, there is currently no way to set timeouts on TCPSocket directly.
See http://bugs.ruby-lang.org/issues/5101 for the feature request. You will have use the basic Socket class and set socket options.

puts doesn't print stuff to console

i'm using POW for local rails development. i don't know why, but i can't print or puts information to my development.log. i want to puts the content of variables to console / log from my controller. any advice?
i read my logs with tail -f logs/development.log
thanks!
Instead of puts, try logger.info(). Logging in Rails is very flexible, but it does mean that you might not be able to use the simplest tools sometimes.
If you're doing debugging and only want to see some messages in the logs you can do the following:
Rails.logger.debug("debug::" + person.name)
and
$ pow logs | grep debug::
now you'll only see logging messages that start with debug::
Another option is to use the rails tagging logger, http://api.rubyonrails.org/classes/ActiveSupport/TaggedLogging.html.
logger = ActiveSupport::TaggedLogging.new(Logger.new(STDOUT))
logger.tagged('BCX') { logger.info 'Stuff' } # Logs "[BCX] Stuff"
$ pow logs | grep BCX
For anyone who still can't get it to work, remember that Ruby doesn't use semicolons. They are only used to chain commands. I was adding them at the end due to muscle memory (coming from PHP), so the ruby console thought I was still entering commands:
irb(main):001:0> puts "hi";
irb(main):002:0* puts "hi"
hi
hi
=> nil
Hope this helps someone.

Resources