How should my scraping "stack" handle 404 errors?

How should my scraping "stack" handle 404 errors? - ruby-on-rails

I have a rake task that is responsible for doing batch processing on millions of URLs. Because this process takes so long I sometimes find that URLs I'm trying to process are no longer valid -- 404s, site's down, whatever.
When I initially wrote this there was basically just one site that would continually go down while processing so my solution was to use open-uri, rescue any exceptions produced, wait a bit, and then retry.
This worked fine when the dataset was smaller but now so much time goes by that I'm finding URLs are no longer there anymore and produce a 404.
Using the case of a 404, when this happens my script just sits there and loops infinitely -- obviously bad.
How should I handle cases where a page doesn't load successfully, and more importantly how does this fit into the "stack" I've built?
I'm pretty new to this, and Rails, so any opinions on where I might have gone wrong in this design are welcome!
Here is some anonymized code that shows what I have:
The rake task that makes a call to MyHelperModule:
# lib/tasks/my_app_tasks.rake
namespace :my_app do
desc "Batch processes some stuff # a later time."
task :process_the_batch => :environment do
# The dataset being processed
# is millions of rows so this is a big job
# and should be done in batches!
MyModel.where(some_thing: nil).find_in_batches do |my_models|
MyHelperModule.do_the_process my_models: my_models
end
end
end
end
MyHelperModule accepts my_models and does further stuff with ActiveRecord. It calls SomeClass:
# lib/my_helper_module.rb
module MyHelperModule
def self.do_the_process(args = {})
my_models = args[:my_models]
# Parallel.each(my_models, :in_processes => 5) do |my_model|
my_models.each do |my_model|
# Reconnect to prevent errors with Postgres
ActiveRecord::Base.connection.reconnect!
# Do some active record stuff
some_var = SomeClass.new(my_model.id)
# Do something super interesting,
# fun,
# AND sexy with my_model
end
end
end
SomeClass will go out to the web via WebpageHelper and process a page:
# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
attr_accessor :some_data
def initialize(arg)
doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
# do more stuff
end
end
WebpageHelper is where the exception is caught and an infinite loop is started in the case of 404:
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'
class WebpageHelper
def self.get_doc(url)
begin
page_content = open(url).read
# do more stuff
rescue Exception => ex
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
puts "Retrying... Attempt #: #{attempts.to_s}"
attempts = attempts + 1
sleep(10)
retry
end
end
end

TL;DR
Use out-of-band error handling and a different conceptual scraping model to speed up operations.
Exceptions Are Not for Common Conditions
There are a number of other answers that address how to handle exceptions for your use case. I'm taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.
In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.
In The Pragmatic Programmer: From Journeyman to Master, the authors state "[E]xceptions should be reserved for unexpected events." In your case, 404 errors are undesirable, but are not at all unexpected--in fact, handling 404 errors is a core consideration!
In short, you need a different approach. Preferably, the alternative approach should provide out-of-band error handling and prevent your process from blocking on retries.
One Alternative: A Faster, More Atomic Process
You have a lot of options here, but the one I'm going to recommend is to handle 404 status codes as a normal result. This allows you to "fail fast," but also allows you to retry pages or remove URLs from your queue at a later time.
Consider this example schema:
ActiveRecord::Schema.define(:version => 20120718124422) do
create_table "webcrawls", :force => true do |t|
t.text "raw_html"
t.integer "retries"
t.integer "status_code"
t.text "parsed_data"
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
end
end
The idea here is that you would simply treat the entire scrape as an atomic process. For example:
Did you get the page?
Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.
Did you get a 404?
Fine, store the error page and the status code. Move on quickly!
When your process is done crawling URLs, you can then use an ActiveRecord lookup to find all the URLs that recently returned a 404 status so that you can take appropriate action. Perhaps you want to retry the page, log a message, or simply remove the URL from your list of URLs to scrape--"appropriate action" is up to you.
By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.
This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.
Final Thoughts: Scaling Up and Out
Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:
Forking background processes.
Using dRuby to split work among multiple processes or machines.
Maximizing core usage by spawning multiple external processes using GNU parallel.
Something else that isn't a monolithic, sequential process.
Optimizing the application logic should suffice for the common case, but if not, scaling up to more processes or out to more servers. Scaling out will certainly be more work, but will also expand the processing options available to you.

Curb has an easier way of doing this and can be a better (and faster) option instead of open-uri.
Errors Curb reports (and that you can rescue from and do something:
http://curb.rubyforge.org/classes/Curl/Err.html
Curb gem:
https://github.com/taf2/curb
Sample code:
def browse(url)
c = Curl::Easy.new(url)
begin
c.connect_timeout = 3
c.perform
return c.body_str
rescue Curl::Err::NotFoundError
handle_not_found_error(url)
end
end
def handle_not_found_error(url)
puts "This is a 404!"
end

You could just raise the 404's:
rescue Exception => ex
raise ex if ex.message['404']
# retry for non-404s
end

It all just depends on what you want to do with 404's.
Lets assume that you just want to swallow them. Part of pguardiario's response is a good start: You can raise an error, and retry a few times...
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'
class WebpageHelper
def self.get_doc(url)
attempt_number = 0
begin
attempt_number = attempt_number + 1
page_content = open(url).read
# do more stuff
rescue Exception => ex
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
puts "Retrying... Attempt #: #{attempts.to_s}"
sleep(10)
retry if attempt_number < 10 # Try ten times.
end
end
end
If you followed this pattern, it would just fail silently. Nothing would happen, and it would move on after ten attempts. I would generally consider that a Bad Plan(tm). Instead of just failing out silently, I would go for something like this in the rescue clause:
rescue Exception => ex
if attempt_number < 10 # Try ten times.
retry
else
raise "Unable to contact #{url} after ten tries."
end
end
and then throw something like this in MyHelperModule#do_the_process (you'd have to update your database to have an errors and error_message column):
my_models.each do |my_model|
# ... cut ...
begin
some_var = SomeClass.new(my_model.id)
rescue Exception => e
my_model.update_attributes(errors: true, error_message: e.message)
next
end
# ... cut ...
end
That's probably the easiest and most graceful way to do it with what you currently have. That said, if you're handling that many request in one massive rake tasks, that's not very elegant. You can't restart it if something goes wrong, it's tying up a single process on your system for a long time, etc. If you end up with any memory leaks (or infinite loops!), you find yourself in a place where you can't just say 'move on'. You probably should be using some kind of queueing system like Resque or Sidekiq, or Delayed Job (though it sounds like you have more items that you'd end up queueing than Delayed Job would happily handle). I'd recommend digging in to those if you're looking for a more eloquent approach.

I actually have a rake task that does something remarkably similar. Here is the gist of what I did to deal with 404's and you could apply it pretty easy.
Basically what you want to do is to use the following code as a filter and create a logfile to store your errors. So before you grab the website and process it you first do the following:
So create/instantiate a logfile in your file:
#logfile = File.open("404_log_#{Time.now.strftime("%m/%d/%Y")}.txt","w")
# #{Time.now.strftime("%m/%d/%Y")} Just includes the date into the log in case you want
# to run diffs on your log files.
Then change your WebpageHelper class to something like this:
class WebpageHelper
def self.get_doc(url)
response = Net::HTTP.get_response(URI.parse(url))
if (response.code.to_i == 404) notify_me(url)
else
page_content = open(url).read
# do more stuff
end
end
end
What this is doing is pinging the page for a response code. The if statement I included is checking if the response code is a 404 and if it is run the notify_me method otherwise run your commands as usual. I just arbitrarily created that notify_me method as an example. On my system I have it writing to txt file that it emails me upon completion. You could use a similar method to look at other response codes.
Generic logging method:
def notify_me(url)
puts "Failed at #{Time.now}"
puts "URL: " + url
#logfile.puts("There was a 404 error for the site #{url} at #{Time.now}.")
end

Regarding the problem you're experiencing, you can do the following:
class WebpageHelper
def self.get_doc(url)
retried = false
begin
page_content = open(url).read
# do more stuff
rescue OpenURI::HTTPError => ex
unless ex.io.status.first.to_i == 404
log_error ex.message
sleep(10)
unless retried
retried = true
retry
end
end
# FIXME: needs some refactoring
rescue Exception => ex
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
puts "Retrying... Attempt #: #{attempts.to_s}"
attempts = attempts + 1
sleep(10)
retry
end
end
end
But I'd rewrite the whole thing in order to do parallel processing with Typhoeus:
https://github.com/typhoeus/typhoeus
where I'd assign a callback block which would do the handling of the returned data, thus decoupling the fetching of the page and the processing.
Something along the lines:
def on_complete(response)
end
def on_failure(response)
end
def run
hydra = Typhoeus::Hydra.new
reqs = urls.collect do |url|
Typhoeus::Request.new(url).tap { |req|
req.on_complete = method(:on_complete).to_proc }
hydra.queue(req)
}
end
hydra.run
# do something with all requests after all requests were performed, if needed
end

I think everyone's comments on this question are spot on and correct. There is alot of good info on this page. Here is my attempt at collecting this very hefty bounty. That being said +1 to all answers.
If you are only concerned with 404 using OpenURI you can handle just those types of exceptions
# lib/webpage_helper.rb
rescue OpenURI::HTTPError => ex
# handle OpenURI HTTP Error!
rescue Exception => e
# similar to the original
case e.message
when /404/ then puts '404!'
when /500/ then puts '500!'
# etc ...
end
end
If you want a bit more you can do different Execption handling per type of error.
# lib/webpage_helper.rb
rescue OpenURI::HTTPError => ex
# do OpenURI HTTP ERRORS
rescue Exception::SyntaxError => ex
# do Syntax Errors
rescue Exception => ex
# do what we were doing before
Also I like what is said in the other posts about number of attempts. Makes sure it isn't an infinite loop.
I think the rails thing to do after a number of attempts would be to log, queue, and or email.
To log you can use
webpage_logger = Log4r::Logger.new("webpage_helper_logger")
# somewhere later
# ie 404
case e.message
when /404/
then
webpage_logger.debug "debug level error #{attempts.to_s}"
webpage_logger.info "info level error #{attempts.to_s}"
webpage_logger.fatal "fatal level error #{attempts.to_s}"
There are many ways to queue.
I think some of the best are faye and resque. Here is a link to both:
http://faye.jcoglan.com/
https://github.com/defunkt/resque/
Queues work just like a line. Believe it or not the Brits call lines, "queues" (The more you know). So, using a queuing server then you can line up many requests and when the server you are trying to send the request comes back, you can hammer that server with your requests in the queue. Thus forcing their server to go down again, but hopefully over time they will upgrade their machines because they keep crashing.
And finally to email, rails also to the rescue (not resque)...
Here is the link to rails guide on ActionMailer: http://guides.rubyonrails.org/action_mailer_basics.html
You could have a mailer like this
class SomeClassMailer < ActionMailer::Base
default :from => "notifications#example.com"
def self.mail(*args)
...
# then later
rescue Exception => e
case e.message
when /404/ && attempts == 3
SomeClassMailer.mail(:to => "broken#example.com", :subject => "Failure ! #{attempts}")

Instead of using initialize, which always returns a new instance of an object, when creating a new SomeClass from a scraping, I'd use a class method to create the instance. I'm not using exceptions here beyond what nokogiri is throwing because it sounds like nothing else should bubble up further since you just want these to be logged, but otherwise be ignored. You mentioned logging the exceptions--are you just logging what goes to stdout? I'll answer as if you are...
# lib/my_helper_module.rb
module MyHelperModule
def self.do_the_process(args = {})
my_models = args[:my_models]
# Parallel.each(my_models, :in_processes => 5) do |my_model|
my_models.each do |my_model|
# Reconnect to prevent errors with Postgres
ActiveRecord::Base.connection.reconnect!
some_object = SomeClass.create_from_scrape(my_model.id)
if some_object
# Do something super interesting if you were able to get a scraping
# otherwise nothing happens (except it is noted in our logging elsewhere)
end
end
end
Your SomeClass:
# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
attr_accessor :some_data
def initialize(doc)
#doc = doc
end
# could shorten this, but you get the idea...
def self.create_from_scrape(arg)
doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
if doc
return SomeClass.new(doc)
else
return nil
end
end
end
Your WebPageHelper:
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'
class WebpageHelper
def self.get_doc(url)
attempts = 0 # define attempts first in non-block local scope before using it
begin
page_content = open(url).read
# do more stuff
rescue Exception => ex
attempts += 1
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
if attempts < 3
puts "Retrying... Attempt #: #{attempts.to_s}"
sleep(10)
retry
else
return nil
end
end
end
end

Related

How to rescue or prevent rails database connection exception and allow controller action to continue

I am trying to catch database connection issues on a specific request and take a different action when the database is down.
for example:
config/routes.rb
get 'my_route' => 'my_controller#my_action'
app/controllers/my_controller.rb
class MyController < Public::ApplicationController
def my_action
begin
url = database_lookup
rescue Mysql2::Error => e
url = fallback_lookup
end
redirect_to url
end
def database_lookup
# get info from db
end
def fallback_lookup
# lookup info in redis cache instead
end
end
This might work in certain situations, however if the database goes down and a new request comes in, active record middleware raises an exception long before ever reaching the controller.
I have been messing with Middleware to try and catch the error and do something else, but its not looking too promising.
What i'm trying is:
application.rb
config.middleware.insert_after 'ActionDispatch::RemoteIp', 'DatabaseExceptionHandler'
app/middleware/database_exception_handler.rb
class DatabaseExceptionHandler
def initialize app
#app = app
end
def call env
#status, #headers, #response = #app.call(env)
[#status, #headers, #response]
rescue Mysql2::Error => e
[#status, #headers, #response]
end
end
This is allowing me to catch connection exceptions that are raised when the request runs, but it doesn't help me much. I need to somehow get to a controller action still.
I think a simpler approach would be to skip all the active record connection nonsense for a specific controller action.
It seems silly to force a database connection for something that might not even need the database.
Anyone have any better ideas than what i've come up with so far?

I think a simpler approach would be to skip all the active record
connection nonsense for a specific controller action.
No doable or even a good idea. Rails does not process the configuration and middleware on a per request basis and that would not work with any kind of server that speeds the process up by booting up rails in the background.
This is allowing me to catch connection exceptions that are raised
when the request runs, but it doesn't help me much. I need to somehow
get to a controller action still.
This is probably a fools errand. If Rails bailed in the initialization process the integrity of the system is probably not great and you can't just continue on like its business as usual.
What you can do is set config.exceptions_app to customize the error pages. And get a less flaky database.

Continue a loop after rescuing from an external API error in Rails

How can I use rescue to continue a loop. I’ll make an example
def self.execute
Foo.some_scope.each do |foo|
# This calls to an external API, and sometimes can raise an error if the account is not active
App::Client::Sync.new(foo).start!
end
end
So normally rescue Bar::Web::Api::Error => e would go at the end of the method and the loop will stop. If I could update a attribute of the foo that was rescued and call the method again, that foo would not be included in the scope and I would be able to start the loop again. But the issue with that is, I only want this once for each foo. So that way would loop through all of the existing foo again.
What’s another way I could do this? I could make a private method that is called at the top of the execute method. This could Loop through the foo and update the attribute so they aren’t part of the scope. But this sounds like an endless loop.
Does anyone have a good solution to this?

You can put a begin and rescue block within the loop. You talk about "updating an attribute of the foo" but it seems you only want that to ensure this foo is not processed on a restart of the loop, but you don't need to restart the loop.
def self.execute
Foo.some_scope.each do |foo|
# This calls to an external API, and sometimes can raise an error if the account is not active
begin
App::Client::Sync.new(foo).start!
rescue Bar::Web::Api::Error
foo.update(attribute: :new_value) # if you still need this
end
end
end

You could use retry. It will re-execute the whole begin block when called from a rescue block. If you only want it to retry a limited number of times you could use a counter. Something like:
def self.execute
Foo.some_scope.each do |foo|
num_tries = 0
begin
App::Client::Sync.new(foo).start!
rescue
num_tries += 1
retry if num_tries > 1
end
end
end
Documentation here.

Can I put retry in rescue block without begin ... end?

I ran into a problem, when PG fails out of sync (well known problem)(example).
PG fails out of sync, sequence of id stops incrementing and raises ActiveRecord::RecordNotUnique error.
But all solutions proposed here (all I found) propose some manual solutions - either do some operations in console, either run custom rake task.
However, I find this unsatisfying for production: each times it happens, users get 500, while someone administrating server should operatively save the day. (And according to test data for some reason it possible will occur frequently in my case).
So I would like rather to patch ActiveRecord Base class to catch this specific error and rescue it.
I use this logic sometimes in controller:
class ApplicationController < ActionController::Base
rescue_from ActionController::ParameterMissing, ActiveRecord::RecordNotFound do |e|
# some logic here
end
end
However, here I don't need retry. Also, I would like to not to go deep in monkey patching, for example, without overriding Base create method.
So I was thinking of something like this:
module ActiveRecord
class Base
rescue ActiveRecord::RecordNotUnique => e
if e.message.include? '_pkey'
table =e.message.match(//) #regex to define table
ActiveRecord::Base.connection.reset_pk_sequence!(table)
retry
else
raise
end
end
end
But it most likely doesn't work, as I'm not sure if Rails/Ruby will understand what exactly it asked to retry.
Is there any solution?
P.S. Not related solution for overall problem of sequence which will work without manual command line commands and having unserved users are also appreciated.

To answer the question you're asking, no. rescue can only be used from within a begin..end block or method body.
begin
bad_method
rescue SomeException
retry
end
def some_method
bad_method
rescue SomeException
retry
end
rescue_from is just a framework helper method created because of how indirect the execution is in a controller.
To answer the question you're really asking, sure. You can override create_or_update with a rescue/retry.
module NonUniquePkeyRecovery
def create_or_update(*)
super
rescue ActiveRecord::RecordNotUnique => e
raise unless e.message.include? '_pkey'
self.class.connection.reset_pk_sequence!(self.class.table_name)
retry
end
end
ActiveSupport.on_load(:active_record) do
include NonUniquePkeyRecovery
end

Rails not catching exception in rescue block

User model has defined indexes to be searched using ThinkingSphinx. However when I stop my searchd deamon, I would like my method to fail gracefully and not throw an error. Normally I do this by using a rescue block for catching exceptions. But in this case, it still throws the error and the puts statement is never executed.
def search_users(key)
begin
search_results = User.search(key,options)
rescue Exception
puts "Hello World!!!"
search_results = []
end
return search_results
end
Following is the error i get:
Riddle::ConnectionError (Connection to 127.0.0.1 on 3201 failed. Connection refused - connect(2)):
Is there any way out?

Solved it.
Add the :populate => true option to your search calls.
Normally, Thinking Sphinx lazily loads search results (allowing for
sphinx scopes and such) - but if you want the rescue to take effect,
then you'll need to force the results to load immediately - hence the
:populate option.
Refer the link posted above for further reading.

Given ruby return semantics, you can compress your code:
def search_users(key)
begin
User.search(key,options)
rescue
puts "Hello World!!!"
[]
end
end
It is evil to rescue Exception. Just use rescue, which rescues StandardError, which captures most of the stuff you want it to. Otherwise you also capture SyntaxError, LoadError, SystemExit and other stuff you don't intend. In this case, rescue Riddle::ConnectionError is appropriate, but not necessary.

Rails: Returning errors from Module to Rake task?

How do I pass an error from my Module back to the rake task that called it?
My rake task looks like this:
require 'mymodule.rb'
task :queue => :environment do
OPERATOR = Mymodule::Operator.new
begin
OPERATOR.initiate_call (1234567189)
rescue StandardError => bang
puts "Shit happened: #{ bang} "
end
end
And here is my module..
module Mymodule
class Operator
def initiate_call (number)
begin
# make the call
rescue StandardError => bang
flash[:error] = "Error #{bang}"
return
end
end
end
end
I also call this module from a controller so it would be nice to have an error handling solution that is more or less agnostic.
Running Rails 3. Any unrelated comments (i.e. suggestions) on my code structure are more than welcomed :)

Your Operator#initiate_call method traps StandardError exceptions so your rake task will never see them. I'd drop the rescue from initiate_call and let the caller deal with all the exception handling. Then, you'd have flash[:error] = "Error #{bang}" in your controller's exception handler and the rake task would remain as-is.
The basic approach is to push the error handling up the call stack all the way to someone that can do something about it; initiate_call can't really do anything useful with the exception so it shouldn't try to handle it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart