I have this method inside a model with this code inside. It calls a gem and returns either the object I want or a 404 resource not found. if I do a method on a 404 then I need to rescue it as shown below. If I just use rescue the linter fails. If I do this brakeman fails.
find_object
return_object = Rails.cache.fetch(cache_key + '/variableInsideObject') do
GemClient.find(id).variableInsideObject
rescue HttpServices::ResourceNotFoundError
raise ApplicationController::ExternalServiceError,
"variable inside object not found for id: #{id}"
end
end
How can I rescue this error without failing the linter and brakeman.
Imo this is a more Ruby-esque implementation of this code:
def find_object
return_object = begin
Rails.cache.fetch(cache_key + '/variableInsideObject') do
GemClient.find(id).variableInsideObject
end
rescue HttpServices::ResourceNotFoundError => e
Rails.logger.error(e)
raise ApplicationController::ExternalServiceError,
"variable inside object not found for id: #{id}"
end
end
Of course, it's hard to say without knowing what the linter or brakeman are complaining about exactly.... but this should be better. You don't of course need to use begin end blocks, but sometimes linters/community finds it is neater...
I ran into a problem, when PG fails out of sync (well known problem)(example).
PG fails out of sync, sequence of id stops incrementing and raises ActiveRecord::RecordNotUnique error.
But all solutions proposed here (all I found) propose some manual solutions - either do some operations in console, either run custom rake task.
However, I find this unsatisfying for production: each times it happens, users get 500, while someone administrating server should operatively save the day. (And according to test data for some reason it possible will occur frequently in my case).
So I would like rather to patch ActiveRecord Base class to catch this specific error and rescue it.
I use this logic sometimes in controller:
class ApplicationController < ActionController::Base
rescue_from ActionController::ParameterMissing, ActiveRecord::RecordNotFound do |e|
# some logic here
end
end
However, here I don't need retry. Also, I would like to not to go deep in monkey patching, for example, without overriding Base create method.
So I was thinking of something like this:
module ActiveRecord
class Base
rescue ActiveRecord::RecordNotUnique => e
if e.message.include? '_pkey'
table =e.message.match(//) #regex to define table
ActiveRecord::Base.connection.reset_pk_sequence!(table)
retry
else
raise
end
end
end
But it most likely doesn't work, as I'm not sure if Rails/Ruby will understand what exactly it asked to retry.
Is there any solution?
P.S. Not related solution for overall problem of sequence which will work without manual command line commands and having unserved users are also appreciated.
To answer the question you're asking, no. rescue can only be used from within a begin..end block or method body.
begin
bad_method
rescue SomeException
retry
end
def some_method
bad_method
rescue SomeException
retry
end
rescue_from is just a framework helper method created because of how indirect the execution is in a controller.
To answer the question you're really asking, sure. You can override create_or_update with a rescue/retry.
module NonUniquePkeyRecovery
def create_or_update(*)
super
rescue ActiveRecord::RecordNotUnique => e
raise unless e.message.include? '_pkey'
self.class.connection.reset_pk_sequence!(self.class.table_name)
retry
end
end
ActiveSupport.on_load(:active_record) do
include NonUniquePkeyRecovery
end
I'm trying to run a command that might fail sometimes. When it fails, it throws an exception.
What I'd like it to do is just log the error quietly and continue executing the next line below it, rather than aborting and going into the 'rescue' block. How should I approach this?
My current code is as follows:
rescue_from 'Gibbon::MailChimpError' do |exception|
logger.error("MAILCHIMP: #{exception}")
end
When I call the Mailchimp API, sometimes there is an error, and this disrupts the flow of my application. I just want it to carry on executing as if nothing has happened, and just note there was an error in the log.
How about something like this:
def rescuing(&block)
begin
yield
rescue NameError => e
puts "(Just rescued: #{e.inspect})"
end
end
rescuing do
puts "This is dangerous"
raise NameError
end
puts "... but I'm still alive"
Obviously, you'd have to replace NameError with the exception you want to be protected against.
I have a rake task that is responsible for doing batch processing on millions of URLs. Because this process takes so long I sometimes find that URLs I'm trying to process are no longer valid -- 404s, site's down, whatever.
When I initially wrote this there was basically just one site that would continually go down while processing so my solution was to use open-uri, rescue any exceptions produced, wait a bit, and then retry.
This worked fine when the dataset was smaller but now so much time goes by that I'm finding URLs are no longer there anymore and produce a 404.
Using the case of a 404, when this happens my script just sits there and loops infinitely -- obviously bad.
How should I handle cases where a page doesn't load successfully, and more importantly how does this fit into the "stack" I've built?
I'm pretty new to this, and Rails, so any opinions on where I might have gone wrong in this design are welcome!
Here is some anonymized code that shows what I have:
The rake task that makes a call to MyHelperModule:
# lib/tasks/my_app_tasks.rake
namespace :my_app do
desc "Batch processes some stuff # a later time."
task :process_the_batch => :environment do
# The dataset being processed
# is millions of rows so this is a big job
# and should be done in batches!
MyModel.where(some_thing: nil).find_in_batches do |my_models|
MyHelperModule.do_the_process my_models: my_models
end
end
end
end
MyHelperModule accepts my_models and does further stuff with ActiveRecord. It calls SomeClass:
# lib/my_helper_module.rb
module MyHelperModule
def self.do_the_process(args = {})
my_models = args[:my_models]
# Parallel.each(my_models, :in_processes => 5) do |my_model|
my_models.each do |my_model|
# Reconnect to prevent errors with Postgres
ActiveRecord::Base.connection.reconnect!
# Do some active record stuff
some_var = SomeClass.new(my_model.id)
# Do something super interesting,
# fun,
# AND sexy with my_model
end
end
end
SomeClass will go out to the web via WebpageHelper and process a page:
# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
attr_accessor :some_data
def initialize(arg)
doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
# do more stuff
end
end
WebpageHelper is where the exception is caught and an infinite loop is started in the case of 404:
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'
class WebpageHelper
def self.get_doc(url)
begin
page_content = open(url).read
# do more stuff
rescue Exception => ex
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
puts "Retrying... Attempt #: #{attempts.to_s}"
attempts = attempts + 1
sleep(10)
retry
end
end
end
TL;DR
Use out-of-band error handling and a different conceptual scraping model to speed up operations.
Exceptions Are Not for Common Conditions
There are a number of other answers that address how to handle exceptions for your use case. I'm taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.
In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.
In The Pragmatic Programmer: From Journeyman to Master, the authors state "[E]xceptions should be reserved for unexpected events." In your case, 404 errors are undesirable, but are not at all unexpected--in fact, handling 404 errors is a core consideration!
In short, you need a different approach. Preferably, the alternative approach should provide out-of-band error handling and prevent your process from blocking on retries.
One Alternative: A Faster, More Atomic Process
You have a lot of options here, but the one I'm going to recommend is to handle 404 status codes as a normal result. This allows you to "fail fast," but also allows you to retry pages or remove URLs from your queue at a later time.
Consider this example schema:
ActiveRecord::Schema.define(:version => 20120718124422) do
create_table "webcrawls", :force => true do |t|
t.text "raw_html"
t.integer "retries"
t.integer "status_code"
t.text "parsed_data"
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
end
end
The idea here is that you would simply treat the entire scrape as an atomic process. For example:
Did you get the page?
Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.
Did you get a 404?
Fine, store the error page and the status code. Move on quickly!
When your process is done crawling URLs, you can then use an ActiveRecord lookup to find all the URLs that recently returned a 404 status so that you can take appropriate action. Perhaps you want to retry the page, log a message, or simply remove the URL from your list of URLs to scrape--"appropriate action" is up to you.
By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.
This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.
Final Thoughts: Scaling Up and Out
Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:
Forking background processes.
Using dRuby to split work among multiple processes or machines.
Maximizing core usage by spawning multiple external processes using GNU parallel.
Something else that isn't a monolithic, sequential process.
Optimizing the application logic should suffice for the common case, but if not, scaling up to more processes or out to more servers. Scaling out will certainly be more work, but will also expand the processing options available to you.
Curb has an easier way of doing this and can be a better (and faster) option instead of open-uri.
Errors Curb reports (and that you can rescue from and do something:
http://curb.rubyforge.org/classes/Curl/Err.html
Curb gem:
https://github.com/taf2/curb
Sample code:
def browse(url)
c = Curl::Easy.new(url)
begin
c.connect_timeout = 3
c.perform
return c.body_str
rescue Curl::Err::NotFoundError
handle_not_found_error(url)
end
end
def handle_not_found_error(url)
puts "This is a 404!"
end
You could just raise the 404's:
rescue Exception => ex
raise ex if ex.message['404']
# retry for non-404s
end
It all just depends on what you want to do with 404's.
Lets assume that you just want to swallow them. Part of pguardiario's response is a good start: You can raise an error, and retry a few times...
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'
class WebpageHelper
def self.get_doc(url)
attempt_number = 0
begin
attempt_number = attempt_number + 1
page_content = open(url).read
# do more stuff
rescue Exception => ex
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
puts "Retrying... Attempt #: #{attempts.to_s}"
sleep(10)
retry if attempt_number < 10 # Try ten times.
end
end
end
If you followed this pattern, it would just fail silently. Nothing would happen, and it would move on after ten attempts. I would generally consider that a Bad Plan(tm). Instead of just failing out silently, I would go for something like this in the rescue clause:
rescue Exception => ex
if attempt_number < 10 # Try ten times.
retry
else
raise "Unable to contact #{url} after ten tries."
end
end
and then throw something like this in MyHelperModule#do_the_process (you'd have to update your database to have an errors and error_message column):
my_models.each do |my_model|
# ... cut ...
begin
some_var = SomeClass.new(my_model.id)
rescue Exception => e
my_model.update_attributes(errors: true, error_message: e.message)
next
end
# ... cut ...
end
That's probably the easiest and most graceful way to do it with what you currently have. That said, if you're handling that many request in one massive rake tasks, that's not very elegant. You can't restart it if something goes wrong, it's tying up a single process on your system for a long time, etc. If you end up with any memory leaks (or infinite loops!), you find yourself in a place where you can't just say 'move on'. You probably should be using some kind of queueing system like Resque or Sidekiq, or Delayed Job (though it sounds like you have more items that you'd end up queueing than Delayed Job would happily handle). I'd recommend digging in to those if you're looking for a more eloquent approach.
I actually have a rake task that does something remarkably similar. Here is the gist of what I did to deal with 404's and you could apply it pretty easy.
Basically what you want to do is to use the following code as a filter and create a logfile to store your errors. So before you grab the website and process it you first do the following:
So create/instantiate a logfile in your file:
#logfile = File.open("404_log_#{Time.now.strftime("%m/%d/%Y")}.txt","w")
# #{Time.now.strftime("%m/%d/%Y")} Just includes the date into the log in case you want
# to run diffs on your log files.
Then change your WebpageHelper class to something like this:
class WebpageHelper
def self.get_doc(url)
response = Net::HTTP.get_response(URI.parse(url))
if (response.code.to_i == 404) notify_me(url)
else
page_content = open(url).read
# do more stuff
end
end
end
What this is doing is pinging the page for a response code. The if statement I included is checking if the response code is a 404 and if it is run the notify_me method otherwise run your commands as usual. I just arbitrarily created that notify_me method as an example. On my system I have it writing to txt file that it emails me upon completion. You could use a similar method to look at other response codes.
Generic logging method:
def notify_me(url)
puts "Failed at #{Time.now}"
puts "URL: " + url
#logfile.puts("There was a 404 error for the site #{url} at #{Time.now}.")
end
Regarding the problem you're experiencing, you can do the following:
class WebpageHelper
def self.get_doc(url)
retried = false
begin
page_content = open(url).read
# do more stuff
rescue OpenURI::HTTPError => ex
unless ex.io.status.first.to_i == 404
log_error ex.message
sleep(10)
unless retried
retried = true
retry
end
end
# FIXME: needs some refactoring
rescue Exception => ex
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
puts "Retrying... Attempt #: #{attempts.to_s}"
attempts = attempts + 1
sleep(10)
retry
end
end
end
But I'd rewrite the whole thing in order to do parallel processing with Typhoeus:
https://github.com/typhoeus/typhoeus
where I'd assign a callback block which would do the handling of the returned data, thus decoupling the fetching of the page and the processing.
Something along the lines:
def on_complete(response)
end
def on_failure(response)
end
def run
hydra = Typhoeus::Hydra.new
reqs = urls.collect do |url|
Typhoeus::Request.new(url).tap { |req|
req.on_complete = method(:on_complete).to_proc }
hydra.queue(req)
}
end
hydra.run
# do something with all requests after all requests were performed, if needed
end
I think everyone's comments on this question are spot on and correct. There is alot of good info on this page. Here is my attempt at collecting this very hefty bounty. That being said +1 to all answers.
If you are only concerned with 404 using OpenURI you can handle just those types of exceptions
# lib/webpage_helper.rb
rescue OpenURI::HTTPError => ex
# handle OpenURI HTTP Error!
rescue Exception => e
# similar to the original
case e.message
when /404/ then puts '404!'
when /500/ then puts '500!'
# etc ...
end
end
If you want a bit more you can do different Execption handling per type of error.
# lib/webpage_helper.rb
rescue OpenURI::HTTPError => ex
# do OpenURI HTTP ERRORS
rescue Exception::SyntaxError => ex
# do Syntax Errors
rescue Exception => ex
# do what we were doing before
Also I like what is said in the other posts about number of attempts. Makes sure it isn't an infinite loop.
I think the rails thing to do after a number of attempts would be to log, queue, and or email.
To log you can use
webpage_logger = Log4r::Logger.new("webpage_helper_logger")
# somewhere later
# ie 404
case e.message
when /404/
then
webpage_logger.debug "debug level error #{attempts.to_s}"
webpage_logger.info "info level error #{attempts.to_s}"
webpage_logger.fatal "fatal level error #{attempts.to_s}"
There are many ways to queue.
I think some of the best are faye and resque. Here is a link to both:
http://faye.jcoglan.com/
https://github.com/defunkt/resque/
Queues work just like a line. Believe it or not the Brits call lines, "queues" (The more you know). So, using a queuing server then you can line up many requests and when the server you are trying to send the request comes back, you can hammer that server with your requests in the queue. Thus forcing their server to go down again, but hopefully over time they will upgrade their machines because they keep crashing.
And finally to email, rails also to the rescue (not resque)...
Here is the link to rails guide on ActionMailer: http://guides.rubyonrails.org/action_mailer_basics.html
You could have a mailer like this
class SomeClassMailer < ActionMailer::Base
default :from => "notifications#example.com"
def self.mail(*args)
...
# then later
rescue Exception => e
case e.message
when /404/ && attempts == 3
SomeClassMailer.mail(:to => "broken#example.com", :subject => "Failure ! #{attempts}")
Instead of using initialize, which always returns a new instance of an object, when creating a new SomeClass from a scraping, I'd use a class method to create the instance. I'm not using exceptions here beyond what nokogiri is throwing because it sounds like nothing else should bubble up further since you just want these to be logged, but otherwise be ignored. You mentioned logging the exceptions--are you just logging what goes to stdout? I'll answer as if you are...
# lib/my_helper_module.rb
module MyHelperModule
def self.do_the_process(args = {})
my_models = args[:my_models]
# Parallel.each(my_models, :in_processes => 5) do |my_model|
my_models.each do |my_model|
# Reconnect to prevent errors with Postgres
ActiveRecord::Base.connection.reconnect!
some_object = SomeClass.create_from_scrape(my_model.id)
if some_object
# Do something super interesting if you were able to get a scraping
# otherwise nothing happens (except it is noted in our logging elsewhere)
end
end
end
Your SomeClass:
# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
attr_accessor :some_data
def initialize(doc)
#doc = doc
end
# could shorten this, but you get the idea...
def self.create_from_scrape(arg)
doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
if doc
return SomeClass.new(doc)
else
return nil
end
end
end
Your WebPageHelper:
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'
class WebpageHelper
def self.get_doc(url)
attempts = 0 # define attempts first in non-block local scope before using it
begin
page_content = open(url).read
# do more stuff
rescue Exception => ex
attempts += 1
puts "Failed at #{Time.now}"
puts "Error: #{ex}"
puts "URL: " + url
if attempts < 3
puts "Retrying... Attempt #: #{attempts.to_s}"
sleep(10)
retry
else
return nil
end
end
end
end
How do I pass an error from my Module back to the rake task that called it?
My rake task looks like this:
require 'mymodule.rb'
task :queue => :environment do
OPERATOR = Mymodule::Operator.new
begin
OPERATOR.initiate_call (1234567189)
rescue StandardError => bang
puts "Shit happened: #{ bang} "
end
end
And here is my module..
module Mymodule
class Operator
def initiate_call (number)
begin
# make the call
rescue StandardError => bang
flash[:error] = "Error #{bang}"
return
end
end
end
end
I also call this module from a controller so it would be nice to have an error handling solution that is more or less agnostic.
Running Rails 3. Any unrelated comments (i.e. suggestions) on my code structure are more than welcomed :)
Your Operator#initiate_call method traps StandardError exceptions so your rake task will never see them. I'd drop the rescue from initiate_call and let the caller deal with all the exception handling. Then, you'd have flash[:error] = "Error #{bang}" in your controller's exception handler and the rake task would remain as-is.
The basic approach is to push the error handling up the call stack all the way to someone that can do something about it; initiate_call can't really do anything useful with the exception so it shouldn't try to handle it.