Check if Nokogiri HTML document is usable - ruby-on-rails

I want to check if the URL that the user inputs is in fact a valid page.
I tried:
if Nokogiri::HTML(open("http://example.com"))
#DO REQUIRED TASK
end
But that immediately throws an error upon attempting to open the page. I want to return the result of whether it is a document of any kind.
I either get the error:
no such file or directory
or:
getaddrinfo: Name or service not known
depending on how I try to make the check.

I'd start with something like:
require 'nokogiri'
require 'open-uri'
begin
doc = Nokogiri.HTML(open(url))
rescue Exception => e
puts "Couldn't read \"#{ url }\": #{ e }"
exit
end
puts (doc.errors.empty?) ? "No problems found" : doc.errors
Nokogiri sets the document's errors array to the values of any errors that occurred during the parsing process.
This only addresses one part of the issue though. Malicious people like to break things, and this would be very easy to break. In general, be very careful about anything a user gives you, especially if your site is exposed to the wild internet.
Prior to telling OpenURI to load the file to give to Nokogiri, you should sniff that URL and do some sanity checks using a HTTP HEAD request to find out the size and MIME-TYPE of the content being retrieved. Once you know those, you can try loading the file.

Firstly, it's bad style to 'rescue Exception => e' in Ruby.
[Refer: http://daniel.fone.net.nz/blog/2013/05/28/why-you-should-never-rescue-exception-in-ruby/ ]
Secondly, for this case, "rescue OpenURI::HTTPError => e" would be more suitable.

I'm not familiar with handling exceptions but something like :
begin
page = Nokogiri::HTML(open("http://example.com"))
ensure
puts "not a document of any kind"
end
do_something_whith(page) if page
...should do the trick.
or (after reading your comment) :
begin
page = open("http://example.com")
ensure
puts "not a document of any kind"
end
Nokogiri::HTML(page) if page

Related

How to "reload" a cloudflare 520 request with ruby?

I wrote a ruby script to download an image URL:
require 'open-uri'
imageAddress = ARGV[0]
targetPath = ARGV[1]
fullFileNamePath = "#{targetPath}test.jpg"
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
end
Example Usage:
ruby download_image.rb "https://images.genius.com/b015b15e476c92d10a834d523575d3c9.1000x1000x1.jpg" "/Users/Me/Downloads/"
The problem is, sometimes I run across this output error:
520 Origin Error
Then, when I try the same URL in my browser, I get something like this:
If I reload the page or click the 'Retry for a live version' button in the above image, the page loads.
Then if I run the script again it downloads the image just fine.
So how can I replicate this page reload / 'Retry for a live version' behavior using ruby and without switching to my browser? Running the script again doesn't do the job.
It sounds like you are looking for a delay command. If the script fails (or encounters '520 Origin Error') wait and re-try.
This is a quick built recursive function, you may want to add other checks for how many times you have looped, breaking after so many. (Also not tested, may contain errors, meant as an example)
def getFile(params_you_need)
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
if ex == '520 Origin Error'
sleep(30) #generally a good time to pause
getFile(params_you_need)
end
end
end

Rails formatting logs to use with aws-logs and CloudWatch

AWS has this very cool log collection tool using aws-logs
However, I do not understand how I can format my log / configure the tool to be smarter and regroup the same error message. Right now AWS shows one message per line (because every line is timestamped)
My current log configuration indeed captures one new log entry per message. How can I go around it
[rails/production.log]
file = /var/www/xxx/shared/log/production.log
log_group_name = /rails/production.log
log_stream_name = {instance_id}
time_zone = LOCAL
datetime_format = %Y-%m-%dT%H:%M:%S
I actually partly solved the problem using lograge and JSON output which is parsed correctly by Amazon and lets your regroup most requests correctly.
However I still have some problems with errors, which are not outputted the same way, and still generate one line per call stack trace on awslogs
EDIT : We are now using a Rails API and regular exceptions thrown during JSON requests are rescued with a json:api error handler renderer. Furthermore, we are using Rollbar to log actual errors, so it becomes irrelevant to have the full error log
In our API::ApplicationController
# We don't want error reports for those errors
RESCUABLE_ERRORS = [
ActionController::ParameterMissing,
ActiveModel::ForbiddenAttributesError,
StrongerParameters::InvalidParameter,
Mongoid::Errors::Validations,
Mongoid::Errors::DocumentNotFound
]
# Note that in tests, we want to actually do not want to rescue non-Runtime exceptions straight away because most likely this indicates a real bug that you should fix, but in production we want to rescue any error so the frontend does not get the default HTML response but a JSON:api error
rescue_from(Rails.env.test? ? RuntimeError : Exception) do |e|
handle_exception(e)
notify_exception(e, 'Rescued from API controller - Rendering JSONAPI Error')
end
rescue_from(*RESCUABLE_ERRORS) do |e|
handle_exception(e)
end
In our controllers that inherit API::ApplicationController, we add as many lines of rescue_from depending whether we want to report the exception as an error (notify_exception) or just convert to a JSON payload (handle_exception)
rescue_from(SPECIFIC_ERROR_CLASS) do |exception|
handle_exception(exception) # will render a json:api error payload
# notify_exception(exception) # Optional : ExceptionNotifier to broadcast the error to email/Rollbar, etc. if this error should not happen.
end

How to retry a rake task if you get a Bad Gateway error response from a web source

I am trying to run a rake task to get all the data with a specific tag from Instagram, and then input some of the data into my server.
The task runs just fine, except sometimes I'll get an error response. It's sort of random, so I think it just happens sometimes, and since it's a fairly long running task, it'll happen eventually.
This is the error on my console:
Instagram::BadGateway: GET https://api.instagram.com/v1/tags/xxx/media/recent.json?access_token=xxxxx&max_id=996890856542960826: 502: The server returned an invalid or incomplete response.
When this happens, I don't know what else to do except run the task again starting from that max_id. However, it would be nice if I could get the whole thing to automate itself, and retry itself from that point when it gets that error.
My task looks something like this:
task :download => :environment do
igs = Instagram.tag_recent_media("xxx")
begin
sleep 0.2
igs.each do |ig|
dl = Instadownload.new
dl.instagram_url = ig.link
dl.image_url = ig.images.standard_resolution.url
dl.caption = ig.caption.text if ig.caption
dl.taken_at = Time.at(ig.created_time.to_i)
dl.save!
end
if igs.pagination.next_max_id?
igs = Instagram.tag_recent_media("xxx", max_id: igs.pagination.next_max_id)
moreigs = true
else
moreigs = false
end
end while moreigs
end
Chad Pytel and Tammer Saleh call this "Fire and forget" antipattern in their Rails Antipatterns book:
Assuming that the request always succeeds or simply not caring if it
fails may be valid in rare circumstances, but in most cases it's
unsufficient. On the other hand, rescuing all the exceptions would be
a bad practice aswell. The proper solution would be to understand the
actual exceptions that will be raised by external service and rescue
those only.
So, what you should do is to wrap your code block into begin/rescue block with the appropriate set of errors raised by Instagram (list of those errors can be found here). I'm not sure which particular line of your code snippet ends with 502 code, so just to give you and idea of what it could look like:
begin
dl = Instadownload.new
dl.instagram_url = ig.link
dl.image_url = ig.images.standard_resolution.url
dl.caption = ig.caption.text if ig.caption
dl.taken_at = Time.at(ig.created_time.to_i)
dl.save!
rescue Instagram::BadGateway => e # list of acceptable errors can be expanded
retry # restart from beginning
end

Does Ruby's 'open_uri' reliably close sockets after read or on fail?

I have been using open_uri to pull down an ftp path as a data source for some time, but suddenly found that I'm getting nearly continual "530 Sorry, the maximum number of allowed clients (95) are already connected."
I am not sure if my code is faulty or if it is someone else who's accessing the server and unfortunately there's no way for me to really seemingly know for sure who's at fault.
Essentially I am reading FTP URI's with:
def self.read_uri(uri)
begin
uri = open(uri).read
uri == "Error" ? nil : uri
rescue OpenURI::HTTPError
nil
end
end
I'm guessing that I need to add some additional error handling code in here...
I want to be sure that I take every precaution to close down all connections so that my connections are not the problem in question, however I thought that open_uri + read would take this precaution vs using the Net::FTP methods.
The bottom line is I've got to be 100% sure that these connections are being closed and I don't somehow have a bunch open connections laying around.
Can someone please advise as to correctly using read_uri to pull in ftp with a guarantee that it's closing the connection? Or should I shift the logic over to Net::FTP which could yield more control over the situation if open_uri is not robust enough?
If I do need to use the Net::FTP methods instead, is there a read method that I should be familiar with vs pulling it down to a tmp location and then reading it (as I'd much prefer to keep it in a buffer vs the fs if possible)?
I suspect you are not closing the handles. OpenURI's docs start with this comment:
It is possible to open http/https/ftp URL as usual like opening a file:
open("http://www.ruby-lang.org/") {|f|
f.each_line {|line| p line}
}
I looked at the source and the open_uri method does close the stream if you pass a block, so, tweaking the above example to fit your code:
uri = ''
open("http://www.ruby-lang.org/") {|f|
uri = f.read
}
Should get you close to what you want.
Here's one way to handle exceptions:
# The list of URLs to pass in to check if one times out or is refused.
urls = %w[
http://www.ruby-lang.org/
http://www2.ruby-lang.org/
]
# the method
def self.read_uri(urls)
content = ''
open(urls.shift) { |f| content = f.read }
content == "Error" ? nil : content
rescue OpenURI::HTTPError
retry if (urls.any?)
nil
end
Try using a block:
data = open(uri){|f| f.read}

Requesting another server for file exist in ruby on rails

I have a condition where I need to check for the file in another server, if that file exists I need to delete from the current server. Can any body help me.
You can place a script on another server and ask it in restful way to perform that tasks for you:
http://another.server/exists/:file_name
http://another.server/delete/:file_name
but you will have to think about security aspects of this solution.
Also take a look on executing remote commands via ssh: http://bashcurescancer.com/run_remote_commands_with_ssh.html. Combined with using ssh "without password" it can be acceptable solution to run command line program that run what you need.
Just write a ruby script and do something along the line with:
require "open-uri"
file_name = "file.name"
begin
file = open("http://www.example.com/#{file_name}")
File.delete("path_to" + file_name)
p "File #{file_name} deleted"
rescue
p "File not found"
end

Resources