Ruby: Detecting broken links without actually crawling the URL? - ruby-on-rails

Is there a Ruby gem, or Ruby-esque way to check a webpage for broken links without crawling the actual links and checking for 404's, etc. Basically, I want a solution that works offline, and I want to detect links that are obviously syntactically broken, not links that point to web pages that don't exist.
So for instance, if a link points to "http//stackoverflow.com", that's a syntactically broken link, and I want to detect that. However if a link points to "http://www.webpagedoesnotexistyet.com" and it returns a 404, I'm OK with not detecting that.

Use nokogiri to parse the HTML and URI.parse to check for valid URLs. URI will raise an error if it encounters what it considers to be an invalid url.

Use this : Links below is an array of links
for link in links do
begin
url = URI.parse(link)
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
if res.code == "200"
puts "#{res.code} ok - #{link}"
else
puts "#{res.code} error - #{link}"
end
rescue
puts "breaking for #{link}"
end
end

You can use URI.regexp. If a string matches it, it is a valid uri.
require 'uri'
def valid_uri?(s)
!!(s =~ URI.regexp)
end
valid_uri?('http//stackoverflow.com') # => false
valid_uri?('http://www.webpagedoesnotexistyet.com/') # => true

Related

Rails app to check the status of a server

I want to achieve a problem, where we manually go and check a webapp/server if it is up/down. I want to build a rails app which can automate this task.
Consider my app url is: HostName:PORT/Route?Params (may or may not have port in url)
I checked 'net/http'
def check_status()
#url='host'
uri = URI(#url)
http = Net::HTTP.new(#url,port)
response = http.request_get('/<route>?<params>')
if response == Net::HTTPSuccess
#result='Running'
else
#result='Not Running'
end
end
I am facing error at ,
response = http.request_get('/<route>?<params>')
when the app is down throwing 'Failed to open TCP connection to URL' which is correct.
Can you guys help me find some new solution or how can I improve the above implementation?
Since it's working as intended and you just need to handle the error that's returned when the app is down, wrap it in a rescue block.
def check_status()
#url='host'
uri = URI(#url)
http = Net::HTTP.new(#url,port)
begin
response = http.request_get('/<route>?<params>')
rescue TheClassNameOfThisErrorWhenSiteIsDown
#result = 'Not Running'
end
if response == Net::HTTPSuccess
#result='Running'
else
#result='Not Running'
end
end
end
Just came across this old question. Net::HTTP methods get and head don't raise an exception. So use one of these instead.
def up?(site)
Net::HTTP.new(site).head('/').kind_of? Net::HTTPOK
end
up? 'www.google.com' #=> true

Is there a way to know if a page request came from the same application?

In Rails3, is there a way to check if the page I'm rendering now was requested from the same application, without the use of the hardcoded domain name?
I currently have:
def back_link(car_id = '')
# Check if search exists
uri_obj = URI.parse(controller.request.env["HTTP_REFERER"]) if controller.request.env["HTTP_REFERER"].present?
if uri_obj.present? && ["my_domain.com", "localhost"].include?(uri_obj.host) && uri_obj.query.present? && uri_obj.query.include?('search')
link_to '◀ '.html_safe + t('back_to_search'), url_for(:back) + (car_id.present? ? '#' + car_id.to_s : ''), :class => 'button grey back'
end
end
But this doesn't check for the "www." in front of the domain and all other possible situations.
It would also be nice if I could find out the specific controller and action that were used in the previous page (the referrer).
I think you're looking at this the wrong way.
If you look around the web, find a site with a search feature, and follow the link you'll see a param showing what was searched for.
That's a good way to do it.
Doing it by HTTP_REFERER seems a bit fragile, and won't work, for example, from a bookmark, or posted link.
eg.
/cars/12?from_search=sports+cars
then you can just look up the params[:from_search]
If you really need to do it by HTTP_REFERER then you probably dont have to worry about subdomains. Just;
def http_referer_uri
request.env["HTTP_REFERER"] && URI.parse(request.env["HTTP_REFERER"])
end
def refered_from_our_site?
if uri = http_referer_uri
uri.host == request.host
end
end
def refered_from_a_search?
if refered_from_our_site?
http_referer_uri.try(:query)['search']
end
end
Try something like this:
ref = URI.parse(controller.request.env["HTTP_REFERER"])
if ref.host == ENV["HOSTNAME"]
# do something
To try and get the controller/action from the referring page:
ActionController::Routing::Routes.recognize_path(url.path)
#=> {:controller => "foo", :action => "bar"}
Create an internal_request? method utilizing request.referrer.
Compare the host and port of the request.referrer with your Application's host and port.
require 'uri' # Might be necesseary.
def internal_request?
return false if request.referrer.blank?
referrer = URI.parse( request.referrer )
application_host = Rails.application.config.action_mailer.default_url_options[ :host ]
application_port = Rails.application.config.action_mailer.default_url_options[ :port ]
return true if referrer.host == application_host && referrer.port == application_port
false
end
And then call it like this where you need it, most likely in application_controller.rb:
if internal_request?
do_something
end
Some caveats:
This might need to be modified if you're using subdomains. Easy, though.
This will require you to be setting your host and port for ActionMailer in your configuration, which is common.
You might want to make it the reverse, like external_request? since you're likely handling those situations uniquely. This would allow you to do something like this:
do_something_unique if external_request?

Using OpenUri, how can I get the contents of a redirecting page?

I want to get data from this page:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793
But that page forwards to:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1
So, when I use open, from OpenUri, to try and fetch the data, it throws a RuntimeError error saying HTTP redirection loop:
I'm not really sure how to get that data after it redirects and throws that error.
You need a tool like Mechanize. From it's description:
The Mechanize library is used for
automating interaction with websites.
Mechanize automatically stores and
sends cookies, follows redirects, can
follow links, and submit forms. Form
fields can be populated and submitted.
Mechanize also keeps track of the
sites that you have visited as a
history.
which is exactly what you need. So,
sudo gem install mechanize
then
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"
page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri
and you're ready to rock 'n' roll.
The site seems to be doing some of the redirection logic with sessions. If you don't send back the session cookies they are sending on the first request you will end up in a redirect loop. IMHO it's a crappy implementation on their part.
However, I tried to pass the cookies back to them, but I didn't get it to work, so I can't be completely sure that that is all that's going on here.
While mechanize is a wonderful tool I prefer to "cook" my own thing.
If you are serious about parsing you can take a look at this code. It serves to crawl thousands of site on an international level everyday and as far as I have researched and tweaked there isn't a more stable approach to this that also allows you to highly customize later on your needs.
require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"
def crawl(url_address)
self.errors = Array.new
begin
begin
url_address = URI.parse(url_address)
rescue URI::InvalidURIError
url_address = URI.decode(url_address)
url_address = URI.encode(url_address)
url_address = URI.parse(url_address)
end
url_address.normalize!
stream = ""
timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
if stream.size > 0
url_crawled = URI.parse(stream.base_uri.to_s)
else
self.errors << "Server said status 200 OK but document file is zero bytes."
return
end
rescue Exception => exception
self.errors << exception
return
end
# extract information before html parsing
self.url_posted = url_address.to_s
self.url_parsed = url_crawled.to_s
self.url_host = url_crawled.host
self.status = stream.status
self.content_type = stream.content_type
self.content_encoding = stream.content_encoding
self.charset = stream.charset
if stream.content_encoding.include?('gzip')
document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(#src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end

Disabled/Custom params_parser per action

I have a create action that handles XML requests. Rather than using the built in params hash, I use Nokogiri to validate the XML against an XML schema. If this validation passes, the raw XML is stored for later processing.
As far as I understand, the XML is parsed twice: First the Rails creates the params hash, then the Nokogiri parsing happens. I've been looking for ways to disable the params parsing to speed things up but have found nothing.
ActionController::Base.param_parsers[Mime::XML] = lambda do |body|
# something
end
I know it's possible to customize the XML params parsing in general using something like the above, but I depend on the default behaviour in other controllers.
Is it possible to bypass the params parsing on a per-action basis? What options do I have?
Thank you for your help!
I've managed to solve the problem using Rails Metal. The relevant part looks something like this:
class ReportMetal
def self.call(env)
if env["PATH_INFO"] =~ /^\/reports/
request = Rack::Request.new(env)
if request.post?
report = Report.new(:raw_xml => request.body.string)
if report.save # this triggers the nokogiri validation on raw_xml
return [201, { 'Content-Type' => 'application/xml' }, report.to_xml]
else
return [422, { 'Content-Type' => 'application/xml' }, report.errors.to_xml]
end
end
end
[404, { "Content-Type" => "text/html" }, "Not Found."]
ensure
ActiveRecord::Base.clear_active_connections!
end
end
Thanks!
PS: Naive benchmarking with Apache Bench in development shows 22.62 Requests per second for standard Rails vs. 57.60 Requests per second for the Metal version.

(rails) weird problem with url validation

i'm trying to see if a url exists. here is my code for doing so:
validate :registered_domain_name_exists
private
def registered_domain_name_exists
if url and url.match(URI::regexp(%w(http https))) then
begin # check header response
case Net::HTTP.get_response(URI.parse(url))
when Net::HTTPSuccess then true
else errors.add(:url, "URL does not exist") and false
end
rescue # DNS failures
errors.add(:url, "URL does not exist") and false
end
end
end
however, this code is failing. it says http://www.biorad.com is not a valid website. this is absolutely incorrect. Also, knowing that http://www.biorad.com just redirects you to http://www.bio-rad.com/evportal/evolutionPortal.portal i tried this url too, and that also failed. again, i know this can't be possible. what's wrong with my code??
Each of the example urls you gave is a redirect (http status code 301 or 302). Your code is only considering http status code 2xx to be success. Add another case:
when Net::HTTPRedirection then true
UPDATE: Note that using HTTP HEAD instead of GET will transmit less data across the network.
uri = URI.parse(url)
response = Net::HTTP.start(uri.host, uri.port) {|http|
http.head('/')
}

Resources