Getting 500 Internal server error with open uri - ruby-on-rails

I have a bookmarking site, done in ruby on rails, in which lots of URLs needed to be open and crawl its title and base_uri. Th method used for opening URL is open(url). When I tried to open http://www.mysite.com/ with open URI method, I got 500 Internal server error.
OpenURI::HTTPError in TestsController#test
500 Internal Server Error
I can access this URL through browser.
My code posted below
require 'hpricot'
require 'open-uri'
require 'timeout'
require 'net/http'
url = 'http://www.mysite.com/'
#filep = open(url)
base_uri = #filep.base_uri
I tried the same with hpricot too using the code.
#doc = Nokogiri::HTML(open(url).read) but getting the same error.
Please help me on this.

I had the exact same problem; could open the website in my browser but not through open-uri . Adding a user agent did not fix it, but using the 'restclient' class did
require 'restclient'
url = 'http://www....'
user_info = RestClient.get(url, "User-Agent" => "Ruby")

Related

trying to webscrape a website but getting 403 forbidden

I'm trying to access ticketmasters website to get prices of events.
However i'm running into issues where httpparty returns a 403 error on trying to scrape the website
This is all i have
page = HTTParty.get(ticketmaster_url)
doc = Nokogiri::HTML(page)
And the 403 appears on the get call
Thanks
Sam
Not sure why you are using HTTParty here. You can simply opne and parse the page usign Nokogiri:
require 'open-uri'
doc = Nokogiri::HTML(open(ticketmaster_url))
where ticketmaster_url is the url of the page you want to extract information from.

Ruby on rails HTTP request issue

I am an newbie to Ruby on Rails. I have a url which points to a JSON output. When I ran the URL directly like http://user:pass#myurl.com/json, I am getting the response without any authendication. However http://myurl.com/json requires a username and password through a standard apache pop up authentication box. I have tried to access this URL from my rails controller like the following:
result = JSON.parse(open("http://user:pass#myurl.com/json").read)
When I try to do, I just get an error which says ArgumentError, userinfo not supported. [RFC3986]
Also I have tried the below one. I am getting a 401-Unauthorized error
open("http://...", :http_basic_authentication=>[user, password])
How can I make a request that works in this case. Any help would be appreciated.
You need to use Net::HTTP (or some other HTTP client).
require 'net/http'
require 'uri'
require 'json'
uri = URI('http://myurl.com/json')
req = Net::HTTP::Get.new( uri )
req.basic_auth 'user', 'pass'
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
result = JSON.parse(res.body)
puts result

Ruby Proxy Authentication GET/POST with OpenURI or net/http

I'm using ruby 1.9.3 and trying to use open-uri to get a url and try posting using Net:HTTP
Im trying to use proxy authentication for both:
Trying to do a POST request with net/http:
require 'net/http'
require 'open-uri'
http = Net::HTTP.new("google.com", 80)
headers = { 'User-Agent' => 'Ruby 193'}
resp, data = http.post("/", "name1=value1&name2=value2", headers)
puts data
And for open-uri which I can't get to do POST I use:
data = open("http://google.com/","User-Agent"=> "Ruby 193").read
How would I modify these to use a proxy with HTTP Authentication
I've tried (for open-uri)
data = open("http://google.com/","User-Agent"=> "Ruby 193", :proxy_http_basic_authentication => ["http://proxy.com:8000/", "proxy-user", "proxy-password"]).read
However all I will get is a OpenURI::HTTPError: 407 Proxy Authentication Required. I've verified all and it works in the browser with the same authentication and proxy details but I can't get ruby to do it.
How would I modify the code above to add http authentication properly? Has anyone gone through this atrocity?
Try:
require "open-uri"
proxy_uri = URI.parse("http://proxy.com:8000")
data = open("http://www.whatismyipaddress.com/", :proxy_http_basic_authentication => [proxy_uri, "username", "password"]).read
puts data
As for Net::HTTP, I recently implemented support for proxies with http authentication into a Net::HTTP wrapper library called http. If you look at my last pull-request, you'll see the basic implementation.
EDIT: Hopefully this will get you moving in the right direction.
Net::HTTP::Proxy(proxy_uri.host, proxy_uri.port,"username","password").start('whatismyipaddress.com') do |http|
puts http.get('/').body
end
EDIT 11/24/2020: Net::HTTP::Proxy is now considered obsolete. You can now configure proxies when creating a new instance of Net::HTTP. See the documentation for Net::HTTP.new for more details.

Get redirect of a URL in Ruby

According to Facebook graph API we can request a user profile picture with this (example):
https://graph.facebook.com/1489686594/picture
But the real image URL of the previous link is:
http://profile.ak.fbcdn.net/hprofile-ak-snc4/hs356.snc4/41721_1489686594_527_q.jpg
If you type the first link on your browser, it will redirect you to the second link.
Is there any way to get the full URL (second link) with Ruby/Rails, by only knowing the first URL?
(This is a repeat of this question, but for Ruby)
This was already answered correctly, but there's a much simpler way:
res = Net::HTTP.get_response(URI('https://graph.facebook.com/1489686594/picture'))
res['location']
You can use Net::Http and read the Location: header from the response
require 'net/http'
require 'uri'
url = URI.parse('http://www.example.com/index.html')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/index.html')
}
res['location']
You've got HTTPS URLs there, so you will handle that...
require 'net/http'
require 'net/https' if RUBY_VERSION < '1.9'
require 'uri'
u = URI.parse('https://graph.facebook.com/1489686594/picture')
h = Net::HTTP.new u.host, u.port
h.use_ssl = u.scheme == 'https'
head = h.start do |ua|
ua.head u.path
end
puts head['location']
I know this is an old question, but I'll add this answer for posterity:
Most of the solutions I've seen only follow a single redirect. In my case, I had to follow multiple redirects to get the actual final destination URL. I used Curl (via the Curb gem) like so:
result = Curl::Easy.perform(url) do |curl|
curl.head = true
curl.follow_location = true
end
result.last_effective_url
You can check the response status code and get the final URL recursively using something like get_final_redirect_url method:
require 'net/http'
def get_final_redirect_url(url, limit = 10)
uri = URI.parse(url)
response = ::Net::HTTP.get_response(uri)
if response.class == Net::HTTPOK
return uri
else
redirect_location = response['location']
location_uri = URI.parse(redirect_location)
if location_uri.host.nil?
redirect_location = uri.scheme + '://' + uri.host + redirect_location
end
warn "redirected to #{redirect_location}"
get_final_redirect_url(redirect_location, limit - 1)
end
end
I was facing the same issue. I solved it and built a gem final_redirect_url around it, so that everyone can benefit from it.
You can find the details on uses here.
Yeah, "Location" response header tell you the actual image URL.
However, if you use the picture as the user's profile image on your site, I recommend you to use "https://graph.facebook.com/:user_id/picture" style URL instead of actual image URL.
Otherwise, your users will see lots of "not found" images, or outdated profile images in the future.
You just put "https://graph.facebook.com/:user_id/picture" as the "src" attribute of "img" tag.
They browser gets the updated image of the user.
ps.
I have such troubles on my site with Twitter & Yahoo! OpenID now..
If you want a solution that:
does not use gems
follows all redirects
works also with url-shortening services
require 'net/http'
require 'uri'
def follow_redirections(url)
response = Net::HTTP.get_response(URI(url))
until response['location'].nil?
response = Net::HTTP.get_response(URI(response['location']))
end
response.uri.to_s
end
# EXAMPLE USAGE
follow_redirections("https://graph.facebook.com/1489686594/picture")
# => https://static.xx.fbcdn.net/rsrc.php/v3/yo/r/UlIqmHJn-SK.gif

Opening a WIKI URL with a comma using `open-uri`

I am running in to OpenURI::HTTPError: 403 Forbidden error
when I try to open a URL with a comma (OR other special characters like .).
I am able to open the same url in a browser.
require 'open-uri'
url = "http://en.wikipedia.org/wiki/Thor_Industries,_Inc."
f = open(url)
# throws OpenURI::HTTPError: 403 Forbidden error
How do I escape such URL?
I have tried to escape the url with CGI::escape and I get the same error.
f = open(CGI::escape(url))
Typically, one would simply require the module cgi, then use CGI::escape(str).
require 'cgi'
require 'open-uri'
escaped_page = CGI::escape("Thor_Industries,_Inc.")
url = "http://en.wikipedia.org/wiki/#{escaped_page}"
f = open(url)
However, this doesn't seem to work for your particular instance, and still returns a 403. I'll leave this here for reference, regardless.
Edit: Wikipedia is refusing your requests because it suspects that you are a bot. It would seem that certain pages that are clearly content are granted to you, but those that don't match its "safe" pattern (e.g. those that contain dots or commas) are subject to its screening. If you actually output the content (I did this with Net::HTTP), you get the following:
Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.
Providing a user-agent string, however, solves the issue:
open("http://en.wikipedia.org/wiki/Thor_Industries,_Inc.",
"User-Agent" => "Ruby/#{RUBY_VERSION}")

Resources