Opening a WIKI URL with a comma using `open-uri` - ruby-on-rails

I am running in to OpenURI::HTTPError: 403 Forbidden error
when I try to open a URL with a comma (OR other special characters like .).
I am able to open the same url in a browser.
require 'open-uri'
url = "http://en.wikipedia.org/wiki/Thor_Industries,_Inc."
f = open(url)
# throws OpenURI::HTTPError: 403 Forbidden error
How do I escape such URL?
I have tried to escape the url with CGI::escape and I get the same error.
f = open(CGI::escape(url))

Typically, one would simply require the module cgi, then use CGI::escape(str).
require 'cgi'
require 'open-uri'
escaped_page = CGI::escape("Thor_Industries,_Inc.")
url = "http://en.wikipedia.org/wiki/#{escaped_page}"
f = open(url)
However, this doesn't seem to work for your particular instance, and still returns a 403. I'll leave this here for reference, regardless.
Edit: Wikipedia is refusing your requests because it suspects that you are a bot. It would seem that certain pages that are clearly content are granted to you, but those that don't match its "safe" pattern (e.g. those that contain dots or commas) are subject to its screening. If you actually output the content (I did this with Net::HTTP), you get the following:
Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.
Providing a user-agent string, however, solves the issue:
open("http://en.wikipedia.org/wiki/Thor_Industries,_Inc.",
"User-Agent" => "Ruby/#{RUBY_VERSION}")

Related

What format of url is this with the colon almost in the end - https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize

I am trying to consume the google text to speech api here : https://cloud.google.com/speech-to-text/docs/async-recognize#speech-async-recognize-gcs-protocol
and it has this url format below
https://google-speech-api-base-urlspeech:longrunningrecognize
What is this URL format with colon(:)in the end?
When I try to hit this URL, it gives me an error specifically while running test case on it .e. Invalid URI. Invalid Port?
But the official google documentation says this is a valid url? How to use this?
This format of URL is called gRPC Transcoding syntax. Your first URL is invlaid , because it's in the first path segment of a relative-path reference.
https://google-speech-api-base-urlspeech:longrunningrecognize
This url is invalid for usage, whereas the one below, https://speech.googleapis.com/v1/speech:longrunningrecognize was running fine.
Try changing your URL to something like
https://google-speech-api-base-url/speech:longrunningrecognize. It will work.
I looked at the documentation page you referenced and was unable to see/find a URL that looked like:
https://google-speech-api-base-urlspeech:longrunningrecognize
However, what I did find was a URL of the form:
https://speech.googleapis.com/v1/speech:longrunningrecognize
which looks perfectly valid.
The documentation for this REST request can be found here:
https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/longrunningrecognize
Could you have made an error in your reading and comprehension?
Apparently the colon (:) is legal in the path part of a URL:
Are colons allowed in URLs?

400 code error when URL contains % symbol? (NGINX)

How to prevent a server from returning an error 400 code error when the URL contains % symbol using NGINX server?
Nginx configuration for my website:
....
rewrite ^/download/(.+)$ /download.php?id=$1 last;
....
When I tried to get access to this URL:
http://mywebsite.net/download/some-string-100%-for-example
I got this error:
400 Bad Request
With this url :
http://mywebsite.net/download/some-string-%25-for-example
it's work fine !
It's because it needs to be URL encoded first.
This will explain:
http://www.w3schools.com/tags/ref_urlencode.asp
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.
The URL interpreter is confused to see a % without hexadecimals after it.
Why would you think of solving by changing Nginx configuration???
It's impossible to solve from the server side. It's a problem from the client side.
https://headteacherofgreenfield.wordpress.com/2016/03/23/100-celebrations/
In that URL, the title is 100% Celebrations! but the permalink is autogenerated to 100-celebrations. It's because they know putting 100% will cause a URL encode problem.
If even Wordpress doesn't do it your way, then why should you do it?

Desire2Learn Valence API | JSON not loading

I'm using the Python Requests library with the Valence-provided Python SDK to attempt to do a GET request. Something odd is happening with the URL and I'm not sure what. The response I get is 200 (which leads me to believe that the authentication is working), but when I try to print the JSON from the Request object, it instead prints the HTML of the page instead of the JSON.
I'm using modified code that I read from http://docs.valence.desire2learn.com/clients/python/auth.html.
Here's the Python code:
import requests
import auth as d2lauth
from auth import *
app_creds = { 'app_id': '----', 'app_key': '----' }
ac = d2lauth.fashion_app_context(app_id=app_creds['app_id'], app_key=app_creds['app_key'])
auth_url = ac.create_url_for_authentication('ugatest2.view.usg.edu', 'http://localhost:8080')
redirect_url = "https://localhost:8080?x_a=3----&x_b=3dMRgCBAHXJDTA2E6DJIfdWq-gYl-pk77fF_3X5oDUuqc"
uc = ac.create_user_context(auth_url, 'ugatest2.view.usg.edu', True)
route = 'ugatest2.view.usg.edu/d2l/api/versions/'
url = uc.create_authenticated_url(route)
r = requests.get(url)
print(r.text)
The output is the HTML of a page instead of JSON. If I do print(r), I get a status of 200. I think my redirect URL may be the issue, but I'm not sure what exactly is wrong. Thanks for any help!
Two things look off to me:
Using auth_url to create a user context isn't going to work, that's the URL you need to send the user to so they can authenticate. You need to use the URL you were redirected to after authenticating to build the user context. Assuming redirect_url is that URL, you should be passing that to create_user_context and not auth_url.
ugatest2.view.usg.edu/d2l/api/versions/ is not a valid value for passing to create_authenticated_route, /d2l/api/versions is probably what you want. The SDK will prepend the scheme, domain, and port so including those in the value passed is going to result in an incorrect URI.
Once your app is working properly, you'll be able to access a JSON response by using r.json() rather than r.text.

Getting 500 Internal server error with open uri

I have a bookmarking site, done in ruby on rails, in which lots of URLs needed to be open and crawl its title and base_uri. Th method used for opening URL is open(url). When I tried to open http://www.mysite.com/ with open URI method, I got 500 Internal server error.
OpenURI::HTTPError in TestsController#test
500 Internal Server Error
I can access this URL through browser.
My code posted below
require 'hpricot'
require 'open-uri'
require 'timeout'
require 'net/http'
url = 'http://www.mysite.com/'
#filep = open(url)
base_uri = #filep.base_uri
I tried the same with hpricot too using the code.
#doc = Nokogiri::HTML(open(url).read) but getting the same error.
Please help me on this.
I had the exact same problem; could open the website in my browser but not through open-uri . Adding a user agent did not fix it, but using the 'restclient' class did
require 'restclient'
url = 'http://www....'
user_info = RestClient.get(url, "User-Agent" => "Ruby")

Get redirect of a URL in Ruby

According to Facebook graph API we can request a user profile picture with this (example):
https://graph.facebook.com/1489686594/picture
But the real image URL of the previous link is:
http://profile.ak.fbcdn.net/hprofile-ak-snc4/hs356.snc4/41721_1489686594_527_q.jpg
If you type the first link on your browser, it will redirect you to the second link.
Is there any way to get the full URL (second link) with Ruby/Rails, by only knowing the first URL?
(This is a repeat of this question, but for Ruby)
This was already answered correctly, but there's a much simpler way:
res = Net::HTTP.get_response(URI('https://graph.facebook.com/1489686594/picture'))
res['location']
You can use Net::Http and read the Location: header from the response
require 'net/http'
require 'uri'
url = URI.parse('http://www.example.com/index.html')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/index.html')
}
res['location']
You've got HTTPS URLs there, so you will handle that...
require 'net/http'
require 'net/https' if RUBY_VERSION < '1.9'
require 'uri'
u = URI.parse('https://graph.facebook.com/1489686594/picture')
h = Net::HTTP.new u.host, u.port
h.use_ssl = u.scheme == 'https'
head = h.start do |ua|
ua.head u.path
end
puts head['location']
I know this is an old question, but I'll add this answer for posterity:
Most of the solutions I've seen only follow a single redirect. In my case, I had to follow multiple redirects to get the actual final destination URL. I used Curl (via the Curb gem) like so:
result = Curl::Easy.perform(url) do |curl|
curl.head = true
curl.follow_location = true
end
result.last_effective_url
You can check the response status code and get the final URL recursively using something like get_final_redirect_url method:
require 'net/http'
def get_final_redirect_url(url, limit = 10)
uri = URI.parse(url)
response = ::Net::HTTP.get_response(uri)
if response.class == Net::HTTPOK
return uri
else
redirect_location = response['location']
location_uri = URI.parse(redirect_location)
if location_uri.host.nil?
redirect_location = uri.scheme + '://' + uri.host + redirect_location
end
warn "redirected to #{redirect_location}"
get_final_redirect_url(redirect_location, limit - 1)
end
end
I was facing the same issue. I solved it and built a gem final_redirect_url around it, so that everyone can benefit from it.
You can find the details on uses here.
Yeah, "Location" response header tell you the actual image URL.
However, if you use the picture as the user's profile image on your site, I recommend you to use "https://graph.facebook.com/:user_id/picture" style URL instead of actual image URL.
Otherwise, your users will see lots of "not found" images, or outdated profile images in the future.
You just put "https://graph.facebook.com/:user_id/picture" as the "src" attribute of "img" tag.
They browser gets the updated image of the user.
ps.
I have such troubles on my site with Twitter & Yahoo! OpenID now..
If you want a solution that:
does not use gems
follows all redirects
works also with url-shortening services
require 'net/http'
require 'uri'
def follow_redirections(url)
response = Net::HTTP.get_response(URI(url))
until response['location'].nil?
response = Net::HTTP.get_response(URI(response['location']))
end
response.uri.to_s
end
# EXAMPLE USAGE
follow_redirections("https://graph.facebook.com/1489686594/picture")
# => https://static.xx.fbcdn.net/rsrc.php/v3/yo/r/UlIqmHJn-SK.gif

Resources