Rails + Nokogiri + Heroku - response 503 for URLs from StackOverflow - ruby-on-rails

I'm writing a just-for-fun app for my use. In this app I'm putting URLs in classic POST form from which I'm extracting some informations. For example, this line is where I'm extracting the title of the page:
self.name = Nokogiri::HTML(open(self.url)).css('title').to_s.sub('<title>','').to_s.sub('</title>','')
I'm using Nokogiri (v1.5.4) for parsing data from the source page. I don't know if I'm missing here something, but the behavior of the application is strange.
If I'm running on my localhost in my development environment on my machine, everything works properly and seems to me alright. But, after pushing on Heroku, some problems occurred. For example, URLs from StackOverflow always have this type of error:
OpenURI::HTTPError (503 Service Unavailable):
app/models/url.rb:67:in `set_name'
app/controllers/urls_controller.rb:48:in `block in create'
app/controllers/urls_controller.rb:46:in `create'
I don't understand why it is happening just on Heroku. On my local machine it's working perfectly with the same URL. I'm maybe missing something with Heroku, but other URLs are returning the normal 200 state and working fine. It's just URLs from StackOverflow.

Don't use:
.to_s.sub('<title>','').to_s.sub('</title>','')
Instead use:
.text
For instance:
html = '<head><title>foo</title></head>'
Nokogiri::HTML(html).css('title').text
In IRB:
irb(main):055:0> html = '<head><title>foo</title></head>'
"<head><title>foo</title></head>"
irb(main):056:0> Nokogiri::HTML(html).css('title').text
"foo"
Why URLs for StackOverflow fail on Heroku fail with a 503 might be a routing or hosting issue since you're getting a 503.
Rather than scraping pages, you might want to consider "Where is Stack Overflow's public data dump?" and "
Stack Overflow Creative Commons Data Dump".

Related

How to pass rewritten url to fcgi in lighttpd

I have a redmine instance that runs in the /redmine sub-uri. This is fully working and I can retrieve /redmine/robots.txt without fault.
Adding
url.rewrite = ( "^/robots\.txt$" => "/redmine/robots.txt" )
still gives a 404 error when trying to retrieve /robots.txt.
The request in error.log appears identical to GETting /redmine/robots.txt after an initial block changing the url if I turn on debug.log-request-handling = "enable"
Using url.rewrite-once does not seem to make a difference.
The request never shows up in redmine production.log.
So my question is what I might be missing?

Json API Curl to HTTParty request. The request works on my local machine with cURL, but not in Heroku production with HTTParty

So, I've gotten to the point that I'm not sure what question to ask anymore when researching my issue. So by coming here, I'm asking 2 things.
First, how do I got about investigating issues like this in the future?
Second, what am I doing wrong right now?
Basically, I created an app and pushed it to Heroku. This app has API endpoints that work exactly as I expect them to when I am running the app locally and sending curl commands through my terminal. The code I'm running to try to sign-in to my app uses the HTTP gem and is returning {"status":500,"error":"Internal Server Error"}.
So, this is my first time trying to connect to any API endpoint, so I'm flying a bit in the dark here. What I read about a 500 error is, "When all else fails; generally, a 500 response is used when processing fails due to unanticipated circumstances on the server side, which causes the server to error out." Which sounds a bit like a catch-all for somethings broken but we don't know what, other than it's on the backend. Please correct me if I'm wrong.
What do I do next?
This is the curl command I'm running to send an email and password to receive and auth_token:
curl -d "user_login[email]=fake#email.com&user_login[password]=password123" http://localhost:3000/api/v1/sign-in
As expected I get this back:
{"auth_token":"628f3ebc47193665e7f1d32ae41ff9a7"}%
This is the code I'm running using HTTParty to try and connect with my API in production using Heroku:
consume_api.rb
require "rubygems"
require "httparty"
query_hash = { :user_login => "fake#email.com",
:password => "password123" }
response = HTTParty.post("https://fake-heroku-94488.herokuapp.com/api/v1/sign-in", :query => query_hash)
puts response.body, response.code, response.message, response.headers.inspect
This is the response back:
{"status":500,"error":"Internal Server Error"}
500
Internal Server Error
{"server"=>["Cowboy"], "date"=>["Fri, 03 Feb 2017 14:49:40 GMT"], "connection"=>["close"], "content-type"=>["application/vnd.api+json; charset=utf-8"], "x-request-id"=>["5d094cee-11a2-4040-abfd-5180f5e46886"], "x-runtime"=>["0.003503"], "vary"=>["Origin"], "content-length"=>["46"], "via"=>["1.1 vegur"]}
To reiterate my questions are what can I take from this response to start investigating further? What questions should I be asking myself when this happens?
Second, What did I do wrong?
Thanks in advance for your help.
Edit:
A quick update is that I ran the curl command that worked locally with the actual URL for my production site and it worked exactly as it does locally. I still can't manage to get further using HTTParty.
I checked out my heroku logs and found this:
2017-02-03T17:22:26.576272+00:00 heroku[router]: at=info method=POST path="/api/v1/sign-in" host=fake-heroku-94488.herokuapp.com request_id=9ab76ce0-95f3-4479-8672-874a624c070e fwd="108.11.195.58" dyno=web.1 connect=1ms service=8ms status=500 bytes=265
Started POST "/api/v1/sign-in" for 108.11.195.58 at 2017-02-03 17:22:26 +0000
Processing by Api::V1::SessionsController#create as JSON
Parameters: {"email"=>"fake#email.com", "password"=>"[FILTERED]"}
Completed 500 Internal Server Error in 1ms (ActiveRecord: 0.0ms)
NoMethodError (undefined method `[]' for nil:NilClass):
app/controllers/api/v1/sessions_controller.rb:7:in `create'
Which points to line 7 of my SessionsController:
...
6 def create
7 resource = User.find_for_database_authentication(:email => params[:user_login][:email])
8 return invalid_login_attempt unless resource
...
Line 7 has two arrays in it, but I'm not sure how they're empty in the Httpary request and just find in the curl request.
The answer to my first question of what I should do if I bump into this issue or one like it again is to check out the Heroku logs. That started the unwinding of my problems. The second part of this answer is that even when I figured out that I wasn't sending in the :user_login info, I still couldn't figure out how to fix my issue, so I started dropping byebug into anyplace related to my issue and asking what my variables values were and running code to see the output in console. When they kept coming back nil and I couldn't figure out what the right order/way to right the HTTParty request, I started putting .methods or .instance_methods after my variables to find more ways to ask what is going on here. After that I spent time in rails console, using whatever instance_methods that were associated with my variables until I put the pieces together.
The answer to my second question, what was I doing wrong? I wasn't sending in :user_login, mainly because I didn't quite understand what it was supposed to be doing (I'm new, I use a lot of tutorials that I understand like 95% of, this was from the 5%) and even when I did, I couldn't figure out how to shove it in my request. As it turns out, I also wasn't using :body appropriatly. Once I figure out that I need to put :body around all of the data that I was sending in, my next layer was :user_login then followed by my :email and :password, which with hindsight looks perfectly obvious now.
This is what the code to sign in to my app looks like:
consume_api.rb
require "rubygems"
require "httparty"
response = HTTParty.post("https://fake-heroku-94488.herokuapp.com/api/v1/sign-in", :body => {:user_login => {:email => "fake#email.com", :password => "password123"}})
puts response.body, response.code, response.message, response.headers.inspect

HTTP::ConnectionError & Errno::EHOSTUNREACH Errors in Rails App

I'm working on a Rails app and here are two important pieces of the error message I get when I try to seed data in my database:
HTTP::ConnectionError: failed to connect: Operation timed out - SSL_connect
Errno::ETIMEDOUT: Operation timed out - SSL_connect
Here is my code, where I'm pulling data from a file, and creating Politician objects:
politician_data.each do |politician|
photo_from_congress = "https://theunitedstates.io/images/congress/original/" + politician["id"]["bioguide"] + ".jpg"
HTTP.get(photo_from_congress).code == 200 ? image = photo_from_congress : image = "noPoliticianImage.png"
Politician.create(
name: "#{politician["name"]["first"]} #{politician["name"]["last"]}",
image: image
)
end
I put in a pry, and the iteration works for the first loop, so the code is OK. After several seconds, the loop breaks, and I get that error, so I think it has something to do with the number of HTTP.get requests I'm making?
https://github.com/unitedstates/images is a Git repo. Perhaps that repo can't handle that many get requests?
I did some Google'ing and saw it may have something to do with "Request timed out" error? My having to set up a proxy servers? I'm a junior programmer so please be very specific when responding.
*EDIT TO ADD THIS:
I found this blurb on the site where I'm making get requests to cull photos (https://github.com/unitedstates/images), that may help?
Note: Our HTTPS permalinks are provided through CloudFlare's Universal SSL, which also uses "Flexible SSL" to talk to GitHub Pages' unencrypted endpoints. So, you should know that it's not an end-to-end encrypted channel, but is encrypted between your client use and CloudFlare's servers (which at least should dissociate your requests from client IP addresses).
by the way, using "Net::HTTP" instead of the "HTTP" Ruby gem worked. Instead of checking the status code, i just checked to see if the body contained key text:
photo_from_congress = "https://theunitedstates.io/images/congress/original/" + politician["id"]["bioguide"] + ".jpg"
photo_as_URI = URI(photo_from_congress)
Net::HTTP.get_response(photo_as_URI ).body.include?("File not found") ? image = "noPoliticianImage.png" : image = photo_from_congress

Large number of likes but now realise it is to an invalid url

My site at www.kruaklaibaan.com (yes I know it's hideous) currently has 3.7 million likes but while working to build a proper site that doesn't use some flowery phpBB monstrosity I noticed that all those likes are registered against an invalid URL that doesn't actually link back to my site's URL at all. Instead the likes have all been registered against a URL-encoded version:
www.kruaklaibaan.com%2Fviewtopic.php%3Ff%3D42%26t%3D370
This is obviously incorrect. Since I already have so many likes I was hoping to either get those likes updated to the correct URL or get them to just point to the base url of www.kruaklaibaan.com
The correct url they SHOULD have been registered against is (not url-encoded):
www.kruaklaibaan.com/viewtopic.php?f=42&t=370
Is there someone at Facebook I can discuss this with? 3.7m likes is a little too many to start over with without a lot of heartache. It took 2 years to build those up.
Short of getting someone at Facebook to update the URL, the only option within your control that I could think of that would work is to create a custom 404 error page. I have tested such a page with your URL and the following works.
First you need to set the Apache directive for ErrorDocument (or equivalent in another server).
ErrorDocument 404 /path/to/404.php
This will cause any 404 pages to hit the script, which in turn will do the necessary check and redirect if appropriate.
I tested the following script and it works perfectly.
<?php
if ( $_SERVER['REQUEST_URI'] == '/%2Fviewtopic.php%3Ff%3D42%26t%3D370' ) {
Header("HTTP/1.1 301 Moved Permanently");
Header("Location: /viewtopic.php?f=42&t=370");
exit();
} else {
header('HTTP/1.0 404 Not Found');
}
?><html><body>
<h1>HTTP 404 Not Found</h1>
<?php echo $_SERVER['REQUEST_URI']; ?>
</body></html>
This is a semi-dirty way of achieving this, however I tried several variations in Apache2.2 using mod_alias's Redirect and mod_rewrite's RewriteRule, neither of which I have been able to get working with a URL containing percent encoded chars. I suspect that with nginx you may have better success at a more graceful way to handle this in the server.

What's up with those requests having "iframe=true&width=80%&height=80%" query params?

I'm running a Rails 3.2 App. I checked Google Webmaster tools and saw lot's of HTTP 502 errors for random pages. Weird thing is that all of them where crawled with ?iframe=true&width=80%&height=80% as query param:
e.g. http://www.mypage.com/anypage?iframe=true&width=80%&height=80%
For sure I dont link like that to those pages internally, must be external. Checking Google, proofs me here - I see lot's of other pages having same issues.
Seems like an external service creates those links, but why??
I'm seeing these too. Over the past 24 hours I have 9 hits on one of my pages. They all come from the same IP address, which is Google's in Mountain View. None of them have a referrer. Also, a really interesting thing is that half of them have headers like this:
HTTP_ACCEPT : */*
HTTP_ACCEPT_ENCODING : gzip,deflate
HTTP_CONNECTION : Keep-alive
HTTP_FROM : googlebot(at)googlebot.com
HTTP_HOST : mydomain.com
HTTP_USER_AGENT : Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
But then interspersed are requests from the same IP that don't have any HTTP headers reported in the exception. I'm not sure if this means they aren't being sent, or if something in the Rails stack is preventing the headers from getting recorded due to some other variation in the requests. In any case the requests are interspersed.
The page in question has existed for only about a month, and it's only seen 5 requests during that time according to GA.
All this leads me to believe that someone inside Google is doing something experimental which is leading to these buggy query string encodings, and Rails apps are seeing it because it happens to crash the rack QS parser, whereas other platforms may be more forgiving.
In the meantime I may monkey patch rack just to stop shouting at me, but the ultimate answer about what's going on will have to come from Google (anyone there?).
You can add this to your initializers to get rid of the errors (with Ruby 1.8.x):
module URI
major, minor, patch = RUBY_VERSION.split('.').map { |v| v.to_i }
if major == 1 && minor < 9
def self.decode_www_form_component(str, enc=nil)
if TBLDECWWWCOMP_.empty?
tbl = {}
256.times do |i|
h, l = i>>4, i&15
tbl['%%%X%X' % [h, l]] = i.chr
tbl['%%%x%X' % [h, l]] = i.chr
tbl['%%%X%x' % [h, l]] = i.chr
tbl['%%%x%x' % [h, l]] = i.chr
end
tbl['+'] = ' '
begin
TBLDECWWWCOMP_.replace(tbl)
TBLDECWWWCOMP_.freeze
rescue
end
end
str = str.gsub(/%(?![0-9a-fA-F]{2})/, "%25")
str.gsub(/\+|%[0-9a-fA-F]{2}/) {|m| TBLDECWWWCOMP_[m]}
end
end
end
All this does is encode % symbols that aren't followed by two characters instead of raising an exception. Not sure it's such a good idea to be monkeypatching rack, though. There must be a valid reason this wasn't done in the gem (maybe security related?).
I just found out more about this issue. It looks like all the links are coming from spidername.com according to google web master. It looks like they add that to the url and somehow when you click on it will use an iframe to show the content. Probably using javascript to see if the url contain the iframe= query param. However, google bot is going straight to the iframe. That is causing the issue.
I decide to use a redirect rule in nginx to solve the issue.
I have the same issue. I am worried that it is third party spam link that tries to lower my site's google ranking.

Resources