Mechanize error "too many bad responses" - ruby-on-rails

Doing scraping I've found that some urls failed. After check the url looked ok in the browser and see in wireshark the remote server was answering with a 200 I've finally found that the url:
http://www.segundamano.es/electronica-barcelona-particulares/galaxy-note-3-mas.htm
was failing with
Net::HTTP::Persistent::Error: too many bad responses after 0 requests on 42319240, last used 1414078471.6468294 seconds ago
More weird is that if you remove a character from the last part, it works. If you add the character in another place, it fails again.
Update 1
The "code"
agent = Mechanize.new
page = agent.get("http://www.segundamano.es/electronica-barcelona-particulares/galaxy-note-3.htm")
Net::HTTP::Persistent::Error: too many bad responses after 0 requests on 41150840, last used 1414079640.353221 seconds ago

This is a network error which normally occurs if you make too many requests to a certain source from the same IP and thus the page takes too long to load. You could try adding a custom timeout to your connection agent, keep the connection alive and ignore bad chunking (potentially bad):
agent = Mechanize.new
agent.keep_alive = true
agent.ignore_bad_chunking = true
agent.open_timeout = 25
agent.read_timeout = 25
page = agent.get("http://www.segundamano.es/electronica-barcelona-particulares/galaxy-note-3.htm")
But that is not giving you a guarantee that the connection will be successfull, it just increases the chances.
It's hard to say why you get the error on one url and not on another. When you remove the 3 you request a different page; one that might be easier for the server to process? My point being: There is nothing wrong with your Mechanize setup but with the response you are getting back.

Agree with Severin, the problem was in the other side. As I can't do anything in the server, I was trying different libs to fetch the data. It was weird that some of them worked and others don't. Trying different setups for mechanize, at the end I've found a good one:
agent = Mechanize.new { |agent|
agent.gzip_enabled = false
}

Related

Why is Net::HTTP returning a 500 when the server is returning a 503?

(This question replaces this one, hopefully with better information.)
I've got three servers that I'm gonna call Alice, Charlie, and Harry (Harry being the server's actual nickname, so as not to confuse myself). They all talk to each other to perform quite a complicated authentication flow:
A client performs a sequence of three requests through Alice to Harry.
On the third one, Harry makes a call to Charlie.
Charlie is prone to timeouts during periods of heavy traffic. If it does, Harry returns a 503 with a Retry-After header to Alice.
Harry is returning a 503, I have confirmed that in its own logs.
Alice is not receiving a 503 but a 500, and without the header.
Alice's other clients (which I'm not in control of) treat a 500 the same as other errors, which is what I'm ultimately trying to fix.
Some extra information, such as I have been able to ascertain:
Alice proxies calls to Harry using RestClient, which uses Net::HTTP under the hood.
Using Net::HTTP directly gives the same result.
It's not environment specific; I have had this problem both in Production and Development.
I have been trying to simulate Alice using Postman, but haven't had any luck yet; Charlie's traffic is quieter at the moment, and I can't force or simulate a timeout, so so far I've only been getting successful 200 responses from Harry.
Fixing Charlie's timeouts would obviously be ideal, but I'm not in control of Charlie's hardware either.
Is there something I can change about Alice so it properly detects Harry's 503?
Or, is it possible that something about Harry is changing its 503 to a 500 after it's returned and logged?
Here's Alice's code for that third call, if it's likely to help, but nothing's jumping out at me; I have been wondering if RestClient or Net::HTTP has some configuration that I don't know about.
http_verb = :post
args = [ # actually constructed differently, but this is a reasonable mock up
'https://api.harry/path?sso_token=token',
'',
{
'content_type' => 'application/json',
'accept' => '*/*',
# some other app-specific headers
}
]
RestClient.send(http_verb, *args) do |response, request, result, &block|
# `result` is present and has a 500 code at the start of this block if Harry returns a 503.
#status_code = result.present? ? result.code : :internal_server_error
cors.merge!( response.headers.slice(:access_control_allow_origin, :access_control_request_method) )
#body = response.body
end
Turns out it's something way simpler and more obvious than anyone thought.
A separate error was raised in the middleware responsible for returning the 503. As with any other exception, this got rendered as a 500.
The thing that was causing it was a line that was supposed to tell the client to wait five seconds and try again:
response.headers['Retry-After'] = 5
... some middleware component was complaining of undefined method 'each' on 5:Fixnum, because it was expecting an Array where it wasn't a String; it started working when we changed 5 to '5'.

HTTP::ConnectionError & Errno::EHOSTUNREACH Errors in Rails App

I'm working on a Rails app and here are two important pieces of the error message I get when I try to seed data in my database:
HTTP::ConnectionError: failed to connect: Operation timed out - SSL_connect
Errno::ETIMEDOUT: Operation timed out - SSL_connect
Here is my code, where I'm pulling data from a file, and creating Politician objects:
politician_data.each do |politician|
photo_from_congress = "https://theunitedstates.io/images/congress/original/" + politician["id"]["bioguide"] + ".jpg"
HTTP.get(photo_from_congress).code == 200 ? image = photo_from_congress : image = "noPoliticianImage.png"
Politician.create(
name: "#{politician["name"]["first"]} #{politician["name"]["last"]}",
image: image
)
end
I put in a pry, and the iteration works for the first loop, so the code is OK. After several seconds, the loop breaks, and I get that error, so I think it has something to do with the number of HTTP.get requests I'm making?
https://github.com/unitedstates/images is a Git repo. Perhaps that repo can't handle that many get requests?
I did some Google'ing and saw it may have something to do with "Request timed out" error? My having to set up a proxy servers? I'm a junior programmer so please be very specific when responding.
*EDIT TO ADD THIS:
I found this blurb on the site where I'm making get requests to cull photos (https://github.com/unitedstates/images), that may help?
Note: Our HTTPS permalinks are provided through CloudFlare's Universal SSL, which also uses "Flexible SSL" to talk to GitHub Pages' unencrypted endpoints. So, you should know that it's not an end-to-end encrypted channel, but is encrypted between your client use and CloudFlare's servers (which at least should dissociate your requests from client IP addresses).
by the way, using "Net::HTTP" instead of the "HTTP" Ruby gem worked. Instead of checking the status code, i just checked to see if the body contained key text:
photo_from_congress = "https://theunitedstates.io/images/congress/original/" + politician["id"]["bioguide"] + ".jpg"
photo_as_URI = URI(photo_from_congress)
Net::HTTP.get_response(photo_as_URI ).body.include?("File not found") ? image = "noPoliticianImage.png" : image = photo_from_congress

In ruby/rails, can you differentiate between no network response vs long-running response?

We have a Rails app with an integration with box.com. It happens fairly frequently that a request for a box action to our app results in a Passenger process being tied up for right around 15 minutes, and then we get the following exception:
Errno::ETIMEDOUT: Connection timed out - SSL_connect
Often it's on something that should be fairly quick, such as listing the contents of a small folder, or deleting a single document.
I'm under the impression that these requests never actually got to an open channel, that either at the tcp or ssl levels we got no initial response, or the full handshake/session-setup never completed.
I'd like to get either such condition to timeout quickly, say 15 seconds, but allow for a large file that is successfully transferring to continue.
Is there any way to get TCP or SSL to raise a timeout much sooner when the connection at either of those levels fails to complete setup, but not raise an exception if the session is successfully established and it's just taking a long time to actually transfer the data?
Here is what our current code looks like - we are not tied to doing it this way (and I didn't write this code):
def box_delete(uri)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Delete.new(uri.request_uri)
http.request(request)
end

Optimizing HTTP Status Request on Heroku (Net:HTTP)

I'm running an app on heroku where users can get the HTTP status (200,301,404, etc) of several URLs that they can paste on a form.
Although it runs fine on my local rails server, when I upload it on heroku, I cannot check more than 30 URLs (I want to check 200), as heroku time outs after 30seconds giving me an H12 Error.
def gethttpresponse(url)
httpstatusarray = Hash.new
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
httpstatusarray['url'] = url
httpstatusarray['status'] = response.code
return httpstatusarray
end
At the moment I'm using Net:HTTP, and it seems very heavy. Is there anything I can change on my code or any other gem I could use to get the HTTP status/headers on a more efficient (fast) way?
i noticed that
response.body holds the entire HTML source code of the page which i do not need. is this loaded on the response object by default?
If this is the most efficient way to check HTTP Status, would you agree that this needs to become a background job?
Any reference on gems that work faster, reading material and thoughts are more than welcome!
If a request takes longer than 30 seconds then the request is timed out as you're seeing here. You're entirely dependent on how fast the server the other end responds. For example, if say it were itself a Heroku application on 1 dyno then it will take possibly 10 seconds to unidle the application therefore leaving only 20 seconds for the other URLs to be tested.
I suggest you move each check to it's own background job and then poll the jobs to know when they complete and update the UI accordingly.

What's up with those requests having "iframe=true&width=80%&height=80%" query params?

I'm running a Rails 3.2 App. I checked Google Webmaster tools and saw lot's of HTTP 502 errors for random pages. Weird thing is that all of them where crawled with ?iframe=true&width=80%&height=80% as query param:
e.g. http://www.mypage.com/anypage?iframe=true&width=80%&height=80%
For sure I dont link like that to those pages internally, must be external. Checking Google, proofs me here - I see lot's of other pages having same issues.
Seems like an external service creates those links, but why??
I'm seeing these too. Over the past 24 hours I have 9 hits on one of my pages. They all come from the same IP address, which is Google's in Mountain View. None of them have a referrer. Also, a really interesting thing is that half of them have headers like this:
HTTP_ACCEPT : */*
HTTP_ACCEPT_ENCODING : gzip,deflate
HTTP_CONNECTION : Keep-alive
HTTP_FROM : googlebot(at)googlebot.com
HTTP_HOST : mydomain.com
HTTP_USER_AGENT : Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
But then interspersed are requests from the same IP that don't have any HTTP headers reported in the exception. I'm not sure if this means they aren't being sent, or if something in the Rails stack is preventing the headers from getting recorded due to some other variation in the requests. In any case the requests are interspersed.
The page in question has existed for only about a month, and it's only seen 5 requests during that time according to GA.
All this leads me to believe that someone inside Google is doing something experimental which is leading to these buggy query string encodings, and Rails apps are seeing it because it happens to crash the rack QS parser, whereas other platforms may be more forgiving.
In the meantime I may monkey patch rack just to stop shouting at me, but the ultimate answer about what's going on will have to come from Google (anyone there?).
You can add this to your initializers to get rid of the errors (with Ruby 1.8.x):
module URI
major, minor, patch = RUBY_VERSION.split('.').map { |v| v.to_i }
if major == 1 && minor < 9
def self.decode_www_form_component(str, enc=nil)
if TBLDECWWWCOMP_.empty?
tbl = {}
256.times do |i|
h, l = i>>4, i&15
tbl['%%%X%X' % [h, l]] = i.chr
tbl['%%%x%X' % [h, l]] = i.chr
tbl['%%%X%x' % [h, l]] = i.chr
tbl['%%%x%x' % [h, l]] = i.chr
end
tbl['+'] = ' '
begin
TBLDECWWWCOMP_.replace(tbl)
TBLDECWWWCOMP_.freeze
rescue
end
end
str = str.gsub(/%(?![0-9a-fA-F]{2})/, "%25")
str.gsub(/\+|%[0-9a-fA-F]{2}/) {|m| TBLDECWWWCOMP_[m]}
end
end
end
All this does is encode % symbols that aren't followed by two characters instead of raising an exception. Not sure it's such a good idea to be monkeypatching rack, though. There must be a valid reason this wasn't done in the gem (maybe security related?).
I just found out more about this issue. It looks like all the links are coming from spidername.com according to google web master. It looks like they add that to the url and somehow when you click on it will use an iframe to show the content. Probably using javascript to see if the url contain the iframe= query param. However, google bot is going straight to the iframe. That is causing the issue.
I decide to use a redirect rule in nginx to solve the issue.
I have the same issue. I am worried that it is third party spam link that tries to lower my site's google ranking.

Resources