What can I do about Mechanize waiting on an unresponsive web site? - ruby-on-rails

I noticed that when I fetch a site that is not responding using Mechanize, it just keeps on waiting.
How can I overcome this problem?

There's a couple ways to deal with it.
Open-Uri, and Net::HTTP have ways of passing in timeout values, which then tell the underlying networking stack how long you are willing to wait. For instance, Mechanize lets you get at its settings when you initialize an instance, something like:
mech = Mechanize.new { |agent|
agent.open_timeout = 5
agent.read_timeout = 5
}
It's all in the docs for new but you'll have to view the source to see what instance variables you can get at.
Or you can use Ruby's timeout module:
require 'timeout'
status = Timeout::timeout(5) {
# Something that should be interrupted if it takes too much time...
}

http://mechanize.rubyforge.org/mechanize/Mechanize.html on this page there are 2 undocumented attributes open_timeout and read_timeout, try using them.
agent = Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.keep_alive=false
agent.open_timeout=15
agent.read_timeout=15
HTH

Related

Rails application taking more than 30 seconds to respond

I'm making a small rails application that fetch data from some different languages at github-api.
The problem is, when i click the button that will fetch the informations, it takes a long time to redirect to the correct page. What i got from network is, the TTFB is actually 30s (!) and is getting a response with the status 302.
The controller function that is doing the logic:
Language.delete_all
search_urls = Introduction.all.map { |introduction| "https://api.github.com/search/repositories?q=#{introduction.name}&per_page=1" }
search_urls.each do |search_url|
json_file = JSON.parse(open(search_url).read)
pl = Language.new
pl.hash_response = json_file['items'].first
pl.name = pl.hash_response['language']
pl.save
end
main_languages = %w[ruby javascript python elixir java]
deletable_languages = Introduction.all.reject do |introduction|
main_languages.include?(introduction.name)
end
deletable_languages.each do |language|
language.delete
end
redirect_to languages_path
end
I believe the bottleneck is the http request in which you are doing it one by one. You could have filtered the languages that you want before generating the url and fetch them.
However, if the count of the urls after filtered is still large, say 20-50, assuming each request take 200ms, this would take at least 4s to 10s just for http request. Thats already too long for the user to wait for. In that case you should make it a background job.
If you insist to do this synchronously, you may consider fire those http requess by spawning multiple threads and join all the results after all threads are completed. You will achieve some concurrency here as the GIL will not block thread for IO wait. But this is very prone to error as you need to manage the threads on your own.

How to split a long-lived Sidekiq job into many short-lived jobs in a Ruby on Rails app

So I'm building a website that calls a third-party API that can take from 20 seconds to 30 minutes to return a result. But I can't know this duration in advance so need to poll it frequently to check if the work is done (returns "COMPLETE" and the result) or not (returns "IN_PROGRESS"). Also, this API might be called many times from many users at the same time.
So I created a Sidekiq worker that checks the API every 5 seconds until it receives "COMPLETE", and only then it ends. But I've read that Sidekiq should only be doing short-lived jobs, and I'm struggling to get my head around how should I do it. Also I've been trying to search for an answer but I suspect I don't know the words to find what I'm looking for.
I'm sure there is a way I can tell my workers to call the API once, and if the result is "IN_PROGRESS" end but make sure another worker will do another API call to check, and so on and so on until the result is "COMPLETE".
Also, I guess this is also handy to better distribute the load in case many users demand the use of said API, because fewer workers can do more of this short-lived jobs.
This is my worker, which I hope clarifies what I'm doing right now:
class ThingProgressWorker
include Sidekiq::Worker
def perform(id)
#thing = Thing.find(id)
#thing_api_call = ThingAPICall.new // This uses the ruby library of the API
completed = false
while completed == false
result = #thing_api_call.get_result( { thing_job_name: #thing.job_name })
if !result.include? "COMPLETED"
completed = false
sleep 5
else
completed = true
#thing.status = "completed"
#thing.save
break
end
end
end
end
So if the API takes ten minutes to go from "IN_PROGRESS" to "COMPLETED" this worker will be busy for that long, which I recon is not advised at all.
I've been thinking about this for some hours now and can't think of how should I do to make each API call its own job without having a worker busy until the API is done.
The only solution I've thought so far is having a master worker that calls another worker for each API call, but then I'll still have a worker busy for as long as the API takes to send the result.
I'd appreciate any help or directions!
Thanks in advance
Try to call the worker with a delay. for example:
class ThingProgressWorker
include Sidekiq::Worker
def perform(id)
#thing = Thing.find(id)
#thing_api_call = ThingAPICall.new // This uses the ruby library of the API
result = #thing_api_call.get_result( { thing_job_name: #thing.job_name })
if !result.include? "COMPLETED"
ThingProgressWorker.perform_in(1.minute, id)
else
completed = true
#thing.status = "completed”
#thing.save
end
end
end
This will add the worker to the queue but will not run it immediately but in the time you specify.

How to prevent rollbar from reporting SEO crawlers activities?

I have setup rollbar in my rails application. It keeps reporting recordnotfound which is as a result of SEO scrawlers (i.e Google bot, Baidu, findxbot etc..) searching for deleted post.
How to prevent rollbar from reporting SEO scrawler activities.
TL;DR:
# ./initializers/rollbar.rb
#
# https://stackoverflow.com/questions/36588449/how-to-prevent-rollbar-from-reporting-seo-crawlers-activities
#
# frozen_string_literal: true
crawlers = %w[Facebot Twitterbot YandexBot bingbot AhrefsBot crawler MJ12bot Yahoo GoogleBot Mail.RU_Bot SemrushBot YandexMobileBot DotBot AppleMail SeznamBot Baiduspider]
regexp = Regexp.new(Regexp.union(*crawlers).source, Regexp::IGNORECASE)
Rollbar.configure do |config|
ignore_bots = lambda do |options|
agent = options.fetch(:scope).fetch(:request).call.fetch(:headers)['User-Agent']
raise Rollbar::Ignore if agent.match?(regexp)
end
config.before_process << ignore_bots
...
end
======================
Be careful with magic comment frozen_string_literal and use =~ instead of match? if you have Ruby version less than 2.3.
Here I use an array that will be transformed into regexp. I did this because I wanted to prevent syntax and escaping related errors of developers in future and add ignorecase thing for same reason.
So in regexp you will see a Mail\.RU_Bot, instead of anything wrong.
Also in your case you can use simply word bot instead of many crawlers, but be careful with unusual user-agents. In my case, I want to know all crawlers on my site, so I came up with this solution. Yet another example of working part: there are crawler and crawler4j on my production site. I use just crawler in array to prevent notifing for both of them.
Last thing I want to say — my solution is not very optimal, but it just works. I hope someone will share an optimized version of my code. That's also the main reason I recommend to send data asynchronously, i.e. use sidekiq, delayed_job or whatever you want, don't forget to check related wikis.
My answer is based on #AndrewSouthpaw's solution (?), that wasn't working for me. Hoping that approved wiki-copy-pasted #Jesse Gibbs will be moderated some way.
=======
EDIT1: it's nice idea to check the https://github.com/ZLevine/rollbar-ignore-crawler-errors repo if you need to prevent rollbar to notify on js.
Looks like you are using rollbar-gem, so you'd want to use Rollbar::Ignore to tell Rollbar to ignore errors that were caused by a spider
handler = proc do |options|
raise Rollbar::Ignore if is_crawler_error(options)
end
Rollbar.configure do |config|
config.before_process << handler
end
where is_crawler_error detects if the request that led to the error was from a crawler.
If you are using rollbar.js to detect errors in client-side Javascript, then you can use the checkIgnore option to filter out client-side errors caused by bots:
_rollbarConfig = {
// current config...
checkIgnore: function(isUncaught, args, payload) {
if (window.navigator.userAgent && window.navigator.userAgent.indexOf('Baiduspider') !== -1) {
// ignore baidu spider
return true;
}
// no other ignores
return false;
}
}
Here's what I did:
is_crawler_error = Proc.new do |options|
return true if options[:scope][:request]['From'] == 'bingbot(at)microsoft.com'
return true if options[:scope][:request]['From'] == 'googlebot(at)googlebot.com'
return true if options[:scope][:request]['User-Agent'] =~ /Facebot Twitterbot/
end
handler = proc do |options|
raise Rollbar::Ignore if is_crawler_error.call(options)
end
config.before_process << handler
Based on these docs.

Several questions about this Varnish VCL

I'm setting up varnish-devicedetect VCL in Varnish 4.0.2:
https://github.com/varnish/varnish-devicedetect/blob/master/INSTALL.rst
I'm following the directions for method #1: "Send HTTP header to backend"
I've read through this readme and have Googled for quite some time now and still quite a few concepts are escaping me.
Here's my code (excerpts):
default.vcl
include "devicedetect.vcl";
sub vcl_recv {
call devicedetect;
# ... snip ...
}
sub vcl_backend_response {
# device detect
if (bereq.http.X-UA-Device) {
if (!beresp.http.Vary) { # no Vary at all
set beresp.http.Vary = "X-UA-Device";
} elseif (beresp.http.Vary !~ "X-UA-Device") { # add to existing Vary
set beresp.http.Vary = beresp.http.Vary + ", X-UA-Device";
}
}
# ... snip ...
}
sub vcl_deliver {
# device detect
if ((req.http.X-UA-Device) && (resp.http.Vary)) {
set resp.http.Vary = regsub(resp.http.Vary, "X-UA-Device", "User-Agent");
}
# ... snip ...
}
Here's my questions.
When I inspect the response in Chrome Dev Tools, why is the Vary header set to User-Agent. Isn't the whole approach of method #1 NOT to use user agent, and instead use X-UA-Device?
Based on other guides I read... it seems this will hit the origin for EACH type of mobile (if you look in device detect, its split up into... mobile-iphone, mobile-android, mobile-smartphone, etc). Is this true in my code above? I definitely DONT want to hit the origin server more than twice for any given URL (desktop, and mobile ... I don't want all the mobile-* cached separately).
Can someone describe what the 3 code blocks above actually do? In somewhat laymen's terms. About the only one I truly understand is the first code block. call devicedetect just looks at the User-Agent and then sets X-UA-Device header with the appropriate grouping on the request to the backend. I'm a bit confused what the other 2 code blocks do though.
Can I delete the bit with X-UA-Device-force if I don't intend to allow the user to 'use desktop site'?
The guide mentions that I should be setting something in the backend in my app code. Right now this is all I have (rails). I'm not changing headers or changing anything about the response. I'm only changing the way the HTML looks (for the mobile version of the site). Should I be changing a header or something? This is what I have so far:
Rails:
def detect_device
if request.headers['X-UA-Device'] =~ /^mobile/
#device = 'mobile'
prepend_view_path Rails.root + 'app' + 'views_mobile'
else
#device = 'desktop'
end
end
As to point 1, your X-UA-Device is a custom header for internal consumption, ie by default not exposed to the external world. To ensure the external caches/proxies understand you are considering the device/user-agent in the response, you have to update the Vary with a header which reflect this. this is where the user-agent comes in, as thats where you have derived the X-UA-Device from.
note the comment within the link you indicate
to keep any caches in the wild from serving wrong content to client #2 behind them, we need to transform the Vary on the way out.

Nokogiri Timeout::Error when scraping own site

Nokogiri works fine for me in the console, but if I put it anywhere... Model, View, or Controller, it times out.
I'd like to use it 1 of 2 ways...
Controller
def show
#design = Design.find(params[:id])
doc = Nokogiri::HTML(open(design_url(#design)))
images = doc.css('.well img') ? doc.css('.well img').map{ |i| i['src'] } : []
end
or...
Model
def first_image
doc = Nokogiri::HTML(open("http://localhost:3000/blog/#{self.id}"))
image = doc.css('.well img')[0] ? doc.css('.well img')[0]['src'] : nil
self.update_attribute(:photo_url, image)
end
Both result in a timeout, though they work perfectly in the console.
When you run your Nokogiri code from the console, you're referencing your development server at localhost:3000. Thus, there are two instances running: one making the call (your console) and one answering the call (your server)
When you run it from within your app, you are referencing the app itself, which is causing an infinite loop since there is no available resource to respond to your call (that resource is the one making the call!). So you would need to be running multiple instances with something like Unicorn (or simply another localhost instance at a different port), and you would need at least one of those instances to be free to answer the Nokogiri request.
If you plan to run this in production, just know that this setup will require an available resource to answer the Nokogiri request, so you're essentially tying up 2 instances with each call. So if you have 4 instances and all 4 happen to make the call at the same time, your whole application is screwed. You'll probably experience pretty severe degradation with only 1 or 2 calls at a time as well...
Im not sure what default value of timeout.
But you can specify some timeout value like below.
require 'net/http'
http = Net::HTTP.new('localhost')
http.open_timeout = 100
http.read_timeout = 100
Nokogiri.parse(http.get("/blog/#{self.id}").body)
Finally you can find what is the problem as you can control timeout value.
So, with tyler's advice I dug into what I was doing a bit more. Because of the disconnect that ckeditor has with the images, due to carrierwave and S3, I can't get any info direct from the uploader (at least it seems that way to me).
Instead, I'm sticking with nokogiri, and it's working wonderfully. I realized what I was actually doing with the open() command, and it was completely unnecessary. Nokogiri parses HTML. I can give it HTML in for form of #design.content! Duh, on my part.
So, this is how I'm scraping my own site, to get the images associated with a blog entry:
designs_controller.rb
def create
params[:design][:photo_url] = Nokogiri::HTML(params[:design][:content]).css('img').map{ |i| i['src']}[0]
#design = Design.new(params[:design])
if #design.save
flash[:success] = "Design created"
redirect_to designs_url
else
render 'designs/new'
end
end
def show
#design = Design.find(params[:id])
#categories = #design.categories
#tags = #categories.map {|c| c.name}
#related = Design.joins(:categories).where('categories.name' => #tags).reject {|d| d.id == #design.id}.uniq
set_meta_tags og: {
title: #design.name,
type: 'article',
url: design_url(#design),
image: Nokogiri::HTML(#design.content).css('img').map{ |i| i['src']},
article: {
published_time: #design.published_at.to_datetime,
modified_time: #design.updated_at.to_datetime,
author: 'Alphabetic Design',
section: 'Designs',
tag: #tags
}
}
end
The Update action has the same code for Nokogiri as the Create action.
Seems kind of obvious now that I'm looking at it, lol. I dwelled on this for longer than I'd like to admit...

Resources