How to prevent rollbar from reporting SEO crawlers activities? - ruby-on-rails

I have setup rollbar in my rails application. It keeps reporting recordnotfound which is as a result of SEO scrawlers (i.e Google bot, Baidu, findxbot etc..) searching for deleted post.
How to prevent rollbar from reporting SEO scrawler activities.

TL;DR:
# ./initializers/rollbar.rb
#
# https://stackoverflow.com/questions/36588449/how-to-prevent-rollbar-from-reporting-seo-crawlers-activities
#
# frozen_string_literal: true
crawlers = %w[Facebot Twitterbot YandexBot bingbot AhrefsBot crawler MJ12bot Yahoo GoogleBot Mail.RU_Bot SemrushBot YandexMobileBot DotBot AppleMail SeznamBot Baiduspider]
regexp = Regexp.new(Regexp.union(*crawlers).source, Regexp::IGNORECASE)
Rollbar.configure do |config|
ignore_bots = lambda do |options|
agent = options.fetch(:scope).fetch(:request).call.fetch(:headers)['User-Agent']
raise Rollbar::Ignore if agent.match?(regexp)
end
config.before_process << ignore_bots
...
end
======================
Be careful with magic comment frozen_string_literal and use =~ instead of match? if you have Ruby version less than 2.3.
Here I use an array that will be transformed into regexp. I did this because I wanted to prevent syntax and escaping related errors of developers in future and add ignorecase thing for same reason.
So in regexp you will see a Mail\.RU_Bot, instead of anything wrong.
Also in your case you can use simply word bot instead of many crawlers, but be careful with unusual user-agents. In my case, I want to know all crawlers on my site, so I came up with this solution. Yet another example of working part: there are crawler and crawler4j on my production site. I use just crawler in array to prevent notifing for both of them.
Last thing I want to say — my solution is not very optimal, but it just works. I hope someone will share an optimized version of my code. That's also the main reason I recommend to send data asynchronously, i.e. use sidekiq, delayed_job or whatever you want, don't forget to check related wikis.
My answer is based on #AndrewSouthpaw's solution (?), that wasn't working for me. Hoping that approved wiki-copy-pasted #Jesse Gibbs will be moderated some way.
=======
EDIT1: it's nice idea to check the https://github.com/ZLevine/rollbar-ignore-crawler-errors repo if you need to prevent rollbar to notify on js.

Looks like you are using rollbar-gem, so you'd want to use Rollbar::Ignore to tell Rollbar to ignore errors that were caused by a spider
handler = proc do |options|
raise Rollbar::Ignore if is_crawler_error(options)
end
Rollbar.configure do |config|
config.before_process << handler
end
where is_crawler_error detects if the request that led to the error was from a crawler.
If you are using rollbar.js to detect errors in client-side Javascript, then you can use the checkIgnore option to filter out client-side errors caused by bots:
_rollbarConfig = {
// current config...
checkIgnore: function(isUncaught, args, payload) {
if (window.navigator.userAgent && window.navigator.userAgent.indexOf('Baiduspider') !== -1) {
// ignore baidu spider
return true;
}
// no other ignores
return false;
}
}

Here's what I did:
is_crawler_error = Proc.new do |options|
return true if options[:scope][:request]['From'] == 'bingbot(at)microsoft.com'
return true if options[:scope][:request]['From'] == 'googlebot(at)googlebot.com'
return true if options[:scope][:request]['User-Agent'] =~ /Facebot Twitterbot/
end
handler = proc do |options|
raise Rollbar::Ignore if is_crawler_error.call(options)
end
config.before_process << handler
Based on these docs.

Related

How can I make this method more concise?

I get a warning when running reek on a Rails project:
[36]:ArborReloaded::UserStoryService#destroy_stories has approx 8 statements (TooManyStatements)
Here's the method:
def destroy_stories(project_id, user_stories)
errors = []
#project = Project.find(project_id)
user_stories.each do |current_user_story_id|
unless #project.user_stories.find(current_user_story_id).destroy
errors.push("Error destroying user_story: #{current_user_story_id}")
end
end
if errors.compact.length == 0
#common_response.success = true
else
#common_response.success = false
#common_response.errors = errors
end
#common_response
end
How can this method be minimized?
First, I find that class and method size are useful for finding code that might need refactoring, but sometimes you really do need a long class or method. And there is always a way to make your code shorter to get around such limits, but that might make it less readable. So I disable that type of inspection when using static analysis tools.
Also, it's unclear to me why you'd expect to have an error when deleting a story, or who benefits from an error message that just includes the ID and nothing about what error occurred.
That said, I'd write that method like this, to reduce the explicit local state and to better separate concerns:
def destroy_stories(project_id, story_ids)
project = Project.find(project_id) # I don't see a need for an instance variable
errors = story_ids.
select { |story_id| !project.user_stories.find(story_id).destroy }.
map { |story_id| "Error destroying user_story: #{story_id}" }
respond errors
end
# Lots of services probably need to do this, so it can go in a superclass.
# Even better, move it to #common_response's class.
def respond(errors)
# It would be best to move this behavior to #common_response.
#common_response.success = errors.any?
# Hopefully this works even when errors == []. If not, fix your framework.
#common_response.errors = errors
#common_response
end
You can see how taking some care in your framework can save a lot of noise in your components.

Several questions about this Varnish VCL

I'm setting up varnish-devicedetect VCL in Varnish 4.0.2:
https://github.com/varnish/varnish-devicedetect/blob/master/INSTALL.rst
I'm following the directions for method #1: "Send HTTP header to backend"
I've read through this readme and have Googled for quite some time now and still quite a few concepts are escaping me.
Here's my code (excerpts):
default.vcl
include "devicedetect.vcl";
sub vcl_recv {
call devicedetect;
# ... snip ...
}
sub vcl_backend_response {
# device detect
if (bereq.http.X-UA-Device) {
if (!beresp.http.Vary) { # no Vary at all
set beresp.http.Vary = "X-UA-Device";
} elseif (beresp.http.Vary !~ "X-UA-Device") { # add to existing Vary
set beresp.http.Vary = beresp.http.Vary + ", X-UA-Device";
}
}
# ... snip ...
}
sub vcl_deliver {
# device detect
if ((req.http.X-UA-Device) && (resp.http.Vary)) {
set resp.http.Vary = regsub(resp.http.Vary, "X-UA-Device", "User-Agent");
}
# ... snip ...
}
Here's my questions.
When I inspect the response in Chrome Dev Tools, why is the Vary header set to User-Agent. Isn't the whole approach of method #1 NOT to use user agent, and instead use X-UA-Device?
Based on other guides I read... it seems this will hit the origin for EACH type of mobile (if you look in device detect, its split up into... mobile-iphone, mobile-android, mobile-smartphone, etc). Is this true in my code above? I definitely DONT want to hit the origin server more than twice for any given URL (desktop, and mobile ... I don't want all the mobile-* cached separately).
Can someone describe what the 3 code blocks above actually do? In somewhat laymen's terms. About the only one I truly understand is the first code block. call devicedetect just looks at the User-Agent and then sets X-UA-Device header with the appropriate grouping on the request to the backend. I'm a bit confused what the other 2 code blocks do though.
Can I delete the bit with X-UA-Device-force if I don't intend to allow the user to 'use desktop site'?
The guide mentions that I should be setting something in the backend in my app code. Right now this is all I have (rails). I'm not changing headers or changing anything about the response. I'm only changing the way the HTML looks (for the mobile version of the site). Should I be changing a header or something? This is what I have so far:
Rails:
def detect_device
if request.headers['X-UA-Device'] =~ /^mobile/
#device = 'mobile'
prepend_view_path Rails.root + 'app' + 'views_mobile'
else
#device = 'desktop'
end
end
As to point 1, your X-UA-Device is a custom header for internal consumption, ie by default not exposed to the external world. To ensure the external caches/proxies understand you are considering the device/user-agent in the response, you have to update the Vary with a header which reflect this. this is where the user-agent comes in, as thats where you have derived the X-UA-Device from.
note the comment within the link you indicate
to keep any caches in the wild from serving wrong content to client #2 behind them, we need to transform the Vary on the way out.

How to read the Rails API

I'm having a difficult time understanding the Rails API. I am trying to figure out a way to understand what I can call from certain points inside Rails, such as when I'm in a controller, so I wrote something to tell me all the methods that are available sorted by what Module/Class they fall under:
last_sig = ""
self.methods.each do |method|
#i_am = self.method(method).owner
#puts i_am.class
#places.push(self.method(method).owner)
m = self.method(method)
sig = "#{m.owner.class}: #{m.owner}"
if sig != last_sig
last_sig = sig
puts sig
end
puts " #{method}"
end
As an example, I find out (just using this as an easy example) that I can use the render() method and it is located at ActionController::Instrumentation, so then I look at the render() function there and it says:
render(*args)
# File actionpack/lib/action_controller/metal/instrumentation.rb, line 38
def render(*args)
render_output = nil
self.view_runtime = cleanup_view_runtime do
Benchmark.ms { render_output = super }
end
render_output
end
That is all is says, I don't understand how from this I could understand how it works, then I do some more searching and by "luck" I discover that it is documented in ActionView, and I wonder how I was able to know this? Anyway, any tips on how to read the API would be appreciated- It seems like many of the things in the API are not documented for a User, and I don't know if they are for the User or for the developers of Rails- I'm used to using a documentation like jQuery which seems much easier to Discover functionality by using-

Nokogiri Timeout::Error when scraping own site

Nokogiri works fine for me in the console, but if I put it anywhere... Model, View, or Controller, it times out.
I'd like to use it 1 of 2 ways...
Controller
def show
#design = Design.find(params[:id])
doc = Nokogiri::HTML(open(design_url(#design)))
images = doc.css('.well img') ? doc.css('.well img').map{ |i| i['src'] } : []
end
or...
Model
def first_image
doc = Nokogiri::HTML(open("http://localhost:3000/blog/#{self.id}"))
image = doc.css('.well img')[0] ? doc.css('.well img')[0]['src'] : nil
self.update_attribute(:photo_url, image)
end
Both result in a timeout, though they work perfectly in the console.
When you run your Nokogiri code from the console, you're referencing your development server at localhost:3000. Thus, there are two instances running: one making the call (your console) and one answering the call (your server)
When you run it from within your app, you are referencing the app itself, which is causing an infinite loop since there is no available resource to respond to your call (that resource is the one making the call!). So you would need to be running multiple instances with something like Unicorn (or simply another localhost instance at a different port), and you would need at least one of those instances to be free to answer the Nokogiri request.
If you plan to run this in production, just know that this setup will require an available resource to answer the Nokogiri request, so you're essentially tying up 2 instances with each call. So if you have 4 instances and all 4 happen to make the call at the same time, your whole application is screwed. You'll probably experience pretty severe degradation with only 1 or 2 calls at a time as well...
Im not sure what default value of timeout.
But you can specify some timeout value like below.
require 'net/http'
http = Net::HTTP.new('localhost')
http.open_timeout = 100
http.read_timeout = 100
Nokogiri.parse(http.get("/blog/#{self.id}").body)
Finally you can find what is the problem as you can control timeout value.
So, with tyler's advice I dug into what I was doing a bit more. Because of the disconnect that ckeditor has with the images, due to carrierwave and S3, I can't get any info direct from the uploader (at least it seems that way to me).
Instead, I'm sticking with nokogiri, and it's working wonderfully. I realized what I was actually doing with the open() command, and it was completely unnecessary. Nokogiri parses HTML. I can give it HTML in for form of #design.content! Duh, on my part.
So, this is how I'm scraping my own site, to get the images associated with a blog entry:
designs_controller.rb
def create
params[:design][:photo_url] = Nokogiri::HTML(params[:design][:content]).css('img').map{ |i| i['src']}[0]
#design = Design.new(params[:design])
if #design.save
flash[:success] = "Design created"
redirect_to designs_url
else
render 'designs/new'
end
end
def show
#design = Design.find(params[:id])
#categories = #design.categories
#tags = #categories.map {|c| c.name}
#related = Design.joins(:categories).where('categories.name' => #tags).reject {|d| d.id == #design.id}.uniq
set_meta_tags og: {
title: #design.name,
type: 'article',
url: design_url(#design),
image: Nokogiri::HTML(#design.content).css('img').map{ |i| i['src']},
article: {
published_time: #design.published_at.to_datetime,
modified_time: #design.updated_at.to_datetime,
author: 'Alphabetic Design',
section: 'Designs',
tag: #tags
}
}
end
The Update action has the same code for Nokogiri as the Create action.
Seems kind of obvious now that I'm looking at it, lol. I dwelled on this for longer than I'd like to admit...

MongoDB― need to display status of db (running or not)

I am currently using MongoDB for tracking of various things in a Rails 2 app. I am using the following code to see if MongoDB is up and running and, depending upon the status, displaying a link or an "Offline" message.
This is only for admins, so it's not mission-critical, as the app will continue to run without MongoDB, but I do want to keep disabling the link in the menu when it's not running. However, I don't like the overhead of the below code (doesn't take long to run, but hope that there is a cleaner, faster way):
def verify_mongodb_status
begin
track = Track.first
#mongodb_running = true
rescue
#mongodb_running = false
logger.debug("***MongoDB not running.***")
notify_admin_about_errors("***MongoDB is not running***)
end
end
EDIT: I forgot to mention that I'm already doing a before_filter for this; the method sits in application_controller.rb.
I decided to go with action_caching as there doesn't seem to be a great way to do this. The result was quite a large speed increase from ~120ms to ~16-25ms:
def verify_mongodb_status
begin
track = Track.first
#mongodb_running = true
rescue => e
#mongodb_running = false
logger.debug("***MONGODB OFFLINE***: #{e}")
notify_admin_about_errors("MongoDB", "MongoDB error:\n#{e}", nil)
expire_action :action => :verify_mongodb_status
return
end
end
I'm adding logic now to keep from getting bombarded by emails when MongoDB goes offline (1 is enough).

Resources