Speed up rake task by using typhoeus - ruby-on-rails

So i stumbled across this: https://github.com/typhoeus/typhoeus
I'm wondering if this is what i need to speed up my rake task
Event.all.each do |row|
begin
url = urlhere + row.first + row.second
doc = Nokogiri::HTML(open(url))
doc.css('.table__row--event').each do |tablerow|
table = tablerow.css('.table__cell__body--location').css('h4').text
next unless table == row.eventvenuename
tablerow.css('.table__cell__body--availability').each do |button|
buttonurl = button.css('a')[0]['href']
if buttonurl.include? '/checkout/external'
else
row.update(row: buttonurl)
end
end
end
rescue Faraday::ConnectionFailed
puts "connection failed"
next
end
end
I'm wondering if this would speed it up, Or because i'm doing a .each it wouldn't?
If it would could you provide an example?
Sam

If you set up Typhoeus::Hydra to run parallel requests, you might be able to speed up your code, assuming that the Kernel#open calls are what's slowing you down. Before you optimize, you might want to run benchmarks to validate this assumption.
If it is true, and parallel requests would speed it up, you would need to restructure your code to load events in batches, build a queue of parallel requests for each batch, and then handle them after they execute. Here's some sketch code.
class YourBatchProcessingClass
def initialize(batch_size: 200)
#batch_size = batch_size
#hydra = Typhoeus::Hydra.new(max_concurrency: #batch_size)
end
def perform
# Get an array of records
Event.find_in_batches(batch_size: #batch_size) do |batch|
# Store all the requests so we can access their responses later.
requests = batch.map do |record|
request = Typhoeus::Request.new(your_url_build_logic(record))
#hydra.queue request
request
end
#hydra.run # Run requests in parallel
# Process responses from each request
requests.each do |request|
your_response_processing(request.response.body)
end
end
rescue WhateverError => e
puts e.message
end
private
def your_url_build_logic(event)
# TODO
end
def your_response_processing(response_body)
# TODO
end
end
# Run the service by calling this in your Rake task definition
YourBatchProcessingClass.new.perform
Ruby can be used for pure scripting, but it functions best as an object-oriented language. Decomposing your processing work into clear methods can help clarify your code and help you catch things like Tom Lord mentioned in the comments on your question. Also, instead of wrapping your whole script in a begin..rescue block, you can use method-level rescues as in #perform above, or just wrap #hydra.run.
As a note, .all.each is a memory hog, and is thus considered a bad solution to iterating over records: .all loads all of the records into memory before iterating over them with .each. To save memory, it's better to use .find_each or .find_in_batches, depending on your use case. See: http://api.rubyonrails.org/classes/ActiveRecord/Batches.html

Related

Rake task for creating database records for all existing ActiveStorage variants

In Rails 6.1, ActiveStorage creates database records for all variants when they're loaded for the first time: https://github.com/rails/rails/pull/37901
I'd like to enable this, but since I have tens of thousands of files in my production Rails app, it'd be problematic (and presumably slow) to have users creating so many database records as they browse the site. Is there a way to write a Rake task that'll iterate through every attachment in my database, and generate the variants and save them in the database?
I'd run that once, after enabling the new active_storage.track_variants config, and then any newly-uploaded files would be saved when they're loaded for the first time.
Thanks for the help!
This is the Rake task I ended up creating for this. The Parallel stuff can be removed if you have a smaller dataset, but I found that with 70k+ variants it was intolerably slow when doing it without any parallelization. You can also ignore the progress bar-related code :)
Essentially, I just take all the models that have an attachment (I do this manually, you could do it in a more dynamic way if you have a ton of attachments), and then filter the ones that are not variable. Then I go through each attachment and generate a variant for each size I've defined, and then call process on it to force it to be saved to the database.
Make sure to catch MiniMagick (or vips, if you prefer) errors in the task so that a bad image file doesn't break everything.
# Rails 6.1 changes the way ActiveStorage works so that variants are
# tracked in the database. The intent of this task is to create the
# necessary variants for all game covers and user avatars in our database.
# This way, the user isn't creating dozens of variant records as they
# browse the site. We want to create them ahead-of-time, when we deploy
# the change to track variants.
namespace 'active_storage:vglist:variants' do
require 'ruby-progressbar'
require 'parallel'
desc "Create all variants for covers and avatars in the database."
task create: :environment do
games = Game.joins(:cover_attachment)
# Only attempt to create variants if the cover is able to have variants.
games = games.filter { |game| game.cover.variable? }
puts 'Creating game cover variants...'
# Use the configured max number of threads, with 2 leftover for web requests.
# Clamp it to 1 if the configured max threads is 2 or less for whatever reason.
thread_count = [(ENV.fetch('RAILS_MAX_THREADS', 5).to_i - 2), 1].max
games_progress_bar = ProgressBar.create(
total: games.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
# Disable logging in production to prevent log spam.
Rails.logger.level = 2 if Rails.env.production?
Parallel.each(games, in_threads: thread_count) do |game|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
game.sized_cover(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
games_progress_bar.log "ERROR: #{e.message}"
games_progress_bar.log "Failed on game ID: #{game.id}"
end
games_progress_bar.increment
end
end
games_progress_bar.finish unless games_progress_bar.finished?
users = User.joins(:avatar_attachment)
# Only attempt to create variants if the avatar is able to have variants.
users = users.filter { |user| user.avatar.variable? }
puts 'Creating user avatar variants...'
users_progress_bar = ProgressBar.create(
total: users.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
Parallel.each(users, in_threads: thread_count) do |user|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
user.sized_avatar(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
users_progress_bar.log "ERROR: #{e.message}"
users_progress_bar.log "Failed on user ID: #{user.id}"
end
users_progress_bar.increment
end
end
users_progress_bar.finish unless users_progress_bar.finished?
end
end
This is what the sized_cover looks like in game.rb:
def sized_cover(size)
width, height = COVER_SIZES[size]
cover&.variant(
resize_to_limit: [width, height]
)
end
sized_avatar is pretty much the same thing.

How can I prevent many sidekiq jobs from exceeding the API calls limit

I am working on an Ruby On Rails application. We have many sidekiq workers that can process multiple jobs at a time. Each job will make calls to the Shopify API, the calls limit set by Shopify is 2 calls per second. I want to synchronize that, so that only two jobs can call the API in a given second.
The way I'm doing that right now, is like this:
# frozen_string_literal: true
class Synchronizer
attr_reader :shop_id, :queue_name, :limit, :wait_time
def initialize(shop_id:, queue_name:, limit: nil, wait_time: 1)
#shop_id = shop_id
#queue_name = queue_name.to_s
#limit = limit
#wait_time = wait_time
end
# This method should be called for each api call
def synchronize_api_call
raise "a block is required." unless block_given?
get_api_call
time_to_wait = calculate_time_to_wait
sleep(time_to_wait) unless Rails.env.test? || time_to_wait.zero?
yield
ensure
return_api_call
end
def set_api_calls
redis.del(api_calls_list)
redis.rpush(api_calls_list, calls_list)
end
private
def get_api_call
logger.log_message(synchronizer: 'Waiting for api call', color: :yellow)
#api_call_timestamp = redis.brpop(api_calls_list)[1].to_i
logger.log_message(synchronizer: 'Got api call.', color: :yellow)
end
def return_api_call
redis_timestamp = redis.time[0]
redis.rpush(api_calls_list, redis_timestamp)
ensure
redis.ltrim(api_calls_list, 0, limit - 1)
end
def last_call_timestamp
#api_call_timestamp
end
def calculate_time_to_wait
current_time = redis.time[0]
time_passed = current_time - last_call_timestamp.to_i
time_to_wait = wait_time - time_passed
time_to_wait > 0 ? time_to_wait : 0
end
def reset_api_calls
redis.multi do |r|
r.del(api_calls_list)
end
end
def calls_list
redis_timestamp = redis.time[0]
limit.times.map do |i|
redis_timestamp
end
end
def api_calls_list
#api_calls_list ||= "api-calls:shop:#{shop_id}:list"
end
def redis
Thread.current[:redis] ||= Redis.new(db: $redis_db_number)
end
end
the way I use it is like this
synchronizer = Synchronizer.new(shop_id: shop_id, queue_name: 'shopify_queue', limit: 2, wait_time: 1)
# this is called once the process started, i.e. it's not called by the jobs themselves but by the App from where the process is kicked off.
syncrhonizer.set_api_calls # this will populate the api_calls_list with 2 timestamps, those timestamps will be used to know when the last api call has been sent.
then when a job wants to make a call
syncrhonizer.synchronize_api_call do
# make the call
end
The problem
The problem with this is that if for some reason a job fails to return to the api_calls_list the api_call it took, that will make that job and the other jobs stuck for ever, or until we notice that and we call set_api_calls again. That problem won't affect that particular shop only, but also the other shops as well, because the sidekiq workers are shared between all the shops using our app. It happen sometimes that we don't notice that until a user calls us, and we find that it was stuck for many hours while it should be finished in a few minutes.
The Question
I just realised lately that Redis is not the best tool for shared locking. So I am asking, Is there any other good tool for this job?? If not in the Ruby world, I'd like to learn from others as well. I'm interested in the techniques as well as the tools. So every bit helps.
You may want to restructure your code and create a micro-service to process the API calls, which will use a local locking mechanism and force your workers to wait on the socket. It comes with the added complexity of maintaining the micro-service. But if you're in a hurry then Ent-Rate-Limiting looks cool too.

How to DRY a list of functions in ruby that are differ only by a single line of code?

I have a User model in a ROR application that has multiple methods like this
#getClient() returns an object that knows how to find certain info for a date
#processHeaders() is a function that processes output and updates some values in the database
#refreshToken() is function that is called when an error occurs when requesting data from the object returned by getClient()
def transactions_on_date(date)
if blocked?
# do something
else
begin
output = getClient().transactions(date)
processHeaders(output)
return output
rescue UnauthorizedError => ex
refresh_token()
output = getClient().transactions(date)
process_fitbit_rate_headers(output)
return output
end
end
end
def events_on_date(date)
if blocked?
# do something
else
begin
output = getClient().events(date)
processHeaders(output)
return output
rescue UnauthorizedError => ex
refresh_token()
output = getClient().events(date)
processHeaders(output)
return output
end
end
end
I have several functions in my User class that look exactly the same. The only difference among these functions is the line output = getClient().something(date). Is there a way that I can make this code look cleaner so that I do not have a repetitive list of functions.
The answer is usually passing in a block and doing it functional style:
def handle_blocking(date)
if blocked?
# do something
else
begin
output = yield(date)
processHeaders(output)
output
rescue UnauthorizedError => ex
refresh_token
output = yield(date)
process_fitbit_rate_headers(output)
output
end
end
end
Then you call it this way:
handle_blocking(date) do |date|
getClient.something(date)
end
That allows a lot of customization. The yield call executes the block of code you've supplied and passes in the date argument to it.
The process of DRYing up your code often involves looking for patterns and boiling them down to useful methods like this. Using a functional approach can keep things clean.
Yes, you can use Object#send: getClient().send(:method_name, date).
BTW, getClient is not a proper Ruby method name. It should be get_client.
How about a combination of both answers:
class User
def method_missing sym, *args
m_name = sym.to_s
if m_name.end_with? '_on_date'
prop = m_name.split('_').first.to_sym
handle_blocking(args.first) { getClient().send(prop, args.first) }
else
super(sym, *args)
end
end
def respond_to? sym, private=false
m_name.end_with?('_on_date') || super(sym, private)
end
def handle_blocking date
# see other answer
end
end
Then you can call "transaction_on_date", "events_on_date", "foo_on_date" and it would work.

How to test the number of database calls in Rails

I am creating a REST API in rails. I'm using RSpec. I'd like to minimize the number of database calls, so I would like to add an automatic test that verifies the number of database calls being executed as part of a certain action.
Is there a simple way to add that to my test?
What I'm looking for is some way to monitor/record the calls that are being made to the database as a result of a single API call.
If this can't be done with RSpec but can be done with some other testing tool, that's also great.
The easiest thing in Rails 3 is probably to hook into the notifications api.
This subscriber
class SqlCounter< ActiveSupport::LogSubscriber
def self.count= value
Thread.current['query_count'] = value
end
def self.count
Thread.current['query_count'] || 0
end
def self.reset_count
result, self.count = self.count, 0
result
end
def sql(event)
self.class.count += 1
puts "logged #{event.payload[:sql]}"
end
end
SqlCounter.attach_to :active_record
will print every executed sql statement to the console and count them. You could then write specs such as
expect do
# do stuff
end.to change(SqlCounter, :count).by(2)
You'll probably want to filter out some statements, such as ones starting/committing transactions or the ones active record emits to determine the structures of tables.
You may be interested in using explain. But that won't be automatic. You will need to analyse each action manually. But maybe that is a good thing, since the important thing is not the number of db calls, but their nature. For example: Are they using indexes?
Check this:
http://weblog.rubyonrails.org/2011/12/6/what-s-new-in-edge-rails-explain/
Use the db-query-matchers gem.
expect { subject.make_one_query }.to make_database_queries(count: 1)
Fredrick's answer worked great for me, but in my case, I also wanted to know the number of calls for each ActiveRecord class individually. I made some modifications and ended up with this in case it's useful for others.
class SqlCounter< ActiveSupport::LogSubscriber
# Returns the number of database "Loads" for a given ActiveRecord class.
def self.count(clazz)
name = clazz.name + ' Load'
Thread.current['log'] ||= {}
Thread.current['log'][name] || 0
end
# Returns a list of ActiveRecord classes that were counted.
def self.counted_classes
log = Thread.current['log']
loads = log.keys.select {|key| key =~ /Load$/ }
loads.map { |key| Object.const_get(key.split.first) }
end
def self.reset_count
Thread.current['log'] = {}
end
def sql(event)
name = event.payload[:name]
Thread.current['log'] ||= {}
Thread.current['log'][name] ||= 0
Thread.current['log'][name] += 1
end
end
SqlCounter.attach_to :active_record
expect do
# do stuff
end.to change(SqlCounter, :count).by(2)

Logging Search Results in a Rails Application

We're interested in logging and computing the number of times an item comes up in search or on a list page. With 50k unique visitors a day, we're expecting we could produce 3-4 million 'impressions' per day, which isn't a terribly high amount, but one we'd like to architect well.
We don't need to read this data in real time, but would like to be able to generate daily totals and analyze trends, etc. Similar to a business analytics tool.
We're planning to do this with an Ajax post after the page is rendered - this will allow us to count results even if those results are cached. We can do this in a single post per page, to send a comma delimited list of ids and their positions on the page.
I am hoping there is some sort of design pattern/gem/blog post about this that would help me avoid the common first-timer mistakes that may come up. I also don't really have much experience logging or reading logs.
My current strategy - make something to write events to a log file, and a background job to tally up the results at the end of the day and put the results back into mysql.
Ok, I have three approaches for you:
1) Queues
In your AJAX Handler, write the simplest method possible (use a Rack Middleware or Rails Metal) to push the query params to a queue. Then, poll the queue and gather the messages.
Queue pushes from a rack middleware are blindingly fast. We use this on a very high traffic site for logging of similar data.
An example rack middleware is below (extracted from our app, can handle request in <2ms or so:
class TrackingMiddleware
CACHE_BUSTER = {"Cache-Control" => "no-cache, no-store, max-age=0, must-revalidate", "Pragma" => "no-cache", "Expires" => "Fri, 29 Aug 1997 02:14:00 EST"}
IMAGE_RESPONSE_HEADERS = CACHE_BUSTER.merge("Content-Type" => "image/gif").freeze
IMAGE_RESPONSE_BODY = [File.open(Rails.root + "public/images/tracker.gif").read].freeze
def initialize(app)
#app = app
end
def call(env)
if env["PATH_INFO"] =~ %r{^/track.gif}
request = Rack::Request.new(env)
YOUR_QUEUE.push([Time.now, request.GET.symbolize_keys])
[200, IMAGE_RESPONSE_BODY, IMAGE_RESPONSE_HEADERS]
else
#app.call(env)
end
end
end
For the queue I'd recommend starling, I've had nothing but good times with it.
On the parsing end, I would use the super-poller toolkit, but I would say that, I wrote it.
2) Logs
Pass all the params along as query params to a static file (/1x1.gif?foo=1&bar=2&baz=3).
This will not hit the rails stack and will be blindingly fast.
When you need the data, just parse the log files!
This is the best scaling home brew approach.
3) Google Analytics
Why handle the load when google will do it for you? You would be surprised at how good google analytics is, before you home brew anything, check it out!
This will scale infinitely, because google buys servers faster than you do.
I could rant on this for ages, but I have to go now. Hope this helps!
Depending no the action required to list items, you might be able to do it in the controller and save yourself a round trip. You can do it with an after_filter, to make the addition unobtrusive.
This only works if all actions that list items you want to log, require parameters. This is because page caching ignores GET requests with parameters.
Assuming you only want to log search data on the search action.
class ItemsController < ApplicationController
after_filter :log_searches, :only => :search
def log_searches
#items.each do |item|
# write to log here
end
end
...
# rest of controller remains unchanged
...
end
Otherwise you're right on track with the AJAX, and an onload remote function.
As for processing the you could use a rake task run by a cron job to collect statistics, and possibly update items for a popularity rating.
Either way you will want to read up on the Ruby Logging class. Learning about cron jobs and rake tasks wouldn't hurt either.
This is what I ultimately did - it was enough for our use for now, and with some simple benchmarking, I feel OK about it. We'll be watching to see how it does in production before we expose the results to our customers.
The components:
class EventsController < ApplicationController
def create
logger = Logger.new("#{RAILS_ROOT}/log/impressions/#{Date.today}.log")
logger.info "#{DateTime.now.strftime} #{params[:ids]}" unless params[:ids].blank?
render :nothing => true
end
end
This is called from an ajax call in the site layout...
<% javascript_tag do %>
var list = '';
$$('div.item').each(function(item) { list += item.id + ','; });
<%= remote_function(:url => { :controller => :events, :action => :create}, :with => "'ids=' + list" ) %>
<% end %>
Then I made a rake task to import these rows of comma delimited ids into the db. This is run the following day:
desc "Calculate impressions"
task :count_impressions => :environment do
date = ENV['DATE'] || (Date.today - 1).to_s # defaults to yesterday (yyyy-mm-dd)
file = File.new("log/impressions/#{date}.log", "r")
item_impressions = {}
while (line = file.gets)
ids_string = line.split(' ')[1]
next unless ids_string
ids = ids_string.split(',')
ids.each {|i| item_impressions[i] ||= 0; item_impressions[i] += 1 }
end
item_impressions.keys.each do |id|
ActiveRecord::Base.connection.execute "insert into item_stats(item_id, impression_count, collected_on) values('#{id}',#{item_impressions[id]},'#{date}')", 'Insert Item Stats'
end
file.close
end
One thing to note - the logger variable is declared in the controller action - not in environment.rb as you would normally do with a logger. I benchmarked this - 10000 writes took about 20 seconds. Averaging about 2 milliseconds a write. With the file name in the envirnment.rb, it took about 14 seconds. We made this trade-off so we could dynamically determine the file name - an easy way to switch files at midnight.
Our main concern at this point - we have no idea how many different items will be counted per day - ie. we don't know how long the tail is. This will determine how many rows are added to the db each day. We expect we'll need to limit how far back we keep daily reports and will role up results even further at that point.

Resources