Rake task for creating database records for all existing ActiveStorage variants - ruby-on-rails

In Rails 6.1, ActiveStorage creates database records for all variants when they're loaded for the first time: https://github.com/rails/rails/pull/37901
I'd like to enable this, but since I have tens of thousands of files in my production Rails app, it'd be problematic (and presumably slow) to have users creating so many database records as they browse the site. Is there a way to write a Rake task that'll iterate through every attachment in my database, and generate the variants and save them in the database?
I'd run that once, after enabling the new active_storage.track_variants config, and then any newly-uploaded files would be saved when they're loaded for the first time.
Thanks for the help!

This is the Rake task I ended up creating for this. The Parallel stuff can be removed if you have a smaller dataset, but I found that with 70k+ variants it was intolerably slow when doing it without any parallelization. You can also ignore the progress bar-related code :)
Essentially, I just take all the models that have an attachment (I do this manually, you could do it in a more dynamic way if you have a ton of attachments), and then filter the ones that are not variable. Then I go through each attachment and generate a variant for each size I've defined, and then call process on it to force it to be saved to the database.
Make sure to catch MiniMagick (or vips, if you prefer) errors in the task so that a bad image file doesn't break everything.
# Rails 6.1 changes the way ActiveStorage works so that variants are
# tracked in the database. The intent of this task is to create the
# necessary variants for all game covers and user avatars in our database.
# This way, the user isn't creating dozens of variant records as they
# browse the site. We want to create them ahead-of-time, when we deploy
# the change to track variants.
namespace 'active_storage:vglist:variants' do
require 'ruby-progressbar'
require 'parallel'
desc "Create all variants for covers and avatars in the database."
task create: :environment do
games = Game.joins(:cover_attachment)
# Only attempt to create variants if the cover is able to have variants.
games = games.filter { |game| game.cover.variable? }
puts 'Creating game cover variants...'
# Use the configured max number of threads, with 2 leftover for web requests.
# Clamp it to 1 if the configured max threads is 2 or less for whatever reason.
thread_count = [(ENV.fetch('RAILS_MAX_THREADS', 5).to_i - 2), 1].max
games_progress_bar = ProgressBar.create(
total: games.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
# Disable logging in production to prevent log spam.
Rails.logger.level = 2 if Rails.env.production?
Parallel.each(games, in_threads: thread_count) do |game|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
game.sized_cover(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
games_progress_bar.log "ERROR: #{e.message}"
games_progress_bar.log "Failed on game ID: #{game.id}"
end
games_progress_bar.increment
end
end
games_progress_bar.finish unless games_progress_bar.finished?
users = User.joins(:avatar_attachment)
# Only attempt to create variants if the avatar is able to have variants.
users = users.filter { |user| user.avatar.variable? }
puts 'Creating user avatar variants...'
users_progress_bar = ProgressBar.create(
total: users.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
Parallel.each(users, in_threads: thread_count) do |user|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
user.sized_avatar(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
users_progress_bar.log "ERROR: #{e.message}"
users_progress_bar.log "Failed on user ID: #{user.id}"
end
users_progress_bar.increment
end
end
users_progress_bar.finish unless users_progress_bar.finished?
end
end
This is what the sized_cover looks like in game.rb:
def sized_cover(size)
width, height = COVER_SIZES[size]
cover&.variant(
resize_to_limit: [width, height]
)
end
sized_avatar is pretty much the same thing.

Related

How can I count the number of accesses/queries to database through Mongoid?

I'm using the Mongoid in a Rails project. To improve the performance of large queries, I'm using the includes method to eager load the relationships.
I would like to know if there is an easy way to count the real number of queries performed by a block of code so that I can check if my includes really reduced the number of DB accesses as expected. Something like:
# It will perform a large query to gather data from companies and their relationships
count = Mongoid.count_queries do
Company.to_csv
end
puts count # Number of DB access
I want to use this feature to add Rspec tests to prove that my query remains efficient after changes (e.g; when adding data from a new relationship). In python's Django framework, for instance, one may use the assertNumQueries method to this end.
Checking on rubygems.org didn't yield anything that seems to do what you want.
You might be better off looking into app performance tools like New Relic, Scout, or DataDog. You may be able to get some out of the gate benchmarking specs with
https://github.com/piotrmurach/rspec-benchmark
I just implemented this feature to count mongo queries in my rspec suite in a small module using mongo Command Monitoring.
It can be used like this:
expect { code }.to change { finds("users") }.by(3)
expect { code }.to change { updates("contents") }.by(1)
expect { code }.not_to change { inserts }
Or:
MongoSpy.flush
# ..code..
expect(MongoSpy.queries).to match(
"find" => { "users" => 1, "contents" => 1 },
"update" => { "users" => 1 }
)
Here is the Gist (ready to copy) for the last up-to-date version: https://gist.github.com/jarthod/ab712e8a31798799841c5677cea3d1a0
And here is the current version:
module MongoSpy
module Helpers
%w(find delete insert update).each do |op|
define_method(op.pluralize) { |ns = nil|
ns ? MongoSpy.queries[op][ns] : MongoSpy.queries[op].values.sum
}
end
end
class << self
def queries
#queries ||= Hash.new { |h, k| h[k] = Hash.new(0) }
end
def flush
#queries = nil
end
def started(event)
op = event.command.keys.first # find, update, delete, createIndexes, etc.
ns = event.command[op] # collection name
return unless ns.is_a?(String)
queries[op][ns] += 1
end
def succeeded(_); end
def failed(_); end
end
end
Mongo::Monitoring::Global.subscribe(Mongo::Monitoring::COMMAND, MongoSpy)
RSpec.configure do |config|
config.include MongoSpy::Helpers
end
What you're looking for is command monitoring. With Mongoid and the Ruby Driver, you can create a custom command monitoring class that you can use to subscribe to all commands made to the server.
I've adapted this from the Command Monitoring Guide for the Mongo Ruby Driver.
For this particular example, make sure that your Rails app has the log level set to debug. You can read more about the Rails logger here.
The first thing you want to do is define a subscriber class. This is the class that tells your application what to do when the Mongo::Client performs commands against the database. Here is the example class from the documentation:
class CommandLogSubscriber
include Mongo::Loggable
# called when a command is started
def started(event)
log_debug("#{prefix(event)} | STARTED | #{format_command(event.command)}")
end
# called when a command finishes successfully
def succeeded(event)
log_debug("#{prefix(event)} | SUCCEEDED | #{event.duration}s")
end
# called when a command terminates with a failure
def failed(event)
log_debug("#{prefix(event)} | FAILED | #{event.message} | #{event.duration}s")
end
private
def logger
Mongo::Logger.logger
end
def format_command(args)
begin
args.inspect
rescue Exception
'<Unable to inspect arguments>'
end
end
def format_message(message)
format("COMMAND | %s".freeze, message)
end
def prefix(event)
"#{event.address.to_s} | #{event.database_name}.#{event.command_name}"
end
end
(Make sure this class is auto-loaded in your Rails application.)
Next, you want to attach this subscriber to the client you use to perform commands.
subscriber = CommandLogSubscriber.new
Mongo::Monitoring::Global.subscribe(Mongo::Monitoring::COMMAND, subscriber)
# This is the name of the default client, but it's possible you've defined
# a client with a custom name in config/mongoid.yml
client = Mongoid::Clients.from_name('default')
client.subscribe( Mongo::Monitoring::COMMAND, subscriber)
Now, when Mongoid executes any commands against the database, those commands will be logged to your console.
# For example, if you have a model called Book
Book.create(title: "Narnia")
# => D, [2020-03-27T10:29:07.426209 #43656] DEBUG -- : COMMAND | localhost:27017 | mongoid_test_development.insert | STARTED | {"insert"=>"books", "ordered"=>true, "documents"=>[{"_id"=>BSON::ObjectId('5e7e0db3f8f498aa88b26e5d'), "title"=>"Narnia", "updated_at"=>2020-03-27 14:29:07.42239 UTC, "created_at"=>2020-03-27 14:29:07.42239 UTC}], "lsid"=>{"id"=><BSON::Binary:0x10600 type=uuid data=0xfff8a93b6c964acb...>}}
# => ...
You can modify the CommandLogSubscriber class to do something other than logging (such as incrementing a global counter).

Why doesn't lock! stop others from updating?

This is my class:
class Plan < ActiveRecord::Base
def testing
self.with_lock do
update_columns(lock: true)
byebug
end
end
def testing2
self.lock!
byebug
end
end
I opened two rails consoles.
In first console:
p = Plan.create
=> (basically success)
p.id
=> 12
p.testing2
(byebug) # simulation of halting the execution,
(BYEBUG) # I just leave the rails console open and wait at here. I expect others won't be able to update p because I still got the lock.
On second console:
p = Plan.find(12)
=> (basically said found)
p.name = 'should not be able to be stored in database'
=> "should not be able to be stored in database"
p.save!
=> true # what????? Why can it update my object? It's lock in the other console!
lock! in testing2 doesn't lock while with_lock in testing does work. Can anybody explain why lock! doesn't work?
#lock! uses SELECT … FOR UPDATE to acquire a lock.
According to PostgreSQL doc.
FOR UPDATE causes the rows retrieved by the SELECT statement to be locked as though for update. This prevents them from being locked, modified or deleted by other transactions until the current transaction ends.
You need a transaction to keep holding a lock of a certain row.
Try
console1:
Plan.transaction{Plan.find(12).lock!; sleep 100.days}
console2:
p = Plan.find(12)
p.name = 'should not be able to be stored in database'
p.save
#with_lock acquire a transaction for you, so you don't need explicit transaction.
(This is PostgreSQL document. But I think other databases implement similar logic. )

Logging raw SQL errors in Rake Tasks

I'm using raw sql bulk updates (for performance reasons) in the context of a rake task. Something like the following:
update_sql = Book.connection.execute("UPDATE books AS b SET
stock = vs.stock,
promotion = vs.promotion,
sales = vs.sales
FROM (values #{values_string}) AS vs
(stock, promotion, sales) WHERE b.id = vs.id;")
While everything is "transparent" in local development, if this SQL fails in production during the execution of the rails task (for example because the promotion column is nil and the statement becomes invalid), no error is logged.
I can manually log this with catching the exception, like below, however some option that would allow for automatic logging would be better.
begin
...
rescue ActiveRecord::StatementInvalid => e
Rails.logger.fatal "Books update: ActiveRecord::StatementInvalid: "+ e.to_s
end
You can make your own custom class in your model folder:
app/models/custom_sql_logger.rb :
class CustomSqlLogger
def self.debug(msg=nil)
#custom_log ||= Logger.new("#{Rails.root}/log/custom_sql.log")
#custom_log.debug(msg) unless msg.nil?
end
end
Then go to the rake task where you would like to debug updated fields for example lib/task/calculate_avarages.rake and call your custom debugger:
CustomSqlLogger.debug "The field was successfully updated into DB"
Example from my project:
require 'rake'
task :calculate_averages => :environment do
products = Product.all
products.each do |product|
puts "Calculating average rating for #{product.name}..."
product.update_attribute(:average_rating, product.reviews.average("rating"))
CustomSqlLogger.debug "#{product.name} was susscefully updated into DB"
end
end
Custom debugger will create the new file custom_sql.log into log folder: log/custom_sql.log and saved all information there. Beware of a log file size after a while.

Speed up rake task by using typhoeus

So i stumbled across this: https://github.com/typhoeus/typhoeus
I'm wondering if this is what i need to speed up my rake task
Event.all.each do |row|
begin
url = urlhere + row.first + row.second
doc = Nokogiri::HTML(open(url))
doc.css('.table__row--event').each do |tablerow|
table = tablerow.css('.table__cell__body--location').css('h4').text
next unless table == row.eventvenuename
tablerow.css('.table__cell__body--availability').each do |button|
buttonurl = button.css('a')[0]['href']
if buttonurl.include? '/checkout/external'
else
row.update(row: buttonurl)
end
end
end
rescue Faraday::ConnectionFailed
puts "connection failed"
next
end
end
I'm wondering if this would speed it up, Or because i'm doing a .each it wouldn't?
If it would could you provide an example?
Sam
If you set up Typhoeus::Hydra to run parallel requests, you might be able to speed up your code, assuming that the Kernel#open calls are what's slowing you down. Before you optimize, you might want to run benchmarks to validate this assumption.
If it is true, and parallel requests would speed it up, you would need to restructure your code to load events in batches, build a queue of parallel requests for each batch, and then handle them after they execute. Here's some sketch code.
class YourBatchProcessingClass
def initialize(batch_size: 200)
#batch_size = batch_size
#hydra = Typhoeus::Hydra.new(max_concurrency: #batch_size)
end
def perform
# Get an array of records
Event.find_in_batches(batch_size: #batch_size) do |batch|
# Store all the requests so we can access their responses later.
requests = batch.map do |record|
request = Typhoeus::Request.new(your_url_build_logic(record))
#hydra.queue request
request
end
#hydra.run # Run requests in parallel
# Process responses from each request
requests.each do |request|
your_response_processing(request.response.body)
end
end
rescue WhateverError => e
puts e.message
end
private
def your_url_build_logic(event)
# TODO
end
def your_response_processing(response_body)
# TODO
end
end
# Run the service by calling this in your Rake task definition
YourBatchProcessingClass.new.perform
Ruby can be used for pure scripting, but it functions best as an object-oriented language. Decomposing your processing work into clear methods can help clarify your code and help you catch things like Tom Lord mentioned in the comments on your question. Also, instead of wrapping your whole script in a begin..rescue block, you can use method-level rescues as in #perform above, or just wrap #hydra.run.
As a note, .all.each is a memory hog, and is thus considered a bad solution to iterating over records: .all loads all of the records into memory before iterating over them with .each. To save memory, it's better to use .find_each or .find_in_batches, depending on your use case. See: http://api.rubyonrails.org/classes/ActiveRecord/Batches.html

How to continue indexing documents in elasticsearch(rails)?

So I ran this command rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true to index documents in elasticsearch.In my database I have 10 000 000 records=)...it takes (I think) one day to index this...When indexing was running my computer turned off...(I indexed 2 000 000 documents)Is it possible to continue indexing documents?
If you use rails 4.2+ you can use ActiveJob to schedule and leave it running. So, first generate it with this
bin/rails generate job elastic_search_index
This will give you class and method perform:
class ElasticSearchIndexJob < ApplicationJob
def perform
# impleement here indexing
AutoPartMapper.__elasticsearch__.create_index! force:true
AutoPartMapper.__elasticsearch__.import
end
end
Set the sidekiq as your active job provider and from console initiate this with:
ElasticSearchIndexJob.perform_later
This will set the active job and execute it on next free job but it will free your console. You can leave it running and check the process in bash later:
ps aux | grep side
this will give you something like: sidekiq 4.1.2 app[1 of 12 busy]
Have a look at this post that explains them
http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/
Hope it helps
There is no such functionality in elasicsearch-rails afaik but you could write a simple task to do that.
namespace :es do
task :populate, [:start_id] => :environment do |_, args|
start_id = args[:start_id].to_i
AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
puts "Processing record ##{record.id}"
record.__elasticsearch__.index_document
end
end
end
Start it with bundle exec rake es:populate[<start_id>] passing the id of the record from which to start the next batch.
Note that this is a simplistic solution which will be much slower than batch indexing.
UPDATE
Here is a batch indexing task. It is much faster and automatically detects the record from which to continue. It does make an assumption that previously imported records were processed in increasing id order and without gaps. I haven't tested it but most of the code is from a production system.
namespace :es do
task :populate_auto => :environment do |_, args|
start_id = get_max_indexed_id
AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
elasticsearch_bulk_index(records)
end
end
def get_max_indexed_id
AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
end
def elasticsearch_bulk_index(records)
return if records.empty?
klass = records.first.class
klass.__elasticsearch__.client.bulk({
index: klass.__elasticsearch__.index_name,
type: klass.__elasticsearch__.document_type,
body: elasticsearch_records_to_index(records)
})
end
def self.elasticsearch_records_to_index(records)
records.map do |record|
payload = { _id: record.id, data: record.as_indexed_json }
{ index: payload }
end
end
end

Resources