How to continue indexing documents in elasticsearch(rails)? - ruby-on-rails

So I ran this command rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true to index documents in elasticsearch.In my database I have 10 000 000 records=)...it takes (I think) one day to index this...When indexing was running my computer turned off...(I indexed 2 000 000 documents)Is it possible to continue indexing documents?

If you use rails 4.2+ you can use ActiveJob to schedule and leave it running. So, first generate it with this
bin/rails generate job elastic_search_index
This will give you class and method perform:
class ElasticSearchIndexJob < ApplicationJob
def perform
# impleement here indexing
AutoPartMapper.__elasticsearch__.create_index! force:true
AutoPartMapper.__elasticsearch__.import
end
end
Set the sidekiq as your active job provider and from console initiate this with:
ElasticSearchIndexJob.perform_later
This will set the active job and execute it on next free job but it will free your console. You can leave it running and check the process in bash later:
ps aux | grep side
this will give you something like: sidekiq 4.1.2 app[1 of 12 busy]
Have a look at this post that explains them
http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/
Hope it helps

There is no such functionality in elasicsearch-rails afaik but you could write a simple task to do that.
namespace :es do
task :populate, [:start_id] => :environment do |_, args|
start_id = args[:start_id].to_i
AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
puts "Processing record ##{record.id}"
record.__elasticsearch__.index_document
end
end
end
Start it with bundle exec rake es:populate[<start_id>] passing the id of the record from which to start the next batch.
Note that this is a simplistic solution which will be much slower than batch indexing.
UPDATE
Here is a batch indexing task. It is much faster and automatically detects the record from which to continue. It does make an assumption that previously imported records were processed in increasing id order and without gaps. I haven't tested it but most of the code is from a production system.
namespace :es do
task :populate_auto => :environment do |_, args|
start_id = get_max_indexed_id
AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
elasticsearch_bulk_index(records)
end
end
def get_max_indexed_id
AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
end
def elasticsearch_bulk_index(records)
return if records.empty?
klass = records.first.class
klass.__elasticsearch__.client.bulk({
index: klass.__elasticsearch__.index_name,
type: klass.__elasticsearch__.document_type,
body: elasticsearch_records_to_index(records)
})
end
def self.elasticsearch_records_to_index(records)
records.map do |record|
payload = { _id: record.id, data: record.as_indexed_json }
{ index: payload }
end
end
end

Related

How to get the limit of my sidekiq worker?

I have 200000 users in my database
I need to iterate through each record to process something.
So I have a rake task to iterate each user, main logic will be in a worker. Now I want to get the limit of the workers those can be run simultaneously. If the limit is 50000, then I can divide my users into 4 sets, for each set I will call the worker seperately.
task:
namespace :users do
task data: :environment do
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id <= 50000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id > 50000 and id <= 100000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id > 100000 and id <= 150000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id > 200000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
end
end
If I can know the limit of sidekiq, I can make the user sets dynamically. And I wanted to know is this correct way to complete the process in less time. Or is there any way that I can process all my records in less time ?
Sidekiq only processes as many jobs concurrently as you have workers/threads. The rest will be placed in the queue and the queue is practically unlimited. No issues with 200k jobs.
Your issue probably comes from the slowness of querying for 200k jobs using 1 sql query and having to keep the result in memory while creating jobs from them.
Use find_each to tell Rails to find the records in batches and yield them one-by-one.
namespace :users do
task data: :environment do
User.where('confirmed_at IS NOT NULL').find_each do |user|
MyWorker.perform_async(user.id)
end
end
end
However, since you only need the id, not the entire user object, we can also remove the object initiation to speed it up more.
User.where('confirmed_at IS NOT NULL').in_batches.each do |batch|
batch.pluck(:id).each do |id|
MyWorker.perform_async(user.id)
end
end
And if that still is not fast enough, there is Sidekiq::Client.push_bulk. It will only make one request to redis for each batch. Might need to adjust batch size here.
User.where('confirmed_at IS NOT NULL').in_batches.each do |batch|
args = batch.pluck(:id).map { |id| [id] } # args is [[1], [2], [3], etc...]
Sidekiq::Client.push_bulk('class' => MyWorker, 'args' => args)
end

Rake task for creating database records for all existing ActiveStorage variants

In Rails 6.1, ActiveStorage creates database records for all variants when they're loaded for the first time: https://github.com/rails/rails/pull/37901
I'd like to enable this, but since I have tens of thousands of files in my production Rails app, it'd be problematic (and presumably slow) to have users creating so many database records as they browse the site. Is there a way to write a Rake task that'll iterate through every attachment in my database, and generate the variants and save them in the database?
I'd run that once, after enabling the new active_storage.track_variants config, and then any newly-uploaded files would be saved when they're loaded for the first time.
Thanks for the help!
This is the Rake task I ended up creating for this. The Parallel stuff can be removed if you have a smaller dataset, but I found that with 70k+ variants it was intolerably slow when doing it without any parallelization. You can also ignore the progress bar-related code :)
Essentially, I just take all the models that have an attachment (I do this manually, you could do it in a more dynamic way if you have a ton of attachments), and then filter the ones that are not variable. Then I go through each attachment and generate a variant for each size I've defined, and then call process on it to force it to be saved to the database.
Make sure to catch MiniMagick (or vips, if you prefer) errors in the task so that a bad image file doesn't break everything.
# Rails 6.1 changes the way ActiveStorage works so that variants are
# tracked in the database. The intent of this task is to create the
# necessary variants for all game covers and user avatars in our database.
# This way, the user isn't creating dozens of variant records as they
# browse the site. We want to create them ahead-of-time, when we deploy
# the change to track variants.
namespace 'active_storage:vglist:variants' do
require 'ruby-progressbar'
require 'parallel'
desc "Create all variants for covers and avatars in the database."
task create: :environment do
games = Game.joins(:cover_attachment)
# Only attempt to create variants if the cover is able to have variants.
games = games.filter { |game| game.cover.variable? }
puts 'Creating game cover variants...'
# Use the configured max number of threads, with 2 leftover for web requests.
# Clamp it to 1 if the configured max threads is 2 or less for whatever reason.
thread_count = [(ENV.fetch('RAILS_MAX_THREADS', 5).to_i - 2), 1].max
games_progress_bar = ProgressBar.create(
total: games.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
# Disable logging in production to prevent log spam.
Rails.logger.level = 2 if Rails.env.production?
Parallel.each(games, in_threads: thread_count) do |game|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
game.sized_cover(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
games_progress_bar.log "ERROR: #{e.message}"
games_progress_bar.log "Failed on game ID: #{game.id}"
end
games_progress_bar.increment
end
end
games_progress_bar.finish unless games_progress_bar.finished?
users = User.joins(:avatar_attachment)
# Only attempt to create variants if the avatar is able to have variants.
users = users.filter { |user| user.avatar.variable? }
puts 'Creating user avatar variants...'
users_progress_bar = ProgressBar.create(
total: users.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
Parallel.each(users, in_threads: thread_count) do |user|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
user.sized_avatar(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
users_progress_bar.log "ERROR: #{e.message}"
users_progress_bar.log "Failed on user ID: #{user.id}"
end
users_progress_bar.increment
end
end
users_progress_bar.finish unless users_progress_bar.finished?
end
end
This is what the sized_cover looks like in game.rb:
def sized_cover(size)
width, height = COVER_SIZES[size]
cover&.variant(
resize_to_limit: [width, height]
)
end
sized_avatar is pretty much the same thing.

How can I count the number of accesses/queries to database through Mongoid?

I'm using the Mongoid in a Rails project. To improve the performance of large queries, I'm using the includes method to eager load the relationships.
I would like to know if there is an easy way to count the real number of queries performed by a block of code so that I can check if my includes really reduced the number of DB accesses as expected. Something like:
# It will perform a large query to gather data from companies and their relationships
count = Mongoid.count_queries do
Company.to_csv
end
puts count # Number of DB access
I want to use this feature to add Rspec tests to prove that my query remains efficient after changes (e.g; when adding data from a new relationship). In python's Django framework, for instance, one may use the assertNumQueries method to this end.
Checking on rubygems.org didn't yield anything that seems to do what you want.
You might be better off looking into app performance tools like New Relic, Scout, or DataDog. You may be able to get some out of the gate benchmarking specs with
https://github.com/piotrmurach/rspec-benchmark
I just implemented this feature to count mongo queries in my rspec suite in a small module using mongo Command Monitoring.
It can be used like this:
expect { code }.to change { finds("users") }.by(3)
expect { code }.to change { updates("contents") }.by(1)
expect { code }.not_to change { inserts }
Or:
MongoSpy.flush
# ..code..
expect(MongoSpy.queries).to match(
"find" => { "users" => 1, "contents" => 1 },
"update" => { "users" => 1 }
)
Here is the Gist (ready to copy) for the last up-to-date version: https://gist.github.com/jarthod/ab712e8a31798799841c5677cea3d1a0
And here is the current version:
module MongoSpy
module Helpers
%w(find delete insert update).each do |op|
define_method(op.pluralize) { |ns = nil|
ns ? MongoSpy.queries[op][ns] : MongoSpy.queries[op].values.sum
}
end
end
class << self
def queries
#queries ||= Hash.new { |h, k| h[k] = Hash.new(0) }
end
def flush
#queries = nil
end
def started(event)
op = event.command.keys.first # find, update, delete, createIndexes, etc.
ns = event.command[op] # collection name
return unless ns.is_a?(String)
queries[op][ns] += 1
end
def succeeded(_); end
def failed(_); end
end
end
Mongo::Monitoring::Global.subscribe(Mongo::Monitoring::COMMAND, MongoSpy)
RSpec.configure do |config|
config.include MongoSpy::Helpers
end
What you're looking for is command monitoring. With Mongoid and the Ruby Driver, you can create a custom command monitoring class that you can use to subscribe to all commands made to the server.
I've adapted this from the Command Monitoring Guide for the Mongo Ruby Driver.
For this particular example, make sure that your Rails app has the log level set to debug. You can read more about the Rails logger here.
The first thing you want to do is define a subscriber class. This is the class that tells your application what to do when the Mongo::Client performs commands against the database. Here is the example class from the documentation:
class CommandLogSubscriber
include Mongo::Loggable
# called when a command is started
def started(event)
log_debug("#{prefix(event)} | STARTED | #{format_command(event.command)}")
end
# called when a command finishes successfully
def succeeded(event)
log_debug("#{prefix(event)} | SUCCEEDED | #{event.duration}s")
end
# called when a command terminates with a failure
def failed(event)
log_debug("#{prefix(event)} | FAILED | #{event.message} | #{event.duration}s")
end
private
def logger
Mongo::Logger.logger
end
def format_command(args)
begin
args.inspect
rescue Exception
'<Unable to inspect arguments>'
end
end
def format_message(message)
format("COMMAND | %s".freeze, message)
end
def prefix(event)
"#{event.address.to_s} | #{event.database_name}.#{event.command_name}"
end
end
(Make sure this class is auto-loaded in your Rails application.)
Next, you want to attach this subscriber to the client you use to perform commands.
subscriber = CommandLogSubscriber.new
Mongo::Monitoring::Global.subscribe(Mongo::Monitoring::COMMAND, subscriber)
# This is the name of the default client, but it's possible you've defined
# a client with a custom name in config/mongoid.yml
client = Mongoid::Clients.from_name('default')
client.subscribe( Mongo::Monitoring::COMMAND, subscriber)
Now, when Mongoid executes any commands against the database, those commands will be logged to your console.
# For example, if you have a model called Book
Book.create(title: "Narnia")
# => D, [2020-03-27T10:29:07.426209 #43656] DEBUG -- : COMMAND | localhost:27017 | mongoid_test_development.insert | STARTED | {"insert"=>"books", "ordered"=>true, "documents"=>[{"_id"=>BSON::ObjectId('5e7e0db3f8f498aa88b26e5d'), "title"=>"Narnia", "updated_at"=>2020-03-27 14:29:07.42239 UTC, "created_at"=>2020-03-27 14:29:07.42239 UTC}], "lsid"=>{"id"=><BSON::Binary:0x10600 type=uuid data=0xfff8a93b6c964acb...>}}
# => ...
You can modify the CommandLogSubscriber class to do something other than logging (such as incrementing a global counter).

Using Simple Scheduler Gem for Scheduling Tasks in a Rails App

I am trying to run a method that adds the response from an API call to Cache, I decided to use the simple_scheduler gem
Below are snippets of code that I am running
# update_cache_job.rb
class UpdateCacheJob < ActiveJob::Base
def perform
return QueuedJobs.new.update_cache
end
end
And
# simple_scheduler.yml
# Global configuration options. The `queue_ahead` and `tz` options can also be set on each task.
queue_ahead: 120 # Number of minutes to queue jobs into the future
queue_name: "default" # The Sidekiq queue name used by SimpleScheduler::FutureJob
tz: "nil" # The application time zone will be used by default if not set
# Runs once every 2 minutes
simple_task:
class: "UpdateCacheJob"
every: "2.minutes"
And the method I have scheduled to run every 2.minutes
class QueuedJobs
include VariableHelper
def initialize; end
def update_cache
#variables = obtain_software_development_table
# First refresh the project Reviews
puts 'Updating reviews...'
#records = Dashboard.new.obtain_projects_reviews.pluck(
obtain_project_reviews_student_variable,
obtain_project_reviews_id_variable,
'Project'
).map { |student, id, project| { 'Student': student, 'ID': id,
'Project': project } }
Rails.cache.write(
:reviews,
#records,
expires_in: 15.minutes
)
#grouped_reviews = Rails.cache.read(
:reviews
).group_by do |review|
review[:Student]&.first
end
puts 'reviews refreshed.'
# Then refresh the coding challenges submissions
puts "Updating challenges submissions.."
#all_required_submissions_columns = Dashboard.new.all_coding_challenges_submissions.all.map do |submission|
{
id: submission.id,
'Student': submission[obtain_coding_chall_subm_student_var],
'Challenge': submission[obtain_coding_chall_subm_challenge_var]
}
end
#all_grouped_submissions = #all_required_submissions_columns.group_by { |challenge| challenge[:Student]&.first }
Rails.cache.write(
:challenges_submissions,
#all_grouped_submissions,
expires_in: 15.minutes
)
puts "challenges submissions refreshed."
end
end
I have been able to reach these methods from the rails console but when ever I run rake simple_scheduler It just logs the first puts and sometimes it does nothing at all.
What do I need to do here?

Ruby on Rails best way to update 100k records

I am in a situation where I have to update more than 100k records in the database with best efficient way Please see below my code:
namespace :order do
desc "update confirmed at field for Payments::Order"
task set_confirmed_at: :environment do
puts "==> Updating confirmed_at for orders starts ...".blue
Payments::Order.find_each(batch_size: 10000) do |orders|
order_action = orders.actions.where("sender LIKE ?", "%ConfirmJob%").first if orders.actions
if !order_action.blank?
orders.update_attribute(:confirmed_at, order_action.created_at)
puts "order id = #{orders.id} has been updated.".green
end
end
puts "== completed ==".blue
end
end
Here I am breaking records into 10000 of each batch size and then try to update the record on the basis of some conditions so could anyone suggest me a more efficient way to do the same task.
Thank you in advance!
You can try update_all:
Payments::Order.joins(:actions).where(Payment::OrderAction.arel_table[:sender].matches("%ConfirmJob%")).update_all("confirmed_at = actions.created_at")
So your code will look like this:
namespace :order do
desc "update confirmed at field for Payments::Order"
task set_confirmed_at: :environment do
puts "==> Updating confirmed_at for orders starts ...".blue
Payments::Order.joins(:actions).where(Payments::OrderAction.arel_table[:sender].matches("%ConfirmJob%")).update_all("confirmed_at = actions.created_at")
puts "== completed ==".blue
end
end
Update:
I've investigated an issue and found out that bulk update with joined table is a long term issue in rails
As set part uses string parameter as it is I suggest to add from clause there.
namespace :order do
desc "update confirmed at field for Payments::Order"
task set_confirmed_at: :environment do
puts "==> Updating confirmed_at for orders starts ...".blue
Payments::Order.joins(:actions).
where(Order::Action.arel_table[:sender].matches("%ConfirmJob%")).
update_all("confirmed_at = actions.created_at FROM actions")
puts "== completed ==".blue
end
end
You are doing Payments::Order.find_each so your solution will loop for each Payment::Order when you only want to loop for the ones having actions.server like '%ConfirmJob%', so I will go with this solution:
Payments::Order
.includes(:actions)
.joins(:actions)
.where("actions.server like '%?%'", "ConfirmJob")
.find_each do |order|
order_action = order.actions.first
order.update!(confirmed_at: order_action.created_at)
end

Resources