Detect which Sidekiq jobs are responsible for high memory usage - ruby-on-rails

I run an app on Heroku and use Sidekiq as the job queue system. However, recently, memory usage is always around 90%-110%.
I already tried reducing the concurrency a little and scaling the number of workers, but with no big success. Is there a way to detect which Sidekiq jobs are consuming so much memory?
We use New Relic to track our transactions but I couldn't find this information on the platform.

I faced the similar situation where I need to track memory used by workers and I have come up with the following solution, Not sure if it can help you, But I hope it'll set the right direction to find the solution
I have written a cron job which collects currently running workers(not realtime) and memory usage and stores it into a csv file. Below is the code.
class DataWorker
include Sidekiq::Worker
def perform
file = File.new("sidekiq_profiling.csv", "a")
memory_usage = gc_start
workers = Sidekiq::Workers.new
worker_running_counts = workers.map {|pid, thrd, wrkr| wrkr["payload"]["class"]}.group_by {|cls| cls}.map {|k, v| {k => v.count}}
datetime = DateTime.now
worker_running_counts.each do |wc|
file << "#{datetime},#{wc.keys[0]},#{wc.values[0]},#{memory_usage}\n"
end
file.close
end
def rss_usage
`ps -o rss= -p #{Process.pid}`.chomp.to_i * 1024
end
# def gc_stats
# GC.stat.slice(:heap_available_slots, :heap_live_slots, :heap_free_slots)
# end
def gc_start
GC.start
# gc_stats.each do |key, value|
# puts "GC.#{key}: #{value.to_s(:delimited)}"
# end
"#{rss_usage.to_s(:human_size, precision: 3)}"
end
Sidekiq::Cron::Job.create(name: 'DataWorker', cron: '* * * * *', class: 'DataWorker')
end

Related

Rake task for creating database records for all existing ActiveStorage variants

In Rails 6.1, ActiveStorage creates database records for all variants when they're loaded for the first time: https://github.com/rails/rails/pull/37901
I'd like to enable this, but since I have tens of thousands of files in my production Rails app, it'd be problematic (and presumably slow) to have users creating so many database records as they browse the site. Is there a way to write a Rake task that'll iterate through every attachment in my database, and generate the variants and save them in the database?
I'd run that once, after enabling the new active_storage.track_variants config, and then any newly-uploaded files would be saved when they're loaded for the first time.
Thanks for the help!
This is the Rake task I ended up creating for this. The Parallel stuff can be removed if you have a smaller dataset, but I found that with 70k+ variants it was intolerably slow when doing it without any parallelization. You can also ignore the progress bar-related code :)
Essentially, I just take all the models that have an attachment (I do this manually, you could do it in a more dynamic way if you have a ton of attachments), and then filter the ones that are not variable. Then I go through each attachment and generate a variant for each size I've defined, and then call process on it to force it to be saved to the database.
Make sure to catch MiniMagick (or vips, if you prefer) errors in the task so that a bad image file doesn't break everything.
# Rails 6.1 changes the way ActiveStorage works so that variants are
# tracked in the database. The intent of this task is to create the
# necessary variants for all game covers and user avatars in our database.
# This way, the user isn't creating dozens of variant records as they
# browse the site. We want to create them ahead-of-time, when we deploy
# the change to track variants.
namespace 'active_storage:vglist:variants' do
require 'ruby-progressbar'
require 'parallel'
desc "Create all variants for covers and avatars in the database."
task create: :environment do
games = Game.joins(:cover_attachment)
# Only attempt to create variants if the cover is able to have variants.
games = games.filter { |game| game.cover.variable? }
puts 'Creating game cover variants...'
# Use the configured max number of threads, with 2 leftover for web requests.
# Clamp it to 1 if the configured max threads is 2 or less for whatever reason.
thread_count = [(ENV.fetch('RAILS_MAX_THREADS', 5).to_i - 2), 1].max
games_progress_bar = ProgressBar.create(
total: games.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
# Disable logging in production to prevent log spam.
Rails.logger.level = 2 if Rails.env.production?
Parallel.each(games, in_threads: thread_count) do |game|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
game.sized_cover(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
games_progress_bar.log "ERROR: #{e.message}"
games_progress_bar.log "Failed on game ID: #{game.id}"
end
games_progress_bar.increment
end
end
games_progress_bar.finish unless games_progress_bar.finished?
users = User.joins(:avatar_attachment)
# Only attempt to create variants if the avatar is able to have variants.
users = users.filter { |user| user.avatar.variable? }
puts 'Creating user avatar variants...'
users_progress_bar = ProgressBar.create(
total: users.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
Parallel.each(users, in_threads: thread_count) do |user|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
user.sized_avatar(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
users_progress_bar.log "ERROR: #{e.message}"
users_progress_bar.log "Failed on user ID: #{user.id}"
end
users_progress_bar.increment
end
end
users_progress_bar.finish unless users_progress_bar.finished?
end
end
This is what the sized_cover looks like in game.rb:
def sized_cover(size)
width, height = COVER_SIZES[size]
cover&.variant(
resize_to_limit: [width, height]
)
end
sized_avatar is pretty much the same thing.

How can I prevent many sidekiq jobs from exceeding the API calls limit

I am working on an Ruby On Rails application. We have many sidekiq workers that can process multiple jobs at a time. Each job will make calls to the Shopify API, the calls limit set by Shopify is 2 calls per second. I want to synchronize that, so that only two jobs can call the API in a given second.
The way I'm doing that right now, is like this:
# frozen_string_literal: true
class Synchronizer
attr_reader :shop_id, :queue_name, :limit, :wait_time
def initialize(shop_id:, queue_name:, limit: nil, wait_time: 1)
#shop_id = shop_id
#queue_name = queue_name.to_s
#limit = limit
#wait_time = wait_time
end
# This method should be called for each api call
def synchronize_api_call
raise "a block is required." unless block_given?
get_api_call
time_to_wait = calculate_time_to_wait
sleep(time_to_wait) unless Rails.env.test? || time_to_wait.zero?
yield
ensure
return_api_call
end
def set_api_calls
redis.del(api_calls_list)
redis.rpush(api_calls_list, calls_list)
end
private
def get_api_call
logger.log_message(synchronizer: 'Waiting for api call', color: :yellow)
#api_call_timestamp = redis.brpop(api_calls_list)[1].to_i
logger.log_message(synchronizer: 'Got api call.', color: :yellow)
end
def return_api_call
redis_timestamp = redis.time[0]
redis.rpush(api_calls_list, redis_timestamp)
ensure
redis.ltrim(api_calls_list, 0, limit - 1)
end
def last_call_timestamp
#api_call_timestamp
end
def calculate_time_to_wait
current_time = redis.time[0]
time_passed = current_time - last_call_timestamp.to_i
time_to_wait = wait_time - time_passed
time_to_wait > 0 ? time_to_wait : 0
end
def reset_api_calls
redis.multi do |r|
r.del(api_calls_list)
end
end
def calls_list
redis_timestamp = redis.time[0]
limit.times.map do |i|
redis_timestamp
end
end
def api_calls_list
#api_calls_list ||= "api-calls:shop:#{shop_id}:list"
end
def redis
Thread.current[:redis] ||= Redis.new(db: $redis_db_number)
end
end
the way I use it is like this
synchronizer = Synchronizer.new(shop_id: shop_id, queue_name: 'shopify_queue', limit: 2, wait_time: 1)
# this is called once the process started, i.e. it's not called by the jobs themselves but by the App from where the process is kicked off.
syncrhonizer.set_api_calls # this will populate the api_calls_list with 2 timestamps, those timestamps will be used to know when the last api call has been sent.
then when a job wants to make a call
syncrhonizer.synchronize_api_call do
# make the call
end
The problem
The problem with this is that if for some reason a job fails to return to the api_calls_list the api_call it took, that will make that job and the other jobs stuck for ever, or until we notice that and we call set_api_calls again. That problem won't affect that particular shop only, but also the other shops as well, because the sidekiq workers are shared between all the shops using our app. It happen sometimes that we don't notice that until a user calls us, and we find that it was stuck for many hours while it should be finished in a few minutes.
The Question
I just realised lately that Redis is not the best tool for shared locking. So I am asking, Is there any other good tool for this job?? If not in the Ruby world, I'd like to learn from others as well. I'm interested in the techniques as well as the tools. So every bit helps.
You may want to restructure your code and create a micro-service to process the API calls, which will use a local locking mechanism and force your workers to wait on the socket. It comes with the added complexity of maintaining the micro-service. But if you're in a hurry then Ent-Rate-Limiting looks cool too.

XML generation very slow and using lots of memory in Rails 4

I'm generating an XML file to share data with another system. From my troubleshooting, I've found that this process is both slow and consuming lots of memory (getting lots of R14's on Heroku.)
My index method on my Jobs Controller looks like this:
def index
respond_to do |format|
format.xml {#jobs = #user.jobs.includes(job_types: [:job_lines, :job_photos])}
format.json
{
# More code here, this part is not the problem.
}
end
end
My view (index.xml.builder) looks like this (I've removed a bunch of fields to keep the example smaller):
xml.instruct!
xml.jobs do
#jobs.each do |j|
xml.job do
xml.id j.id
xml.job_number j.job_number
xml.registration j.registration
xml.name j.name
xml.job_types do
j.job_types.each do |t|
xml.job_type do
xml.id t.id
xml.job_id t.job_id
xml.type_number t.type_number
xml.description t.description
xml.job_lines do
t.job_lines.each do |l|
xml.job_line do
xml.id l.id
xml.line_number l.line_number
xml.job_type_id l.job_type_id
xml.line_type l.line_type
xml.type_number l.type_number
xml.description l.description
xml.part_number l.part_number
end # job_line node
end # job_lines.each
end # job_lines node
xml.job_photos do
t.job_photos.each do |p|
xml.job_photo do
xml.id p.id
xml.pcid p.pcid
xml.job_type_id p.job_type_id
xml.image_url p.image.url
end # job_line node
end # job_lines.each
end # job_lines node
end # job_type
end # job_types.each
end # job_types node
end # job node
end # #jobs.each
end # jobs node
The generated XML file is not small (it's about 100kB). Running on Heroku, their Scout tool tells me that this process is often taking 4-6 seconds to run. Also, despite only running 1 worker, with 4 threads (in Puma) this part of my code is consuming all my memory. In scout, I can see that it's "Max Allocations" are as high as 10M compared with my next worst method which is only 500k allocations.
Can anyone tell me what I'm doing wrong? Is there a more efficient (in terms of speed and memory usage) way for me to generate XML?
Any help would be appreciated.
EDIT 1
I've tried building the XML manually like this:
joblist.each do |j|
result << " <job>\n"
result << " <id>" << j.id.to_s << "</id>\n"
result << " <job_number>" << j.job_number.to_s << "</job_number>\n"
# Lots more lines removed
end
This has given me some improvements. My largest allocations is now 1.8M. I'm close to Heroku's limit (reached a max of 500MB out of the 512MB limit over 24 hours).
I am still only running 1 Worker with 4 threads.If I can I'd like to get the memory down more so I can run some more Puma Workers and Threads.
EDIT 2
I ended up having to do this in batches (using offset and limit) and send 5 jobs at a time. The memory usage dropped substantially when I did this. Obviously there was more calls to the controller but each was smaller and faster.

Speed up rake task by using typhoeus

So i stumbled across this: https://github.com/typhoeus/typhoeus
I'm wondering if this is what i need to speed up my rake task
Event.all.each do |row|
begin
url = urlhere + row.first + row.second
doc = Nokogiri::HTML(open(url))
doc.css('.table__row--event').each do |tablerow|
table = tablerow.css('.table__cell__body--location').css('h4').text
next unless table == row.eventvenuename
tablerow.css('.table__cell__body--availability').each do |button|
buttonurl = button.css('a')[0]['href']
if buttonurl.include? '/checkout/external'
else
row.update(row: buttonurl)
end
end
end
rescue Faraday::ConnectionFailed
puts "connection failed"
next
end
end
I'm wondering if this would speed it up, Or because i'm doing a .each it wouldn't?
If it would could you provide an example?
Sam
If you set up Typhoeus::Hydra to run parallel requests, you might be able to speed up your code, assuming that the Kernel#open calls are what's slowing you down. Before you optimize, you might want to run benchmarks to validate this assumption.
If it is true, and parallel requests would speed it up, you would need to restructure your code to load events in batches, build a queue of parallel requests for each batch, and then handle them after they execute. Here's some sketch code.
class YourBatchProcessingClass
def initialize(batch_size: 200)
#batch_size = batch_size
#hydra = Typhoeus::Hydra.new(max_concurrency: #batch_size)
end
def perform
# Get an array of records
Event.find_in_batches(batch_size: #batch_size) do |batch|
# Store all the requests so we can access their responses later.
requests = batch.map do |record|
request = Typhoeus::Request.new(your_url_build_logic(record))
#hydra.queue request
request
end
#hydra.run # Run requests in parallel
# Process responses from each request
requests.each do |request|
your_response_processing(request.response.body)
end
end
rescue WhateverError => e
puts e.message
end
private
def your_url_build_logic(event)
# TODO
end
def your_response_processing(response_body)
# TODO
end
end
# Run the service by calling this in your Rake task definition
YourBatchProcessingClass.new.perform
Ruby can be used for pure scripting, but it functions best as an object-oriented language. Decomposing your processing work into clear methods can help clarify your code and help you catch things like Tom Lord mentioned in the comments on your question. Also, instead of wrapping your whole script in a begin..rescue block, you can use method-level rescues as in #perform above, or just wrap #hydra.run.
As a note, .all.each is a memory hog, and is thus considered a bad solution to iterating over records: .all loads all of the records into memory before iterating over them with .each. To save memory, it's better to use .find_each or .find_in_batches, depending on your use case. See: http://api.rubyonrails.org/classes/ActiveRecord/Batches.html

How to continue indexing documents in elasticsearch(rails)?

So I ran this command rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true to index documents in elasticsearch.In my database I have 10 000 000 records=)...it takes (I think) one day to index this...When indexing was running my computer turned off...(I indexed 2 000 000 documents)Is it possible to continue indexing documents?
If you use rails 4.2+ you can use ActiveJob to schedule and leave it running. So, first generate it with this
bin/rails generate job elastic_search_index
This will give you class and method perform:
class ElasticSearchIndexJob < ApplicationJob
def perform
# impleement here indexing
AutoPartMapper.__elasticsearch__.create_index! force:true
AutoPartMapper.__elasticsearch__.import
end
end
Set the sidekiq as your active job provider and from console initiate this with:
ElasticSearchIndexJob.perform_later
This will set the active job and execute it on next free job but it will free your console. You can leave it running and check the process in bash later:
ps aux | grep side
this will give you something like: sidekiq 4.1.2 app[1 of 12 busy]
Have a look at this post that explains them
http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/
Hope it helps
There is no such functionality in elasicsearch-rails afaik but you could write a simple task to do that.
namespace :es do
task :populate, [:start_id] => :environment do |_, args|
start_id = args[:start_id].to_i
AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
puts "Processing record ##{record.id}"
record.__elasticsearch__.index_document
end
end
end
Start it with bundle exec rake es:populate[<start_id>] passing the id of the record from which to start the next batch.
Note that this is a simplistic solution which will be much slower than batch indexing.
UPDATE
Here is a batch indexing task. It is much faster and automatically detects the record from which to continue. It does make an assumption that previously imported records were processed in increasing id order and without gaps. I haven't tested it but most of the code is from a production system.
namespace :es do
task :populate_auto => :environment do |_, args|
start_id = get_max_indexed_id
AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
elasticsearch_bulk_index(records)
end
end
def get_max_indexed_id
AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
end
def elasticsearch_bulk_index(records)
return if records.empty?
klass = records.first.class
klass.__elasticsearch__.client.bulk({
index: klass.__elasticsearch__.index_name,
type: klass.__elasticsearch__.document_type,
body: elasticsearch_records_to_index(records)
})
end
def self.elasticsearch_records_to_index(records)
records.map do |record|
payload = { _id: record.id, data: record.as_indexed_json }
{ index: payload }
end
end
end

Resources