XML generation very slow and using lots of memory in Rails 4 - ruby-on-rails

I'm generating an XML file to share data with another system. From my troubleshooting, I've found that this process is both slow and consuming lots of memory (getting lots of R14's on Heroku.)
My index method on my Jobs Controller looks like this:
def index
respond_to do |format|
format.xml {#jobs = #user.jobs.includes(job_types: [:job_lines, :job_photos])}
format.json
{
# More code here, this part is not the problem.
}
end
end
My view (index.xml.builder) looks like this (I've removed a bunch of fields to keep the example smaller):
xml.instruct!
xml.jobs do
#jobs.each do |j|
xml.job do
xml.id j.id
xml.job_number j.job_number
xml.registration j.registration
xml.name j.name
xml.job_types do
j.job_types.each do |t|
xml.job_type do
xml.id t.id
xml.job_id t.job_id
xml.type_number t.type_number
xml.description t.description
xml.job_lines do
t.job_lines.each do |l|
xml.job_line do
xml.id l.id
xml.line_number l.line_number
xml.job_type_id l.job_type_id
xml.line_type l.line_type
xml.type_number l.type_number
xml.description l.description
xml.part_number l.part_number
end # job_line node
end # job_lines.each
end # job_lines node
xml.job_photos do
t.job_photos.each do |p|
xml.job_photo do
xml.id p.id
xml.pcid p.pcid
xml.job_type_id p.job_type_id
xml.image_url p.image.url
end # job_line node
end # job_lines.each
end # job_lines node
end # job_type
end # job_types.each
end # job_types node
end # job node
end # #jobs.each
end # jobs node
The generated XML file is not small (it's about 100kB). Running on Heroku, their Scout tool tells me that this process is often taking 4-6 seconds to run. Also, despite only running 1 worker, with 4 threads (in Puma) this part of my code is consuming all my memory. In scout, I can see that it's "Max Allocations" are as high as 10M compared with my next worst method which is only 500k allocations.
Can anyone tell me what I'm doing wrong? Is there a more efficient (in terms of speed and memory usage) way for me to generate XML?
Any help would be appreciated.
EDIT 1
I've tried building the XML manually like this:
joblist.each do |j|
result << " <job>\n"
result << " <id>" << j.id.to_s << "</id>\n"
result << " <job_number>" << j.job_number.to_s << "</job_number>\n"
# Lots more lines removed
end
This has given me some improvements. My largest allocations is now 1.8M. I'm close to Heroku's limit (reached a max of 500MB out of the 512MB limit over 24 hours).
I am still only running 1 Worker with 4 threads.If I can I'd like to get the memory down more so I can run some more Puma Workers and Threads.
EDIT 2
I ended up having to do this in batches (using offset and limit) and send 5 jobs at a time. The memory usage dropped substantially when I did this. Obviously there was more calls to the controller but each was smaller and faster.

Related

Rake task for creating database records for all existing ActiveStorage variants

In Rails 6.1, ActiveStorage creates database records for all variants when they're loaded for the first time: https://github.com/rails/rails/pull/37901
I'd like to enable this, but since I have tens of thousands of files in my production Rails app, it'd be problematic (and presumably slow) to have users creating so many database records as they browse the site. Is there a way to write a Rake task that'll iterate through every attachment in my database, and generate the variants and save them in the database?
I'd run that once, after enabling the new active_storage.track_variants config, and then any newly-uploaded files would be saved when they're loaded for the first time.
Thanks for the help!
This is the Rake task I ended up creating for this. The Parallel stuff can be removed if you have a smaller dataset, but I found that with 70k+ variants it was intolerably slow when doing it without any parallelization. You can also ignore the progress bar-related code :)
Essentially, I just take all the models that have an attachment (I do this manually, you could do it in a more dynamic way if you have a ton of attachments), and then filter the ones that are not variable. Then I go through each attachment and generate a variant for each size I've defined, and then call process on it to force it to be saved to the database.
Make sure to catch MiniMagick (or vips, if you prefer) errors in the task so that a bad image file doesn't break everything.
# Rails 6.1 changes the way ActiveStorage works so that variants are
# tracked in the database. The intent of this task is to create the
# necessary variants for all game covers and user avatars in our database.
# This way, the user isn't creating dozens of variant records as they
# browse the site. We want to create them ahead-of-time, when we deploy
# the change to track variants.
namespace 'active_storage:vglist:variants' do
require 'ruby-progressbar'
require 'parallel'
desc "Create all variants for covers and avatars in the database."
task create: :environment do
games = Game.joins(:cover_attachment)
# Only attempt to create variants if the cover is able to have variants.
games = games.filter { |game| game.cover.variable? }
puts 'Creating game cover variants...'
# Use the configured max number of threads, with 2 leftover for web requests.
# Clamp it to 1 if the configured max threads is 2 or less for whatever reason.
thread_count = [(ENV.fetch('RAILS_MAX_THREADS', 5).to_i - 2), 1].max
games_progress_bar = ProgressBar.create(
total: games.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
# Disable logging in production to prevent log spam.
Rails.logger.level = 2 if Rails.env.production?
Parallel.each(games, in_threads: thread_count) do |game|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
game.sized_cover(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
games_progress_bar.log "ERROR: #{e.message}"
games_progress_bar.log "Failed on game ID: #{game.id}"
end
games_progress_bar.increment
end
end
games_progress_bar.finish unless games_progress_bar.finished?
users = User.joins(:avatar_attachment)
# Only attempt to create variants if the avatar is able to have variants.
users = users.filter { |user| user.avatar.variable? }
puts 'Creating user avatar variants...'
users_progress_bar = ProgressBar.create(
total: users.count,
format: "\e[0;32m%c/%C |%b>%i| %e\e[0m"
)
Parallel.each(users, in_threads: thread_count) do |user|
ActiveRecord::Base.connection_pool.with_connection do
begin
[:small, :medium, :large].each do |size|
user.sized_avatar(size).process
end
# Rescue MiniMagick errors if they occur so that they don't block the
# task from continuing.
rescue MiniMagick::Error => e
users_progress_bar.log "ERROR: #{e.message}"
users_progress_bar.log "Failed on user ID: #{user.id}"
end
users_progress_bar.increment
end
end
users_progress_bar.finish unless users_progress_bar.finished?
end
end
This is what the sized_cover looks like in game.rb:
def sized_cover(size)
width, height = COVER_SIZES[size]
cover&.variant(
resize_to_limit: [width, height]
)
end
sized_avatar is pretty much the same thing.

Detect which Sidekiq jobs are responsible for high memory usage

I run an app on Heroku and use Sidekiq as the job queue system. However, recently, memory usage is always around 90%-110%.
I already tried reducing the concurrency a little and scaling the number of workers, but with no big success. Is there a way to detect which Sidekiq jobs are consuming so much memory?
We use New Relic to track our transactions but I couldn't find this information on the platform.
I faced the similar situation where I need to track memory used by workers and I have come up with the following solution, Not sure if it can help you, But I hope it'll set the right direction to find the solution
I have written a cron job which collects currently running workers(not realtime) and memory usage and stores it into a csv file. Below is the code.
class DataWorker
include Sidekiq::Worker
def perform
file = File.new("sidekiq_profiling.csv", "a")
memory_usage = gc_start
workers = Sidekiq::Workers.new
worker_running_counts = workers.map {|pid, thrd, wrkr| wrkr["payload"]["class"]}.group_by {|cls| cls}.map {|k, v| {k => v.count}}
datetime = DateTime.now
worker_running_counts.each do |wc|
file << "#{datetime},#{wc.keys[0]},#{wc.values[0]},#{memory_usage}\n"
end
file.close
end
def rss_usage
`ps -o rss= -p #{Process.pid}`.chomp.to_i * 1024
end
# def gc_stats
# GC.stat.slice(:heap_available_slots, :heap_live_slots, :heap_free_slots)
# end
def gc_start
GC.start
# gc_stats.each do |key, value|
# puts "GC.#{key}: #{value.to_s(:delimited)}"
# end
"#{rss_usage.to_s(:human_size, precision: 3)}"
end
Sidekiq::Cron::Job.create(name: 'DataWorker', cron: '* * * * *', class: 'DataWorker')
end

Handle connection breakages in rails

I have a module written in ruby which connects to a postgres table and then applies some logic and code.
Below is a sample code:
module SampleModuleHelper
def self.traverse_database
ProductTable.where(:column => value).find_each do |product|
#some logic here that takes a long time
end
end
end
ProductTable has more than 3 million records. I have used the where clause to shorten number of records retrieved.
However I need to make the code connection proof. There are times when the connection breaks and I have to start traversing the table from the very beginning. I don't want this, rather it should start where it left off since the time taken is too much for each record.
What is the best way to make the code start where it left off?
One way is to make a table in the database that records the primary key(id) where it stopped and start from there again. But I don't want to make tables in the database as there are many such processes.
You could keep a counter of processed records and use the offset method to continue processing.
Something along the lines of:
MAX_RETRIES = 3
def self.traverse(query)
counter = 0
retries = 0
begin
query.offset(counter).find_each do |record|
yield record
counter += 1
end
rescue ActiveRecord::ConnectionNotEstablished => e # or whatever error you're expecting
retries += 1
retry unless retries > MAX_RETRIES
raise
end
end
def self.traverse_products
traverse(ProductTable.where(column: value)) do |product|
# do something with `product`
end
end

Speed up rake task by using typhoeus

So i stumbled across this: https://github.com/typhoeus/typhoeus
I'm wondering if this is what i need to speed up my rake task
Event.all.each do |row|
begin
url = urlhere + row.first + row.second
doc = Nokogiri::HTML(open(url))
doc.css('.table__row--event').each do |tablerow|
table = tablerow.css('.table__cell__body--location').css('h4').text
next unless table == row.eventvenuename
tablerow.css('.table__cell__body--availability').each do |button|
buttonurl = button.css('a')[0]['href']
if buttonurl.include? '/checkout/external'
else
row.update(row: buttonurl)
end
end
end
rescue Faraday::ConnectionFailed
puts "connection failed"
next
end
end
I'm wondering if this would speed it up, Or because i'm doing a .each it wouldn't?
If it would could you provide an example?
Sam
If you set up Typhoeus::Hydra to run parallel requests, you might be able to speed up your code, assuming that the Kernel#open calls are what's slowing you down. Before you optimize, you might want to run benchmarks to validate this assumption.
If it is true, and parallel requests would speed it up, you would need to restructure your code to load events in batches, build a queue of parallel requests for each batch, and then handle them after they execute. Here's some sketch code.
class YourBatchProcessingClass
def initialize(batch_size: 200)
#batch_size = batch_size
#hydra = Typhoeus::Hydra.new(max_concurrency: #batch_size)
end
def perform
# Get an array of records
Event.find_in_batches(batch_size: #batch_size) do |batch|
# Store all the requests so we can access their responses later.
requests = batch.map do |record|
request = Typhoeus::Request.new(your_url_build_logic(record))
#hydra.queue request
request
end
#hydra.run # Run requests in parallel
# Process responses from each request
requests.each do |request|
your_response_processing(request.response.body)
end
end
rescue WhateverError => e
puts e.message
end
private
def your_url_build_logic(event)
# TODO
end
def your_response_processing(response_body)
# TODO
end
end
# Run the service by calling this in your Rake task definition
YourBatchProcessingClass.new.perform
Ruby can be used for pure scripting, but it functions best as an object-oriented language. Decomposing your processing work into clear methods can help clarify your code and help you catch things like Tom Lord mentioned in the comments on your question. Also, instead of wrapping your whole script in a begin..rescue block, you can use method-level rescues as in #perform above, or just wrap #hydra.run.
As a note, .all.each is a memory hog, and is thus considered a bad solution to iterating over records: .all loads all of the records into memory before iterating over them with .each. To save memory, it's better to use .find_each or .find_in_batches, depending on your use case. See: http://api.rubyonrails.org/classes/ActiveRecord/Batches.html

optimizing reading database and writing to csv file

I'm trying to read a large amount of cells from database (over 100.000) and write them to a csv file on VPS Ubuntu server. It happens that server doesn't have enough memory.
I was thinking about reading 5000 rows at once and writing them to file, then reading another 5000, etc..
How should I restructure my current code so that memory won't be consumed fully?
Here's my code:
def write_rows(emails)
File.open(file_path, "w+") do |f|
f << "email,name,ip,created\n"
emails.each do |l|
f << [l.email, l.name, l.ip, l.created_at].join(",") + "\n"
end
end
end
The function is called from sidekiq worker by:
write_rows(user.emails)
Thanks for help!
The problem here is that when you call emails.each ActiveRecord loads all the records from the database and keeps them in memory, to avoid this you can use the method find_each:
require 'csv'
BATCH_SIZE = 5000
def write_rows(emails)
CSV.open(file_path, 'w') do |csv|
csv << %w{email name ip created}
emails.find_each do |email|
csv << [email.email, email.name, email.ip, email.created_at]
end
end
end
By default find_each loads records in batches of 1000 at a time, if you want to load batches of 5000 record you have to pass the option :batch_size to find_each:
emails.find_each(:batch_size => 5000) do |email|
...
More information about the find_each method (and the related find_in_batches) can be found on the Ruby on Rails Guides.
I've used the CSV class to write the file instead of joining fields and lines by hand. This is not inteded to be a performance optimization since writing on the file shouldn't be the bottleneck here.

Resources