Updating Lots of Records at Once in Rails - ruby-on-rails

I've got a background job that I run about 5,000 of them every 10 minutes. Each job makes a request to an external API and then either adds new or updates existing records in my database. Each API request returns around 100 items, so every 10 minutes I am making 50,000 CREATE or UPDATE sql queries.
The way I handle this now is, each API item returned has a unique ID. I search my database for a post that has this id, and if it exists, it updates the model. If it doesn't exist, it creates a new one.
Imagine the api response looks like this:
[
{
external_id: '123',
text: 'blah blah',
count: 450
},
{
external_id: 'abc',
text: 'something else',
count: 393
}
]
which is set to the variable collection
Then I run this code in my parent model:
class ParentModel < ApplicationRecord
def update
collection.each do |attrs|
child = ChildModel.find_or_initialize_by(external_id: attrs[:external_id], parent_model_id: self.id)
child.assign_attributes attrs
child.save if child.changed?
end
end
end
Each of these individual calls is extremely quick, but when I am doing 50,000 in a short period of time it really adds up and can slow things down.
I'm wondering if there's a more efficient way I can handle this, I was thinking of doing something instead like:
class ParentModel < ApplicationRecord
def update
eager_loaded_children = ChildModel.where(parent_model_id: self.id).limit(100)
collection.each do |attrs|
cached_child = eager_loaded_children.select {|child| child.external_id == attrs[:external_id] }.first
if cached_child
cached_child.update_attributes attrs
else
ChildModel.create attrs
end
end
end
end
Essentially I would be saving the lookups and instead doing a bigger query up front (this is also quite fast) but making a tradeoff in memory. But this doesn't seem like it would be that much time, maybe slightly speeding up the lookup part, but I'd still have to do 100 updates and creates.
Is there some kind of way I can do batch updates that I'm not thinking of? Anything else obvious that could make this go faster, or reduce the amount of queries I am doing?

You can do something like this:
collection2 = collection.map { |c| [c[:external_id], c.except(:external_id)]}.to_h
def update
ChildModel.where(external_id: collection2.keys).each |cm| do
ext_id = cm.external_id
cm.assign_attributes collection2[ext_id]
cm.save if cm.changed?
collection2.delete(ext_id)
end
if collection2.present?
new_ids = collection2.keys
new = collection.select { |c| new_ids.include? c[:external_id] }
ChildModel.create(new)
end
end
Better because
fetches all required records all at once
creates all new records at once
You can use update_columns if you don't need callbacks/validations
Only drawback, more ruby code manipulation which I think is a good tradeoff for db queries..

Related

Performance of data validation

I have an endpoint that accepts incoming data, checks it for errors and imports into the database. Incoming data can be up to 300 000 rows. Stack is - Ruby on Rails, Postgres, Redis, Sidekiq, dry-validation. Current flow:
load data into Redis;
prepare/transform;
validate and mark every row as valid/invalid;
fetch valid rows and bulk import them.
I need an advice on how to improve the performance of the validation step here because sometimes it takes more than a day to validate a large file.
Some details
It basically loops through every row in the background and applies validation rules like
rows.each do |row|
result = validate(row)
set_status(row, result) # mark as valid/invalid
end
Some validation rules are uniqueness checks - and they're heavy because they check uniqueness across the whole database. Example:
rule(:sku, :name) do
if Product.where(sku: values[:sku]).where.not(name: values[:name]).exists?
# add error
end
end
Needless to say, DB & logs are going mad during validation.
Another approach I tried was to pluck necessary fields from all database records, then loop through and compare every row with this array rather than make DB requests. But comparing with a huge array appeared to be even slower.
def existing_data
#existing_data ||= Product.pluck(:sku, :name, ...)
end
rule(:sku, :name) do
conflict = existing_data.find do |data|
data[0] == values[:sku] && data[1] != values[:name]
end
if conflict.present?
# add error
end
end
I think you could get a performance improvement by doing something along the lines of your second approach, only you should try to fetch as little of the existing products as possible, preferably only the products that will be relevant to your validations. Looking only at the code provided, it seems to me like you could cut down on the amount of products that you're loading by aggregating the SKUs from the newly received rows and using them to filter the products table
skus = skus_from_rows(rows)
#existing_products = existing_products(skus)
rows.each do |row|
result = validate(row)
set_status(row, result) # mark as valid/invalid
end
def skus_from_rows(rows)
rows.map { |row| row[:sku] }.uniq
end
def existing_products(skus)
Product.where(sku: skus).pluck(:sku, :name, ...)
end
rule(:sku, :name) do
conflict = #existing_products.find do |data|
data[0] == values[:sku] && data[1] != values[:name]
end
if conflict.present?
# add error
end
end
Additionally, I would add an index(if not already present) to the sku column to improve the performance of the query that filters skus.

How to speed up a very frequently made query using raw SQL and without ORM?

I have an API endpoint that accounts for a little less than half of the average response time (on averaging taking about 514 ms, yikes). The endpoint simply returns some statistics about stored data scoped to particular time periods, such as this week, last week, this month, and so on...
There are a number of ways that we could reduce it's impact, like getting the clients to hit it less and with more particular queries such as only querying for "this week" when only that data is used. Here we focus on what can be done at the database-level first. In our current implementation we generate this data for all "time scopes" on-the-fly and the number of queries is enormous and made multiple times per second. No caching is used, but maybe there is a way to use Rails's cache_key, or the low-level Rails.cache?
The current implementation look something like this:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
#user = user
summaries = Struct::Summaries.new
TimeScope::TIME_SCOPES.each do |scope|
foos = user.foos.by_scope(scope.to_sym)
summary = Struct::Summary.new
# e.g: summaries.last_week = build_summary(foos)
summaries.send("#{scope}=", build_summary(summary, foos))
end
summaries
end
private_class_method
def self.build_summary(summary, foos)
summary.all_quuz = #user.foos_count
summary.all_quux = all_quux(foos)
summary.quuw = quuw(foos).to_f
%w[foo bar baz qux].product(
%w[quux quuz corge]
).each do |a, b|
# e.g: summary.foo_quux = quux(foos, "foo")
summary.send("#{a.downcase}_#{b}=", send(b, foos, a) || 0)
end
summary
end
def self.all_quuz(foos)
foos.count
end
def self.all_quux(foos)
foos.sum(:quux)
end
def self.quuw(foos)
foos.quuwable.total_quuw
end
def self.corge(foos, foo_type)
return if foos.count.zero?
count = self.quuz(foos, foo_type) || 0
count.to_f / foos.count
end
def self.quux(foos, foo_type)
case foo_type
when "foo"
foos.where(foo: true).sum(:quux)
when "bar"
foos.bar.where(foo: false).sum(:quux)
when "baz"
foos.baz.where(foo: false).sum(:quux)
when "qux"
foos.qux.sum(:quux)
end
end
def self.quuz(foos, foo_type)
case trip_type
when "foo"
foos.where(foo: true).count
when "bar"
foos.bar.where(foo: false).count
when "baz"
foos.baz.where(foo: false).count
when "qux"
foos.qux.count
end
end
end
To avoid making changes to the model, or creating migrations to create a table to store this data (both of which may be valid and better solutions) I decided maybe it would be easier to construct one large sql query that will be executed at once in the hopes that it will be faster to build the query string and execute it without the overhead of active record set up and tear down of SQL queries.
The new approach looks something like this, it is horrifying to me and I know there must be a more elegant way:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
results = ActiveRecord::Base.connection.execute(build_query_for(user))
results.each do |result|
# build up summary struct from query results
end
end
def self.build_query_for(user)
TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[foo bar baz qux].map do |foo_type|
%[
select
'#{scope}_#{foo_type}',
sum(quux) as quux,
count(*), as quuz,
round(100.0 * (count(*) / #{user.foos_count.to_f}), 3) as corge
from
"foos"
where
"foo"."user_id" = #{user.id}
and "foos"."foo_type" = '#{foo_type.humanize}'
and "foos"."end_time" between '#{time_scope.from}' AND '#{time_scope.to}'
and "foos"."foo" = '#{foo_type == 'foo' ? 't' : 'f'}'
union
]
end
end.join.reverse.sub("union".reverse, "").reverse
end
end
The funny way of replacing the last occurance of union also horrifies but it seems to work. There must be a beter way as there are probably many things that are wrong with the above implementation(s). It may be helpful to note that I use Postgresql and have no problem with writing queries that are not portable to other DB's. Any advice is truly appreciated!
Thanks for reading!
Update: I found a solution that works for me and sped up the endpoint that uses this service object by 500% ! Essentially the idea is, instead of building a query string and then executing it for each set of parameters, we create a prepared statement using prepare followed by an exec_prepared passing in parameters to the query. Since this query is made many times over this is a useful optmization because, as per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
We prepare the query like so:
def prepare_query!
ActiveRecord::Base.transaction do
connection.prepare("foos_summary",
%[with scoped_foos as (
select
*
from
"foos"
where
"foos"."user_id" = $3
and ("foos"."end_time" between $4 and $5)
)
select
$1::text as scope,
$2::text as foo_type,
sum(quux)::float as quux,
sum(eggs + bacon + ham)::float as food,
count(*) as count,
round((sum(quux) / nullif(
(select
sum(quux)
from
scoped_foos), 0))::numeric,
5)::float as quuz
from
scoped_foos
where
(case $6
when 'Baz'
then (baz = 't')
else
(baz = 'f' and foo_type = $6)
end
)
])
end
You can see in this query we use a common table expression for more readability and to avoid writing the same select query twice over.
Then we execute the query, passing in the parameters we need:
def connection
#connection ||= ActiveRecord::Base.connection.raw_connection
end
def query_results
prepare_query! unless query_already_prepared?
#results ||= TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[bacon eggs ham spam].map do |foo_type|
connection.exec_prepared("foos_summary",
[scope,
foo_type,
#user.id,
time_scope.from,
time_scope.to,
foo_type.humanize])
end
end
end
Where query_already_prepared? is a simple check in the prepared statements table maintained by postgres:
def query_already_prepared?
connection.exec(%(select
name
from
pg_prepared_statements
where name = 'foos_summary')).count.positive?
end
A nice solution, I thought! Hopefully the technique illustrated here will help others with a similar problems.

Getting all the pages from an API

This is something I struggle with, or whenever I do it it seems to be messy.
I'm going to ask the question in a very generic way as it's not a single problem I'm really trying to solve.
I have an API that I want to consume some data from, e.g. via:
def get_api_results(page)
results = HTTParty.get("api.api.com?page=#{page}")
end
When I call it I can retrieve a total.
results["total"] = 237
The API limits the number of records I can retrieve in one call, say 20. So I need to call it a few more times.
I want to do something like the following, ideally breaking it into pieces so I can use things like delayed_job..etc
def get_all_api_pages
results = get_api_results(1)
total = get_api_results(1)["total"]
until page*20 > total do |p|
results += get_api_results(p)
end
end
I always feel like I'm writing rubbish whenever I try and solve this (and I've tried to solve it in a number of ways).
The above, for example, leaves me at the mercy of an error with the API, which knocks out all my collected results if I hit an error at any point.
Wondering if there is just a generally good, clean way of dealing with this situation.
I don't think you can have that much cleaner...because you only receive the total once you called the API.
Have you tried to build your own enum for this. It encapsulates the ugly part. Here is a bit of sample code with a "mocked" API:
class AllRecords
PER_PAGE = 50
def each
return enum_for(:each) unless block_given?
current_page = 0
total = nil
while total.nil? || current_page * PER_PAGE < total
current_page += 1
page = load_page(current_page)
total = page[:total]
page[:items].each do |item|
yield(item)
end
end
end
private
def load_page(page)
if page == 5
{items: Array.new(37) { rand(100) }, total: 237}
else
{items: Array.new(50) { rand(100) }, total: 237}
end
end
end
AllRecords.new.each.each_with_index do |item, index|
p index
end
You can surely clean that out a bit but i think that this is nice because it does not collect all the items first.

How to test the number of database calls in Rails

I am creating a REST API in rails. I'm using RSpec. I'd like to minimize the number of database calls, so I would like to add an automatic test that verifies the number of database calls being executed as part of a certain action.
Is there a simple way to add that to my test?
What I'm looking for is some way to monitor/record the calls that are being made to the database as a result of a single API call.
If this can't be done with RSpec but can be done with some other testing tool, that's also great.
The easiest thing in Rails 3 is probably to hook into the notifications api.
This subscriber
class SqlCounter< ActiveSupport::LogSubscriber
def self.count= value
Thread.current['query_count'] = value
end
def self.count
Thread.current['query_count'] || 0
end
def self.reset_count
result, self.count = self.count, 0
result
end
def sql(event)
self.class.count += 1
puts "logged #{event.payload[:sql]}"
end
end
SqlCounter.attach_to :active_record
will print every executed sql statement to the console and count them. You could then write specs such as
expect do
# do stuff
end.to change(SqlCounter, :count).by(2)
You'll probably want to filter out some statements, such as ones starting/committing transactions or the ones active record emits to determine the structures of tables.
You may be interested in using explain. But that won't be automatic. You will need to analyse each action manually. But maybe that is a good thing, since the important thing is not the number of db calls, but their nature. For example: Are they using indexes?
Check this:
http://weblog.rubyonrails.org/2011/12/6/what-s-new-in-edge-rails-explain/
Use the db-query-matchers gem.
expect { subject.make_one_query }.to make_database_queries(count: 1)
Fredrick's answer worked great for me, but in my case, I also wanted to know the number of calls for each ActiveRecord class individually. I made some modifications and ended up with this in case it's useful for others.
class SqlCounter< ActiveSupport::LogSubscriber
# Returns the number of database "Loads" for a given ActiveRecord class.
def self.count(clazz)
name = clazz.name + ' Load'
Thread.current['log'] ||= {}
Thread.current['log'][name] || 0
end
# Returns a list of ActiveRecord classes that were counted.
def self.counted_classes
log = Thread.current['log']
loads = log.keys.select {|key| key =~ /Load$/ }
loads.map { |key| Object.const_get(key.split.first) }
end
def self.reset_count
Thread.current['log'] = {}
end
def sql(event)
name = event.payload[:name]
Thread.current['log'] ||= {}
Thread.current['log'][name] ||= 0
Thread.current['log'][name] += 1
end
end
SqlCounter.attach_to :active_record
expect do
# do stuff
end.to change(SqlCounter, :count).by(2)

How do I populate a table in rails from a fixture?

Quick summary:
I have a Rails app that is a personal checklist / to-do list. Basically, you can log in and manage your to-do list.
My Question:
When a user creates a new account, I want to populate their checklist with 20-30 default to-do items. I know I could say:
wash_the_car = ChecklistItem.new
wash_the_car.name = 'Wash and wax the Ford F650.'
wash_the_car.user = #new_user
wash_the_car.save!
...repeat 20 times...
However, I have 20 ChecklistItem rows to populate, so that would be 60 lines of very damp (aka not DRY) code. There's gotta be a better way.
So I want to use seed the ChecklistItems table from a YAML file when the account is created. The YAML file can have all of my ChecklistItem objects to be populated. When a new user is created -- bam! -- the preset to-do items are in their list.
How do I do this?
Thanks!
(PS: For those of you wondering WHY I am doing this: I am making a client login for my web design company. I have a set of 20 steps (first meeting, design, validate, test, etc.) that I go through with each web client. These 20 steps are the 20 checklist items that I want to populate for each new client. However, while everyone starts with the same 20 items, I normally customize the steps I'll take based on the project (and hence my vanilla to-do list implementation and desire to populate the rows programatically). If you have questions, I can explain further.
Just write a function:
def add_data(data, user)
wash_the_car = ChecklistItem.new
wash_the_car.name = data
wash_the_car.user = user
wash_the_car.save!
end
add_data('Wash and wax the Ford F650.', #user)
I agree with the other answerers suggesting you just do it in code. But it doesn't have to be as verbose as suggested. It's already a one liner if you want it to be:
#new_user.checklist_items.create! :name => 'Wash and wax the Ford F650.'
Throw that in a loop of items that you read from a file, or store in your class, or wherever:
class ChecklistItem < AR::Base
DEFAULTS = ['do one thing', 'do another']
...
end
class User < AR::Base
after_create :create_default_checklist_items
protected
def create_default_checklist_items
ChecklistItem::DEFAULTS.each do |x|
#new_user.checklist_items.create! :name => x
end
end
end
or if your items increase in complexity, replace the array of strings with an array of hashes...
# ChecklistItem...
DEFAULTS = [
{ :name => 'do one thing', :other_thing => 'asdf' },
{ :name => 'do another', :other_thing => 'jkl' },
]
# User.rb in after_create hook:
ChecklistItem::DEFAULTS.each do |x|
#new_user.checklist_items.create! x
end
But I'm not really suggesting you throw all the defaults in a constant inside ChecklistItem. I just described it that way so that you could see the structure of the Ruby object. Instead, throw them in a YAML file that you read in once and cache:
class ChecklistItem < AR::Base
def self.defaults
##defaults ||= YAML.read ...
end
end
Or if you wand administrators to be able to manage the default options on the fly, put them in the database:
class ChecklistItem < AR::Base
named_scope :defaults, :conditions => { :is_default => true }
end
# User.rb in after_create hook:
ChecklistItem.defaults.each do |x|
#new_user.checklist_items.create! :name => x.name
end
Lots of options.
A Rails Fixture is used to populate test-data for unit tests ; Dont think it's meant to be used in the scenario you mentioned.
I'd say just Extract a new method add_checklist_item and be done with it.
def on_user_create
add_checklist_item 'Wash and wax the Ford F650.', #user
# 19 more invocations to go
end
If you want more flexibility
def on_user_create( new_user_template_filename )
#read each line from file and call add_checklist_item
end
The file can be a simple text file where each line corresponds to a task description like "Wash and wax the Ford F650.". Should be pretty easy to write in Ruby,

Resources