How to write the query with batches in rails? - ruby-on-rails

I have a users table with 800 000 records. I created a new field called token in users table. for all the new users token is getting populated. for existing users to populate the token i wrote a rake task with following code. i feel this is not work for these many records in production environment. How to rewrite these queries with batches or some other way of writing the queries
users = User.all
users.each do |user|
user.token = SecureRandom.urlsafe_base64(nil, false)
user.save
end

How you want to proceed depends on different factors: is validation important for you when executing this? Is time an issue?
If you don't care about validations, you may generate raw SQL queries for each user and then execute them at once, otherwise you have options like ActiveRecord transactions:
User.transaction do
users = User.all
users.each do |user|
user.update(token: SecureRandom.urlsafe_base64(nil, false))
end
end
This would be quicker than your rake task, but still would take some time, depending on the number of users you want to update at once.

lower_limit = User.first.id
upper_limit = 30000
while true
users = User.where('id >= ? and id< ?',lower_limit,upper_limit)
break if users.empty?
users.each do |user|
user.update(token: SecureRandom.urlsafe_base64(nil, false))
end
lower_limit+=30000
upper_limit+=30000
end

I think that the best option for you is to use find_each or transactions.
Doc for find_each:
Looping through a collection of records from the database (using the ActiveRecord::Scoping::Named::ClassMethods#all method, for example) is very inefficient since it will try to instantiate all the objects at once.
In that case, batch processing methods allow you to work with the records in batches, thereby greatly reducing memory consumption.
The find_each method uses find_in_batches with a batch size of 1000 (or as specified by the :batch_size option).
Doc for transaction:
Transactions are protective blocks where SQL statements are only permanent if they can all succeed as one atomic action
In case that you care about memory, because you are bringnig all the 800k of users in memory, the User.all.each will instantiate the 800k objects consuming a lot of memory so my approach will be:
User.find_each(batch_size: 500) do |user|
user.token = SecureRandom.urlsafe_base64(nil, false)
user.save
end
In this case, it only instantiate 500 users instead of 1000 that is the default batch_size.
If you still want to do it in only one transaction to the database, you can use the answer of #Francesco

The common mistake is instantiating model instance without need. While AR instantiating is not cheap.
You can try this naive code:
BATCH_SIZE = 1000
while true
uids = User.where( token: nil ).limit( BATCH_SIZE ).pluck( :id )
break if uids.empty?
ApplicationRecord.transaction do
uids.each do |uid|
# def urlsafe_base64(n=nil, padding=false)
User
.where( id: uid )
.update_all( token: SecureRandom.urlsafe_base64 )
end
end
end
Next option is to use native DB's analog for SecureRandom.urlsafe_base64 and run one query like:
UPDATE users SET token=db_specific_urlsafe_base64 WHERE token IS NULL
If you won't find the analog, you can prepopulate temp table (like PostgreSQL's' COPY command) from precalculated CSV file(id, token=SecureRandom.urlsafe_base64)
and run one query like:
UPDATE users SET token=temp_table.token
FROM temp_table
WHERE (users.token IS NULL) AND (users.id=temp_table.id)
But in fact you need no to fill token on existing users because of:
i am using "token" for token based authentication in rails – John
You have to check if user's token is NULL(or expired) and redirect him to login form. It's common way and it will save your time.

Related

Performance of data validation

I have an endpoint that accepts incoming data, checks it for errors and imports into the database. Incoming data can be up to 300 000 rows. Stack is - Ruby on Rails, Postgres, Redis, Sidekiq, dry-validation. Current flow:
load data into Redis;
prepare/transform;
validate and mark every row as valid/invalid;
fetch valid rows and bulk import them.
I need an advice on how to improve the performance of the validation step here because sometimes it takes more than a day to validate a large file.
Some details
It basically loops through every row in the background and applies validation rules like
rows.each do |row|
result = validate(row)
set_status(row, result) # mark as valid/invalid
end
Some validation rules are uniqueness checks - and they're heavy because they check uniqueness across the whole database. Example:
rule(:sku, :name) do
if Product.where(sku: values[:sku]).where.not(name: values[:name]).exists?
# add error
end
end
Needless to say, DB & logs are going mad during validation.
Another approach I tried was to pluck necessary fields from all database records, then loop through and compare every row with this array rather than make DB requests. But comparing with a huge array appeared to be even slower.
def existing_data
#existing_data ||= Product.pluck(:sku, :name, ...)
end
rule(:sku, :name) do
conflict = existing_data.find do |data|
data[0] == values[:sku] && data[1] != values[:name]
end
if conflict.present?
# add error
end
end
I think you could get a performance improvement by doing something along the lines of your second approach, only you should try to fetch as little of the existing products as possible, preferably only the products that will be relevant to your validations. Looking only at the code provided, it seems to me like you could cut down on the amount of products that you're loading by aggregating the SKUs from the newly received rows and using them to filter the products table
skus = skus_from_rows(rows)
#existing_products = existing_products(skus)
rows.each do |row|
result = validate(row)
set_status(row, result) # mark as valid/invalid
end
def skus_from_rows(rows)
rows.map { |row| row[:sku] }.uniq
end
def existing_products(skus)
Product.where(sku: skus).pluck(:sku, :name, ...)
end
rule(:sku, :name) do
conflict = #existing_products.find do |data|
data[0] == values[:sku] && data[1] != values[:name]
end
if conflict.present?
# add error
end
end
Additionally, I would add an index(if not already present) to the sku column to improve the performance of the query that filters skus.

find_or_create by, if found, does it update?

https://apidock.com/rails/v4.0.2/ActiveRecord/Relation/find_or_create_by
After reading the docs, it does say: "find the first user named "Penélope" or create a new one." and "We already have one so the existing record will be returned."
But I do want to be 100% clear about this .
If I do:
User.find_or_create_by(first_name: 'Scarlett') do |user|
user.last_name = 'Johansson'
end
and User does exist with both first_name: 'Scarlett' and `last_name: 'Johansson'``, will it update it or completely ignore it?
In my case, I would like to completely ignore it if it exists at all and wondering if find_or_create is the way to go. Because I don't want to bother updating records with the same information. I am not trying to return anything either.
Should I be using find_or_create, or just use exists?
Also, if find_or_create does act as a way to check if it exists and would ignore if it does, would I be able to use it that way?
For example:
User.find_or_create_by(first_name: 'Scarlett') do |user|
user.last_name = 'Johansson'
puts "Hello" #if it doesn't exist
end
Would "Hello" puts if it doesn't exist and not puts if it does?
In the example, if you have one or more User records with the first name 'Scarlett', then find_or_create_by will return one of those records (using a LIMIT 1 query). Your code, as provided, will set - but not save - the last_name of that record to 'Johansson'.
If you do not have one or more records with the first name 'Scarlett', then a record will be created and the field first_name will have the value 'Scarlett'. Again, the last_name field will be set to 'Johansson', but will not be saved (in the code you provide; you might save it elsewhere).
In this code:
User.find_or_create_by(first_name: 'Scarlett') do |user|
user.last_name = 'Johansson'
puts "Hello" #if it doesn't exist
end
...you will always see "Hello" because find_or_create_by will always return a record (either a found one or a created one).

How to speed up a very frequently made query using raw SQL and without ORM?

I have an API endpoint that accounts for a little less than half of the average response time (on averaging taking about 514 ms, yikes). The endpoint simply returns some statistics about stored data scoped to particular time periods, such as this week, last week, this month, and so on...
There are a number of ways that we could reduce it's impact, like getting the clients to hit it less and with more particular queries such as only querying for "this week" when only that data is used. Here we focus on what can be done at the database-level first. In our current implementation we generate this data for all "time scopes" on-the-fly and the number of queries is enormous and made multiple times per second. No caching is used, but maybe there is a way to use Rails's cache_key, or the low-level Rails.cache?
The current implementation look something like this:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
#user = user
summaries = Struct::Summaries.new
TimeScope::TIME_SCOPES.each do |scope|
foos = user.foos.by_scope(scope.to_sym)
summary = Struct::Summary.new
# e.g: summaries.last_week = build_summary(foos)
summaries.send("#{scope}=", build_summary(summary, foos))
end
summaries
end
private_class_method
def self.build_summary(summary, foos)
summary.all_quuz = #user.foos_count
summary.all_quux = all_quux(foos)
summary.quuw = quuw(foos).to_f
%w[foo bar baz qux].product(
%w[quux quuz corge]
).each do |a, b|
# e.g: summary.foo_quux = quux(foos, "foo")
summary.send("#{a.downcase}_#{b}=", send(b, foos, a) || 0)
end
summary
end
def self.all_quuz(foos)
foos.count
end
def self.all_quux(foos)
foos.sum(:quux)
end
def self.quuw(foos)
foos.quuwable.total_quuw
end
def self.corge(foos, foo_type)
return if foos.count.zero?
count = self.quuz(foos, foo_type) || 0
count.to_f / foos.count
end
def self.quux(foos, foo_type)
case foo_type
when "foo"
foos.where(foo: true).sum(:quux)
when "bar"
foos.bar.where(foo: false).sum(:quux)
when "baz"
foos.baz.where(foo: false).sum(:quux)
when "qux"
foos.qux.sum(:quux)
end
end
def self.quuz(foos, foo_type)
case trip_type
when "foo"
foos.where(foo: true).count
when "bar"
foos.bar.where(foo: false).count
when "baz"
foos.baz.where(foo: false).count
when "qux"
foos.qux.count
end
end
end
To avoid making changes to the model, or creating migrations to create a table to store this data (both of which may be valid and better solutions) I decided maybe it would be easier to construct one large sql query that will be executed at once in the hopes that it will be faster to build the query string and execute it without the overhead of active record set up and tear down of SQL queries.
The new approach looks something like this, it is horrifying to me and I know there must be a more elegant way:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
results = ActiveRecord::Base.connection.execute(build_query_for(user))
results.each do |result|
# build up summary struct from query results
end
end
def self.build_query_for(user)
TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[foo bar baz qux].map do |foo_type|
%[
select
'#{scope}_#{foo_type}',
sum(quux) as quux,
count(*), as quuz,
round(100.0 * (count(*) / #{user.foos_count.to_f}), 3) as corge
from
"foos"
where
"foo"."user_id" = #{user.id}
and "foos"."foo_type" = '#{foo_type.humanize}'
and "foos"."end_time" between '#{time_scope.from}' AND '#{time_scope.to}'
and "foos"."foo" = '#{foo_type == 'foo' ? 't' : 'f'}'
union
]
end
end.join.reverse.sub("union".reverse, "").reverse
end
end
The funny way of replacing the last occurance of union also horrifies but it seems to work. There must be a beter way as there are probably many things that are wrong with the above implementation(s). It may be helpful to note that I use Postgresql and have no problem with writing queries that are not portable to other DB's. Any advice is truly appreciated!
Thanks for reading!
Update: I found a solution that works for me and sped up the endpoint that uses this service object by 500% ! Essentially the idea is, instead of building a query string and then executing it for each set of parameters, we create a prepared statement using prepare followed by an exec_prepared passing in parameters to the query. Since this query is made many times over this is a useful optmization because, as per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
We prepare the query like so:
def prepare_query!
ActiveRecord::Base.transaction do
connection.prepare("foos_summary",
%[with scoped_foos as (
select
*
from
"foos"
where
"foos"."user_id" = $3
and ("foos"."end_time" between $4 and $5)
)
select
$1::text as scope,
$2::text as foo_type,
sum(quux)::float as quux,
sum(eggs + bacon + ham)::float as food,
count(*) as count,
round((sum(quux) / nullif(
(select
sum(quux)
from
scoped_foos), 0))::numeric,
5)::float as quuz
from
scoped_foos
where
(case $6
when 'Baz'
then (baz = 't')
else
(baz = 'f' and foo_type = $6)
end
)
])
end
You can see in this query we use a common table expression for more readability and to avoid writing the same select query twice over.
Then we execute the query, passing in the parameters we need:
def connection
#connection ||= ActiveRecord::Base.connection.raw_connection
end
def query_results
prepare_query! unless query_already_prepared?
#results ||= TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[bacon eggs ham spam].map do |foo_type|
connection.exec_prepared("foos_summary",
[scope,
foo_type,
#user.id,
time_scope.from,
time_scope.to,
foo_type.humanize])
end
end
end
Where query_already_prepared? is a simple check in the prepared statements table maintained by postgres:
def query_already_prepared?
connection.exec(%(select
name
from
pg_prepared_statements
where name = 'foos_summary')).count.positive?
end
A nice solution, I thought! Hopefully the technique illustrated here will help others with a similar problems.

Updating Lots of Records at Once in Rails

I've got a background job that I run about 5,000 of them every 10 minutes. Each job makes a request to an external API and then either adds new or updates existing records in my database. Each API request returns around 100 items, so every 10 minutes I am making 50,000 CREATE or UPDATE sql queries.
The way I handle this now is, each API item returned has a unique ID. I search my database for a post that has this id, and if it exists, it updates the model. If it doesn't exist, it creates a new one.
Imagine the api response looks like this:
[
{
external_id: '123',
text: 'blah blah',
count: 450
},
{
external_id: 'abc',
text: 'something else',
count: 393
}
]
which is set to the variable collection
Then I run this code in my parent model:
class ParentModel < ApplicationRecord
def update
collection.each do |attrs|
child = ChildModel.find_or_initialize_by(external_id: attrs[:external_id], parent_model_id: self.id)
child.assign_attributes attrs
child.save if child.changed?
end
end
end
Each of these individual calls is extremely quick, but when I am doing 50,000 in a short period of time it really adds up and can slow things down.
I'm wondering if there's a more efficient way I can handle this, I was thinking of doing something instead like:
class ParentModel < ApplicationRecord
def update
eager_loaded_children = ChildModel.where(parent_model_id: self.id).limit(100)
collection.each do |attrs|
cached_child = eager_loaded_children.select {|child| child.external_id == attrs[:external_id] }.first
if cached_child
cached_child.update_attributes attrs
else
ChildModel.create attrs
end
end
end
end
Essentially I would be saving the lookups and instead doing a bigger query up front (this is also quite fast) but making a tradeoff in memory. But this doesn't seem like it would be that much time, maybe slightly speeding up the lookup part, but I'd still have to do 100 updates and creates.
Is there some kind of way I can do batch updates that I'm not thinking of? Anything else obvious that could make this go faster, or reduce the amount of queries I am doing?
You can do something like this:
collection2 = collection.map { |c| [c[:external_id], c.except(:external_id)]}.to_h
def update
ChildModel.where(external_id: collection2.keys).each |cm| do
ext_id = cm.external_id
cm.assign_attributes collection2[ext_id]
cm.save if cm.changed?
collection2.delete(ext_id)
end
if collection2.present?
new_ids = collection2.keys
new = collection.select { |c| new_ids.include? c[:external_id] }
ChildModel.create(new)
end
end
Better because
fetches all required records all at once
creates all new records at once
You can use update_columns if you don't need callbacks/validations
Only drawback, more ruby code manipulation which I think is a good tradeoff for db queries..

Faster way to find the last updated timestamp

I expect this call
Model.maximum(:updated_at)
to be faster than this one
Model.order(:updated_at).last.updated_at
Is anyone able to confirm this assertion? and, if true, explain why?
You can use the the Benchmark module to investigate easily, e.g.:
require 'benchmark'
n = 50000
Benchmark.bm do |x|
x.report('maximum') { n.times.do; v1; end }
x.report('order-pluck') { n.times do; v2; end }
end
def v1
clear_cache
Model.maximum(:updated_at)
end
def v2
clear_cache
Model.order(:updated_at).pluck(:updated_at).last
end
def clear_cache
ActiveRecord::Base.connection.query_cache.clear
end
To make it worth doing this with n > 1 you'll have to clear the various caches that might be involved. There may well be a cache in your db server, separate from the ActiveRecord cache. For instance to clear the Mysql cache you could call:
`mysql mydb -e 'RESET QUERY CACHE'`
Your expectation is correct.
When you call Model.maximum(:updated_at), you ask your DB to return just a single value.
When you call Model.order(:updated_at).pluck(:updated_at).last, your database returns you all the values for the updated_at column in the table, which consumes more memory (because you have to build a big array), and takes more time.

Resources