Performance of data validation - ruby-on-rails

I have an endpoint that accepts incoming data, checks it for errors and imports into the database. Incoming data can be up to 300 000 rows. Stack is - Ruby on Rails, Postgres, Redis, Sidekiq, dry-validation. Current flow:
load data into Redis;
prepare/transform;
validate and mark every row as valid/invalid;
fetch valid rows and bulk import them.
I need an advice on how to improve the performance of the validation step here because sometimes it takes more than a day to validate a large file.
Some details
It basically loops through every row in the background and applies validation rules like
rows.each do |row|
result = validate(row)
set_status(row, result) # mark as valid/invalid
end
Some validation rules are uniqueness checks - and they're heavy because they check uniqueness across the whole database. Example:
rule(:sku, :name) do
if Product.where(sku: values[:sku]).where.not(name: values[:name]).exists?
# add error
end
end
Needless to say, DB & logs are going mad during validation.
Another approach I tried was to pluck necessary fields from all database records, then loop through and compare every row with this array rather than make DB requests. But comparing with a huge array appeared to be even slower.
def existing_data
#existing_data ||= Product.pluck(:sku, :name, ...)
end
rule(:sku, :name) do
conflict = existing_data.find do |data|
data[0] == values[:sku] && data[1] != values[:name]
end
if conflict.present?
# add error
end
end

I think you could get a performance improvement by doing something along the lines of your second approach, only you should try to fetch as little of the existing products as possible, preferably only the products that will be relevant to your validations. Looking only at the code provided, it seems to me like you could cut down on the amount of products that you're loading by aggregating the SKUs from the newly received rows and using them to filter the products table
skus = skus_from_rows(rows)
#existing_products = existing_products(skus)
rows.each do |row|
result = validate(row)
set_status(row, result) # mark as valid/invalid
end
def skus_from_rows(rows)
rows.map { |row| row[:sku] }.uniq
end
def existing_products(skus)
Product.where(sku: skus).pluck(:sku, :name, ...)
end
rule(:sku, :name) do
conflict = #existing_products.find do |data|
data[0] == values[:sku] && data[1] != values[:name]
end
if conflict.present?
# add error
end
end
Additionally, I would add an index(if not already present) to the sku column to improve the performance of the query that filters skus.

Related

How to speed up a very frequently made query using raw SQL and without ORM?

I have an API endpoint that accounts for a little less than half of the average response time (on averaging taking about 514 ms, yikes). The endpoint simply returns some statistics about stored data scoped to particular time periods, such as this week, last week, this month, and so on...
There are a number of ways that we could reduce it's impact, like getting the clients to hit it less and with more particular queries such as only querying for "this week" when only that data is used. Here we focus on what can be done at the database-level first. In our current implementation we generate this data for all "time scopes" on-the-fly and the number of queries is enormous and made multiple times per second. No caching is used, but maybe there is a way to use Rails's cache_key, or the low-level Rails.cache?
The current implementation look something like this:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
#user = user
summaries = Struct::Summaries.new
TimeScope::TIME_SCOPES.each do |scope|
foos = user.foos.by_scope(scope.to_sym)
summary = Struct::Summary.new
# e.g: summaries.last_week = build_summary(foos)
summaries.send("#{scope}=", build_summary(summary, foos))
end
summaries
end
private_class_method
def self.build_summary(summary, foos)
summary.all_quuz = #user.foos_count
summary.all_quux = all_quux(foos)
summary.quuw = quuw(foos).to_f
%w[foo bar baz qux].product(
%w[quux quuz corge]
).each do |a, b|
# e.g: summary.foo_quux = quux(foos, "foo")
summary.send("#{a.downcase}_#{b}=", send(b, foos, a) || 0)
end
summary
end
def self.all_quuz(foos)
foos.count
end
def self.all_quux(foos)
foos.sum(:quux)
end
def self.quuw(foos)
foos.quuwable.total_quuw
end
def self.corge(foos, foo_type)
return if foos.count.zero?
count = self.quuz(foos, foo_type) || 0
count.to_f / foos.count
end
def self.quux(foos, foo_type)
case foo_type
when "foo"
foos.where(foo: true).sum(:quux)
when "bar"
foos.bar.where(foo: false).sum(:quux)
when "baz"
foos.baz.where(foo: false).sum(:quux)
when "qux"
foos.qux.sum(:quux)
end
end
def self.quuz(foos, foo_type)
case trip_type
when "foo"
foos.where(foo: true).count
when "bar"
foos.bar.where(foo: false).count
when "baz"
foos.baz.where(foo: false).count
when "qux"
foos.qux.count
end
end
end
To avoid making changes to the model, or creating migrations to create a table to store this data (both of which may be valid and better solutions) I decided maybe it would be easier to construct one large sql query that will be executed at once in the hopes that it will be faster to build the query string and execute it without the overhead of active record set up and tear down of SQL queries.
The new approach looks something like this, it is horrifying to me and I know there must be a more elegant way:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
results = ActiveRecord::Base.connection.execute(build_query_for(user))
results.each do |result|
# build up summary struct from query results
end
end
def self.build_query_for(user)
TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[foo bar baz qux].map do |foo_type|
%[
select
'#{scope}_#{foo_type}',
sum(quux) as quux,
count(*), as quuz,
round(100.0 * (count(*) / #{user.foos_count.to_f}), 3) as corge
from
"foos"
where
"foo"."user_id" = #{user.id}
and "foos"."foo_type" = '#{foo_type.humanize}'
and "foos"."end_time" between '#{time_scope.from}' AND '#{time_scope.to}'
and "foos"."foo" = '#{foo_type == 'foo' ? 't' : 'f'}'
union
]
end
end.join.reverse.sub("union".reverse, "").reverse
end
end
The funny way of replacing the last occurance of union also horrifies but it seems to work. There must be a beter way as there are probably many things that are wrong with the above implementation(s). It may be helpful to note that I use Postgresql and have no problem with writing queries that are not portable to other DB's. Any advice is truly appreciated!
Thanks for reading!
Update: I found a solution that works for me and sped up the endpoint that uses this service object by 500% ! Essentially the idea is, instead of building a query string and then executing it for each set of parameters, we create a prepared statement using prepare followed by an exec_prepared passing in parameters to the query. Since this query is made many times over this is a useful optmization because, as per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
We prepare the query like so:
def prepare_query!
ActiveRecord::Base.transaction do
connection.prepare("foos_summary",
%[with scoped_foos as (
select
*
from
"foos"
where
"foos"."user_id" = $3
and ("foos"."end_time" between $4 and $5)
)
select
$1::text as scope,
$2::text as foo_type,
sum(quux)::float as quux,
sum(eggs + bacon + ham)::float as food,
count(*) as count,
round((sum(quux) / nullif(
(select
sum(quux)
from
scoped_foos), 0))::numeric,
5)::float as quuz
from
scoped_foos
where
(case $6
when 'Baz'
then (baz = 't')
else
(baz = 'f' and foo_type = $6)
end
)
])
end
You can see in this query we use a common table expression for more readability and to avoid writing the same select query twice over.
Then we execute the query, passing in the parameters we need:
def connection
#connection ||= ActiveRecord::Base.connection.raw_connection
end
def query_results
prepare_query! unless query_already_prepared?
#results ||= TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[bacon eggs ham spam].map do |foo_type|
connection.exec_prepared("foos_summary",
[scope,
foo_type,
#user.id,
time_scope.from,
time_scope.to,
foo_type.humanize])
end
end
end
Where query_already_prepared? is a simple check in the prepared statements table maintained by postgres:
def query_already_prepared?
connection.exec(%(select
name
from
pg_prepared_statements
where name = 'foos_summary')).count.positive?
end
A nice solution, I thought! Hopefully the technique illustrated here will help others with a similar problems.

How to write the query with batches in rails?

I have a users table with 800 000 records. I created a new field called token in users table. for all the new users token is getting populated. for existing users to populate the token i wrote a rake task with following code. i feel this is not work for these many records in production environment. How to rewrite these queries with batches or some other way of writing the queries
users = User.all
users.each do |user|
user.token = SecureRandom.urlsafe_base64(nil, false)
user.save
end
How you want to proceed depends on different factors: is validation important for you when executing this? Is time an issue?
If you don't care about validations, you may generate raw SQL queries for each user and then execute them at once, otherwise you have options like ActiveRecord transactions:
User.transaction do
users = User.all
users.each do |user|
user.update(token: SecureRandom.urlsafe_base64(nil, false))
end
end
This would be quicker than your rake task, but still would take some time, depending on the number of users you want to update at once.
lower_limit = User.first.id
upper_limit = 30000
while true
users = User.where('id >= ? and id< ?',lower_limit,upper_limit)
break if users.empty?
users.each do |user|
user.update(token: SecureRandom.urlsafe_base64(nil, false))
end
lower_limit+=30000
upper_limit+=30000
end
I think that the best option for you is to use find_each or transactions.
Doc for find_each:
Looping through a collection of records from the database (using the ActiveRecord::Scoping::Named::ClassMethods#all method, for example) is very inefficient since it will try to instantiate all the objects at once.
In that case, batch processing methods allow you to work with the records in batches, thereby greatly reducing memory consumption.
The find_each method uses find_in_batches with a batch size of 1000 (or as specified by the :batch_size option).
Doc for transaction:
Transactions are protective blocks where SQL statements are only permanent if they can all succeed as one atomic action
In case that you care about memory, because you are bringnig all the 800k of users in memory, the User.all.each will instantiate the 800k objects consuming a lot of memory so my approach will be:
User.find_each(batch_size: 500) do |user|
user.token = SecureRandom.urlsafe_base64(nil, false)
user.save
end
In this case, it only instantiate 500 users instead of 1000 that is the default batch_size.
If you still want to do it in only one transaction to the database, you can use the answer of #Francesco
The common mistake is instantiating model instance without need. While AR instantiating is not cheap.
You can try this naive code:
BATCH_SIZE = 1000
while true
uids = User.where( token: nil ).limit( BATCH_SIZE ).pluck( :id )
break if uids.empty?
ApplicationRecord.transaction do
uids.each do |uid|
# def urlsafe_base64(n=nil, padding=false)
User
.where( id: uid )
.update_all( token: SecureRandom.urlsafe_base64 )
end
end
end
Next option is to use native DB's analog for SecureRandom.urlsafe_base64 and run one query like:
UPDATE users SET token=db_specific_urlsafe_base64 WHERE token IS NULL
If you won't find the analog, you can prepopulate temp table (like PostgreSQL's' COPY command) from precalculated CSV file(id, token=SecureRandom.urlsafe_base64)
and run one query like:
UPDATE users SET token=temp_table.token
FROM temp_table
WHERE (users.token IS NULL) AND (users.id=temp_table.id)
But in fact you need no to fill token on existing users because of:
i am using "token" for token based authentication in rails – John
You have to check if user's token is NULL(or expired) and redirect him to login form. It's common way and it will save your time.

Updating Lots of Records at Once in Rails

I've got a background job that I run about 5,000 of them every 10 minutes. Each job makes a request to an external API and then either adds new or updates existing records in my database. Each API request returns around 100 items, so every 10 minutes I am making 50,000 CREATE or UPDATE sql queries.
The way I handle this now is, each API item returned has a unique ID. I search my database for a post that has this id, and if it exists, it updates the model. If it doesn't exist, it creates a new one.
Imagine the api response looks like this:
[
{
external_id: '123',
text: 'blah blah',
count: 450
},
{
external_id: 'abc',
text: 'something else',
count: 393
}
]
which is set to the variable collection
Then I run this code in my parent model:
class ParentModel < ApplicationRecord
def update
collection.each do |attrs|
child = ChildModel.find_or_initialize_by(external_id: attrs[:external_id], parent_model_id: self.id)
child.assign_attributes attrs
child.save if child.changed?
end
end
end
Each of these individual calls is extremely quick, but when I am doing 50,000 in a short period of time it really adds up and can slow things down.
I'm wondering if there's a more efficient way I can handle this, I was thinking of doing something instead like:
class ParentModel < ApplicationRecord
def update
eager_loaded_children = ChildModel.where(parent_model_id: self.id).limit(100)
collection.each do |attrs|
cached_child = eager_loaded_children.select {|child| child.external_id == attrs[:external_id] }.first
if cached_child
cached_child.update_attributes attrs
else
ChildModel.create attrs
end
end
end
end
Essentially I would be saving the lookups and instead doing a bigger query up front (this is also quite fast) but making a tradeoff in memory. But this doesn't seem like it would be that much time, maybe slightly speeding up the lookup part, but I'd still have to do 100 updates and creates.
Is there some kind of way I can do batch updates that I'm not thinking of? Anything else obvious that could make this go faster, or reduce the amount of queries I am doing?
You can do something like this:
collection2 = collection.map { |c| [c[:external_id], c.except(:external_id)]}.to_h
def update
ChildModel.where(external_id: collection2.keys).each |cm| do
ext_id = cm.external_id
cm.assign_attributes collection2[ext_id]
cm.save if cm.changed?
collection2.delete(ext_id)
end
if collection2.present?
new_ids = collection2.keys
new = collection.select { |c| new_ids.include? c[:external_id] }
ChildModel.create(new)
end
end
Better because
fetches all required records all at once
creates all new records at once
You can use update_columns if you don't need callbacks/validations
Only drawback, more ruby code manipulation which I think is a good tradeoff for db queries..

Handle connection breakages in rails

I have a module written in ruby which connects to a postgres table and then applies some logic and code.
Below is a sample code:
module SampleModuleHelper
def self.traverse_database
ProductTable.where(:column => value).find_each do |product|
#some logic here that takes a long time
end
end
end
ProductTable has more than 3 million records. I have used the where clause to shorten number of records retrieved.
However I need to make the code connection proof. There are times when the connection breaks and I have to start traversing the table from the very beginning. I don't want this, rather it should start where it left off since the time taken is too much for each record.
What is the best way to make the code start where it left off?
One way is to make a table in the database that records the primary key(id) where it stopped and start from there again. But I don't want to make tables in the database as there are many such processes.
You could keep a counter of processed records and use the offset method to continue processing.
Something along the lines of:
MAX_RETRIES = 3
def self.traverse(query)
counter = 0
retries = 0
begin
query.offset(counter).find_each do |record|
yield record
counter += 1
end
rescue ActiveRecord::ConnectionNotEstablished => e # or whatever error you're expecting
retries += 1
retry unless retries > MAX_RETRIES
raise
end
end
def self.traverse_products
traverse(ProductTable.where(column: value)) do |product|
# do something with `product`
end
end

removing objects from an array during a loop

I am trying to filter the results of an user search in my app to only show users who are NOT friends. My friends table has 3 columns; f1 (userid of person who sent request), f2 (userid of friend who received request), and confirmed (boolean of true or false). As you can see, #usersfiltered is the result of the search. Then the definition of the current user's friend is established. Then I am trying to remove the friends from the search results. This does not seem to be working but should be pretty straight forward. I've tried delete (not good) and destroy.
def index
#THIS IS THE SEARCH RESULT
#usersfiltered = User.where("first_name LIKE?", "%#{params[:first_name]}%" )
#THIS IS DEFINING ROWS ON THE FRIEND TABLE THAT BELONG TO CURRENT USER
#confirmedfriends = Friend.where(:confirmed => true)
friendsapproved = #confirmedfriends.where(:f2 => current_user.id)
friendsrequestedapproved = #confirmedfriends.where(:f1 => current_user.id)
#GOING THROUGH SEARCH RESULTS
#usersfiltered.each do |usersfiltered|
if friendsapproved.present?
friendsapproved.each do |fa|
if usersfiltered.id == fa.f1
#NEED TO REMOVE THIS FROM RESULTS HERE SOMEHOW
usersfiltered.remove
end
end
end
#SAME LOGIC
if friendsrequestedapproved.present?
friendsrequestedapproved.each do |fra|
if usersfiltered.id == fra.f2
usersfiltered.remove
end
end
end
end
end
I would flip it around the other way. Take the logic that is loop-invariant out of the loop, which gives a good first-order simplification:
approved_ids = []
approved_ids = friendsapproved.map { |fa| fa.f1 } if friendsapproved.present?
approved_ids += friendsrequestedapproved.map { |fra| fra.f2 } if friendsrequestedapproved.present?
approved_ids.uniq! # (May not be needed)
#usersfiltered.delete_if { |user| approved_ids.include? user.id }
This could probably be simplified further if friendsapproved and friendsrequestedapproved have been created separately strictly for the purpose of the deletions. You could generate a single friendsapproval list consisting of both and avoid unioning id sets above.
While I agree that there may be better ways to implement what you're doing, I think the specific problem you're facing is that in Rails 4, the where method returns an ActiveRecord::Relation not an Array. While you can use each on a Relation, you cannot in general perform array operations.
However, you can convert a Relation to an Array with the to_a method as in:
#usersfiltered = User.where("first_name LIKE?", "%#{params[:first_name]}%" ).to_a
This would then allow you to do the following within your loop:
usersfiltered.delete(fa)

Resources