Rails ActiveRecord Performance with many Selects and Inserts - ruby-on-rails

I have a Rails 3.2 application that tracks mailings for subscription orders.
The basic model structure is:
Order has_many Subscriptions has_many SubscriptionMailings
Each month a record for each subscription mailing is generated and a csv file is exported from these records.
The mailing address is stored at the order level.
Basically I select all of the subscriptions that are valid to mail that month and loop through them getting the mailing address from the order object. Then I create a new subscription mailing record for each one.
Right now this works ok because there aren't a lot of subscriptions, but it is pretty slow.
How can I speed up this process?

In order to optimize, you need to step down from Ruby level onto the SQL level here.
Instead of doing N+1 selects (1 for fetching all subscription and N for fetching all orders for each subscription) you may be able to do only 1 select with join.
SubscriptionMailing.
joins(:subscrtiption).
joins(:order).
where(Order.table_name => { valid: true })

Sounds like you want to use includes to eager load the orders. Maybe something like this:
# Subscription.rb
scope :valid_for_month lambda {|month| where(month: month)}
# Elsewhere
valid_subscriptions = Subscription.valid_for_month(Time.now.month).includes(:order)
valid_subscriptions.each do |subscription|
subscription.generate_subscription_mailing
end
More on includes: http://api.rubyonrails.org/classes/ActiveRecord/Associations/ClassMethods.html

After some research I ended up wrapping my code in a transaction without making any other changes.
It sped up things quite a bit.
Before I added the transaction the code was taking over 1 minute to run, now it is down to roughly 10 seconds. This is plenty fast enough for my needs, so I didn't try and optimize any further.
ActiveRecord::Base.transaction do
# my db stuff here
end

Related

Include vs Join

I have 3 models
User - has many debits and has many credits
Debit - belongs to User
Credit - belongs to User
Debit and credit are very similar. The fields are basically the same.
I'm trying to run a query on my models to return all fields from debit and credit where user is current_user
User.left_outer_joins(:debits, :credits).where("users.id = ?", #user.id)
As expected returned all fields from User as many times as there were records in credits and debits.
User.includes(:credits, :debits).order(created_at: :asc).where("users.id = ?", #user.id)
It ran 3 queries and I thought it should be done in one.
The second part of this question is. How I could I add the record type into the query?
as in records from credits would have an extra field to show credits and same for debits
I have looked into ActiveRecordUnion gem but I did not see how it would solve the problem here
includes can't magically retrieve everything you want it to in one query; it will run one query per model (typically) that you need to hit. Instead, it eliminates future unnecessary queries. Take the following examples:
Bad
users = User.first(5)
users.each do |user|
p user.debits.first
end
There will be 6 queries in total here, one to User retrieving all the users, then one for each .debits call in the loop.
Good!
users = User.includes(:debits).first(5)
users.each do |user|
p user.debits.first
end
You'll only make two queries here: one for the users and one for their associated debits. This is how includes speeds up your application, by eagerly loading things you know you'll need.
As for your comment, yes it seems to make sense to combine them into one table. Depending on your situation, I'd recommend looking into Single Table Inheritance (STI). If you don't go this route, be careful with adding a column called type, Rails won't like that!
First of all, in the first query, by calling the query on User class you are asking for records of type User and if you do not want user objects you are performing an extra join which could be costly. (COULD BE not will be)
If you want credit and debit records simply call queries on Credit and Debit models. If you load user object somewhere prior to this point, use includes preload eager_load to do load linked credit and debit record all at once.
There is two way of pre-loading records in Rails. In the first, Rails performs single query of each type of record and the second one Rails perform only a one query and load objects of different types using the data returned.
includes is a smart pre-loader that performs either one of the ways depending on which one it thinks would be faster.
If you want to force Rails to use one query no matter what, eager_load is what you are looking for.
Please read all about includes, eager_load and preload in the article here.

Optimizing has many record association query

I have this query that I've built using Enumerable#select. The purpose is to find records thave have no has many associated records or if it does have those records select only those with it's preview attribute set to true. The code below works perfectly for that use case. However, this query does not scale well. When I test against thousands of records it takes several hundred seconds to complete. How can this query be improved upon?
# User has many enrollments
# Enrollment belongs to user.
users_with_no_courses = User.includes(:enrollments).select {|user| user.enrollments.empty? || user.enrollments.where(preview: false).empty?}
So first, make sure enrollments.user_id has an index.
Second, you can speed this up by not loading all the enrollments, and doing your filtering in SQL:
User.where(<<-EOQ)
NOT EXISTS (SELECT 1
FROM enrollments e
WHERE e.user_id = users.id
AND NOT e.preview)
EOQ
By the way here I'm simplifying your two conditions into one: "no enrollments or no real enrollments" is the same as "no real enrollments".
If you want you can put this condition into a scope so it is more reusable.
Third, this is still going to be slow if you're instantiating thousands of User objects. So I would look into paginating if that makes sense, or find_each if this is an offline script. Or use raw SQL to avoid all the object instances.
Oh by the way: even though you are saying includes(:enrollments), this will still go back to the database, giving you an n+1 problem:
user.enrollments.where(preview: false)
That is because the where means ActiveRecord can't use the already-loaded association. You can avoid that by using select instead of where. But not loading the enrollments in the first place is even better.

Duplicating logic in methods and scopes (and sql)

Named scopes really made this problem easier but it is far from being solved. The common situation is to have logic redefined in both named scopes and model methods.
I'll try to demonstrate the edge case of this by using somewhat complex example. Lets say that we have Message model that has many Recipients. Each recipient is being able to mark the message as being read for himself.
If you want to get the list of unread messages for given user, you would say something like this:
Message.unread_for(user)
That would use the named scope unread_for that would generate the sql which will return the unread messages for given user. This sql is probably going to join two tables together and filter messages by those recipients that haven't already read them.
On the other hand, when we are using the Message model in our code, we are using the following:
message.unread_by?(user)
This method is defined in message class and even it is doing basically the same thing, it now has different implementation.
For simpler projects, this is really not a big thing. Implementing the same simple logic in both sql and ruby in this case is not a problem.
But when application starts to get really complex, it starts to be a problem. If we have permission system implemented that checks who is able to access what message based on dozens of criteria defined in dozens of tables, this starts to get very complex. Soon it comes to the point where you need to join 5 tables and write really complex sql by hand in order to define the scope.
The only "clean" solution to the problem is to make the scopes use the actual ruby code. They would fetch ALL messages, and then filter them with ruby. However, this causes two major problems:
Performance
Pagination
Performance: we are creating a lot more queries to the database. I am not sure about internals of DMBS, but how harder is it for database to execute 5 queries each on single table, or 1 query that is going to join 5 tables at once?
Pagination: we want to keep fetching records until specified number of records is being retrieved. We fetch them one by one and check whether it is accepted by ruby logic. Once 10 of them are accepted, process will stop.
Curious to hear your thoughts on this. I have no experience with nosql dbms, can they tackle the issue in different way?
UPDATE:
I was only speaking hypotetical, but here is one real life example. Lets say that we want to display all transactions on the one page (both payments and expenses).
I have created SQL UNION QUERY to get them both, then go through each record, check whether it could be :read by current user and finally paginated it as an array.
def form_transaction_log
sql1 = #project.payments
.select("'Payment' AS record_type, id, created_at")
.where('expense_id IS NULL')
.to_sql
sql2 = #project.expenses
.select("'Expense' AS record_type, id, created_at")
.to_sql
result = ActiveRecord::Base.connection.execute %{
(#{sql1} UNION #{sql2})
ORDER BY created_at DESC
}
result = result.map do |record|
klass = Object.const_get record["record_type"]
klass.find record["id"]
end.select do |record|
can? :read, record
end
#transactions = Kaminari.paginate_array(result).page(params[:page]).per(7)
end
Both payments and expenses need to be displayed within same table, ordered by creation date and paginated.
Both payments and expenses have completely different :read permissions (defined in ability class, CanCan gem). These permission are quite complex and they require querieng several other tables.
The "ideal" thing would be to write one HUGE sql query that would do return what I need. It would made pagination and everything else a lot easier. But that is going to duplicate my logic defined in ability.rb class.
I'm aware that CanCan provides a way of defining the sql query for the ability, but the abilities are so complex, that they couldn't be defined in that way.
What I did is working, but I'm loading ALL transactions, and then checking which ones I could read. I consider it a big performance issue. Pagination here seems pointless because I'm already loading all records (it only saves bandwidth). An alternative is to write really complex SQL that is going to be hard to maintain.
Sounds like you should remove some duplication and perhaps use DB logic more. There's no reason that you can't share code between named scopes between other methods.
Can you post some problematic code for review?

Minimizing calls to database in rails

i am familiar with memcached and eager loading, but neither seems to solve the problem i am facing.
My main performance lag comes from hundreds of data retrieval calls from the database. The tricky thing is that I do not know which set of users i need to retrieve until i have several steps of computation.
I can refactor my code, but i was wondering how you experts handle this situation? I think it should be a fairly common situation
def newsfeed
- find out which users i need
- retrieve those users via DB
- find out which events happened for these users
- for each of those events
- retrieve new set of users
- find out which groups are relevant
- for each of those groups
- retrieve new set of users
- etc, etc
end
Denormalization is the magic password for your situation.
There are several ways to do this:
For example, store the ids of the last 10 users in the event and group.
Or create a new model NewsFeedItem (belongs_to :parent, :polymorphic => true). When a user attends an event, create a NewsFeedItem with denormalized informations like this users name, his profile pic etc. Saves you from second queries to user_events and users.
You should be able to do this with only one query per Event / Group loop. What you'll want to do is: inside your for loop add user ids to a Set, then after the for loop, retrieve all the User records with those ids. Rinse and Repeat. Here is an example:
def newsfeed
user_ids = Set.new
# find out which users i need
... add ids to user_ids
# retrieve those users via DB
users = User.find(user_ids.to_a)
# find out which events happened for these users
# you might want to add a condition
# that limits then events returned to only recent ones
events = Event.find_by_user_id(user_ids.to_a)
user_ids = Set.new
events.each do |event|
user_ids << discover_user_ids_for_event(event)
# retrieve new set of users
users = User.find(user_ids.to_a)
# ... and so on
end
I'm not sure what your method is supposed to return, but you can likely figure out how to use the idea of grouping finds together by working with collections of IDs to minimize DB queries.
Do you want to show all the details at once (I mean when the page is loading do you really want to load all of those information) , If not what you can do is, load them on demand
as follows
def newsfeed
find out which users i need
retrieve those users via DB
find out which events happened for these users
once you show the events give them a button or something to drill down to other details (on -demand) then load them using AJAX (so that page will not refresh)
use this technique repeatedly when users want to go deep details
By doing this , you will save lots of processing power and will get only the details user needs
I dont know if this is applicable to your situation
If not then you have to find a more optimized way of loading details
cheers,
sameera
I understand that you are trying to perform some kind of algorithm on the basis of your data to do some kind of recommendation or similar sort of thing.
I have two suggestions:
1) You reevaluate your algorithm / design on the basis of what you actually want to achieve. For instance, in cases where an application has users who can potentially have lots of posts and the app wants to perform some algorithm on the basis of the number of posts then it will be quite expensive to count their posts every time. To optimise this, a post_count column can be added on the user model and increase that count whenever a user successfully does a post. Similarly, if you can establish some kind of relation like this between your user, events, groups etc, then think of something on those lines.
2) If first solution is not feasible, then for anything like this you must avoid doing multiple queries and then using ruby for crunching data which would obviously be very expensive and is never advisable if you have large data set. So what you need here is to make one sql query using join and get all data in just one go. Also pick only those field names from the database that you need. It really helps in case of large data sets. For instance, if you need user id and event_id from user and events table and nothing else then do something like so
User.find(:all,
:select => 'users.id, users.event_id',
:joins => 'join events on users.id = events.user_id',
:conditions => ['users.id in (your user ids)'])
I hope this will point you in the right direction.

Rails Caching DB Queries and Best Practices

The DB load on my site is getting really high so it is time for me to cache common queries that are being called 1000s of times an hour where the results are not changing.
So for instance on my city model I do the following:
def self.fetch(id)
Rails.cache.fetch("city_#{id}") { City.find(id) }
end
def after_save
Rails.cache.delete("city_#{self.id}")
end
def after_destroy
Rails.cache.delete("city_#{self.id}")
end
So now when I can City.find(1) the first time I hit the DB but the next 1000 times I get the result from memory. Great. But most of the calls to city are not City.find(1) but #user.city.name where Rails does not use the fetch but queries the DB again... which makes sense but not exactly what I want it to do.
I can do City.find(#user.city_id) but that is ugly.
So my question to you guys. What are the smart people doing? What is
the right way to do this?
With respect to the caching, a couple of minor points:
It's worth using slash for separation of object type and id, which is rails convention. Even better, ActiveRecord models provide the cacke_key instance method which will provide a unique identifier of table name and id, "cities/13" etc.
One minor correction to your after_save filter. Since you have the data on hand, you might as well write it back to the cache as opposed to delete it. That's saving you a single trip to the database ;)
def after_save
Rails.cache.write(cache_key,self)
end
As to the root of the question, if you're continuously pulling #user.city.name, there are two real choices:
Denormalize the user's city name to the user row. #user.city_name (keep the city_id foreign key). This value should be written to at save time.
-or-
Implement your User.fetch method to eager load the city. Only do this if the contents of the city row never change (i.e. name etc.), otherwise you can potentially open up a can of worms with respect to cache invalidation.
Personal opinion:
Implement basic id based fetch methods (or use a plugin) to integrate with memcached, and denormalize the city name to the user's row.
I'm personally not a huge fan of cached model style plugins, I've never seen one that's saved a significant amount of development time that I haven't grown out of in a hurry.
If you're getting way too many database queries it's definitely worth checking out eager loading (through :include) if you haven't already. That should be the first step for reducing the quantity of database queries.
If you need to speed up sql queries on data that doesnt change much over time then you can use materialized views.
A matview stores the results of a query into a table-like structure of
its own, from which the data can be queried. It is not possible to add
or delete rows, but the rest of the time it behaves just like an
actual table. Queries are faster, and the matview itself can be
indexed.
At the time of this writing, matviews are natively available in Oracle
DB, PostgreSQL, Sybase, IBM DB2, and Microsoft SQL Server. MySQL
doesn’t provide native support for matviews, unfortunately, but there
are open source alternatives to it.
Here is some good articles on how to use matviews in Rails
sitepoint.com/speed-up-with-materialized-views-on-postgresql-and-rails
hashrocket.com/materialized-view-strategies-using-postgresql
I would go ahead and take a look at Memoization, which is now in Rails 2.2.
"Memoization is a pattern of
initializing a method once and then
stashing its value away for repeat
use."
There was a great Railscast episode on it recently that should get you up and running nicely.
Quick code sample from the Railscast:
class Product < ActiveRecord::Base
extend ActiveSupport::Memoizable
belongs_to :category
def filesize(num = 1)
# some expensive operation
sleep 2
12345789 * num
end
memoize :filesize
end
More on Memoization
Check out cached_model

Resources