How do I get Rails to eager load counts? - ruby-on-rails

This is related to a question a year and change ago.
I put up an example of the question that should work out of the box, provided you have sqlite3 available: https://github.com/cairo140/rails-eager-loading-counts-demo
Installation instructions (for the main branch)
git clone git://github.com/cairo140/rails-eager-loading-counts-demo.git
cd rails-eager-loading-counts-demo
rails s
I have a fuller write-up in the repository, but my general question is this.
How can I make Rails eager load counts in a way that minimizes db queries across the board?
The n+1 problem emerges whenever you use #count on an association, despite having included that association via #includes(:associated) in the ActiveRelation. A workaround is to use #length, but this works well only when the object it's being called on has already been loaded up, not to mention that I suspect it duplicates something that the Rails internals have done already. Also, an issue with using #length is that it results in an unfortunate over-loading when the association was not loaded to begin with and the count is all you need.
From the readme:
We can dodge this issue by running #length on the posts array (see appendix), which is already loaded, but it would be nice to have count readily available as well. Not only is it more consistent; it provides a path of access that doesn't necessarily require posts to be loaded. For instance, if you have a partial that displays the count no matter what, but half the time, the partial is called with posts loaded and half the time without, you are faced with the following scenario:
Using #count
n COUNT style queries when posts are already loaded
n COUNT style queries when posts are not already loaded
Using #length
Zero additional queries when posts are already loaded
n * style queries when posts are not already loaded
Between these two choices, there is no dominant option. But it would be nice to revise #count to defer to #length or access the length that is some other way stored behind the scenes so that we can have the following scenario:
Using revised #count
Zero additional queries when posts are already loaded
n COUNT style queries when posts are not already loaded
So what's the correct approach here? Is there something I've overlooked (very, very likely)?

As #apneadiving suggested, counter_cache works well because the counter column gets automatically updated when records are added or removed. So when you load the parent object, the count is included in the object without needing to access the other table.
However, if for whatever reason you don't like that approach, you could do this:
Post.find(:all,
:select => "posts.*, count(comments.id) `comments_count`",
:joins => "left join comments on comments.post_id = posts.id")

An alternative approach to the one of Zubin:
Post.select('posts.*, count(comments.id) `comments_count`').joins(:comments).group('posts.id')

It appears that the best way to implement this sort of facility might be to create SQL Views (ref: here and here) for the seperate model-and-child-count objects that you want; and their associated ActiveRecord models.
You might be able to be very clever and use subclassing on the original model combined with set_table_name :sql_view_name to retain all the original methods on the objects, and maybe even some of their associations.
For instance, say we were to add 'Post.has_many :comments' to your example, like in #Zubin's answer above; then one might be able to do:
class CreatePostsWithCommentsCountsView < ActiveRecord::Migration
def self.up
#Create SQL View called posts_with_comments_counts which maps over
# select posts.*, count(comments.id) as comments_count from posts
# left outer join comments on comments.post_id = posts.id
# group by posts.id
# (As zubin pointed out above.)
#*Except* this is in SQL so perhaps we'll be able to do further
# reducing queries against it *as though it were any other table.*
end
end
class PostWithCommentsCount < Post #Here there be cleverness.
#The class definition sets up PWCC
# with all the regular methods of
# Post (pointing to the posts table
# due to Rails' STI facility.)
set_table_name :posts_with_comment_counts #But then we point it to the
# SQL view instead.
#If you don't really care about
# the methods of Post being in PWCC
# then you could just make it a
# normal subclass of AR::Base.
end
PostWithCommentsCount.all(:include => :user) #Obviously, this sort of "upward
# looking" include is best used in big lists like "latest posts" rather than
# "These posts for this user." But hopefully it illustrates the improved
# activerecordiness of this style of solution.
PostWithCommentsCount.all(:include => :comments) #And I'm pretty sure you
# should be able to do this without issue as well. And it _should_ only be
# the two queries.

I have set up a small gem that adds an includes_count method to ActiveRecord, that uses a SELECT COUNT to fetch the number of records in an association, without resorting to a JOIN which might be expensive (depending on the case).
See https://github.com/manastech/includes-count
Hope it helps!

Related

Using Rails includes with conditions on children

I have a model Parent that has many children Child. I want to get all Parent models and show every Child of the Parent as well. This is a classic use case for Rails' includes method, as far as I can tell.
However, I can't get Rails to add conditions to the child models without limiting the Parent models to those that have children.
For example, this only outputs parents that have children:
Parent.includes(:children).where(children: {age: 10}).each do |parent|
# output parent info
parent.children.where("age = 10").each do |child|
#output child info
end
end
I've looked at Rails includes with conditions but it seems like I'm having the same trouble as the question's OP and neither part of the accepted answer doesn't solve it (it either has only some parents, or resorts to multiple queries).
You need to use LEFT JOIN.
Parent.joins("LEFT JOIN children ON parent.id = children.parent_id")
.where("parent.age = 10 AND children.age = 10")
.select("parent.*, children.*")
If you want to select rows from the parent table which may or may not have corresponding rows in the children table, you use the LEFT JOIN clause. In case there is no matching row in the children table, the values of the columns in the children table are substituted by the NULL values.
I ran into this issue, thus stumbling across this question. Sadly, none of the answers so far are solutions. Happily, I have found the solution! Thanks in-part to the docs :) http://apidock.com/rails/ActiveRecord/QueryMethods/includes
As the docs suggest, simply including the association, and then adding a condition to it is not sufficient; you must also "reference" the association references(:children).
Now, additionally you can see that I'm using some syntactic sugar that I recommend for merging in your conditions, versus re-writing them. Use this when possible.
Parent.includes(:children).merge(Child.at_school).references(:children).first
So what I did, and what I suggest doing is setting up a scope for this:
class Parent < ActiveRecord::Model
has_many :children
scope :with_children_at_school, -> { includes(:children).merge(Child.at_school).references(:children) }
# ...
end
And then you can just call Parent.with_children_at_school.first (or whatever else you want to chain on to the end!
I hope this helps!
This a limitation of the includes method. What you need is an outer join and unfortunately rails doesnt have a good way to force an outer join without using the raw sql syntax (#joins defaults to inner join and #includes eager loads).
try using something along the lines of
Parent.joins('LEFT OUTER JOIN child on child.parent_id = parent.id').where(...)
this should grab all parents, even those without children
This is not a 100% answer, but one approach is to accept that you wil get all child records returned by the eager loading, but to choose the ones that you then want using a non-ActiveRecord method.
You will includes more child records in the eager loading than you need, so that's less efficient than a perfect solution, but you still get the records you want:
Parent.includes(:children).each do |parent|
parent.children.select{|child| child.age == 10}.each do |child|
blah blah...
end
end
I'm assuming here that you need a lot of flexibility on your select criteria, and that an association based on a scope would not offer such flexibility.
The parents who don't have children will have a children.age of NULL, you are only filtering for children.age = 10.
Try
where('children.age = 10 or children.age is null')

Rails subquery reduce amount of raw SQL

I have two ActiveRecord models: Post and Vote. I want a make a simple query:
SELECT *,
(SELECT COUNT(*)
FROM votes
WHERE votes.id = posts.id) AS vote_count
FROM posts
I am wondering what's the best way to do it in activerecord DSL. My goal is to minimize the amount of SQL I have to write.
I can do Post.select("COUNT(*) from votes where votes.id = posts.id as vote_count")
Two problems with this:
Raw SQL. Anyway to write this in DSL?
This returns only attribute vote_count and not "*" + vote_count. I can append .select("*") but I will be repeating this every time. Is there an much better/DRY way to do this?
Thanks
Well, if you want to reduce amount of SQL, you can split that query into smaller two end execute them separately. For instance, the votes counting part could be extracted to query:
SELECT votes.id, COUNT(*) FROM votes GROUP BY votes.id;
which you may write with ActiveRecord methods as:
Vote.group(:id).count
You can store the result for later use and access it directly from Post model, for example you may define #votes_count as a method:
class Post
def votes_count
##votes_count_cache ||= Vote.group(:id).count
##votes_count_cache[id] || 0
end
end
(Of course every use of cache raises a question about invalidating or updating it, but this is out of the scope of this topic.)
But I strongly encourage you to consider yet another approach.
I believe writing complicated queries like yours with ActiveRecord methods — even if would be possible — or splitting queries into two as I proposed earlier are both bad ideas. They result in extremely cluttered code, far less readable than raw SQL. Instead, I suggest introducing query objects. IMO there is nothing wrong in using raw, complicated SQL when it's hidden behind nice interface. See: M. Fowler's P of EAA and Brynary's post on Code Climate Blog.
How about doing this with no additional SQL at all - consider using the Rails counter_cache feature.
If you add an integer votes_count column to the posts table, you can get Rails to automatically increment and decrement that counter by changing the belongs_to declaration in Vote to:
belongs_to :post, counter_cache: true
Rails will then keep each Post updated with the number of votes it has. That way the count is already calculated and no sub-query is needed.
Maybe you can create mysql view and just map it to new AR model. It works similar way to table, you just need to specify with set_table_name "your_view_name"....maybe on DB level it will work faster and will be automatically re-calculating.
Just stumbled upon postgres_ext gem which adds support for Common Table Expressions in Arel and ActiveRecord which is exactly what you asked. Gem is not for SQLite, but perhaps some portions could be extracted or serve as examples.

Duplicating logic in methods and scopes (and sql)

Named scopes really made this problem easier but it is far from being solved. The common situation is to have logic redefined in both named scopes and model methods.
I'll try to demonstrate the edge case of this by using somewhat complex example. Lets say that we have Message model that has many Recipients. Each recipient is being able to mark the message as being read for himself.
If you want to get the list of unread messages for given user, you would say something like this:
Message.unread_for(user)
That would use the named scope unread_for that would generate the sql which will return the unread messages for given user. This sql is probably going to join two tables together and filter messages by those recipients that haven't already read them.
On the other hand, when we are using the Message model in our code, we are using the following:
message.unread_by?(user)
This method is defined in message class and even it is doing basically the same thing, it now has different implementation.
For simpler projects, this is really not a big thing. Implementing the same simple logic in both sql and ruby in this case is not a problem.
But when application starts to get really complex, it starts to be a problem. If we have permission system implemented that checks who is able to access what message based on dozens of criteria defined in dozens of tables, this starts to get very complex. Soon it comes to the point where you need to join 5 tables and write really complex sql by hand in order to define the scope.
The only "clean" solution to the problem is to make the scopes use the actual ruby code. They would fetch ALL messages, and then filter them with ruby. However, this causes two major problems:
Performance
Pagination
Performance: we are creating a lot more queries to the database. I am not sure about internals of DMBS, but how harder is it for database to execute 5 queries each on single table, or 1 query that is going to join 5 tables at once?
Pagination: we want to keep fetching records until specified number of records is being retrieved. We fetch them one by one and check whether it is accepted by ruby logic. Once 10 of them are accepted, process will stop.
Curious to hear your thoughts on this. I have no experience with nosql dbms, can they tackle the issue in different way?
UPDATE:
I was only speaking hypotetical, but here is one real life example. Lets say that we want to display all transactions on the one page (both payments and expenses).
I have created SQL UNION QUERY to get them both, then go through each record, check whether it could be :read by current user and finally paginated it as an array.
def form_transaction_log
sql1 = #project.payments
.select("'Payment' AS record_type, id, created_at")
.where('expense_id IS NULL')
.to_sql
sql2 = #project.expenses
.select("'Expense' AS record_type, id, created_at")
.to_sql
result = ActiveRecord::Base.connection.execute %{
(#{sql1} UNION #{sql2})
ORDER BY created_at DESC
}
result = result.map do |record|
klass = Object.const_get record["record_type"]
klass.find record["id"]
end.select do |record|
can? :read, record
end
#transactions = Kaminari.paginate_array(result).page(params[:page]).per(7)
end
Both payments and expenses need to be displayed within same table, ordered by creation date and paginated.
Both payments and expenses have completely different :read permissions (defined in ability class, CanCan gem). These permission are quite complex and they require querieng several other tables.
The "ideal" thing would be to write one HUGE sql query that would do return what I need. It would made pagination and everything else a lot easier. But that is going to duplicate my logic defined in ability.rb class.
I'm aware that CanCan provides a way of defining the sql query for the ability, but the abilities are so complex, that they couldn't be defined in that way.
What I did is working, but I'm loading ALL transactions, and then checking which ones I could read. I consider it a big performance issue. Pagination here seems pointless because I'm already loading all records (it only saves bandwidth). An alternative is to write really complex SQL that is going to be hard to maintain.
Sounds like you should remove some duplication and perhaps use DB logic more. There's no reason that you can't share code between named scopes between other methods.
Can you post some problematic code for review?

rails + ActiveRecord: caching all registers of a model

I've got a tiny model (let's call it "Node") that represents a tree-like structure. Each node contains only a name and a reference to its father:
class Node < ActiveRecord::Base
validates_presence_of :name, :parent_id
end
The table isn't very big - less than 100 elements. It's updated rarely - in the last 4 months 20 new elements were added, in one occasion, by the site admin.
Yet it is used quite a lot on my application. Given its tree-like structure, on some occasions a request triggers more than 30 database hits (including ajax calls, which I use quite a lot).
I'd like to use some sort of caching in order to lower the database access - since the table is so small, I thought about caching all registers in memory.
Is this possible rails 2.3? Is there a better way to deal with this?
Why don't you just load them all every time to avoid getting hit with multiple loads?
Here's a simple example:
before_filter :load_all_nodes
def load_all_nodes
#nodes = Node.all.inject({ }) { |h, n| h[n.id] = n; n }
end
This will give you a hash indexed by Node#id so you can use this cache in place of a find call:
# Previously
#node = Node.find(params[:id])
# Now
#node = #nodes[params[:id].to_i]
For small, simple records, loading them in quickly in one fetch is a fairly inexpensive operation.
Have you looked at any of the plugins that give tree like behaviour.
Ryan Bates has a railscast on acts_as_tree however acts_as_nested_set or one of the other projects inspired by it such as awesome_nested_set or acts_as_better_nested_set may be better fits for your needs.
These projects allow you to get a node and all of its children with one sql query. The acts_as_better_nested_set site has a good description of how this method works.
After looking in several places, I think tadman's solution is the simplest one.
For a more flexible solution, I've found this gist:
http://gist.github.com/72250/
Regards!

Rails Eager Loading Question Find(:all, :include => [:model])

I have a Topic and a Project model. I have a Many-to-many ass between them (HABTM one).
In the Topic's Index Page, I want to display the number of projects that each topic have. So I have
#topics = Topic.all(:include => [:projects])
In my controller, and so far so good. The problem is that the Project Model is so big that the query is still really slow
Topic Load (1.5ms) SELECT * FROM "topics"
Project Load (109.2ms) SELECT "projects".*, t0.topic_id as the_parent_record_id FROM "projects" INNER JOIN "projects_topics" t0 ON "projects".id = t0.project_id WHERE (t0.topic_id IN (1,2,3,4,5,6,7,8,9,10,11))
Is there a way to make the second query not to select * but just the name or the ID? Because the counter_cache is not supported by the HABTM Ass, and I don't really want to implement it by myself... so is there a way to make this second query faster?
I just need to pull the count without loading the whole project object...
Thanks in advance,
Nicolás Hock Isaza
counter_cache is very easy to implement
you can convert habtm to double has_many, i.e. has_many :projects_topics in both project and topic model (and belongs_to in projects_topics) and then use counter_cache or do eager loading only on projects_topics
you can do :select => "count(projects_topics.id)", :group => "topics.id" but this won't work well with postgresql if you care about it...
The second option is the best IMO, I usually don't use habtm at all, only double has_many :)
To expand on Devenv's answer counter cache is what you would typically use for this kind of scenario.
From the api docs:
Caches the number of belonging objects
on the associate class through the use
of increment_counter and
decrement_counter. The counter cache
is incremented when an object of this
class is created and decremented when
it‘s destroyed. This requires that a
column named #{table_name}_count (such
as comments_count for a belonging
Comment class) is used on the
associate class (such as a Post
class). You can also specify a custom
counter cache column by providing a
column name instead of a true/false
value to this option (e.g.,
:counter_cache => :my_custom_counter.)
Note: Specifying a counter cache will
add it to that model‘s list of
readonly attributes using
attr_readonly.
Here is a screen cast from ryan bates' railscasts on counter_cache.
Here is an answer to a question I asked half a year ago where the solution was an easily implemented home-brew counter cache.

Resources