Why does ActiveRecord's :include do two queries? - ruby-on-rails

I am just learning ActiveRecord and SQL and I was under the impression that :include does one SQL query. So if I do:
Show.first :include => :artist
It will execute one query and that query is going to return first show and artist. But looking at the SQL generated, I see two queries:
[2013-01-08T09:38:00.455705 #1179] DEBUG -- : Show Load (0.5ms) SELECT `shows`.* FROM `shows` LIMIT 1
[2013-01-08T09:38:00.467123 #1179] DEBUG -- : Artist Load (0.5ms) SELECT `artists`.* FROM `artists` WHERE `artists`.`id` IN (2)
I saw one of the Railscast videos where the author was going over :include vs :join and I saw the output SQL on the console and it was a large SQL query, but it was only one query. I am just wondering if this is how it is supposed to be or am I missing something?

Active Record has two ways in which it loads association up front. :includes will trigger either of those, based on some heuristics.
One way is for there to be one query per association: you first load all the shows (1 query) then you load all artists (2nd query). If you were then including an association on artists that would be a 3rd query. All of these queries are simple queries, although it does mean that no advantage is gained in your specific case. Because the queries are separate, you can't do things like order the top level (shows) by the child associations and thing like that.
The second way is to load everything in one big joins based query. This always produces a single query, but its more complicated - 1 join per association included and the code to turn the result set back into ruby objects is more complicated too. There are some other corner cases: polymorphic belongs_to can't be handled and including multiple has_many at the same level will produce a very large result set).
Active Record will by default use the first strategy (preload), unless it thinks that your query conditions or order are referencing the associations, in which case it falls back to the second approach. You can force the strategy used by using preload or eager_load instead of :includes.

Using :includes is a solution to provide eager loading. It will load at most two queries in your example. If you were to change your query Show.all :include => :artist. This will also call just two queries.
Better explanation: Active Record Querying Eager Loading

Related

When is better to use preload or eager_load or includes?

SQL contains join
Load associated record in memory
Performs two queries
preload
no
yes
yes
includes
yes (left outer join)
yes
sometimes
eager load
(left outer join)
yes
no
I am aware from the concepts.
I want to know when to use which API. I search but does not find exact answer.
includes chooses whether to use preload or eager_load. If you're not happy with the decision includes makes, you have to resort to using either of eager_load or preload
From https://engineering.gusto.com/a-visual-guide-to-using-includes-in-rails/:
When does :includes use :preload?
In most cases :includes will default to use the method :preload which will fire 2 queries:
Load all records tied to the leading model
Load records associated with the leading model based off the foreign key on the associated model or the leading model
When does :includes use :eager_load?
:includes will default to use :preload unless you reference the association being loaded in a subsequent clause, such as :where or :order. When constructing a query this way, you also need to explicitly reference the eager loaded model.
Employee.includes(:forms).where('forms.kind = "health"').references(:forms)

How to get first row of has_and_belongs_to_many relationship postgres sql

The relationship between the model 'battle' to the model 'category' is has_and_belongs_to_many. I'm trying to find all the battles that don't relate to a specific category.
What I tried to do so far is:
Battle.all.includes(:categories).where("(categories[0].name NOT IN ('Other', 'Fashion'))")
This error occurred: ActiveRecord::StatementInvalid: PG::DatatypeMismatch: ERROR: cannot subscript type categories because it is not an array
Thanks,
Rotem
The use of categories[0].name is not a valid SQL reference to the name column of categories.
Try this:
Battle.includes(:categories).reject do |battle|
['Other', 'Fashion'].include? battle.categories.map(&:name)
end
Note that this code performs two queries - one for Battle and one for Category - and eager loads the appropriate Category Active Record objects into the instantiated Battle records' :categories relationship. This is helpful to prevent N+1 queries as described in details in the guides.
Also note that the code above first loads to memory ALL Battle records before eliminating the ones that have the undesired categories. Depending on your data, this could be prohibitive. If you prefer to restrict the SQL query so that only the relevant ActiveRecord objects get instantiated, you could get away with something like the following:
battle_ids_sql = Battle.select(:id).joins(:categories).where(categories: {name: ['Other' ,'Fashion']}).to_sql
Battle.where("id NOT IN (#{battle_ids_sql})")
battle_ids_sql is an SQL statement that returns battle IDs that HAVE one of your undesired categories. The SQL which is actually executed fetches all Battle records that are not in that inner SQL. It's effective, although use sparingly - queries like this tend to become hard to maintain rather quickly.
You can learn about joins, includes and two other related methods (eager_load and preload) here.

Optimizing has many record association query

I have this query that I've built using Enumerable#select. The purpose is to find records thave have no has many associated records or if it does have those records select only those with it's preview attribute set to true. The code below works perfectly for that use case. However, this query does not scale well. When I test against thousands of records it takes several hundred seconds to complete. How can this query be improved upon?
# User has many enrollments
# Enrollment belongs to user.
users_with_no_courses = User.includes(:enrollments).select {|user| user.enrollments.empty? || user.enrollments.where(preview: false).empty?}
So first, make sure enrollments.user_id has an index.
Second, you can speed this up by not loading all the enrollments, and doing your filtering in SQL:
User.where(<<-EOQ)
NOT EXISTS (SELECT 1
FROM enrollments e
WHERE e.user_id = users.id
AND NOT e.preview)
EOQ
By the way here I'm simplifying your two conditions into one: "no enrollments or no real enrollments" is the same as "no real enrollments".
If you want you can put this condition into a scope so it is more reusable.
Third, this is still going to be slow if you're instantiating thousands of User objects. So I would look into paginating if that makes sense, or find_each if this is an offline script. Or use raw SQL to avoid all the object instances.
Oh by the way: even though you are saying includes(:enrollments), this will still go back to the database, giving you an n+1 problem:
user.enrollments.where(preview: false)
That is because the where means ActiveRecord can't use the already-loaded association. You can avoid that by using select instead of where. But not loading the enrollments in the first place is even better.

ActiveRecord use joins for performance but load all associated records into memory like with includes

Unless I am mistaken: joins has better performance than includes because at the database level:
joins causes an inner join
includes causes a subquery
And in general, an inner join is faster than a subquery.
Example:
#app/models/owner.rb
class Owner < ActiveRecord::Base
has_many :pets
end
#app/models/pet.rb
class Pet < ActiveRecord::Base
belongs_to :owner
end
Using rails console:
# showing how 'includes' in rails causes an IN statement which is a subquery
irb(main):001:0> #owners = Owner.all.includes(:pets)
Owner Load (2.7ms) SELECT "owners".* FROM "owners"
Pet Load (0.4ms) SELECT "pets".* FROM "pets" WHERE "pets"."owner_id" IN (1, 2, 3)
And now using joins which causes an inner join:
irb(main):001:0> #owners = Owner.all.joins(:pets)
Owner Load (0.3ms) SELECT "owners".* FROM "owners" INNER JOIN "pets" ON "pets"."owner_id" = "owners"."id"
So it would seem like it would almost always be better the use joins over includes because:
includes causes a subquery (the IN statement)
joins causes an inner join which is usually faster than a subquery
However, there is one gotcha with using joins. This article does a great job describing it. Basically, includes loads all the associated objects into memory, so that if you query for any of the attributes for those associated objects, it doesn't hit the database. Meanwhile, joins DOES NOT load into memory the associated objects' attributes, so if you query for any of the attributes, it makes additional hits on the database.
So here is my question: Is it possible to do inner joins like with joins for performance but at the same time load all the associated objects into memory like includes does?
Put another way: is it possible to load all the associated objects into memory like includes does, but causes an inner join as opposed to a subquery?
I think your assumption that a JOIN is always faster than two queries is not correct. It highly depends on the size of your database tables.
Imagine you have thousands of owners and pets in your database. Then your database had to join all together first, even if you just want to load 10 records. On the other hand one query loading 10 owners and one query to load all pets for that 10 owners would be faster than that JOIN.
I would argue that both methods exist to solve different problems:
joins is used when you need to combine two tables to run a query on the data of both tables.
includes is used to avoid N+1 queries.
Btw: The Rails documentation has a note that includes has performance benefits over joins:
This will often result in a performance improvement over a simple join.

Rails query] difference between joins and includes

#teachers = User.joins(:students).where("student_id IS NOT NULL")
The above works, the below doesn't.
#teachers = User.includes(:students).where("student_id IS NOT NULL")
As far as I understand, joins and includes should both bring the same result with different performance. According to this, you use includes to load associated records of the objects called by Model, where joins to simply add two tables together. Using includes can also prevent the N+1 queries.
First question: why does my second line of code not work?
Second question: should anyone always use includes in a case similar to above?
You use joins when you want to query against the joined model. This is doing an inner join between your tables.
Includes is when you want to eager load the associated model to the end result.
This allows you to call the association on any of the results without having to again do the db lookup.
You cannot query against a model that is loaded via includes. If you want to query against it you must use joins( you can do both! )

Resources