Comparing .references requirement on includes vs. eager_load - ruby-on-rails

I know that when you utilize includes and you specify a where clause on the joined table, you should use .references
example:
# will error out or throw deprecation warning in logs
users = User.includes(:orders).where("Orders.cost < ?", 20)
In rails 4 or later, you will get an error like the following:
Mysql2::Error: Unknown column 'Orders.cost' in 'where clause': SELECT
customers.* FROM customers WHERE (Orders.cost < 100)
Or you will get a deprecation warning:
DEPRECATION WARNING: It looks like you are eager loading table(s) (one
of: users, addresses) that are referenced in a string SQL snippet. For
example:
Post.includes(:comments).where("comments.title = 'foo'") Currently,
Active Record recognizes the table in the string, and knows to JOIN
the comments table to the query, rather than loading comments in a
separate query. However, doing this without writing a full-blown SQL
parser is inherently flawed. Since we don't want to write an SQL
parser, we are removing this functionality. From now on, you must
explicitly tell Active Record when you are referencing a table from a
string:
Post.includes(:comments).where("comments.title =
'foo'").references(:comments)
If you don't rely on implicit join references you can disable the
feature entirely by setting
config.active_record.disable_implicit_join_references = true. (
SELECT "users"."id" AS t0_r0, "users"."name" AS t0_r1, "users"."email"
AS t0_r2, "users"."created_at" AS t0_r3, "users"."updated_at" AS
t0_r4, "addresses"."id" AS t1_r0, "addresses"."user_id" AS t1_r1,
"addresses"."country" AS t1_r2, "addresses"."street" AS t1_r3,
"addresses"."postal_code" AS t1_r4, "addresses"."city" AS t1_r5,
"addresses"."created_at" AS t1_r6, "addresses"."updated_at" AS t1_r7
FROM "users" LEFT OUTER JOIN "addresses" ON "addresses"."user_id" =
"users"."id" WHERE (addresses.country = 'Poland')
so we do this:
# added .references(:orders)
users = User.includes(:orders).where("Orders.cost < ?", 20).references(:orders)
And it executes just fine:
SELECT "users"."id" AS t0_r0,
"users"."name" AS t0_r1,
"users"."created_at" AS t0_r2,
"users"."updated_at" AS t0_r3,
"orders"."id" AS t1_r0,
"orders"."cost" AS t1_r1,
"orders"."user_id" AS t1_r2,
"orders"."created_at" AS t1_r3,
"orders"."updated_at" AS t1_r4
FROM "users"
LEFT OUTER JOIN "orders"
ON "orders"."user_id" = "users"."id"
WHERE ( orders.cost < 20 )
I know that .includes is just a wrapper for two methods: eager_load and preload. I know that since my query above is doing a filter on a joined table (orders in this example), includes is smart and knows to pick the eager_load implementation over preload because preload cannot handle doing this query since preload does not join tables.
Here is where I am confused. Ok: So on that query above: under the hood includes will utilize the eager_load implementation. But notice how when I explicitly use eager_load for this same query (which is what includes is essentially doing): I do not need to use .references! It runs the query and loads the data just fine. No error and no deprecation warning:
# did not specify .references(:orders), and yet no error and no deprecation warning
users = User.eager_load(:orders).where("Orders.cost < ?", 20)
And it executes the exact same process with no problem:
SELECT "users"."id" AS t0_r0,
"users"."name" AS t0_r1,
"users"."created_at" AS t0_r2,
"users"."updated_at" AS t0_r3,
"orders"."id" AS t1_r0,
"orders"."cost" AS t1_r1,
"orders"."user_id" AS t1_r2,
"orders"."created_at" AS t1_r3,
"orders"."updated_at" AS t1_r4
FROM "users"
LEFT OUTER JOIN "orders"
ON "orders"."user_id" = "users"."id"
WHERE ( orders.cost < 20 )
That seems odd. Why does .references need to be specified for the includes version of the query, whereas .references does not need to be specified for the eager_load version of the query? What am I missing here?

It comes down to the problem they mention in the deprecation warning:
Currently, Active Record recognizes the table in the string, and knows to JOIN the comments table to the query, rather than loading comments in a separate query. However, doing this without writing a full-blown SQL parser is inherently flawed. Since we don't want to write an SQL parser, we are removing this functionality.
In older versions, Rails tried to be helpful about selecting the query pattern to use, and includes would use the preload strategy if it could, but switch to the eager_load strategy when it looks like you're referencing something in a joined table. But without a full SQL parser figuring out what tables are actually referenced, it's like parsing XHTML with a Regex - you can get some things done, but Rails can't decide correctly in every case. Consider:
User.includes(:orders).where("Orders.cost < 20")
This is a nice, simple example, and Rails could tell that you need Orders joined. Now try this one:
User.includes(:orders).where("id IN (select user_id from Orders where Orders.cost < 20)")
This gives the same result, but the subquery rendered joining Orders unnecessary. It's a contrived example, and I don't know whether Rails would decide the second query needed to join or not, but the point is there are cases when the heuristic could make the wrong decision. In those cases, either Rails would perform an unnecessary join, burning memory and slowing the query down, or not perform a necessary join, causing an error.
Rather than maintain a heuristic with a pretty bad failure case, the developers decided to just ask the programmer whether the join is needed. You're able to get it right more often than Rails can (hopefully), and when you get it wrong, it's clear what to change.
Instead of adding references you could switch to eager_load, but keeping includes and references separate allows the implementation flexibility in its query pattern. You could conceivably .includes(:orders, :addresses).references(:orders) and have addresses loaded in a second preload-style query because it's not needed during the join (though Rails actually just includes addresses in the join anyway). You don't need to specify references when you're using eager_load because eager_load always joins, where preload always does multiple queries. All references does is instruct includes to use the necessary eager_load strategy and specify which tables are needed.

Related

How to attach raw SQL to an existing Rails ActiveRecord chain?

I have a rule builder that ultimately builds up ActiveRecord queries by chaining multiple where calls, like so:
Track.where("tracks.popularity < ?", 1).where("(audio_features ->> 'valence')::numeric between ? and ?", 2, 5)
Then, if someone wants to sort the results randomly, it would append order("random()").
However, given the table size, random() is extremely inefficient for ordering, so I need to use Postgres TABLESAMPLE-ing.
In a raw SQL query, that looks like this:
SELECT * FROM "tracks" TABLESAMPLE SYSTEM(0.1) LIMIT 250;
Is there some way to add that TABLESAMPLE SYSTEM(0.1) to the existing chain of ActiveRecord calls? Putting it inside a where() or order() doesn't work since it's not a WHERE or ORDER BY function.
irb(main):004:0> Track.from('"tracks" TABLESAMPLE SYSTEM(0.1)')
Track Load (0.7ms) SELECT "tracks".* FROM "tracks" TABLESAMPLE SYSTEM(0.1) LIMIT $1 [["LIMIT", 11]]

How to prevent SELECTing extra fields in a JOINed .includes()

I’m trying to implement parametrized grouping to a report. A simplified example of what I’m trying to achieve:
observation_query = Observation.includes(:reporter).order("reporters.name")
if params[:group_results]
observation_query = observation_query
.select("DATE(observations.created_at) AS created_at, AVG(value) AS value")
.group("DATE(observations.created_at)", :reporter_id)
end
observation_query.each do |observation|
puts "#{observation.reporter.name} #{observation.created_at}: #{observation.value}"
end
When grouping is not used, or if I remove the ordering, the results are as expected. But when both ordering and grouping are used, the query generated due to having to achieve the eager loading with JOINs is:
SELECT DATE(observations.updated_at) AS updated_at, AVG(value) AS value,
`observations`.`id` AS t0_r0,
`observations`.`value` AS t0_r1,
`observations`.`reporter_id` AS t0_r2,
...
`observations`.`created_at` AS t0_r6,
`observations`.`updated_at` AS t0_r7,
`reporters`.`id` AS t1_r0,
...
FROM `observations` INNER JOIN `reporters` ON `reporters`.`id` = `observations`.`user_id`
GROUP BY DATE(observations.created_at), `observations`.`reporter_id`
ORDER BY reporters.name
..which gives the MySQL error 'observations.id' isn't in GROUP BY. How do I prevent selection of columns which are not used for grouping?
I got it working with preload, which seems to work similarly to includes, with the difference that JOINs and SELECTs of the primary query are controlled manually.
observation_query = Observation.joins(:pulse).preload(:reporter).order("reporters.name")
if params[:group_results]
observation_query = observation_query
.select(:reporter_id, "DATE(observations.created_at) AS created_at, AVG(value) AS value")
.group("DATE(observations.created_at)", :reporter_id)
end
The thing about this solution is that table reporters is queried twice, first JOINed for ordering and then a second query that SELECTs the values for filling the associated records. Because the equivalent of reporters.name is indexed in my actual case, this is good enough, but the optimal solution would generate a single query, so I’m not marking this as the answer.

Rails 4: Joins vs Includes: Why different results with nested association?

In Rails 4 app, I have two models:
Merchant has_many :offering_specials
OfferingSpecial belongs_to :merchant
I want to retrieve all merchants and their special offerings that are open (with the status_code: "OP")
I tried this:
#merchants = Merchant.joins(:offering_specials).where(offering_specials: { status_code: "OP" })
This is the query:
Merchant Load (0.4ms) SELECT `merchants`.* FROM `merchants` INNER JOIN `offering_specials` ON `offering_specials`.`merchant_id` = `merchants`.`id` WHERE `offering_specials`.`status_code` = 'OP'
But it retrieved all offering specials, both the open ("OP") and the pending ("PN").
However, using includes worked:
#merchants = Merchant.joins(:offering_specials).where(offering_specials: { status_code: "OP" })
This retrieved only the open offering specials. But look at the much slower query:
SQL (19.9ms) SELECT `merchants`.`id` AS t0_r0, `merchants`.`name` AS t0_r1, `merchants`.`slug` AS t0_r2, `merchants`.`url` AS t0_r3, `merchants`.`summary` AS t0_r4, `merchants`.`description` AS t0_r5, `merchants`.`active_for_display` AS t0_r6, `merchants`.`active_for_offerings_by_merchant` AS t0_r7, `merchants`.`active_for_offerings_by_legatocard` AS t0_r8, `merchants`.`credit_limit` AS t0_r9, `merchants`.`search_location_code` AS t0_r10, `merchants`.`image_file_name` AS t0_r11, `merchants`.`image_file_size` AS t0_r12, `merchants`.`image_content_type` AS t0_r13, `merchants`.`image_updated_at` AS t0_r14, `merchants`.`logo_file_name` AS t0_r15, `merchants`.`logo_file_size` AS t0_r16, `merchants`.`logo_content_type` AS t0_r17, `merchants`.`logo_updated_at` AS t0_r18, `merchants`.`created_at` AS t0_r19, `merchants`.`updated_at` AS t0_r20, `offering_specials`.`id` AS t1_r0, `offering_specials`.`special_number` AS t1_r1, `offering_specials`.`merchant_id` AS t1_r2, `offering_specials`.`merchant_user_id` AS t1_r3, `offering_specials`.`nonprofit_percentage` AS t1_r4, `offering_specials`.`discount_percentage` AS t1_r5, `offering_specials`.`start_at` AS t1_r6, `offering_specials`.`end_at` AS t1_r7, `offering_specials`.`closed_at` AS t1_r8, `offering_specials`.`max_dollar_amount_for_offering` AS t1_r9, `offering_specials`.`max_dollar_amount_per_buyer` AS t1_r10, `offering_specials`.`status_code` AS t1_r11, `offering_specials`.`created_at` AS t1_r12, `offering_specials`.`updated_at` AS t1_r13 FROM `merchants` LEFT OUTER JOIN `offering_specials` ON `offering_specials`.`merchant_id` = `merchants`.`id` WHERE `offering_specials`.`status_code` = 'OP'
How can I get get this query to work with a joins, instead of the includes?
Queries of this sort do not normally return associated records. You're requesting a list of Merchants, and that's what you get. When you subsequently request the associated OfferingSpecials of one of those Merchants, a new query is executed (which you should see in the logs), and you get all of them, because you did not specify otherwise. The code in your question does not include the place where you do this, but you must be doing it somewhere, in order to get the OfferingSpecials.
Using includes asks to eager-load the association, which means that it will be subject to the restrictions of the query, which is why you're seeing it work when you do it that way. It's slower because it's fetching those extra records for you now, instead of doing it separately later.
If you really do want to refactor this using .joins, you simply need to add the conditional to the line where you fetch the .offering_specials of the Merchant:
#merchants.each do |m|
m.offering_specials.where(:status_code => 'OP')
end
However, you should read up on why eager loading exists before doing so - it is likely that either you are already getting better performance by doing one slower query vs. many fast ones, or that you will do so if the number of merchant records involve passes some threshold (which may or may not happen, depending on the nature of your app).
I've wanted leaner queries with .includes(...) as well, and have now released this feature as a part of a data-related gem I maintain, The Brick.
By overriding ActiveRecord::Associations::JoinDependency.apply_column_aliases() like this then when you add a .select(...) then it can act as a filter to choose which column aliases get built out.
With gem 'brick' loaded, in order to enable this selective behaviour, add the special column name :_brick_eager_load as the first entry in your .select(...), which turns on the filtering of columns while the aliases are being built out. Here's an example from your merchant offerings data set:
#merchants = Merchant.includes(:offering_specials)
.references(:offering_specials)
.where(offering_specials: { status_code: "OP" })
.select(:_brick_eager_load, # Turns on the filtering
:name, :slug,
'offering_specials.discount_percentage',
'offering_specials.start_at', 'offering_specials.end_at', 'offering_specials.closed_at'
Hope it can save you both query time and some RAM!

Filter parents by child attribute, but eager-load all children

That title is a bit obtuse, so here's an example. Suppose we have a Rails 3 app with models Ship, Pirate, and Parrot. A ship has_many pirates, and a pirate has_many parrots.
Ship.includes(pirates: :parrots).where('parrots.name LIKE ?', '%polly%')
This returns ships having at least one pirate with at least one parrot whose name is like "polly". I would also like it to eager-load all of the pirates and parrots for those ships... but in reality only the pirates with matching parrots are eager-loaded, and among those, only the matching parrots are eager-loaded. The generated SQL is something like this:
SELECT ships.id AS t0_r0, ships.name AS t0_r1, pirates.id AS t1_r0, pirates.name AS t1_r1, parrots.id AS t2_r0, parrots.name AS t2_r1 FROM ships LEFT OUTER JOIN pirates ON pirates.ship_id = ships.id LEFT OUTER JOIN parrots ON parrots.pirate_id = pirates.id WHERE (parrots.name LIKE '%polly%')
When doing Ship.includes(pirates: :parrots) without the condition, ActiveRecord generates a bundle of queries that is somewhat closer to what I want:
SELECT ships.* FROM ships
SELECT pirates.* FROM pirates WHERE pirates.ship_id IN (ship IDs from previous query)
SELECT parrots.* FROM parrots WHERE parrots.pirate_id IN (pirate IDs from previous query)
If I could somehow change that first query to use the SQL from the first example, it would do exactly what I want:
SELECT ships.* FROM ships LEFT OUTER JOIN pirates ON pirates.ship_id = ships.id LEFT OUTER JOIN parrots ON parrots.pirate_id = pirates.id WHERE (parrots.name LIKE '%polly%')
SELECT pirates.* FROM pirates WHERE pirates.ship_id IN (ship IDs from previous query)
SELECT parrots.* FROM parrots WHERE parrots.pirate_id IN (pirate IDs from previous query)
But I'm not aware of any way to get ActiveRecord to do this, or any way to do it myself and "manually" wire up the eager-loading (which is necessary in my situation to avoid an N+1 query explosion). Any ideas or advice would be appreciated.
Ship.joins(pirates: :parrots).where('parrots.name LIKE ?', '%polly%').preload(pirates: :parrots)
requires rails 3+
If INNER JOIN is what you're looking for, I think
Ship.includes(pirates: :parrots).where('parrots.name LIKE ?', '%polly%').joins(pirates: :parrots)
gets it done.

Joining a rails table with a large number of records - causing my app to hang

I have an actions table with over 450,000 records. I want to join the actions table on the users table (it actually joins two other tables, one of which is joined on the other, and the other being joined on the users table, before joining the actions table.) The sql query looks like this:
SELECT "users".* FROM "users" INNER JOIN "campaigns" ON "campaigns"."user_id" = "users"."id" INNER JOIN "books" ON "books"."campaign_id" = "campaigns"."id" INNER JOIN "actions" ON "actions"."book_id" = "books"."id" AND "actions"."type" IN ('Impression')
However, this query in rails causes my app to hang because of the large number of records in the actions table.
How should I be handling this?
There are several problems with this approach:
You are fetching the same users several times (users multiplied by
number of actions)
You are fetching all the users with corresponding actions at once. It means big memory consumption and thus frequent garbage collection
You are fetching all the attributes for users. I guess, you do not need all them all
You have made comment about ordering users by their order count. Do you do this in Ruby code ? If yes, then it's the 4th problem. Big problem, indeed
So I'd propose to use group() method for grouping records or just plain SQL like
SELECT "users".id, count(*) as actions_cnt
FROM "users"
INNER JOIN "campaigns" ON "campaigns"."user_id" = "users"."id"
INNER JOIN "books" ON "books"."campaign_id" = "campaigns"."id"
INNER JOIN "actions" ON "actions"."book_id" = "books"."id" AND "actions"."type" IN ('Impression')
GROUP BY
"users".id
If there are many users in your app then I'd propose to add "OFFSET #{offset} LIMIT #{limit}" to fetch records in batches.
Finally, you can directly specify columns that you need so that memory footprint will be not so large.

Resources