How to prevent SELECTing extra fields in a JOINed .includes() - ruby-on-rails

I’m trying to implement parametrized grouping to a report. A simplified example of what I’m trying to achieve:
observation_query = Observation.includes(:reporter).order("reporters.name")
if params[:group_results]
observation_query = observation_query
.select("DATE(observations.created_at) AS created_at, AVG(value) AS value")
.group("DATE(observations.created_at)", :reporter_id)
end
observation_query.each do |observation|
puts "#{observation.reporter.name} #{observation.created_at}: #{observation.value}"
end
When grouping is not used, or if I remove the ordering, the results are as expected. But when both ordering and grouping are used, the query generated due to having to achieve the eager loading with JOINs is:
SELECT DATE(observations.updated_at) AS updated_at, AVG(value) AS value,
`observations`.`id` AS t0_r0,
`observations`.`value` AS t0_r1,
`observations`.`reporter_id` AS t0_r2,
...
`observations`.`created_at` AS t0_r6,
`observations`.`updated_at` AS t0_r7,
`reporters`.`id` AS t1_r0,
...
FROM `observations` INNER JOIN `reporters` ON `reporters`.`id` = `observations`.`user_id`
GROUP BY DATE(observations.created_at), `observations`.`reporter_id`
ORDER BY reporters.name
..which gives the MySQL error 'observations.id' isn't in GROUP BY. How do I prevent selection of columns which are not used for grouping?

I got it working with preload, which seems to work similarly to includes, with the difference that JOINs and SELECTs of the primary query are controlled manually.
observation_query = Observation.joins(:pulse).preload(:reporter).order("reporters.name")
if params[:group_results]
observation_query = observation_query
.select(:reporter_id, "DATE(observations.created_at) AS created_at, AVG(value) AS value")
.group("DATE(observations.created_at)", :reporter_id)
end
The thing about this solution is that table reporters is queried twice, first JOINed for ordering and then a second query that SELECTs the values for filling the associated records. Because the equivalent of reporters.name is indexed in my actual case, this is good enough, but the optimal solution would generate a single query, so I’m not marking this as the answer.

Related

Properly format an ActiveRecord query with a subquery in Postgres

I have a working SQL query for Postgres v10.
SELECT *
FROM
(
SELECT DISTINCT ON (title) products.title, products.*
FROM "products"
) subquery
WHERE subquery.active = TRUE AND subquery.product_type_id = 1
ORDER BY created_at DESC
With the goal of the query to do a distinct based on the title column, then filter and order them. (I used the subquery in the first place, as it seemed there was no way to combine DISTINCT ON with ORDER BY without a subquery.
I am trying to express said query in ActiveRecord.
I have been doing
Product.select("*")
.from(Product.select("DISTINCT ON (product.title) product.title, meals.*"))
.where("subquery.active IS true")
.where("subquery.meal_type_id = ?", 1)
.order("created_at DESC")
and, that works! But, it's fairly messy with the string where clauses in there. Is there a better way to express this query with ActiveRecord/Arel, or am I just running into the limits of what ActiveRecord can express?
I think the resulting ActiveRecord call can be improved.
But I would start improving with original SQL query first.
Subquery
SELECT DISTINCT ON (title) products.title, products.* FROM products
(I think that instead of meals there should be products?) has duplicate products.title, which is not necessary there. Worse, it misses ORDER BY clause. As PostgreSQL documentation says:
Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
I would rewrite sub-query as:
SELECT DISTINCT ON (title) * FROM products ORDER BY title ASC
which gives us a call:
Product.select('DISTINCT ON (title) *').order(title: :asc)
In main query where calls use Rails-generated alias for the subquery. I would not rely on Rails internal convention on aliasing subqueries, as it may change anytime. If you do not take this into account you could merge these conditions in one where call with hash-style argument syntax.
The final result:
Product.select('*')
.from(Product.select('DISTINCT ON (title) *').order(title: :asc))
.where(subquery: { active: true, meal_type_id: 1 })
.order('created_at DESC')

ActiveRecord query searching for duplicates on a column, but returning associated records

So here's the lay of the land:
I have a Applicant model which has_many Lead records.
I need to group leads by applicant email, i.e. for each specific applicant email (there may be 2+ applicant records with the email) i need to get a combined list of leads.
I already have this working using an in-memory / N+1 solution
I want to do this in a single query, if possible. Right now I'm running one for each lead which is maxing out the CPU.
Here's my attempt right now:
Lead.
all.
select("leads.*, applicants.*").
joins(:applicant).
group("applicants.email").
having("count(*) > 1").
limit(1).
to_a
And the error:
Lead Load (1.2ms) SELECT leads.*, applicants.* FROM "leads" INNER
JOIN "applicants" ON "applicants"."id" = "leads"."applicant_id"
GROUP BY applicants.email HAVING count(*) > 1 LIMIT 1
ActiveRecord::StatementInvalid: PG::GroupingError: ERROR: column
"leads.id" must appear in the GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT leads.*, applicants.* FROM "leads" INNER JOIN
"appli...
This is a postgres specific issue. "the selected fields must appear in the GROUP BY clause".
must appear in the GROUP BY clause or be used in an aggregate function
You can try this
Lead.joins(:applicant)
.select('leads.*, applicants.email')
.group_by('applicants.email, leads.id, ...')
You will need to list all the fields in leads table in the group by clause (or all the fields that you are selecting).
I would just get all the records and do the grouping in memory. If you have a lot of records, I would paginate them or batch them.
group_by_email = Hash.new { |h, k| h[k] = [] }
Applicant.eager_load(:leads).each_batch(10_000) do |batch|
batch.each do |applicant|
group_by_email[:applicant.email] << applicant.leads
end
end
You need to use a .where rather than using Lead.all. The reason it is maxing out the CPU is you are trying to load every lead into memory at once. That said I guess I am still missing what you actually want back from the query so it would be tough for me to help you write the query. Can you give more info about your associations and the expected result of the query?

ActiveRecord using pluck with includes/left outer joins

When I do includes it left joins the table I want to filter on, but when I add pluck that join disappears. Is there any way to mix pluck and left join without manually typing the sql for 'left join'
Here's my case:
Select u.id
From users u
Left join profiles p on u.id=p.id
Left join admin_profiles a on u.id=a.uid
Where 2 in (p.prop, a.prop, u.prop)
Doing this is just loading all the values:
Users.includes(:AdminProfiles, :Profiles).where(...).map{ |a| a[:id] }
But when I do pluck instead of map, it doesn't left join the profile tables.
Your problem is that you're using includes which doesn't really do a join, instead it fires a second query after the first one to query for the associations, in your case you want them both to be actually joined, so for that replace includes(:something) with joins(:something) and every thing should work fine.
Replying to your comment, i'm gonna quote few parts from the rails guide about active record query interface
From the section Solution to N + 1 queries problem
clients = Client.includes(:address).limit(10)
clients.each do |client|
puts client.address.postcode
end
The above code will execute just 2 queries, as opposed to 11 queries in the previous case:
SELECT * FROM clients LIMIT 10
SELECT addresses.* FROM addresses WHERE (addresses.client_id IN (1,2,3,4,5,6,7,8,9,10))
as you can see, two queries, no joins at all.
From the section Specifying Conditions on Eager Loaded Associations link
Even though Active Record lets you specify conditions on the eager loaded associations just like joins, the recommended way is to use joins instead.
Then an example:
Article.includes(:comments).where(comments: { visible: true })
This would generate a query which contains a LEFT OUTER JOIN whereas the joins method would generate one using the INNER JOIN function instead.
SELECT "articles"."id" AS t0_r0, ... "comments"."updated_at" AS t1_r5 FROM "articles" LEFT OUTER JOIN "comments" ON "comments"."article_id" = "articles"."id" WHERE (comments.visible = 1)
If there was no where condition, this would generate the normal set of two queries.

Filter parents by child attribute, but eager-load all children

That title is a bit obtuse, so here's an example. Suppose we have a Rails 3 app with models Ship, Pirate, and Parrot. A ship has_many pirates, and a pirate has_many parrots.
Ship.includes(pirates: :parrots).where('parrots.name LIKE ?', '%polly%')
This returns ships having at least one pirate with at least one parrot whose name is like "polly". I would also like it to eager-load all of the pirates and parrots for those ships... but in reality only the pirates with matching parrots are eager-loaded, and among those, only the matching parrots are eager-loaded. The generated SQL is something like this:
SELECT ships.id AS t0_r0, ships.name AS t0_r1, pirates.id AS t1_r0, pirates.name AS t1_r1, parrots.id AS t2_r0, parrots.name AS t2_r1 FROM ships LEFT OUTER JOIN pirates ON pirates.ship_id = ships.id LEFT OUTER JOIN parrots ON parrots.pirate_id = pirates.id WHERE (parrots.name LIKE '%polly%')
When doing Ship.includes(pirates: :parrots) without the condition, ActiveRecord generates a bundle of queries that is somewhat closer to what I want:
SELECT ships.* FROM ships
SELECT pirates.* FROM pirates WHERE pirates.ship_id IN (ship IDs from previous query)
SELECT parrots.* FROM parrots WHERE parrots.pirate_id IN (pirate IDs from previous query)
If I could somehow change that first query to use the SQL from the first example, it would do exactly what I want:
SELECT ships.* FROM ships LEFT OUTER JOIN pirates ON pirates.ship_id = ships.id LEFT OUTER JOIN parrots ON parrots.pirate_id = pirates.id WHERE (parrots.name LIKE '%polly%')
SELECT pirates.* FROM pirates WHERE pirates.ship_id IN (ship IDs from previous query)
SELECT parrots.* FROM parrots WHERE parrots.pirate_id IN (pirate IDs from previous query)
But I'm not aware of any way to get ActiveRecord to do this, or any way to do it myself and "manually" wire up the eager-loading (which is necessary in my situation to avoid an N+1 query explosion). Any ideas or advice would be appreciated.
Ship.joins(pirates: :parrots).where('parrots.name LIKE ?', '%polly%').preload(pirates: :parrots)
requires rails 3+
If INNER JOIN is what you're looking for, I think
Ship.includes(pirates: :parrots).where('parrots.name LIKE ?', '%polly%').joins(pirates: :parrots)
gets it done.

Rails expanding fields with scope, PG does not like it

I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]

Resources