So here's the lay of the land:
I have a Applicant model which has_many Lead records.
I need to group leads by applicant email, i.e. for each specific applicant email (there may be 2+ applicant records with the email) i need to get a combined list of leads.
I already have this working using an in-memory / N+1 solution
I want to do this in a single query, if possible. Right now I'm running one for each lead which is maxing out the CPU.
Here's my attempt right now:
Lead.
all.
select("leads.*, applicants.*").
joins(:applicant).
group("applicants.email").
having("count(*) > 1").
limit(1).
to_a
And the error:
Lead Load (1.2ms) SELECT leads.*, applicants.* FROM "leads" INNER
JOIN "applicants" ON "applicants"."id" = "leads"."applicant_id"
GROUP BY applicants.email HAVING count(*) > 1 LIMIT 1
ActiveRecord::StatementInvalid: PG::GroupingError: ERROR: column
"leads.id" must appear in the GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT leads.*, applicants.* FROM "leads" INNER JOIN
"appli...
This is a postgres specific issue. "the selected fields must appear in the GROUP BY clause".
must appear in the GROUP BY clause or be used in an aggregate function
You can try this
Lead.joins(:applicant)
.select('leads.*, applicants.email')
.group_by('applicants.email, leads.id, ...')
You will need to list all the fields in leads table in the group by clause (or all the fields that you are selecting).
I would just get all the records and do the grouping in memory. If you have a lot of records, I would paginate them or batch them.
group_by_email = Hash.new { |h, k| h[k] = [] }
Applicant.eager_load(:leads).each_batch(10_000) do |batch|
batch.each do |applicant|
group_by_email[:applicant.email] << applicant.leads
end
end
You need to use a .where rather than using Lead.all. The reason it is maxing out the CPU is you are trying to load every lead into memory at once. That said I guess I am still missing what you actually want back from the query so it would be tough for me to help you write the query. Can you give more info about your associations and the expected result of the query?
Say you creating an imdb type site for TV Shows. You have a Show with many attached episodes and a bunch of people
Right now I link people to episodes though a contribution table - but if I want to make a list of all the shows they are on, I have to go through episodes.
Since this query takes a long time I was thinking about adding show_id to the contributions table. Is this common practice to increase performance or is there another way I haven't thought of?
Since this query takes a long time
Have you run a SQL explain plan to show why this is the case? What is the actual SQL query that is being run, and are you doing things like ordering or running subqueries within it?
If I understand your structure it is something like this:
|people| n---1 |contribution| 1---n |episodes| n---1 |shows|
A sql select of the sort:
select distinct s.name
from shows s,
episodes e,
contribution c
where c.people_id = <id>
and c.episode_id = e.id
and e.show_id = s.id
should really not have performance issues unless there are no indexes on the tables or the tables are massive.
Here's a way using where id in ( ... ) to select all shows a specific person appeared in
Shows.where(id: Contribution.select("show_id")
.join(:episodes)
.where(person_id: personId)
.group("episodes.show_id"))
You may also want to try exists
Shows.where("EXISTS(SELECT 1 from contributions c
join episodes e on e.id = c.episode_id
where c.person_id = ? and e.show_id = shows.id)")
I'm trying to find the best way to count the number of Users who have one (or many) instances of a has_many relation.
For example, User has_many :bank_accounts and :credit_accounts (and a few other relations). I want to find the number of unique Users who have at least one bank_account and at least one credit_account, and ideally implement this inside of a scope so I can run where queries on it.
At the moment I'm implementing it (poorly) using the following code:
(BankAccount.select(:user_id).uniq + CreditAccount.select(:user_id) + ...).uniq.count
I've played around a lot with some joins, however I'm not getting any results. For example, I've toyed around a lot with different forms of User.joins(:bank_accounts, :credit_accounts).uniq('users.id').count however I don't appear to be getting any results.
Any help would be greatly appreciated, thanks!
If you are fine with using normal sql. You can use the below query
select distinct(user_id) from
(select user_id from bank_accounts union select user_id from credit_accounts) a;
I am not sure if a rails way exists for this.
In this case all we need is an INNER JOIN of users with credit_accounts and bank_accounts.
User.joins(:credit_accounts, :bank_accounts).uniq.count
The above query works for me. The sql generated by this query is below
"SELECT DISTINCT COUNT(DISTINCT `users`.`id`) FROM `users`.* FROM `users` INNER JOIN `credit_accounts` ON `credit_accounts`.`user_id` = `users`.`id` INNER JOIN `bank_accounts` ON `bank_accounts`.`user_id` = `users`.`id`"
I couldn't think of a better way to refactor the below code (see this question), though I know it's very ugly. However, it's throwing a Postgres error (not with SQLite):
ActiveRecord::StatementInvalid:
PG::Error: ERROR:
column "articles.id" must appear in the GROUP BY clause or be used in an aggregate function
The query itself is:
SELECT "articles".*
FROM "articles"
WHERE "articles"."user_id" = 1
GROUP BY publication
Which comes from the following view code:
=#user.articles.group(:publication).map do |p|
=p.publication
=#user.articles.where("publication = ?", p.publication).sum(:twitter_count)
=#user.articles.where("publication = ?", p.publication).sum(:facebook_count)
=#user.articles.where("publication = ?", p.publication).sum(:linkedin_count)
In SQLite, this gives the output (e.g.) NYT 12 18 14 BBC 45 46 47 CNN 75 54 78, which is pretty much what I need.
How can I improve the code to remove this error?
When using GROUP BY you cannot SELECT fields that are not either part of the GROUP BY or used in an aggregate function. This is specified by the SQL standard, though some databases choose to execute such queries anyway. Since there's no single correct way to execute such a query they tend to just pick the first row they find and return that, so results will vary unpredictably.
It looks like you're trying to say:
"For each publication get me the sum of the twitter, facebook and linkedin counts for that publication".
If so, you could write:
SELECT publication,
sum(twitter_count) AS twitter_sum,
sum(linkedin_count) AS linkedin_sum,
sum(facebook_count) AS facebook_sum
FROM "articles"
WHERE "articles"."user_id" = 1
GROUP BY publication;
Translating that into ActiveRecord/Rails ... up to you, I don't use it. It looks like it's pretty much what you tried to write but ActiveRecord seems to be mangling it, perhaps trying to execute the sums locally.
Craig's answer explains the issue well. Active Record will select * by default, but you can override it easily:
#user.articles.select("publication, sum(twitter_count) as twitter_count").group(:publication).each do |row|
p row.publication # "BBC"
p row.twitter_count # 45
end
I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]