ActiveRecord query searching for duplicates on a column, but returning associated records - ruby-on-rails

So here's the lay of the land:
I have a Applicant model which has_many Lead records.
I need to group leads by applicant email, i.e. for each specific applicant email (there may be 2+ applicant records with the email) i need to get a combined list of leads.
I already have this working using an in-memory / N+1 solution
I want to do this in a single query, if possible. Right now I'm running one for each lead which is maxing out the CPU.
Here's my attempt right now:
Lead.
all.
select("leads.*, applicants.*").
joins(:applicant).
group("applicants.email").
having("count(*) > 1").
limit(1).
to_a
And the error:
Lead Load (1.2ms) SELECT leads.*, applicants.* FROM "leads" INNER
JOIN "applicants" ON "applicants"."id" = "leads"."applicant_id"
GROUP BY applicants.email HAVING count(*) > 1 LIMIT 1
ActiveRecord::StatementInvalid: PG::GroupingError: ERROR: column
"leads.id" must appear in the GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT leads.*, applicants.* FROM "leads" INNER JOIN
"appli...

This is a postgres specific issue. "the selected fields must appear in the GROUP BY clause".
must appear in the GROUP BY clause or be used in an aggregate function
You can try this
Lead.joins(:applicant)
.select('leads.*, applicants.email')
.group_by('applicants.email, leads.id, ...')
You will need to list all the fields in leads table in the group by clause (or all the fields that you are selecting).
I would just get all the records and do the grouping in memory. If you have a lot of records, I would paginate them or batch them.
group_by_email = Hash.new { |h, k| h[k] = [] }
Applicant.eager_load(:leads).each_batch(10_000) do |batch|
batch.each do |applicant|
group_by_email[:applicant.email] << applicant.leads
end
end

You need to use a .where rather than using Lead.all. The reason it is maxing out the CPU is you are trying to load every lead into memory at once. That said I guess I am still missing what you actually want back from the query so it would be tough for me to help you write the query. Can you give more info about your associations and the expected result of the query?

Related

Rails: How to Eager Load with Left Join Table?

Currently I have a controller query which fetches products & product updates as follows:
products = Product.left_outer_joins(:productupdates).select("products.*, count(productupdates.*) as update_count, max(productupdates.old_price) as highest_price").group(:id)
products = products.paginate(:page => params[:page], :per_page => 20)
This query creates N+1 query but I can not put .include(:productsupdates) since I have a left out join as well.
If possible, can you please help me how to reduce N+1 queries?
EDIT------------------------------
As per Vishal's suggestion; I have changed the controller query as follows,
products = product.includes(:productupdates).select("products.*, count(productupdates.*) as productupdate_count, max(productupdates.old_price) as highest_price").group("productupdates.product_id")
products = products.paginate(:page => params[:page], :per_page => 20)
Unfortunately, I receive the following error:
ActiveRecord::StatementInvalid (PG::UndefinedTable: ERROR: missing FROM-clause entry for table "productupdates"
LINE 1: SELECT products.*, count(productupdates.*) as productupdate_count, m...
^
: SELECT products.*, count(productupdates.*) as productupdate_count, max(productupdates.old_price) as highest_price FROM "products" WHERE "products"."isopen" = $1 AND (products.year > 2009) AND ("products"."make" IS NOT NULL) GROUP BY productupdates.product_id LIMIT $2 OFFSET $3):
Please advise how this is causing N+1 and how you think this will solve the issue. The only way I can see an N+1 situation here is if you are then calling productupdates on each product later. If this is the case then this will not solve the issue. Please advise so others can formulate appropriate responses
For the time being I am going to assume that somewhere later in the code you are calling productupdates on the individual products. If this is the case then we can solve this without the aggregation as follows
#products = Product.eager_load(:productupdates)
Now when we loop the productupdates are already loaded so to get the count and the max we can do things like
#products.each do |p|
# COUNT
# (don't use the count method or it will execute a query )
p.productupdates.size
# MAX old_price
# older ruby versions use rails `try` instead
# e.g. p.productupdates.max_by(&:old_price).try(:old_price) || 0
p.productupdates.max_by(&:old_price)&.old_price || 0
end
Using these methods will not execute additional queries since the productupdates are already loaded
Side note: The reason includes did not work for you is that includes will use 2 queries to retrieve the data (sudo outer join) unless one of the following conditions is met:
The where clause uses a hash finder condition that references the association table (e.g. where(productupdates: {old_price: 12}))
You include the references method (e.g. Product.includes(:productupdates).references(:productupdates))
In both theses cases the table will be left joined. I chose to use eager load in this case as includes delegates to eager_load in the above cases anyway
You can directly do Product.includes(:productupdates) this will query the database with left outer join as well as it will overcome the N+1 query problem.
So instead of Product.left_outer_joins(:productupdates) in your query use Product.includes(:productupdates)
after firing this query in the console you can see that includes fires left outer join query on the table

RoR PostgresQL - Get latest, distinct values from database

I am trying to query my PostgreSQL database to get the latest (by created_at) and distinct (by user_id) Activity objects, where each user has multiple activities in the database. The activity object is structured as such:
Activity(id, user_id, created_at, ...)
I first tried to get the below query to work:
Activity.order('created_at DESC').select('DISTINCT ON (activities.user_id) activities.*')
however, kept getting the below error:
ActiveRecord::StatementInvalid: PG::InvalidColumnReference: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
According to this post: PG::Error: SELECT DISTINCT, ORDER BY expressions must appear in select list, it looks like The ORDER BY clause can only be applied after the DISTINCT has been applied. This does not help me, as I want to get the distinct activities by user_id, but also want the activities to be the most recently created activities. Thus, I need the activities to be sorted before getting the distinct activities.
I have come up with a solution that works, but first grouping the activities by user id, and then ordering the activities within the groups by created_at. However, this takes two queries to do.
I was wondering if what I want is possible in just one query?
This should work, try the following
Solution 1
Activity.select('DISTINCT ON (activities.user_id) activities.*').order('created_at DESC')
Solution 2
If not work Solution 1 then this is helpful if you create a scope for this
activity model
scope :latest, -> {
select("distinct on(user_id) activities.user_id,
activities.*").
order("user_id, created_at desc")
}
Now you can call this anywhere like below
Activity.latest
Hope it helps

Ambiguous reference on column when grouping by association

I'm grouping a list of Bug reports on a known collection of users that are related to the report (that is, the user that is responsible for the report and the user that is currently assigned to it).
The Model Bug (AR, Rails 4.2.x) thus has, among others, two associations assigned_to and responsible, which are resolved to the foreign keys assigned_to_id, responsible_id.
Bugs can also be related to a project, which may also have a responsible user set, thus they also possess a responsible_id foreign key.
As we're grouping on both attributes from the report itself and the associated project, we want to include the associated project in the returned query.
I can then get a hash count of <User> => count through the following statement, grouping on the association name of the bug report:
Bug.group(:assigned_to)
.includes(:project)
.references(:projects)
.count
which correctly produces the desired result: A collection of Users (assignees) and the Bugs they are being assigned to.
For responsibles, the same query:
Bug.group(:responsible)
.includes(:project)
.references(:projects)
.count
yields an error, since the attribute responsible_id is both contained in the query by bugs and the associated projects.
SELECT COUNT(DISTINCT "bugs"."id") AS count_id,
responsible_id AS responsible_id
FROM "bugs"
LEFT OUTER JOIN "projects" ON "projects"."id" = "bugs"."project_id"
GROUP BY "bugs"."responsible_id"
If I instead group on the explicit attribute itself using Bugs.group('bugs.responsible_id'), I get a valid response, however in the form of responsible_id => count.
SELECT COUNT(DISTINCT "bugs"."id") AS count_id,
bugs.responsible_id AS bugs_responsible_id
FROM "bugs"
LEFT OUTER JOIN "projects" ON "projects"."id" = "bugs"."project_id"
WHERE <condition>
GROUP BY bugs.responsible_id
Is there a way to force using the association, but namespace the query as in the second query?
Of course I could process the result and expand it to the responsible users, however since the grouping is part of a larger querying functionality, I only get to manipulate the grouping identifier without extensive changes to the query builder.
I don't think there is a fix for this now (in rails 4.2.4). This will however become easy in rails 5.
If you absolutely must solve the problem now, you could patch ActiveRecord::Calculations#execute_grouped_calculation with the fix available in rails 5 for your app. Simply add an initializer at config/initializers e.g. active_record_calculations_patch.rb with the following (abbreviated) content. You can copy the original code from your rails version and then add the fix:
module ActiveRecord
module Calculations
def execute_grouped_calculation(operation, column_name, distinct)
...
else
group_fields = group_attrs
end
# LINE OF CODE COPIED OVER FROM THE FIX
group_fields = arel_columns(group_fields)
# END OF COPIED OVER CODE
group_aliases = group_fields.map { |field|
column_alias_for(field)
...
end
end
end

Rails Postgres Error GROUP BY clause or be used in an aggregate function

In SQLite (development) I don't have any errors, but in production with Postgres I get the following error. I don't really understand the error.
PG::Error: ERROR: column "commits.updated_at" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...mmits"."user_id" = 1 GROUP BY mission_id ORDER BY updated_at...
^
: SELECT COUNT(*) AS count_all, mission_id AS mission_id FROM "commits" WHERE "commits"."user_id" = 1 GROUP BY mission_id ORDER BY updated_at DESC
My controller method:
def show
#user = User.find(params[:id])
#commits = #user.commits.order("updated_at DESC").page(params[:page]).per(25)
#missions_commits = #commits.group("mission_id").count.length
end
UPDATE:
So i digged further into this PostgreSQL specific annoyance and I am surprised that this exception is not mentioned in the Ruby on Rails Guide.
I am using psql (PostgreSQL) 9.1.11
So from what I understand, I need to specify which column that should be used whenever you use the GROUP_BY clause. I thought using SELECT would help, which can be annoying if you need to SELECT a lot of columns.
Interesting discussion here
Anyways, when I look at the error, everytime the cursor is pointed to updated_at. In the SQL query, rails will always ORDER BY updated_at. So I have tried this horrible query:
#commits.group("mission_id, date(updated_at)")
.select("date(updated_at), count(mission_id)")
.having("count(mission_id) > 0")
.order("count(mission_id)").length
which gives me the following SQL
SELECT date(updated_at), count(mission_id)
FROM "commits"
WHERE "commits"."user_id" = 1
GROUP BY mission_id, date(updated_at)
HAVING count(mission_id) > 0
ORDER BY updated_at DESC, count(mission_id)
LIMIT 25 OFFSET 0
the error is the same.
Note that no matter what it will ORDER BY updated_at, even if I wanted to order by something else.
Also I don't want to group the records by updated_at just by mission_id.
This PostgreSQL error is just misleading and has little explanation to solving it. I have tried many formulas from the stackoverflow sidebar, nothing works and always the same error.
UPDATE 2:
So I got it to work, but it needs to group the updated_at because of the automatic ORDER BY updated_at. How do I count only by mission_id?
#missions_commits = #commits.group("mission_id, updated_at").count("mission_id").size
I guest you want to show general number of distinct Missions related with Commits, anyway it won't be number on page.
Try this:
#commits = #user.commits.order("updated_at DESC").page(params[:page]).per(25)
#missions_commits = #user.commits.distinct.count(:mission_id)
However if you want to get the number of distinct Missions on page I suppose it should be:
#missions_commits = #commits.collect(&:mission_id).uniq.count
Update
In Rails 3, distinct did not exist, but pure SQL counting should be used this way:
#missions_commits = #user.commits.count(:mission_id, distinct: true)
See the docs for PostgreSQL GROUP BY here:
http://www.postgresql.org/docs/9.3/interactive/sql-select.html#SQL-GROUPBY
Basically, unlike Sqlite (and MySQL) postgres requires that any columns selected or ordered on must appear in an aggregate function or the group by clause.
If you think it through, you'll see that this actually makes sense. Sqlite/MySQL cheat under the hood and silently drop those fields (not sure that's technically what happens).
Or thinking about it another way if you are grouping by a field, what's the point of ordering it? How would that even make sense unless you also had an aggregate function on the ordered field?

Rails expanding fields with scope, PG does not like it

I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]

Resources