Ordering records by frequency with Arel - ruby-on-rails

How do I retrieve a set of records, ordered by count in Arel? I have a model which tracks how many views a product get. I want to find the X most frequently viewed products over the last Y days.
This problem has cropped up while migrating to PostgreSQL from MySQL, due to MySQL being a bit forgiving in what it will accept. This code, from the View model, works with MySQL, but not PostgreSQL due to non-aggregated columns being included in the output.
scope :popular, lambda { |time_ago, freq|
where("created_on > ?", time_ago).group('product_id').
order('count(*) desc').limit(freq).includes(:product)
}
Here's what I've got so far:
View.select("id, count(id) as freq").where('created_on > ?', 5.days.ago).
order('freq').group('id').limit(5)
However, this returns the single ID of the model, not the actual model.
Update
I went with:
select("product_id, count(id) as freq").
where('created_on > ?', time_ago).
order('freq desc').
group('product_id').
limit(freq)
On reflection, it's not really logical to expect a complete model when the results are made up of GROUP BY and aggregate functions results, as returned data will (most likely) match no actual model (row).

you have to extend your select clause with all column you wish to retrieve. or
select("views.*, count(id) as freq")

SQL would be:
SELECT product_id, product, count(*) as freq
WHERE created_on > '$5_days_ago'::timestamp
GROUP BY product_id, product
ORDER BY count(*) DESC, product
LIMIT 5;
Extrapolating from your example, it should be:
View.select("product_id, product, count(*) as freq").where('created_on > ?', 5.days.ago).
order("count(*) DESC" ).group('product_id, product').limit(5)
Disclaimer: Ruby syntax is a foreign language to me.

Related

Properly format an ActiveRecord query with a subquery in Postgres

I have a working SQL query for Postgres v10.
SELECT *
FROM
(
SELECT DISTINCT ON (title) products.title, products.*
FROM "products"
) subquery
WHERE subquery.active = TRUE AND subquery.product_type_id = 1
ORDER BY created_at DESC
With the goal of the query to do a distinct based on the title column, then filter and order them. (I used the subquery in the first place, as it seemed there was no way to combine DISTINCT ON with ORDER BY without a subquery.
I am trying to express said query in ActiveRecord.
I have been doing
Product.select("*")
.from(Product.select("DISTINCT ON (product.title) product.title, meals.*"))
.where("subquery.active IS true")
.where("subquery.meal_type_id = ?", 1)
.order("created_at DESC")
and, that works! But, it's fairly messy with the string where clauses in there. Is there a better way to express this query with ActiveRecord/Arel, or am I just running into the limits of what ActiveRecord can express?
I think the resulting ActiveRecord call can be improved.
But I would start improving with original SQL query first.
Subquery
SELECT DISTINCT ON (title) products.title, products.* FROM products
(I think that instead of meals there should be products?) has duplicate products.title, which is not necessary there. Worse, it misses ORDER BY clause. As PostgreSQL documentation says:
Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
I would rewrite sub-query as:
SELECT DISTINCT ON (title) * FROM products ORDER BY title ASC
which gives us a call:
Product.select('DISTINCT ON (title) *').order(title: :asc)
In main query where calls use Rails-generated alias for the subquery. I would not rely on Rails internal convention on aliasing subqueries, as it may change anytime. If you do not take this into account you could merge these conditions in one where call with hash-style argument syntax.
The final result:
Product.select('*')
.from(Product.select('DISTINCT ON (title) *').order(title: :asc))
.where(subquery: { active: true, meal_type_id: 1 })
.order('created_at DESC')

ActiveRecord query searching for duplicates on a column, but returning associated records

So here's the lay of the land:
I have a Applicant model which has_many Lead records.
I need to group leads by applicant email, i.e. for each specific applicant email (there may be 2+ applicant records with the email) i need to get a combined list of leads.
I already have this working using an in-memory / N+1 solution
I want to do this in a single query, if possible. Right now I'm running one for each lead which is maxing out the CPU.
Here's my attempt right now:
Lead.
all.
select("leads.*, applicants.*").
joins(:applicant).
group("applicants.email").
having("count(*) > 1").
limit(1).
to_a
And the error:
Lead Load (1.2ms) SELECT leads.*, applicants.* FROM "leads" INNER
JOIN "applicants" ON "applicants"."id" = "leads"."applicant_id"
GROUP BY applicants.email HAVING count(*) > 1 LIMIT 1
ActiveRecord::StatementInvalid: PG::GroupingError: ERROR: column
"leads.id" must appear in the GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT leads.*, applicants.* FROM "leads" INNER JOIN
"appli...
This is a postgres specific issue. "the selected fields must appear in the GROUP BY clause".
must appear in the GROUP BY clause or be used in an aggregate function
You can try this
Lead.joins(:applicant)
.select('leads.*, applicants.email')
.group_by('applicants.email, leads.id, ...')
You will need to list all the fields in leads table in the group by clause (or all the fields that you are selecting).
I would just get all the records and do the grouping in memory. If you have a lot of records, I would paginate them or batch them.
group_by_email = Hash.new { |h, k| h[k] = [] }
Applicant.eager_load(:leads).each_batch(10_000) do |batch|
batch.each do |applicant|
group_by_email[:applicant.email] << applicant.leads
end
end
You need to use a .where rather than using Lead.all. The reason it is maxing out the CPU is you are trying to load every lead into memory at once. That said I guess I am still missing what you actually want back from the query so it would be tough for me to help you write the query. Can you give more info about your associations and the expected result of the query?

ActiveRecord SUM includes more objects in the calculation than it supplies the same query without sum

I faced today a problem that leads me in a gotcha of ActiveRecord use.
ActiveRecord returns for a specific query (with includes) certain amount of objects in an ActiveRelation object.
If you chain on the same ActiveRecord query sum(:attribute), it includes more objects in the calculated result. To describe what I mean here my example:
Environment:
ActiveRecord 4.2.3
Postgres 9.3.5
DB-structure:
Order has_many items
My query:
#orders = Order.includes(:items).where('orders.created_at >= ? AND orders.created_at <= ?', date_from, date_to)
The produced SQL-Query:
SELECT orders.* FROM order_containers WHERE orders.created_at >= '2015-08-11' AND orders.created_at <= '2015-08-17 23:59:59.999999';
The mentioned query returns e.g. 20 orders. As you can see, the includes doesn't play any rule in the query. And if I sum the price for the result, in ruby:
#orders.to_a.sum(&:price)
it returns 20.00
The same ActiveRecord query with SUM:
Order.includes(:items).where('orders.created_at >= ? AND orders.created_at <= ?', date_from, date_to).sum(:price)
it returns 45.00
It produces a different SQL statment:
SELECT SUM(orders.price_eur) FROM orders LEFT OUTER JOIN line_items ON items.order_container_id = orders.id WHERE orders.created_at >= '2015-08-11' AND orders.created_at <= '2015-08-17 23:59:59.999999'
The summed orders in this case are much more because the produced SQL-query includes the same order more than one time (because of Join). Every order has one or more items what leads to much more orders (duplicates) than the query without the Left Outer Join.
I hope this can help you avoid this gotcha.
nabinabou
includes is generally used for eager loading. Why don't you replace it with joins?

Get most recent distinct records

Every resort has many snow reports.
I want to get the most recent snow report for every resort where the field snow_summit in the table snow_reports should be > 0 .
So I tried to select distinct on resort_id which is the fkey in snow_reports and order by updated_at but this it not possible since updated_at does not occur in the select.
So how do I get only the most recent records of an associated mode in rails4 (on postgres)?
SnowReport belongs_to Resort
Resort has_many snow_reports
Table snow_reports has id,resort_id,updated_at,snow_summit
Ideally the result is joined for performance reasons.
My approach fails
SnowReport.includes(:resort).select(:resort_id).group(:resort_id).having('max(snow_summit)> 0').order('max(snow_reports.updated_at) DESC')
since SnowReport.id is nil
#<ActiveRecord::Relation [#<SnowReport id: nil, resort_id: 1735>, ...
edit:
I found a solution in plain sql.
How can I transform this to rails ?
select * from resorts where id in (select distinct(resort_id) from snow_reports where snow_summit > 0 and created_at > (now()- interval '3 days')
and created_at in (select max(created_at) from snow_reports group by resort_id));
Try this:
SnowReport.includes(:resort).select(:id, :resort_id).where("snow_summit > ?", 0).order("updated_at DESC").group(:resort_id)
It may work.

Rails expanding fields with scope, PG does not like it

I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]

Resources