Can you do a group by with find_each in rails? - ruby-on-rails

I am trying to write a function that groups by some columns in a very large table (millions of rows). Is there any way to get find_each to work with this, or is it impossible given that I do not want to order by the id column?
The SQL of my query is:
SELECT derivable_type, derivable_id FROM "mytable" GROUP BY derivable_type, derivable_id ORDER BY "mytable"."id" ASC;
The rails find_each automatically adds the ORDER BY clause using a reorder statement. I have tried changing the SQL to:
SELECT MAX(id) AS "mytable"."id", derivable_type, derivable_id FROM "mytable" GROUP BY derivable_type, derivable_id ORDER BY "mytable"."id" ASC;
but that doesn't work either. Any ideas other than writing my own find_each function or overriding the private batch_order function in batches.rb?

There are at least two approaches to solve this problem:
I. Use subquery:
# query the table and select id, derivable_type and derivable_id
my_table_ids = MyTable
.group("derivable_type, derivable_id")
.select("MAX(id) AS my_table_id, derivable_type, derivable_id")
# use subquery to allow rails to use ORDER BY in find_each
MyTable
.where(id: my_table_ids.select('my_table_id'))
.find_each { |row| do_something(row) }
II. Write custom find_each function
rows = MyTable
.group("derivable_type, derivable_id")
.select("derivable_type, derivable_id")
find_each_grouped(rows, ['derivable_type', 'derivable_id']) do |row|
do_something(row)
end
def find_each_grouped(rows, columns, &block)
offset = 0
batch_size = 1_000
loop do
batch = rows
.order(columns)
.offset(offset)
.limit(limit)
batch.each(&block)
break if batch.size < limit
offset += limit
end
end

I'm not sure I'm 100% clear on what you're trying to do, but your query looks the same as doing an aggregate distinct()
SELECT derivable_type, derivable_id FROM "mytable" GROUP BY derivable_type, derivable_id ORDER BY "mytable"."id" ASC;
---- vv
SELECT DISTINCT(derivable_type, derivable_id) FROM "mytable" ORDER BY "mytable"."id" ASC;
You should be able to use Active Record to accomplish this, combined with find_each (if Mytable is your model):
Mytable.all.group(:derivable_type, :derivable_id).distinct.find_each
# gives => #<Enumerator: #<ActiveRecord::Relation [...]>:find_each({:start=>nil, :finish=>nil, :batch_size=>1000, :error_on_ignore=>nil})>

Related

ActiveRecord distinct doesn't work

I made a Select using Active Record with a lot of Joins. This resulted in duplicate values. After the select function there's the distinct function with value :id. But that didn't work!
Here's the code:
def join_query
<<-SQL
LEFT JOIN orders on orders.purchase_id = purchases.id
LEFT JOIN products on products.id = orders.complete_product_id
SQL
end
def select_query
<<-SQL
purchases.*,
products.reference_code as products_reference_code
SQL
end
result = Purchase.joins(join_query)
.select(select_query)
.distinct(:id)
Of course, neither distinct! or uniq functions worked. The distinct! returned a error from "ActiveRecord::ImmutableRelation" that I don't know what means.
To fix this I did a hack, converting the ActiveRecord_Relation object to an Array and I used the uniq function of Ruby.
What's going on here?
try this out:
def select_query
<<-SQL
DISTINCT ON (purchases.id) purchases.id,
products.reference_code as products_reference_code
SQL
end
add more comma separated column names in select clause
Purchase.select(select_query).joins(join_query)

How to eager load child model's sum value for ruby on rails?

I have an Order model, it has many items, it looks like this
class Order < ActiveRecord::Base
has_many :items
def total
items.sum('price * quantity')
end
end
And I have an order index view, querying order table like this
def index
#orders = Order.includes(:items)
end
Then, in the view, I access total of order, as a result, you will see tons of SUM query like this
SELECT SUM(price * quantity) FROM "items" WHERE "items"."order_id" = $1 [["order_id", 1]]
SELECT SUM(price * quantity) FROM "items" WHERE "items"."order_id" = $1 [["order_id", 2]]
SELECT SUM(price * quantity) FROM "items" WHERE "items"."order_id" = $1 [["order_id", 3]]
...
It's pretty slow to load order.total one by one, I wonder how can I load the sum in a eager manner via single query, but still I can access order.total just like before.
Try this:
subquery = Order.joins(:items).select('orders.id, sum(items.price * items.quantity) AS total').group('orders.id')
#orders = Order.includes(:items).joins("INNER JOIN (#{subquery.to_sql}) totals ON totals.id = orders.id")
This will create a subquery that sums the total of the orders, and then you join that subquery to your other query.
I wrote up two options for this in this blog post on using find_by_sql or joins to solve this.
For your example above, using find_by_sql you could write something like this:
Order.find_by_sql("select
orders.id,
SUM(items.price * items.quantity) as total
from orders
join items
on orders.id = items.order_id
group by
order.id")
Using joins, you could rewrite as:
Order.all.select("order.id, SUM(items.price * items.quantity) as total").joins(:items).group("order.id")
Include all the fields you want in your select list in both the select clause and the group by clause. Hope that helps!

Grouping a timestamp field by date in Ruby On Rails / PostgreSQL

I am trying to convert the following bit of code to work with PostgreSQL.
After doing some digging around I realized that PostgreSQL is much stricter (in a good way) with the GROUP BY than MySQL but for the life of me I cannot figure out how to rewrite this statement to satisfy Postgres.
def show
show! do
#recent_tasks = resource.jobs.group(:task).order(:created_at).limit(5)
end
end
PG::Error: ERROR: column "jobs.created_at" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...ND "jobs"."project_id" = 1 GROUP BY task ORDER BY created_at...
^
: SELECT COUNT(*) AS count_all, task AS task FROM "jobs" WHERE "jobs"."deleted_at" IS NULL AND "jobs"."project_id" = 1 GROUP BY task ORDER BY created_at DESC, created_at LIMIT 5
You cannot use column in order which is not in group by.
You can do something like
#recent_tasks = resource.jobs.group(:task, :created_at).order(:created_at).limit(5)
but it will change result
You can also
#recent_tasks = resource.jobs.group(:task).order(:task).limit(5)
or
#recent_tasks = resource.jobs.group(:task).order('count(*) desc').limit(5)

"Order by" result of "group by" count?

This query
Message.where("message_type = ?", "incoming").group("sender_number").count
will return me an hash.
OrderedHash {"1234"=>21, "2345"=>11, "3456"=>63, "4568"=>100}
Now I want to order by count of each group. How can I do that within the query.
The easiest way to do this is to just add an order clause to the original query. If you give the count method a specific field, it will generate an output column with the name count_{column}, which can be used in the sql generated by adding an order call:
Message.where('message_type = ?','incoming')
.group('sender_number')
.order('count_id asc').count('id')
When I tried this, rails gave me this error
SQLite3::SQLException: no such column: count_id: SELECT COUNT(*) AS count_all, state AS state FROM "ideas" GROUP BY state ORDER BY count_id desc LIMIT 3
Notice that it says SELECT ... AS count_all
So I updated the query from #Simon's answer to look like this and it works for me
.order('count_all desc')

Rails 3.1 with PostgreSQL: GROUP BY must be used in an aggregate function

I am trying to load the latest 10 Arts grouped by the user_id and ordered by created_at. This works fine with SqlLite and MySQL, but gives an error on my new PostgreSQL database.
Art.all(:order => "created_at desc", :limit => 10, :group => "user_id")
ActiveRecord error:
Art Load (18.4ms) SELECT "arts".* FROM "arts" GROUP BY user_id ORDER BY created_at desc LIMIT 10
ActiveRecord::StatementInvalid: PGError: ERROR: column "arts.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT "arts".* FROM "arts" GROUP BY user_id ORDER BY crea...
Any ideas?
The sql generated by the expression is not a valid query, you are grouping by user_id and selecting lot of other fields based on that but not telling the DB how it should aggregate the other fileds. For example, if your data looks like this:
a | b
---|---
1 | 1
1 | 2
2 | 3
Now when you ask db to group by a and also return b, it doesn't know how to aggregate values 1,2. You need to tell if it needs to select min, max, average, sum or something else. Just as I was writing the answer there have been two answers which might explain all this better.
In your use case though, I think you don't want a group by on db level. As there are only 10 arts, you can group them in your application. Don't use this method with thousands of arts though:
arts = Art.all(:order => "created_at desc", :limit => 10)
grouped_arts = arts.group_by {|art| art.user_id}
# now you have a hash with following structure in grouped_arts
# {
# user_id1 => [art1, art4],
# user_id2 => [art3],
# user_id3 => [art5],
# ....
# }
EDIT: Select latest_arts, but only one art per user
Just to give you the idea of sql(have not tested it as I don't have RDBMS installed on my system)
SELECT arts.* FROM arts
WHERE (arts.user_id, arts.created_at) IN
(SELECT user_id, MAX(created_at) FROM arts
GROUP BY user_id
ORDER BY MAX(created_at) DESC
LIMIT 10)
ORDER BY created_at DESC
LIMIT 10
This solution is based on the practical assumption, that no two arts for same user can have same highest created_at, but it may well be wrong if you are importing or programitically creating bulk of arts. If assumption doesn't hold true, the sql might get more contrieved.
EDIT: Attempt to change the query to Arel:
Art.where("(arts.user_id, arts.created_at) IN
(SELECT user_id, MAX(created_at) FROM arts
GROUP BY user_id
ORDER BY MAX(created_at) DESC
LIMIT 10)").
order("created_at DESC").
page(params[:page]).
per(params[:per])
You need to select the specific columns you need
Art.select(:user_id).group(:user_id).limit(10)
It will raise error when you try to select title in the query, for example
Art.select(:user_id, :title).group(:user_id).limit(10)
column "arts.title" must appear in the GROUP BY clause or be used in an aggregate function
That is because when you try to group by user_id, the query has no idea how to handle the title in the group, because the group contains several titles.
so the exception already mention you need to appear in group by
Art.select(:user_id, :title).group(:user_id, :title).limit(10)
or be used in an aggregate function
Art.select("user_id, array_agg(title) as titles").group(:user_id).limit(10)
Take a look at this post SQLite to Postgres (Heroku) GROUP BY
PostGres is actually following the SQL standard here whilst sqlite and mysql break from the standard.
Have at look at this question - Converting MySQL select to PostgreSQL. Postgres won't allow a column to be listed in the select statement that isn't in the group by clause.

Resources