I am using Rails 3 and postgresql. I have the following genres: rock, ambience, alternative, house.
I also have two users registered. One has rock and the other house, as their genres. I need to return rock and house genre objects.
I found two ways to do this. One is using group:
Genre.group('genres.id, genres.name, genres.cached_slug, genres.created_at, genres.updated_at').joins(:user).all
And the other using DISTINCT:
Genre.select('DISTINCT(genres.name), genres.cached_slug').joins(:user)
Both return the same desired results. But which one is better performance wise? Using group() looks messy since I have to indicate all the fields in the Genre table otherwise I'll get errors as such:
ActiveRecord::StatementInvalid: PGError: ERROR: column "genres.id" must appear in the GROUP BY clause or be used in an aggregate function
: SELECT genres.id FROM "genres" INNER JOIN "users" ON "users"."genre_id" = "genres"."id" GROUP BY genres.name
A DISTINCT and GROUP BY usually generate the same query plan, so performance should be the same across both query constructs.
Since you're not using any aggregate functions, you should use the one that makes the most sense in your situation, which I believe is this one:
Genre.select('DISTINCT(genres.name), genres.cached_slug').joins(:user)
This will be more clear when trying to read your code later and remember what you did here and, as you pointed out, is much less messy.
Update
It depends on the version of Postgresql you are using. Using versions < 8.4, GROUP BY is faster. With version 8.4 and later, they are the same.
Related
I'm using a query with multiple user-set filters in order to show a list of invoices in a Rails app. One of the filters adds a where condition on a column of a separate table, which needs a double join in order to be accessible (estimates -through projects-).
scope :by_seller, lambda {|user_id|
joins(project: :estimates)
.where(estimates: {:user_id => user_id}) unless user_id.blank?
}
Additionally, I use Rails' aggregate method "sum" in order to find out the total amount of the invoices, #invoices.sum(:total_cache), where total_cache is a cached column in the database specifically designed to perform this kind of sum in a performant way.
#invoices.sum(:total_cache)
My problem is, given the fact that I need a double join in order to access Estimates through Projects, and that each Invoice belongs to a Project, BUT a project can have many Estimates, the join operation results in duplicate records, so my Invoices table shows some of the invoices many times (as many as the number of estimates its project has). This results in an invoices table with duplicate records, and in an incorrect sum value, as it sums some of the invoice totals N times.
The filtering behaviour is just fine, as my intention is to filter by the user who made ANY of the estimates in the invoice project. However, the issue is that when I try to avoid the duplicates by adding a group('invoices.id') -the way I always solved such situations-, the final sum operation won't return the total sum of the invoices' total, but a grouped sum of each one of them (totally useless).
The only workaround I've found is to include the group clause and perform the sum in pure ruby code, treating the collection as an array, which IMHO is terribly inefficient, as there are tons of invoices:
#invoices.map(&:total_cache).inject(0, &:+)
Is there a way I can obtain a unique ActiveRecord collection of Invoices without duplicates in a way I can then call the aggregate sum method and obtain a total calculated by Postgres?
Of course, if there is something wrong in my base idea I'm completely open to hearing it! It's quite a complex query (I simplified it for the sake of the question here) and there can be many approaches I'm sure!
Thank you everyone!
I'm not sure how much "slower" or "faster" this is than doing the sum in ruby code. But if you want to still retain an ActiveRecord::Relation object, then you can do something like below. I reproduced your setup environment in a local Rails project.
user = User.first
Invoice.where(
id: Invoice.by_seller(user.id).select(:id)
).sum(:total_cache)
# (1.2 ms) SELECT SUM("invoices"."total_cache") FROM "invoices" WHERE "invoices"."id" IN (SELECT "invoices"."id" FROM "invoices" INNER JOIN "projects" ON "projects"."id" = "invoices"."project_id" INNER JOIN "estimates" ON "estimates"."project_id" = "projects"."id" WHERE "estimates"."user_id" = $1) [["user_id", 1]]
# => 5
The relationship between the model 'battle' to the model 'category' is has_and_belongs_to_many. I'm trying to find all the battles that don't relate to a specific category.
What I tried to do so far is:
Battle.all.includes(:categories).where("(categories[0].name NOT IN ('Other', 'Fashion'))")
This error occurred: ActiveRecord::StatementInvalid: PG::DatatypeMismatch: ERROR: cannot subscript type categories because it is not an array
Thanks,
Rotem
The use of categories[0].name is not a valid SQL reference to the name column of categories.
Try this:
Battle.includes(:categories).reject do |battle|
['Other', 'Fashion'].include? battle.categories.map(&:name)
end
Note that this code performs two queries - one for Battle and one for Category - and eager loads the appropriate Category Active Record objects into the instantiated Battle records' :categories relationship. This is helpful to prevent N+1 queries as described in details in the guides.
Also note that the code above first loads to memory ALL Battle records before eliminating the ones that have the undesired categories. Depending on your data, this could be prohibitive. If you prefer to restrict the SQL query so that only the relevant ActiveRecord objects get instantiated, you could get away with something like the following:
battle_ids_sql = Battle.select(:id).joins(:categories).where(categories: {name: ['Other' ,'Fashion']}).to_sql
Battle.where("id NOT IN (#{battle_ids_sql})")
battle_ids_sql is an SQL statement that returns battle IDs that HAVE one of your undesired categories. The SQL which is actually executed fetches all Battle records that are not in that inner SQL. It's effective, although use sparingly - queries like this tend to become hard to maintain rather quickly.
You can learn about joins, includes and two other related methods (eager_load and preload) here.
I'm writing a search for a project I'm working on. It is meant to be able to search the body of articles and produce a list of their authors, ordered by the number of matching articles and including the relevant articles only, not all of their articles.
I currently have the following query:
Author.includes(:articles).where('articles.body ilike ?', '%foo%').references(:articles)
The use of includes in this case makes it so that all the relevant articles (not all articles) are preloaded, that's exactly what I want. However, when it comes to ordering by the number of included articles, I'm not sure how to proceed.
I should note I want to do this in ActiveRecord because pagination will be applied after the query. Not after a Ruby solution.
I should note I'm using PostgreSQL 9.3.
Edit: using raw SQL
This seems to work on its own like so:
Author.includes(:articles).where('articles.body ilike ?', '%foo%').references(:articles).select('authors.*, (SELECT COUNT(0) FROM articles WHERE articles.author_id = authors.id) AS article_count').order('article_count DESC')
This works fine. However, if I add .limit(1) it breaks.
PG::UndefinedColumn: ERROR: column "article_count" does not exist
Any idea why adding limit breaks it? The query seems very different too
SELECT DISTINCT "authors"."id", article_count AS alias_0 FROM "authors" LEFT OUTER JOIN "articles" ON "articles"."author_id" = "authors"."id" WHERE (articles.body ilike '%microsoft%') ORDER BY article_count DESC LIMIT 1
I don't think there's an out of the box solution for this. You have to write raw sql to do this but you can combine it with existing ActiveRecord queries.
Author
.includes(:articles)
.select('authors.*, (SELECT COUNT(0) FROM articles WHERE articles.author_id = authors.id) AS article_count')
.order('article_count DESC')
So the only thing to explain here is the select part. The first part, authors.*, selects all fields under the authors table and this is the default. Since we want to also count the number of articles, we create a subquery and pass its result as one of the pseudo columns of authors (we called it article_count). The last part is to just call order using article_count.
This solution assumes a couple of things which you'll have to fine tune depending on your setup.
Author by convention in rails maps to an authors table. If it is an STI (inherits from a User class and is using users table), you'll need to change authors to users.
articles.author_id assumes that the foreign key is author_id (and essentially, an article is only written by a single author). Change to whatever the foreign key is.
So given that, you'll have an array of authors ordered by the number of articles they've written.
I've got a User model and a Card model. User has many Cards, so card has a attribute user_id.
I want to fetch the newest single Card for each user. I've been able to do this:
Card.all.order(:user_id, :created_at)
# => gives me all the Cards, sorted by user_id then by created_at
This gets me half way there, and I could certainly iterate through these rows and grab the first one per user. But this smells really bad to me as I'd be doing a lot of this using Arrays in Ruby.
I can also do this:
Card.select('user_id, max(created_at)').group('user_id')
# => gives me user_id and created_at
...but I only get back user_ids and created_at timestamps. I can't select any other columns (including id) so what I'm getting back is worthless. I also don't understand why PG won't let me select more columns than above without putting them in the group_by or an aggregate function.
I'd prefer to find a way to get what I want using only ActiveRecord. I'm also willing to write this query in raw SQL but that's if I can't get it done with AR. BTW, I'm using a Postgres DB, which limits some of my options.
Thanks guys.
We join the cards table on itself, ON
a) first.id != second.id
b) first.user_id = second.user_id
c) first.created_at < second.created_at
Card.joins("LEFT JOIN cards AS c ON cards.id != c.id AND c.user_id = cards.user_id AND cards.created_at < c.created_at").where('c.id IS NULL')
This is a bit late, but I am working on the same matter, and i found this one works for me :
Card.all.group_by(&:user_id).map{|s| s.last.last}
What do you think ?
I've found one solution that is suboptimal performance-wise but will work for very small datasets, when time is short or it's a hobby project:
Card.all.order(:user_id, :created_at).to_a.uniq(&:user_id)
This takes the AR:Relation results, casts them into a Ruby Array, then performs a Array#uniq on the results with a Proc. After some brief testing it appears #uniq will preserve order, so as long as everything is in order before using uniq you should be good.
The feature is time sensitive so I'm going to use this for now, but I will be looking at something in raw SQL following #Gene's response and link.
I am just learning ActiveRecord and SQL and I was under the impression that :include does one SQL query. So if I do:
Show.first :include => :artist
It will execute one query and that query is going to return first show and artist. But looking at the SQL generated, I see two queries:
[2013-01-08T09:38:00.455705 #1179] DEBUG -- : Show Load (0.5ms) SELECT `shows`.* FROM `shows` LIMIT 1
[2013-01-08T09:38:00.467123 #1179] DEBUG -- : Artist Load (0.5ms) SELECT `artists`.* FROM `artists` WHERE `artists`.`id` IN (2)
I saw one of the Railscast videos where the author was going over :include vs :join and I saw the output SQL on the console and it was a large SQL query, but it was only one query. I am just wondering if this is how it is supposed to be or am I missing something?
Active Record has two ways in which it loads association up front. :includes will trigger either of those, based on some heuristics.
One way is for there to be one query per association: you first load all the shows (1 query) then you load all artists (2nd query). If you were then including an association on artists that would be a 3rd query. All of these queries are simple queries, although it does mean that no advantage is gained in your specific case. Because the queries are separate, you can't do things like order the top level (shows) by the child associations and thing like that.
The second way is to load everything in one big joins based query. This always produces a single query, but its more complicated - 1 join per association included and the code to turn the result set back into ruby objects is more complicated too. There are some other corner cases: polymorphic belongs_to can't be handled and including multiple has_many at the same level will produce a very large result set).
Active Record will by default use the first strategy (preload), unless it thinks that your query conditions or order are referencing the associations, in which case it falls back to the second approach. You can force the strategy used by using preload or eager_load instead of :includes.
Using :includes is a solution to provide eager loading. It will load at most two queries in your example. If you were to change your query Show.all :include => :artist. This will also call just two queries.
Better explanation: Active Record Querying Eager Loading