I debugged my query for some time now. I narrowed the problem down to this:
I've got a local database with 44 movies. Running the query:
Movie.order(:name).joins(:movie_images).offset(40).limit(10).uniq
I retrieve the last 4 movies as expected.
Now I wanted to order the results by the movie_images.created_at field instead of the movies.name. I had to add the select clause as distinct requires that.
Movie.select("movies.*, movie_images.created_at").order("movie_images.created_at").joins(:movie_images).offset(40).limit(10).uniq
This returns the last 4 movies as expected, but it additionally returns the first 6 movies again (now being duplicates as loaded earlier), adding up to the set limit of 10.
My 44 movies each have 2 images, each with a different created_at time. So my assumption is that there are 88 distinct results according to the select clause:
SELECT DISTINCT movies.*, movie_images.created_at
Setting the offset to 80 returns the last 8 results, proving my point.
But how would you order by movie_images.created_at while still only selecting distinct movies`
You could use a GROUP BY statement instead, like so:
SELECT movies.id, MAX(movie_images.created_at) mi_created_at
FROM movies JOINS movie_images ON movies.id = movie_images.movie_id
GROUP BY movies.id
ORDER BY mi_created_at
Also refer to this answer on Stackoverflow for the differences of GROUP BY and DISTINCT statements in PostgreSQL.
Another thing to consider would be a different design. Rails offers the touch option on a belongs_to association. Meaning it will update the updated_at column of the parent object if the child object changes (i.e. a new movie image is being created, etc).
class MovieImage < ActiveRecord::Base
belongs_to :movie, touch: true
end
With that in place you can simply sort by the updated_at column of your Movie object.
Movie.order(updated_at: :asc).offset(40).limit(10)
Of course this would also contain updates like the Movie object itself being updated, or a MovieImage getting destroyed, etc. If you purely want to consider the creation of a new MovieImage object, you could also work with an after_create callback, and a specific column on your Movie model:
class MovieImage < ActiveRecord::Base
belongs_to :movie
after_create :touch_movie
def touch_movie
movie.update(last_movie_image_created_at: Time.now)
end
end
And then use that column in your query:
Movie.order(:last_movie_image_created_at).offset(40).limit(10)
With those small design changes, I think it's also more readable and obvious what one's doing instead of having endless distinct or group by queries.
Related
The question goes as RoR but I guess the question applies more on a query level.
I have a model
ModelA has_many ModelB
and I'm using paranoid gem in both of them.
I am trying to find out all the ModelA which have either no ModelB associated or all its ModelB have been deleted (therefore deleted_at is not NULL).
Based on this question, I was able to achieve the desired result, but reading the query it does not make much sense to me how it is working.
ModelA.joins('LEFT OUTER JOIN model_bs ON model_bs.model_a_id = model_as.id AND model_bs.deleted_at IS NULL')
.where('model_bs.model_a_id IS NULL')
.group(model_as.id)
As mentioned, reading this query it does not make sense to me because the joins condition I'm using is also being as null on the where clause afterwards.
Can someone please help me out getting the query properly set and if this is the right way to go, explain me how does the query breakdown into the right result.
Update:
As mentioned on a comment below, after some raw sql I managed to understand what was going on. Also, wrote my AR solution as to make it more clear (at least from my perspective)
Breaking it down:
This query represents a regular left outer join, which returns all models on the left independently if there is an association on the right.
ModelA.joins('LEFT OUTER JOIN model_bs ON model_bs.model_a_id = model_as.id')
Adding the extra condition in the join
AND model_bs.deleted_at IS NULL
will return the rows with ModelB attributes available if there is one associated, else will return one row with only the ModelA.
Applying over these rows the
.where('model_bs.model_a_id IS NULL') will keep only rows that have no associated models.
Note:
In this case, using the attribute model_a_id or any other attribute of model_b will work, but if going this way, I'd recommend using an attribute that if the association exists, its always going to be there. Otherwise you might end up with wrong results.
Finally, the AR that I've ended up by using to translate what I wanted:
ModelA.joins('LEFT OUTER JOIN model_bs ON model_bs.model_a_id = model_as.id AND model_bs.deleted_at IS NULL')
.group('model_as.id')
.having('count(model_bs.id) = 0')
Hey you can try this way i have tested it in mysql you can verify it with PG
ModelA.joins('LEFT OUTER JOIN model_bs ON model_bs.model_a_id = model_as.id')
.group('model_as.id')
.having('SUM(CASE WHEN model_bs.deleted_at IS NULL THEN 0 ELSE 1 END) AS OCT = 0')
I'm grouping a list of Bug reports on a known collection of users that are related to the report (that is, the user that is responsible for the report and the user that is currently assigned to it).
The Model Bug (AR, Rails 4.2.x) thus has, among others, two associations assigned_to and responsible, which are resolved to the foreign keys assigned_to_id, responsible_id.
Bugs can also be related to a project, which may also have a responsible user set, thus they also possess a responsible_id foreign key.
As we're grouping on both attributes from the report itself and the associated project, we want to include the associated project in the returned query.
I can then get a hash count of <User> => count through the following statement, grouping on the association name of the bug report:
Bug.group(:assigned_to)
.includes(:project)
.references(:projects)
.count
which correctly produces the desired result: A collection of Users (assignees) and the Bugs they are being assigned to.
For responsibles, the same query:
Bug.group(:responsible)
.includes(:project)
.references(:projects)
.count
yields an error, since the attribute responsible_id is both contained in the query by bugs and the associated projects.
SELECT COUNT(DISTINCT "bugs"."id") AS count_id,
responsible_id AS responsible_id
FROM "bugs"
LEFT OUTER JOIN "projects" ON "projects"."id" = "bugs"."project_id"
GROUP BY "bugs"."responsible_id"
If I instead group on the explicit attribute itself using Bugs.group('bugs.responsible_id'), I get a valid response, however in the form of responsible_id => count.
SELECT COUNT(DISTINCT "bugs"."id") AS count_id,
bugs.responsible_id AS bugs_responsible_id
FROM "bugs"
LEFT OUTER JOIN "projects" ON "projects"."id" = "bugs"."project_id"
WHERE <condition>
GROUP BY bugs.responsible_id
Is there a way to force using the association, but namespace the query as in the second query?
Of course I could process the result and expand it to the responsible users, however since the grouping is part of a larger querying functionality, I only get to manipulate the grouping identifier without extensive changes to the query builder.
I don't think there is a fix for this now (in rails 4.2.4). This will however become easy in rails 5.
If you absolutely must solve the problem now, you could patch ActiveRecord::Calculations#execute_grouped_calculation with the fix available in rails 5 for your app. Simply add an initializer at config/initializers e.g. active_record_calculations_patch.rb with the following (abbreviated) content. You can copy the original code from your rails version and then add the fix:
module ActiveRecord
module Calculations
def execute_grouped_calculation(operation, column_name, distinct)
...
else
group_fields = group_attrs
end
# LINE OF CODE COPIED OVER FROM THE FIX
group_fields = arel_columns(group_fields)
# END OF COPIED OVER CODE
group_aliases = group_fields.map { |field|
column_alias_for(field)
...
end
end
end
My Rails 4 app has a User model, a Link model, and a Hit model. Each User has many Links, and each Link has many Hits. Occasionally, I want to display a list of the User's Links with the number of Hits it has.
The obvious way to do this would be to loop over the links and call link.hits.count on each one, but this produces N+1 queries. So instead, I wrote a scope which joins the hits table:
scope :with_hit_counts, -> {
joins("LEFT OUTER JOIN hits ON hits.link_id = links.id").select('links.*', 'count(hits.link_id) AS hit_count').group("links.id")
}
This effectively adds a virtual hit_count attribute to each Link, which is computed in a single query. Curiously, it appears to be a separate query from loading the links, rather than actually being done in the same query:
SELECT COUNT(*) AS count_all, links.id AS links_id
FROM "links" LEFT OUTER JOIN hits ON hits.link_id = links.id
WHERE "links"."user_id" = $1
GROUP BY links.id
ORDER BY "links"."domain_id" ASC, "links"."custom_slug" ASC, "links"."id" ASC ;
Unfortunately, as the hits table grows, this has become a slow query. EXPLAIN indicates that the query is joining all hits with their matching links using an index, and then narrowing the links down to just the ones with the correct user_id by sequential scan; that seems to be the reason it's slow. However, if we're already loading the list of links separately—and we are—there's no actual need to join the links table at all. We can get the list of link IDs for the user and then do a query purely on the hits table with hits.link_id IN (list of IDs).
It's easy to write this as a separate query, and it runs lightning-fast:
Hit.where(link_id: #user.links.ids).group(:link_id).count
The problem is, I can't figure out how to get ActiveRecord to do this as a scope on the Link model, so that each Link has a hit_count attribute I can use, and so that I can use the resulting return value as a relation with the ability to chain other queries onto it. Any ideas?
(I do know about ActiveRecord's counter_cache feature, but I don't want to use it here—hits are inserted by a separate, non-Ruby system, and modifying that system to update the counter cache would be moderately painful.)
As trh mentioned, the common way to go about this is to add a hits_count to link.rb Then you can query and sort quickly. You just have to keep them in sync. If it truly is a basic count, then you can use the basic rails counter cache. This will increment and decrement on create and destroy.
class AddCounterCacheToLink < ActiveRecord::Migration
def up
add_column :links, :hits_count, :integer, :default => 0
Link.reset_column_information
Link.all.each do |l|
l.update_attribute :hits_count, l.hits.size
end
end
end
And then in the model
class Hit < ActiveRecord::Base
belongs_to :link, :counter_cache => true #this will trigger the rails magic
If you have something more complicated, you can write your own, which is trivial.
I have two models, Monkey and Session, where Monkey has_many Session. I have a scope for Monkey:
scope :with_session_counts, -> {
joins("LEFT OUTER JOIN `sessions` ON `sessions`.`monkey_id` = `monkeys`.`id`")
.group(:id)
.select("`monkeys`.*, COUNT(DISTINCT `sessions`.`id`) as session_count")
}
in order to grab the number of associated Sessions (even when 0).
Querying #monkeys = Monkey.with_session_counts works as expected. However, when I test in my view:
<% unless #monkeys.empty?%>
I get this error:
Mysql2::Error: Column 'id' in field list is ambiguous:
SELECT COUNT(*) AS count_all, id AS id FROM `monkeys`
LEFT OUTER JOIN `sessions` ON `sessions`.`monkey_id` = `monkeys`.`id`
GROUP BY `monkeys`.`id`
How would I convince Rails to prefix id with the table name in presence of the JOIN?
Or is there a better alternative for the OUTER JOIN?
This applies equally to calling #monkeys.count(:all). I'm using RoR 4.2.1.
Update:
I have a partial fix for my issue (specify group("monkeys.id") explicitly) I wonder whether this is a bug in the code that generates the SELECT clause for count(:all). Note that in both cases (group("monkeys.id") and group(:id)) the GROUP BY part is generated correctly (i.e. with monkeys.id), but in the latter case the SELECT only contains id AS id. The reason I say 'partial' is because it works in that it does not break a call to empty?, but a call to count(:all) returns a Hash {monkey_id => number_of_sessions} instead of the number of records.
Update 2:
I guess my real question is: How can I get the number of associated sessions for each monkey, so that for all intents and purposes I can work with the query result as with Monkey.all? I know about counter cache but would prefer not to use it.
I believe it is not a bug. Like you added on your update, you have to specify the table that the id column belongs to. In this case group('monkeys.id') would do it.
How would the code responsible for generating the statement know the table to use? Without the count worked fine because it adds points.* to the projection and that is the one used by group by. However, if you actually wanted to group by Sessions id, you would have to specify it anyway.
When using a has many association to manage a serious of tags, what is the most efficient way to order/sort the collection by the number of tags selected.
For example:
Product can have many tags through ProductTags
When a user selects the tags, I would like to order the products by the number of the selected tags each product has.
Is it possible to use a cache_counter or something similar in this case? I'm not convinced using sort is the best option. Am I correct in thinking that using order on the actual database is generally faster than sort?
Clarification/update
Sorry if the above is confusing. Basically what I'm after is closer to ordering by relevancy. For example a user might select tag 1, 2, and 4. If an product has all tree tags associated with it, I want that product listed first. The second product might only have tags 1 & 4. And so on. I'm almost certain that this will have to use sort versus order, but was wondering if anyone has found a more efficient way of doing this.
Ordering by relevance within the database is both possible and far more efficient than using the sort method in Ruby. Assuming the following model structure and an appropriate underlying SQL table structure:
class Product < ActiveRecord::Base
has_many :product_taggings
has_many :product_tags, :through => :product_taggings
end
class ProductTags < ActiveRecord::Base
has_many :product_taggings
has_many :products, :through => :product_taggings
end
class ProductTaggings < ActiveRecord::Base
belongs_to :product
belongs_to :product_tags
end
Querying for relevance in MySQL would look something like:
SELECT
`product_id`
,COUNT(*) AS relevance
FROM
`product_taggings` AS ptj
LEFT JOIN
`products` AS p
ON p.`id` = ptj.`product_id`
LEFT JOIN
`product_tags` AS pt
ON pt.`id` = ptj.`product_tag_id`
WHERE
pt.`name` IN ('Tag 1', 'Tag 2')
GROUP BY
`product_id`
If I have the following products and related tags:
Product 1 -> Tag 3
Product 2 -> Tag 1, Tag 2
Product 3 -> Tag 1, Tag 3
Then the WHERE clause from above should net me:
product_id | relevance
----------------------
2 | 2
3 | 1
* Product 1 is not included since there were no matches.
Given that the user is performing a filtered search,
this behavior is probably fine. There's a way to get
Product 1 into the results with 0 relevance if
necessary.
What you've done is create a nice little result set that can act as a sort of inline join table. In order to stick a relevance score onto each row of a query from your products table, use this query as a subquery as follows:
SELECT *
FROM
`products` AS p
,(SELECT
`product_id`
,COUNT(*) AS relevance
FROM
`product_taggings` AS ptj
LEFT JOIN
`products` AS p
ON p.`id` = ptj.`product_id`
LEFT JOIN
`product_tags` AS pt
ON pt.`id` = ptj.`product_tag_id`
WHERE
pt.`name` IN ('Tag 1', 'Tag 2')
GROUP BY `product_id`
) AS r
WHERE
p.`id` = r.`product_id`
ORDER BY
r.`relevance` DESC
What you'll have is a result set containing the fields from your products table and an additional relevance column at the end that will then be used in the ORDER BY clause.
You'll need to write up a method that will in-fill this query with your desired pt.name IN list. Be certain to sanitize that list before plugging it into the query or you'll open yourself up to possible SQL injection.
Take the result of your query assembling method and run it through Product.find_by_sql(my_relevance_sql) to get your models pre-sorted by relevance directly from the DB.
The obvious down-side is that you introduce a DBMS-specific dependency into your Rails code (and risk SQL injection if you're not careful). If you're not using MySQL, the syntax might need to be adapted. However, it should perform much faster, especially on a huge result set, than using a Ruby sort on the results. Furthermore, adding a LIMIT clause will give you pagination support if needed.
Building on Ryan's excellent answer, I wanted a method that could be used acts-as-taggable-on and similar plug-ins (tables called tags/taggings), and ended up with this:
def Product.find_by_tag_list(tag_list)
tag_list_sql = "'" + tag_list.join("','") + "'"
Product.find_by_sql("SELECT * FROM products, (SELECT taggable_id, COUNT(*) AS relevance FROM taggings LEFT JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN (" + tag_list_sql + ") GROUP BY taggable_id) AS r WHERE products.id = r.taggable_id ORDER BY r.relevance DESC;")
end
To get a list of related products ordered by relevance, I then can do:
Product.find_by_tag_list(my_product.tag_list)