Fastest way to order by matching has many through association? - ruby-on-rails

When using a has many association to manage a serious of tags, what is the most efficient way to order/sort the collection by the number of tags selected.
For example:
Product can have many tags through ProductTags
When a user selects the tags, I would like to order the products by the number of the selected tags each product has.
Is it possible to use a cache_counter or something similar in this case? I'm not convinced using sort is the best option. Am I correct in thinking that using order on the actual database is generally faster than sort?
Clarification/update
Sorry if the above is confusing. Basically what I'm after is closer to ordering by relevancy. For example a user might select tag 1, 2, and 4. If an product has all tree tags associated with it, I want that product listed first. The second product might only have tags 1 & 4. And so on. I'm almost certain that this will have to use sort versus order, but was wondering if anyone has found a more efficient way of doing this.

Ordering by relevance within the database is both possible and far more efficient than using the sort method in Ruby. Assuming the following model structure and an appropriate underlying SQL table structure:
class Product < ActiveRecord::Base
has_many :product_taggings
has_many :product_tags, :through => :product_taggings
end
class ProductTags < ActiveRecord::Base
has_many :product_taggings
has_many :products, :through => :product_taggings
end
class ProductTaggings < ActiveRecord::Base
belongs_to :product
belongs_to :product_tags
end
Querying for relevance in MySQL would look something like:
SELECT
`product_id`
,COUNT(*) AS relevance
FROM
`product_taggings` AS ptj
LEFT JOIN
`products` AS p
ON p.`id` = ptj.`product_id`
LEFT JOIN
`product_tags` AS pt
ON pt.`id` = ptj.`product_tag_id`
WHERE
pt.`name` IN ('Tag 1', 'Tag 2')
GROUP BY
`product_id`
If I have the following products and related tags:
Product 1 -> Tag 3
Product 2 -> Tag 1, Tag 2
Product 3 -> Tag 1, Tag 3
Then the WHERE clause from above should net me:
product_id | relevance
----------------------
2 | 2
3 | 1
* Product 1 is not included since there were no matches.
Given that the user is performing a filtered search,
this behavior is probably fine. There's a way to get
Product 1 into the results with 0 relevance if
necessary.
What you've done is create a nice little result set that can act as a sort of inline join table. In order to stick a relevance score onto each row of a query from your products table, use this query as a subquery as follows:
SELECT *
FROM
`products` AS p
,(SELECT
`product_id`
,COUNT(*) AS relevance
FROM
`product_taggings` AS ptj
LEFT JOIN
`products` AS p
ON p.`id` = ptj.`product_id`
LEFT JOIN
`product_tags` AS pt
ON pt.`id` = ptj.`product_tag_id`
WHERE
pt.`name` IN ('Tag 1', 'Tag 2')
GROUP BY `product_id`
) AS r
WHERE
p.`id` = r.`product_id`
ORDER BY
r.`relevance` DESC
What you'll have is a result set containing the fields from your products table and an additional relevance column at the end that will then be used in the ORDER BY clause.
You'll need to write up a method that will in-fill this query with your desired pt.name IN list. Be certain to sanitize that list before plugging it into the query or you'll open yourself up to possible SQL injection.
Take the result of your query assembling method and run it through Product.find_by_sql(my_relevance_sql) to get your models pre-sorted by relevance directly from the DB.
The obvious down-side is that you introduce a DBMS-specific dependency into your Rails code (and risk SQL injection if you're not careful). If you're not using MySQL, the syntax might need to be adapted. However, it should perform much faster, especially on a huge result set, than using a Ruby sort on the results. Furthermore, adding a LIMIT clause will give you pagination support if needed.

Building on Ryan's excellent answer, I wanted a method that could be used acts-as-taggable-on and similar plug-ins (tables called tags/taggings), and ended up with this:
def Product.find_by_tag_list(tag_list)
tag_list_sql = "'" + tag_list.join("','") + "'"
Product.find_by_sql("SELECT * FROM products, (SELECT taggable_id, COUNT(*) AS relevance FROM taggings LEFT JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN (" + tag_list_sql + ") GROUP BY taggable_id) AS r WHERE products.id = r.taggable_id ORDER BY r.relevance DESC;")
end
To get a list of related products ordered by relevance, I then can do:
Product.find_by_tag_list(my_product.tag_list)

Related

Rails: Collect records whose join tables appear in two queries

There are three models that matter here: Objective, Student, and Seminar. All are associated with has_and_belongs_to_many.
There is an ObjectiveStudent join model that includes columns "ready" and "points_all_time". There is an ObjectiveSeminar join model that includes column "priority".
I need to collect all of the objectives that are associated with a given student and also with a given seminar.
They need to also be marked with a "priority" above zero in the seminar. So I think I need this line:
obj_sems = ObjectiveSeminar.where(:seminar => given_seminar).where("priority > ?", 0)
Finally, they need to also be objectives where the student is ready, but has not scored above 7. So I think I need this line:
obj_studs = ObjectiveStudent.where(:user => given_student, :ready => true).where("points_all_time <= ?", 7)
Is there a way to gather all the objectives whose join table records appear in both of the above queries? Note that neither of the lists return objectives; they return objective_seminars, and objective_students, respectively. My end goal is to collect the objectives that meet all of the above criteria.
Or am I approaching this all wrong?
Bonus question: I would also love to sort the objectives by their priority in the given seminar. But I'm afraid that would add too much to the database load. What are your thoughts on this?
Thank you in advance for any insight.
In order to get Objectives you'll need to start your query from that.
In order to query with an AND condition the associated tables, you'll need inner joins with these tables.
Finally you'll need a distinct operator to only fetch each objective once.
The extended version of what (I think) you need is:
Objective.joins(objective_seminars: :seminar, objective_student: :student).
where(seminars: seminar_search_params, strudents: student_search_params).
where('objective_seminars.priority > 0').
where('objective_students.ready = 1 AND points_all_time <= 7').
order('objective_seminars.priority ASC').
distinct
Now for the database load it all depends on your indexes and the size of your tables.
The above query will translate to the following SQL (or something similar).
SELECT DISTINCT objectives.* FROM objectives
INNER JOIN objective_students ON objective_students.objective_id = objectives.id
INNER JOIN students ON students.id = objective_students.student_id
INNER JOIN objective_seminars ON objective_seminars.objective_id = objectives.id
INNER JOIN seminars ON seminars.id = objective_seminars.seminar_id
WHERE seminars_query AND
students_query AND
objective_seminars.priority > 0 AND
objective_students.ready = 1 AND points_all_time <= 7 AND
objective_seminars.priority ASC
So you'll need to add or extend your indexes so that all 5 tables queries can have an index helping out. The actual index implementation is up to you and depends on your application's specific (read - write load, tables size, cardinality etc)

Adding related data to ActiveRecords without using a join

My Rails 4 app has a User model, a Link model, and a Hit model. Each User has many Links, and each Link has many Hits. Occasionally, I want to display a list of the User's Links with the number of Hits it has.
The obvious way to do this would be to loop over the links and call link.hits.count on each one, but this produces N+1 queries. So instead, I wrote a scope which joins the hits table:
scope :with_hit_counts, -> {
joins("LEFT OUTER JOIN hits ON hits.link_id = links.id").select('links.*', 'count(hits.link_id) AS hit_count').group("links.id")
}
This effectively adds a virtual hit_count attribute to each Link, which is computed in a single query. Curiously, it appears to be a separate query from loading the links, rather than actually being done in the same query:
SELECT COUNT(*) AS count_all, links.id AS links_id
FROM "links" LEFT OUTER JOIN hits ON hits.link_id = links.id
WHERE "links"."user_id" = $1
GROUP BY links.id
ORDER BY "links"."domain_id" ASC, "links"."custom_slug" ASC, "links"."id" ASC ;
Unfortunately, as the hits table grows, this has become a slow query. EXPLAIN indicates that the query is joining all hits with their matching links using an index, and then narrowing the links down to just the ones with the correct user_id by sequential scan; that seems to be the reason it's slow. However, if we're already loading the list of links separately—and we are—there's no actual need to join the links table at all. We can get the list of link IDs for the user and then do a query purely on the hits table with hits.link_id IN (list of IDs).
It's easy to write this as a separate query, and it runs lightning-fast:
Hit.where(link_id: #user.links.ids).group(:link_id).count
The problem is, I can't figure out how to get ActiveRecord to do this as a scope on the Link model, so that each Link has a hit_count attribute I can use, and so that I can use the resulting return value as a relation with the ability to chain other queries onto it. Any ideas?
(I do know about ActiveRecord's counter_cache feature, but I don't want to use it here—hits are inserted by a separate, non-Ruby system, and modifying that system to update the counter cache would be moderately painful.)
As trh mentioned, the common way to go about this is to add a hits_count to link.rb Then you can query and sort quickly. You just have to keep them in sync. If it truly is a basic count, then you can use the basic rails counter cache. This will increment and decrement on create and destroy.
class AddCounterCacheToLink < ActiveRecord::Migration
def up
add_column :links, :hits_count, :integer, :default => 0
Link.reset_column_information
Link.all.each do |l|
l.update_attribute :hits_count, l.hits.size
end
end
end
And then in the model
class Hit < ActiveRecord::Base
belongs_to :link, :counter_cache => true #this will trigger the rails magic
If you have something more complicated, you can write your own, which is trivial.

How to build inner join in Rails with conditions?

I've a model StockUpdate which keeps track of stocks for every product for a store. Table attributes are: :product_id, :stock, :store_id. I was trying to find out last entry for every product for a given store. According to that I build my query in PGAdmin which is given below and it's working fine. I'm new in Rails and I don't know how to represent it in Model. Please help.
SELECT a.*
FROM stock_updates a
INNER JOIN
(
SELECT product_id, MAX(id) max_id
FROM stock_updates where store_id = 9 and stock > 0
GROUP BY product_id
) b ON a.product_id = b.product_id AND
a.id = b.max_id
I does not clearly understand what you want to do, but I think you can do something like this:
class StockUpdate < ActiveRecord::Base
scope :a_good_name, -> { joins(:product).where('store_id = ? and stock > ?', 9, 0) }
end
You can all call StoclUpdate.a_good_name.explain to check the generated sql
What you need is really simple and can be easily accomplished with 2 queries. Otherwise it becomes very complicated in a single query (it's still doable though):
store_ids = [0, 9]
latest_stock_update_ids = StockUpdate.
where(store_id: store_ids).
group(:product_id).
maximum(:id).
values
StockUpdate.where(id: latest_stock_update_ids)
Two queries, without any joins necessary. The same could be possible with a single query too. But like your original code, it would include subqueries.
Something like this should work:
StockUpdate.
where(store_id: store_ids).
where("stock_updates.id = (
SELECT MAX(su.id) FROM stock_updates AS su WHERE (
su.product_id = stock_updates.product_id
)
)
")
Or perhaps:
StockUpdate.where("id IN (
SELECT MAX(su.id) FROM stock_updates AS su GROUP BY su.product_id
)")
And to answer your original question, you can manually specify a joins like so:
Model1.joins("INNER JOINS #{Model2.table_name} ON #{conditions}")
# That INNER JOINS can also be LEFT OUTER JOIN, etc.

offset not selecting distinct values

I debugged my query for some time now. I narrowed the problem down to this:
I've got a local database with 44 movies. Running the query:
Movie.order(:name).joins(:movie_images).offset(40).limit(10).uniq
I retrieve the last 4 movies as expected.
Now I wanted to order the results by the movie_images.created_at field instead of the movies.name. I had to add the select clause as distinct requires that.
Movie.select("movies.*, movie_images.created_at").order("movie_images.created_at").joins(:movie_images).offset(40).limit(10).uniq
This returns the last 4 movies as expected, but it additionally returns the first 6 movies again (now being duplicates as loaded earlier), adding up to the set limit of 10.
My 44 movies each have 2 images, each with a different created_at time. So my assumption is that there are 88 distinct results according to the select clause:
SELECT DISTINCT movies.*, movie_images.created_at
Setting the offset to 80 returns the last 8 results, proving my point.
But how would you order by movie_images.created_at while still only selecting distinct movies`
You could use a GROUP BY statement instead, like so:
SELECT movies.id, MAX(movie_images.created_at) mi_created_at
FROM movies JOINS movie_images ON movies.id = movie_images.movie_id
GROUP BY movies.id
ORDER BY mi_created_at
Also refer to this answer on Stackoverflow for the differences of GROUP BY and DISTINCT statements in PostgreSQL.
Another thing to consider would be a different design. Rails offers the touch option on a belongs_to association. Meaning it will update the updated_at column of the parent object if the child object changes (i.e. a new movie image is being created, etc).
class MovieImage < ActiveRecord::Base
belongs_to :movie, touch: true
end
With that in place you can simply sort by the updated_at column of your Movie object.
Movie.order(updated_at: :asc).offset(40).limit(10)
Of course this would also contain updates like the Movie object itself being updated, or a MovieImage getting destroyed, etc. If you purely want to consider the creation of a new MovieImage object, you could also work with an after_create callback, and a specific column on your Movie model:
class MovieImage < ActiveRecord::Base
belongs_to :movie
after_create :touch_movie
def touch_movie
movie.update(last_movie_image_created_at: Time.now)
end
end
And then use that column in your query:
Movie.order(:last_movie_image_created_at).offset(40).limit(10)
With those small design changes, I think it's also more readable and obvious what one's doing instead of having endless distinct or group by queries.

Rails expanding fields with scope, PG does not like it

I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]

Resources