I have a query used for statistical purposes. It breaks down the number of users that have logged-in a given number of times. User has_many installations and installation has a login_count.
select total_login as 'logins', count(*) as `users`
from (select u.user_id, sum(login_count) as total_login
from user u
inner join installation i on u.user_id = i.user_id
group by u.user_id) g
group by total_login;
+--------+-------+
| logins | users |
+--------+-------+
| 2 | 3 |
| 6 | 7 |
| 10 | 2 |
| 19 | 1 |
+--------+-------+
Is there some elegant ActiveRecord style find to obtain this same information? Ideally as a hash collection of logins and users: { 2=>3, 6=>7, ...
I know I can use sql directly but wanted to know how this could be solved in rails 3.
# Our relation variables(RelVars)
U =Table(:user, :as => 'U')
I =Table(:installation, :as => 'I')
# perform operations on relations
G =U.join(I) #(implicit) will reference final joined relationship
#(explicit) predicate = Arel::Predicates::Equality.new U[:user_id], I[:user_id]
G =U.join(I).on( U[:user_id].eq(I[:user_id] )
# Keep in mind you MUST PROJECT for this to make sense
G.project(U[:user_id], I[:login_count].sum.as('total_login'))
# Now you can group
G=G.group(U[:user_id])
#from this group you can project and group again (or group and project)
# for the final relation
TL=G.project(G[:total_login].as('logins') G[:id].count.as('users')).group(G[:total_login])
Keep in mind this is VERY verbose because I wanted to show you the order of operations not just the "Here is the code". The code can actually be written with half the code.
The hairy part is Count()
As a rule, any attribute in the SELECT that is not used in an aggregate should appear in the GROUP BY so be careful with count()
Why would you group by the total_login count?
At the end of the day I would simply ask why don't you just do a count of the total logins of all installations since the user information is made irrelevant by the outer most count grouping.
I don't think you'll find anything as efficient as having the db do the work. Remember that you don't want to have to retrieve the rows from the db, you want the db itself to compute the answer by grouping the data.
If you want to push the SQL further into the database, you can create the query as a view in the database and then use a Rails ActiveRecord class to retrieve the results.
In the end imo the SQL syntax is way more readable. This arel stuff is just slowing me down all the time when I only need just a tiny bit more complexity. It's just another syntax you have learn, not worth it imo. I'd stick to SQL in these cases.
Related
I'm confused about something in Rails (using Rails 5). I have this model
class MyEventActivity < ApplicationRecord
belongs_to :event_activity
end
and what I want to do is get a list of all the objects linked to it, in other words, all the "event_activity" objects. I thought this would do the trick
my_event_activities = MyEventActivity.all.pluck(:event_activity)
but its giving me this SQL error
(2.3ms) SELECT "event_activity" FROM "my_event_activities"
ActiveRecord::StatementInvalid: PG::UndefinedColumn: ERROR: column "event_activity" does not exist
LINE 1: SELECT "event_activity" FROM "my_event_activities"
How do I get the objects linked to the MyEventActivity objects? Note that I don't want just the IDs, I want the whole object.
Edit: This is the PostGres table as requested
eactivit=# \d event_activities;
Table "public.event_activities"
Column | Type | Modifiers
--------------------------+-----------------------------+----------------------------------------------------------------
id | integer | not null default nextval('event_activities_id_seq'::regclass)
name | character varying |
abbrev | character varying |
attendance | bigint |
created_at | timestamp without time zone | not null
updated_at | timestamp without time zone | not null
EventActivity.joins(:my_event_activities).distinct
Returns all EventActivity objects that have associated MyEventActivity records
Or more along the lines of what you've already tried:
EventActivity.where(id: MyEventActivity.all.pluck(:event_activity_id).uniq)
But the first one is preferable for its brevity and performance.
Update to explain why the first option should be preferred
TL;DR much faster and more readable
Assume we have 100 event_activities, and all but the last (id: 100) have 100 my_event_activities for a total of 9900 my_event_activities.
EventActivity.where(id: MyEventActivity.all.pluck(:event_activity_id).uniq) performs two SQL queries:
SELECT "my_event_activities"."event_activity_id" FROM "my_event_activities" which will return an Array of 9900 non-unique event_activity_ids. We want to reduce this to unique ids to optimize the second query, so we call Array#uniq which has its own performance cost on large arrays, reducing 9900 down to 99. Then we can call the second query: SELECT "event_activities".* FROM "event_activities" WHERE "event_activities"."id" IN (1, 2, 3, ... 97, 98, 99)
EventActivity.joins(:my_event_activities).distinct performs only one SQL query: SELECT DISTINCT "event_activities".* FROM "event_activities" INNER JOIN "my_event_activities" ON "my_event_activities"."event_activity_id" = "event_activities"."id". Once we drop into the database we never have to switch back to Ruby to perform some expensive process and then make a second trip back to the database. joins is designed for performing these types of chainable and composable queries in situations like this.
The performance difference can be checked with a simple benchmark. With an actual Postgres database loaded with 100 event_activities, 99 of which have 100 my_event_activities:
require 'benchmark/ips'
require_relative 'config/environment'
Benchmark.ips do |bm|
bm.report('joins.distinct') do
EventActivity.joins(:my_event_activities).distinct
end
bm.report('pluck.uniq') do
EventActivity.where(id: MyEventActivity.all.pluck(:event_activity_id).uniq)
end
bm.compare!
end
And the results:
Warming up --------------------------------------
joins.distinct 5.922k i/100ms
pluck.uniq 7.000 i/100ms
Calculating -------------------------------------
joins.distinct 71.504k (± 3.5%) i/s - 361.242k in 5.058311s
pluck.uniq 73.459 (±13.6%) i/s - 364.000 in 5.061892s
Comparison:
joins.distinct: 71503.9 i/s
pluck.uniq: 73.5 i/s - 973.38x slower
973x slower :-O ! The joins method is meant to be used for things just like this, and this is one of the happy cases in Ruby where more readable is also more performant.
For the analytics of my site, I'm required to extract the 4 states of my users.
#members = list.members.where(enterprise_registration_id: registration.id)
# This pulls roughly 10,0000 records.. Which is evidently a huge data pull for Rails
# Member Load (155.5ms)
#invited = #members.where("user_id is null")
# Member Load (21.6ms)
#not_started = #members.where("enterprise_members.id not in (select enterprise_member_id from quizzes where quizzes.section_id IN (?)) AND enterprise_members.user_id in (select id from users)", #sections.map(&:id) )
# Member Load (82.9ms)
#in_progress = #members.joins(:quizzes).where('quizzes.section_id IN (?) and (quizzes.completed is null or quizzes.completed = ?)', #sections.map(&:id), false).group("enterprise_members.id HAVING count(quizzes.id) > 0")
# Member Load (28.5ms)
#completes = Quiz.where(enterprise_member_id: registration.members, section_id: #sections.map(&:id)).completed
# Quiz Load (138.9ms)
The operation returns a 503 meaning my app gives up on the request. Any ideas how I can refactor this code to run faster? Maybe by better joins syntax? I'm curious how sites with larger datasets accomplish what seems like such trivial DB calls.
The answer is your indexes. Check your rails logs (or check the console in development mode) and copy the queries to your db tool. Slap an "Explain" in front of the query and it will give you a breakdown. From here you can see what indexes you need to optimize the query.
For a quick pass, you should at least have these in your schema,
enterprise_members: needs an index on enterprise_member_id
members: user_id
quizes: section_id
As someone else posted definitely look into adding indexes if needed. Some of how to refactor depends on what exactly you are trying to do with all these records. For the #members query, what are you using the #members records for? Do you really need to retrieve all attributes for every member record? If you are not using every attribute, I suggest only getting the attributes that you actually use for something, .pluck usage could be warranted. 3rd and 4th queries, look fishy. I assume you've run the queries in a console? Again not sure what the queries are being used for but I'll toss in that it is often useful to write raw sql first and query on the db first. Then, you can apply your findings to rewriting activerecord queries.
What is the .completed tagged on the end? Is it supposed to be there? only thing I found close in the rails api is .completed? If it is a custom method definitely look into it. You potentially also have an use case for scopes.
THIRD QUERY:
I unfortunately don't know ruby on rails, but from a postgresql perspective, changing your "not in" to a left outer join should make it a little faster:
Your code:
enterprise_members.id not in (select enterprise_member_id from quizzes where quizzes.section_id IN (?)) AND enterprise_members.user_id in (select id from users)", #sections.map(&:id) )
Better version (in SQL):
select blah
from enterprise_members em
left outer join quizzes q on q.enterprise_member_id = em.id
join users u on u.id = q.enterprise_member_id
where quizzes.section_id in (?)
and q.enterprise_member_id is null
Based on my understanding this will allow postgres to sort both the enterprise_members table and the quizzes and do a hash join. This is better than when it will do now. Right now it finds everything in the quizzes subquery, brings it into memory, and then tries to match it to enterprise_members.
FIRST QUERY:
You could also create a partial index on user_id for your first query. This will be especially good if there are a relatively small number of user_ids that are null in a large table. Partial index creation:
CREATE INDEX user_id_null_ix ON enterprise_members (user_id)
WHERE (user_id is null);
Anytime you query enterprise_members with something that matches the index's where clause, the partial index can be used and quickly limit the rows returned. See http://www.postgresql.org/docs/9.4/static/indexes-partial.html for more info.
Thanks everyone for your ideas. I basically did what everyone said. I added indexes, resorted how I called everything, but the major difference was using the pluck method.. Here's my new stats :
#alt_members = list.members.pluck :id # 23ms
if list.course.sections.tests.present? && #sections = list.course.sections.tests
#quiz_member_ids = Quiz.where(section_id: #sections.map(&:id)).pluck(:enterprise_member_id) # 8.5ms
#invited = list.members.count('user_id is null') # 12.5ms
#not_started = ( #alt_members - ( #alt_members & #quiz_member_ids ).count #0ms
#in_progress = ( #alt_members & #quiz_member_ids ).count # 0ms
#completes = ( #alt_members & Quiz.where(section_id: #sections.map(&:id), completed: true).pluck(:enterprise_member_id) ).count # 9.7ms
#question_count = Quiz.where(section_id: #sections.map(&:id), completed: true).limit(5).map{|quiz|quiz.answers.count}.max # 3.5ms
I have a Rails 4 app using ActiveRecord and Postgresql with two tables: stores and open_hours. a store has many open_hours:
stores:
Column |
--------------------+
id |
name |
open_hours:
Column |
-----------------+
id |
open_time |
close_time |
store_id |
The open_time and close_time columns represent the number of seconds since midnight of Sunday (i.e. beginning of the week).
I would like to get list of store objects ordered by whether the store is open or not, so stores that are open will be ranked ahead of the stores that are closed. This is my query in Rails:
Store.joins(:open_hours).order("#{current_time} > open_time AND #{current_time} < close_time desc")
Notes that current_time is in number of seconds since midnight on the previous Sunday.
This gives me a list of stores with the currently open stores ranked ahead of the closed ones. However, I'm getting a lot of duplicates in the result.
I tried using the distinct, uniq and group methods, but none of them work:
Store.joins(:open_hours).group("stores.id").group("open_hours.open_time").group("open_hours.close_time").order("#{current_time} > open_time AND #{current_time} < close_time desc")
I've read a lot of the questions/answers already on Stackoverflow but most of them don't address the order method. This question seems to be the most relevant one but the MAX aggregate function does not work on booleans.
Would appreciate any help! Thanks.
Here is what I did to solve the issue:
In Rails:
is_open = "bool_or(#{current_time} > open_time AND #{current_time} < close_time)"
Store.select("stores.*, CASE WHEN #{is_open} THEN 1 WHEN #{is_open} IS NULL THEN 2 ELSE 3 END AS open").group("stores.id").joins("LEFT JOIN open_hours ON open_hours.store_id = stores.id").uniq.order("open asc")
Explanation:
The is_open variable is just there to shorten the select statement.
The bool_or aggregate function is needed here to group the open_hours records. Otherwise there likely will be two results for each store (one open and one closed), which is why using the uniq method alone doesn't eliminate the duplicate issues
LEFT JOIN is used instead of INNER JOIN so we can include the stores that don't have any open_hours objects
The store can be open (i.e. true), closed (i.e. false) or not determined (i.e. nil), so the CASE WHEN statement is needed here: if a store is open, then it's 1, 2 if not determined and 3 if closed
Ordering the results ASC will show open stores first, then the not determined ones, then the closed stores.
This solution works but doesn't feel very elegant. Please post your answer if you have a better solution. Thanks a lot!
Have you tried uniq method, just append it at the end
Store.joins(:open_hours).order("#{current_time} > open_time AND #{current_time} < close_time desc").uniq
So, I'm doing something like:
user.students.includes(:exams).ungraded.paginate(:page =>
params[:page]).order("exams.created_at desc")
However, this causes a subtle problem. Somewhere in the guts of active record, the limit makes it do a distinct on the student id's, like this:
SELECT DISTINCT "students".id, exams.id AS alias_0 FROM "students"
LEFT OUTER JOIN "exams" ON "exams"."student_id" = "students"."id"
WHERE "students"."ready_for_grading" = 't' ORDER BY exams.id LIMIT 10 OFFSET 0;
However, this may cause results like:
id | alias_0
----+---------
42 | 256
42 | 257
42 | 260
See the problem? Eventually the limit kicks in and we don't fetch as many student id's as we were supposed to because we've "used them up" by selecting both the student id's and the exam ids, even though we really only want the exam id's for ordering.
This is Rails 3.2.1, and PostgreSQL 9.1.
Edit
I think what is happening is that paginate is using the query to get a list of students, which it then feeds to a second query, but because of the left outer join, we're not getting distinct results for the students, so it 'underfills' the 10 slots we have and generally confuses things. I think this is a bug somewhere, but I'm not sure who to pin it on.
Ok, I finally figured it out:
user.students.joins("left outer join exams on exams.student_id = students.id").ungraded.paginate(:page =>
params[:page]).order("exams.created_at desc")
Seems to work. I'm not sure why this works out better than using 'includes', but it does.
When using a has many association to manage a serious of tags, what is the most efficient way to order/sort the collection by the number of tags selected.
For example:
Product can have many tags through ProductTags
When a user selects the tags, I would like to order the products by the number of the selected tags each product has.
Is it possible to use a cache_counter or something similar in this case? I'm not convinced using sort is the best option. Am I correct in thinking that using order on the actual database is generally faster than sort?
Clarification/update
Sorry if the above is confusing. Basically what I'm after is closer to ordering by relevancy. For example a user might select tag 1, 2, and 4. If an product has all tree tags associated with it, I want that product listed first. The second product might only have tags 1 & 4. And so on. I'm almost certain that this will have to use sort versus order, but was wondering if anyone has found a more efficient way of doing this.
Ordering by relevance within the database is both possible and far more efficient than using the sort method in Ruby. Assuming the following model structure and an appropriate underlying SQL table structure:
class Product < ActiveRecord::Base
has_many :product_taggings
has_many :product_tags, :through => :product_taggings
end
class ProductTags < ActiveRecord::Base
has_many :product_taggings
has_many :products, :through => :product_taggings
end
class ProductTaggings < ActiveRecord::Base
belongs_to :product
belongs_to :product_tags
end
Querying for relevance in MySQL would look something like:
SELECT
`product_id`
,COUNT(*) AS relevance
FROM
`product_taggings` AS ptj
LEFT JOIN
`products` AS p
ON p.`id` = ptj.`product_id`
LEFT JOIN
`product_tags` AS pt
ON pt.`id` = ptj.`product_tag_id`
WHERE
pt.`name` IN ('Tag 1', 'Tag 2')
GROUP BY
`product_id`
If I have the following products and related tags:
Product 1 -> Tag 3
Product 2 -> Tag 1, Tag 2
Product 3 -> Tag 1, Tag 3
Then the WHERE clause from above should net me:
product_id | relevance
----------------------
2 | 2
3 | 1
* Product 1 is not included since there were no matches.
Given that the user is performing a filtered search,
this behavior is probably fine. There's a way to get
Product 1 into the results with 0 relevance if
necessary.
What you've done is create a nice little result set that can act as a sort of inline join table. In order to stick a relevance score onto each row of a query from your products table, use this query as a subquery as follows:
SELECT *
FROM
`products` AS p
,(SELECT
`product_id`
,COUNT(*) AS relevance
FROM
`product_taggings` AS ptj
LEFT JOIN
`products` AS p
ON p.`id` = ptj.`product_id`
LEFT JOIN
`product_tags` AS pt
ON pt.`id` = ptj.`product_tag_id`
WHERE
pt.`name` IN ('Tag 1', 'Tag 2')
GROUP BY `product_id`
) AS r
WHERE
p.`id` = r.`product_id`
ORDER BY
r.`relevance` DESC
What you'll have is a result set containing the fields from your products table and an additional relevance column at the end that will then be used in the ORDER BY clause.
You'll need to write up a method that will in-fill this query with your desired pt.name IN list. Be certain to sanitize that list before plugging it into the query or you'll open yourself up to possible SQL injection.
Take the result of your query assembling method and run it through Product.find_by_sql(my_relevance_sql) to get your models pre-sorted by relevance directly from the DB.
The obvious down-side is that you introduce a DBMS-specific dependency into your Rails code (and risk SQL injection if you're not careful). If you're not using MySQL, the syntax might need to be adapted. However, it should perform much faster, especially on a huge result set, than using a Ruby sort on the results. Furthermore, adding a LIMIT clause will give you pagination support if needed.
Building on Ryan's excellent answer, I wanted a method that could be used acts-as-taggable-on and similar plug-ins (tables called tags/taggings), and ended up with this:
def Product.find_by_tag_list(tag_list)
tag_list_sql = "'" + tag_list.join("','") + "'"
Product.find_by_sql("SELECT * FROM products, (SELECT taggable_id, COUNT(*) AS relevance FROM taggings LEFT JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN (" + tag_list_sql + ") GROUP BY taggable_id) AS r WHERE products.id = r.taggable_id ORDER BY r.relevance DESC;")
end
To get a list of related products ordered by relevance, I then can do:
Product.find_by_tag_list(my_product.tag_list)