I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]
Related
I'm running the following query in Rails 5, with the goal of finding the user with the most Pitches:
User
.select("users.*, COUNT(user_id) as pitch_count")
.unscoped
.joins("LEFT JOIN pitches AS pitches ON pitches.user_id = users.id")
.group("pitch.user_id")
.order("pitch_count DESC")
.limit(5)
But I'm getting the error:
Caused by PG::UndefinedColumn: ERROR: column "pitch_count" does not exist
Why isn't the query orderable by pitch_count?
Problem is in the unscoped method. It removes all previously defined scopes including select statement. See the following example:
User.select(:full_name, :email).unscoped.to_sql
# => SELECT "users".* FROM "users"
User.unscoped.select(:full_name, :email).to_sql
# => SELECT "users"."full_name", "users"."email" FROM "users"
See the difference? unscoped called after select definition completely removed every thing defined in the select.
For you this means that you should modify your code to call unscoped right after the model name:
User
.unscoped
.select("users.*, COUNT(user_id) as pitch_count")
.joins("LEFT JOIN pitches AS pitches ON pitches.user_id = users.id")
.group("pitch.user_id")
.order("pitch_count DESC")
.limit(5)
Note: new lines added mostly for readability but it should work like this in your ruby files. If you want to execute it in the rails console. You will have to remove new lines
Btw. you still might get error that column "user.id" must appear in the GROUP BY clause or be used in an aggregate function. It should be fixed by modifying group statement to use users.id instead of pitch.user_id:
.group("users.id")
I suggest you use counter_cache to make it easy to maintain and good for performance as well. By adding counter cache, you can get the user record with most pitches by User.reorder(pitches_count: :desc).first.
So here's the lay of the land:
I have a Applicant model which has_many Lead records.
I need to group leads by applicant email, i.e. for each specific applicant email (there may be 2+ applicant records with the email) i need to get a combined list of leads.
I already have this working using an in-memory / N+1 solution
I want to do this in a single query, if possible. Right now I'm running one for each lead which is maxing out the CPU.
Here's my attempt right now:
Lead.
all.
select("leads.*, applicants.*").
joins(:applicant).
group("applicants.email").
having("count(*) > 1").
limit(1).
to_a
And the error:
Lead Load (1.2ms) SELECT leads.*, applicants.* FROM "leads" INNER
JOIN "applicants" ON "applicants"."id" = "leads"."applicant_id"
GROUP BY applicants.email HAVING count(*) > 1 LIMIT 1
ActiveRecord::StatementInvalid: PG::GroupingError: ERROR: column
"leads.id" must appear in the GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT leads.*, applicants.* FROM "leads" INNER JOIN
"appli...
This is a postgres specific issue. "the selected fields must appear in the GROUP BY clause".
must appear in the GROUP BY clause or be used in an aggregate function
You can try this
Lead.joins(:applicant)
.select('leads.*, applicants.email')
.group_by('applicants.email, leads.id, ...')
You will need to list all the fields in leads table in the group by clause (or all the fields that you are selecting).
I would just get all the records and do the grouping in memory. If you have a lot of records, I would paginate them or batch them.
group_by_email = Hash.new { |h, k| h[k] = [] }
Applicant.eager_load(:leads).each_batch(10_000) do |batch|
batch.each do |applicant|
group_by_email[:applicant.email] << applicant.leads
end
end
You need to use a .where rather than using Lead.all. The reason it is maxing out the CPU is you are trying to load every lead into memory at once. That said I guess I am still missing what you actually want back from the query so it would be tough for me to help you write the query. Can you give more info about your associations and the expected result of the query?
For the analytics of my site, I'm required to extract the 4 states of my users.
#members = list.members.where(enterprise_registration_id: registration.id)
# This pulls roughly 10,0000 records.. Which is evidently a huge data pull for Rails
# Member Load (155.5ms)
#invited = #members.where("user_id is null")
# Member Load (21.6ms)
#not_started = #members.where("enterprise_members.id not in (select enterprise_member_id from quizzes where quizzes.section_id IN (?)) AND enterprise_members.user_id in (select id from users)", #sections.map(&:id) )
# Member Load (82.9ms)
#in_progress = #members.joins(:quizzes).where('quizzes.section_id IN (?) and (quizzes.completed is null or quizzes.completed = ?)', #sections.map(&:id), false).group("enterprise_members.id HAVING count(quizzes.id) > 0")
# Member Load (28.5ms)
#completes = Quiz.where(enterprise_member_id: registration.members, section_id: #sections.map(&:id)).completed
# Quiz Load (138.9ms)
The operation returns a 503 meaning my app gives up on the request. Any ideas how I can refactor this code to run faster? Maybe by better joins syntax? I'm curious how sites with larger datasets accomplish what seems like such trivial DB calls.
The answer is your indexes. Check your rails logs (or check the console in development mode) and copy the queries to your db tool. Slap an "Explain" in front of the query and it will give you a breakdown. From here you can see what indexes you need to optimize the query.
For a quick pass, you should at least have these in your schema,
enterprise_members: needs an index on enterprise_member_id
members: user_id
quizes: section_id
As someone else posted definitely look into adding indexes if needed. Some of how to refactor depends on what exactly you are trying to do with all these records. For the #members query, what are you using the #members records for? Do you really need to retrieve all attributes for every member record? If you are not using every attribute, I suggest only getting the attributes that you actually use for something, .pluck usage could be warranted. 3rd and 4th queries, look fishy. I assume you've run the queries in a console? Again not sure what the queries are being used for but I'll toss in that it is often useful to write raw sql first and query on the db first. Then, you can apply your findings to rewriting activerecord queries.
What is the .completed tagged on the end? Is it supposed to be there? only thing I found close in the rails api is .completed? If it is a custom method definitely look into it. You potentially also have an use case for scopes.
THIRD QUERY:
I unfortunately don't know ruby on rails, but from a postgresql perspective, changing your "not in" to a left outer join should make it a little faster:
Your code:
enterprise_members.id not in (select enterprise_member_id from quizzes where quizzes.section_id IN (?)) AND enterprise_members.user_id in (select id from users)", #sections.map(&:id) )
Better version (in SQL):
select blah
from enterprise_members em
left outer join quizzes q on q.enterprise_member_id = em.id
join users u on u.id = q.enterprise_member_id
where quizzes.section_id in (?)
and q.enterprise_member_id is null
Based on my understanding this will allow postgres to sort both the enterprise_members table and the quizzes and do a hash join. This is better than when it will do now. Right now it finds everything in the quizzes subquery, brings it into memory, and then tries to match it to enterprise_members.
FIRST QUERY:
You could also create a partial index on user_id for your first query. This will be especially good if there are a relatively small number of user_ids that are null in a large table. Partial index creation:
CREATE INDEX user_id_null_ix ON enterprise_members (user_id)
WHERE (user_id is null);
Anytime you query enterprise_members with something that matches the index's where clause, the partial index can be used and quickly limit the rows returned. See http://www.postgresql.org/docs/9.4/static/indexes-partial.html for more info.
Thanks everyone for your ideas. I basically did what everyone said. I added indexes, resorted how I called everything, but the major difference was using the pluck method.. Here's my new stats :
#alt_members = list.members.pluck :id # 23ms
if list.course.sections.tests.present? && #sections = list.course.sections.tests
#quiz_member_ids = Quiz.where(section_id: #sections.map(&:id)).pluck(:enterprise_member_id) # 8.5ms
#invited = list.members.count('user_id is null') # 12.5ms
#not_started = ( #alt_members - ( #alt_members & #quiz_member_ids ).count #0ms
#in_progress = ( #alt_members & #quiz_member_ids ).count # 0ms
#completes = ( #alt_members & Quiz.where(section_id: #sections.map(&:id), completed: true).pluck(:enterprise_member_id) ).count # 9.7ms
#question_count = Quiz.where(section_id: #sections.map(&:id), completed: true).limit(5).map{|quiz|quiz.answers.count}.max # 3.5ms
Im trying to query my db for records that are similar to the currently viewed record (based on taggings), which I have working but I would like to randomize the order.
my development environment is mysql so I would do something like:
#tattoos = Tattoo.tagged_with(tags, :any => true).order("RAND()").limit(6)
which works, but my production environment is heroku which is using postgresql so I tried using this:
#tattoos = Tattoo.tagged_with(tags, :any => true).order("RANDOM()").limit(6)
but I get the following error:
ActionView::Template::Error (PGError: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
SELECT DISTINCT tattoos.* FROM "tattoos" JOIN taggings
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477 ON
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.taggable_id = tattoos.id AND
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.taggable_type = 'Tattoo' WHERE
(tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.tag_id = 3 OR
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.tag_id = 4 OR
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.tag_id = 5 OR
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.tag_id = 24 OR
tattoos_taggings_color_fantasy_newschool_nerdy_tv_477.tag_id = 205) ORDER BY RANDOM() LIMIT 6):
After analyzing the query more closely, I have to correct my first draft. The query would require a DISTINCT or GROUP BY the way it is.
The (possibly) duplicate tattoos.* come from first joining to (possibly) multiple rows in the table taggings. Your query engine then tries to get rid of such duplicates again by using DISTINCT - in a syntactically illegal way.
DISTINCT basically sorts the resulting rows by the resulting columns from left to right and picks the first for each set of duplicates. That's why the leftmost ORDER BY column have to match the SELECT list.
MySQL is more permissive and allows the non-standard use of DISTINCT, but PostgreSQL throws an error.
ORMs often produce ineffective SQL statements (they are just crutches after all). However, if you use appropriate PostgreSQL libraries, such an illegal statement shouldn't be produced to begin with. I am no Ruby expert, but something's fishy here.
The query is also very ugly and inefficient.
There are several ways to fix it. For instance:
SELECT *
FROM (<query without ORDER BY and LIMIT>) x
ORDER BY RANDOM()
LIMIT 6
Or, better yet, rewrite the query with this faster, cleaner alternative doing the same:
SELECT ta.*
FROM tattoos ta
WHERE EXISTS (
SELECT 1
FROM taggings t
WHERE t.taggable_id = ta .id
AND t.taggable_type = 'Tattoo'
AND t.tag_id IN (3, 4, 5, 24, 205)
)
ORDER BY RANDOM()
LIMIT 6;
You'll have to implement it in Ruby yourself.
not sure about the random, as it should work.
But take a note of http://railsforum.com/viewtopic.php?id=36581
which has code that might suit you
/lib/agnostic_random.rb
module AgnosticRandom
def random
case DB_ADAPTER
when "mysql" then "RAND()"
when "postgresql" then "RANDOM()"
end
end
end
/initializers/extend_ar.rb (name doesn't matter)
ActiveRecord::Base.extend AgnosticRandom
Can anyone explain this?
Project.includes([:user, :company])
This executes 3 queries, one to fetch projects, one to fetch users for those projects and one to fetch companies.
Project.select("name").includes([:user, :company])
This executes 3 queries, and completely ignores the select bit.
Project.select("user.name").includes([:user, :company])
This executes 1 query with proper left joins. And still completely ignores the select.
It would seem to me that rails ignores select with includes. Ok fine, but why when I put a related model in select does it switch from issuing 3 queries to issuing 1 query?
Note that the 1 query is what I want, I just can't imagine this is the right way to get it nor why it works, but I'm not sure how else to get the results in one query (.joins seems to only use INNER JOIN which I do not in fact want, and when I manually specifcy the join conditions to .joins the search gem we're using freaks out as it tries to re-add joins with the same name).
I had the same problem with select and includes.
For eager loading of associated models I used native Rails scope 'preload' http://apidock.com/rails/ActiveRecord/QueryMethods/preload
It provides eager load without skipping of 'select' at scopes chain.
I found it here https://github.com/rails/rails/pull/2303#issuecomment-3889821
Hope this tip will be helpful for someone as it was helpful for me.
Allright so here's what I came up with...
.joins("LEFT JOIN companies companies2 ON companies2.id = projects.company_id LEFT JOIN project_types project_types2 ON project_types2.id = projects.project_type_id LEFT JOIN users users2 ON users2.id = projects.user_id") \
.select("six, fields, I, want")
Works, pain in the butt but it gets me just the data I need in one query. The only lousy part is I have to give everything a model2 alias since we're using meta_search, which seems to not be able to figure out that a table is already joined when you specify your own join conditions.
Rails has always ignored the select argument(s) when using include or includes. If you want to use your select argument then use joins instead.
You might be having a problem with the query gem you're talking about but you can also include sql fragments using the joins method.
Project.select("name").joins(['some sql fragement for users', 'left join companies c on c.id = projects.company_id'])
I don't know your schema so i'd have to guess at the exact relationships but this should get you started.
I might be totally missing something here but select and include are not a part of ActiveRecord. The usual way to do what you're trying to do is like this:
Project.find(:all, :select => "users.name", :include => [:user, :company], :joins => "LEFT JOIN users on projects.user_id = users.id")
Take a look at the api documentation for more examples. Occasionally I've had to go manual and use find_by_sql:
Project.find_by_sql("select users.name from projects left join users on projects.user_id = users.id")
Hopefully this will point you in the right direction.
I wanted that functionality myself,so please use it.
Include this method in your class
#ACCEPTS args in string format "ASSOCIATION_NAME:COLUMN_NAME-COLUMN_NAME"
def self.includes_with_select(*m)
association_arr = []
m.each do |part|
parts = part.split(':')
association = parts[0].to_sym
select_columns = parts[1].split('-')
association_macro = (self.reflect_on_association(association).macro)
association_arr << association.to_sym
class_name = self.reflect_on_association(association).class_name
self.send(association_macro, association, -> {select *select_columns}, class_name: "#{class_name.to_sym}")
end
self.includes(*association_arr)
end
And you will be able to call like: Contract.includes_with_select('user:id-name-status', 'confirmation:confirmed-id'), and it will select those specified columns.
The preload solution doesn't seem to do the same JOINs as eager_load and includes, so to get the best of all worlds I also wrote my own, and released it as a part of a data-related gem I maintain, The Brick.
By overriding ActiveRecord::Associations::JoinDependency.apply_column_aliases() like this then when you add a .select(...) then it can act as a filter to choose which column aliases get built out.
With gem 'brick' loaded, in order to enable this selective behaviour, add the special column name :_brick_eager_load as the first entry in your .select(...), which turns on the filtering of columns while the aliases are being built out. Here's an example:
Employee.includes(orders: :order_details)
.references(orders: :order_details)
.select(:_brick_eager_load,
'employees.first_name', 'orders.order_date', 'order_details.product_id')
Because foreign keys are essential to have everything be properly associated, they are automatically added, so you do not need to include them in your select list.
Hope it can save you both query time and some RAM!