Avoiding duplicate joins in Rails ActiveRecord query - ruby-on-rails

I have a scenario where I have SQL joins somewhere in the query chain, and then at some further point I need to append a condition which needs the same join, but I don't know at this point whether that join exists already in the scope. For example:
#foo = Foo.joins("INNER JOIN foos_bars ON foos_bars.foo_id = foos.id")
....
#foo.joins(:bars).where(bars: { id: 1 })
This will product an SQL error about duplicate table/alias names.
The reason I write the SQL join manually in the first instance is to improve the efficiency as the classic rails AREL join will product two INNER JOINS where in my case I only need the one.
Is there a recommended way around this? Some way to inspect the joins currently in a scope for example.
Response to comment:
With a has_and_belongs_to_many relationship Rails produces two INNER JOINS like this:
SELECT "journals".* FROM "journals"
INNER JOIN "categories_journals"
ON "categories_journals"."journal_id" = "journals"."id"
INNER JOIN "categories"
ON "categories"."id" = "categories_journals"."category_id"
WHERE "categories"."id" = 1
Whereas I believe I can do this instead:
SELECT "journals".* FROM "journals"
INNER JOIN "categories_journals"
ON "categories_journals"."journal_id" = "journals"."id"
WHERE "categories_journals"."category_id" = 1
Correct me if I'm wrong.

The solution was to universally use string joins. Unbeknownst to me Rails actually uniqs string joins -- so as long as they're string identical this problem doesn't occur.
This article put me on the scent, the author exhibits the exact same problem as me and patches Rails, and it looks like the patch was implemented a long time ago. I don't think it's perfect though. There should be a way for rails to handle hash parameter joins and string joins and not bomb out when they overlap. Might see if I can patch that ..
EDIT:
I did a couple of benchmarks to see if I was really worrying about nothing or not (between the two ways of joining):
1.9.3p194 :008 > time = Benchmark.realtime { 1000.times { a = Incident.joins("INNER JOIN categories_incidents ON categories_incidents.incident_id = incidents.id").where("categories_incidents.category_id = 1") } }
=> 0.042458
1.9.3p194 :009 > time = Benchmark.realtime { 1000.times { a = Incident.joins(:categories).where(categories: { id: 1 }) } }
=> 0.152703
I'm not a regular benchmarker so my benchmarks may not be perfect but it looks to me as though my more efficient way does make real world performance improvements over large queries or lots of queries.
The downside of joining in the way I have done is that if a Category didn't exist but was still recorded in the join table then that might cause a few problems that would otherwise be avoidable with the more thorough join.

Related

Rails: How to remove n+1 query when we need to query association inside loop?

I have output as result in code having queries in it (only showing basic one here)
So basically I need sum of the custom line items as well as all line items
results = Order.includes(:customer, :line_items).where('completed_at IS NOT NULL')
results.each do |result|
custom_items_sum = result.line_items.where(line_item_type: 'custom').sum(:amount)
total_sum = result.line_items.sum(:amount)
end
In this code, there is n+1 query issue, I have tried adding includes but for sure it is not going to work as we have another query inside the loop, Any help will be appreciated??
If you don't want to trigger other queries in the loop you need to avoid methods which work on relations and use that ones which work on collections. Try
custom_items_sum = result.line_items.
select { |line_item| line_item.line_item_type == 'custom' }.
sum(&:amount)
This should work without n+1 queries.
Note that it's possible to write just one query and avoid this computation anyway but that's beyond the scope of your question :)
Rails was never known to be robust enough as ORM. Use plain SQL instead:
results =
Order.connection.execute <<-SQL
SELECT order.id, SUM(line_items.amount)
FROM orders
JOIN line_items
ON (line_items.order_id = orders.id)
WHERE orders.completed_at IS NOT NULL
GROUP BY orders.id
HAVING line_items.line_item_type = 'custom'
SQL
That way you’ll get all the intermediate sums in a single query, which is way faster than performing all the calculations in ruby.
Just because #AlekseiMatiushkin says write it in raw SQL let's do the same with rails
order_table = Order.arel_table
line_items_table = LineItem.arel_table
custom_items = Arel::Table.new(:custom_items)
Order.select(
order_table[Arel.star],
line_items_table[:amount].sum.as('total_sum'),
custom_items[:amount].sum.as('custom_items_sum')
).joins(
order_table.join(line_items_table).on(
line_items_table[:order_id].eq(order_table[:id])
).join(
Arel::Nodes::As.new(line_items_table,:custom_items),
Arel::Nodes::OuterJoin
).on(
custom_items[:order_id].eq(order_table[:id]).and(
custom_items[:line_item_type].eq('custom')
)
).join_sources
).where(
order_table[:completed_at].not_eq(nil)
).group(:id)
This will produce an ActiveRecord::Relation of Order objects with a virtual attributes of total_sum and custom_items_sum using the following query
SELECT
orders.*,
SUM(line_items.amount) AS total_sum,
SUM(custom_items.amount) As custom_items_sum
FROM
orders
INNER JOIN line_items ON line_items.order_id = orders.id
LEFT OUTER JOIN line_items AS custom_items ON custom_items.order_id = orders.id
AND custom_items.line_item_type = 'custom'
WHERE
orders.completed_at IS NOT NULL
GROUP BY
orders.id
This should handle the request in a single query by using 2 joins to aggregate the needed data.
Try to use the scoping block. The following code generates very clean SQL queries.
Order.includes(:line_items).where.not(completed_at: nil).scoping do
#custom_items_sum = Order.where(line_items: { line_item_type: 'custom' })
.sum(:amount)
#total_sum = Order.sum(:amount)
end
There's not that much documentation about the scoping block but it scopes your model to the ActiveRecord requests made before (here : where('completed IS NOT NULL') and with the :line_items included).
Hope this helps! :)

How can I combine COUNT(*) for two different ActiveRecord relations into a single SQL query?

With two different ActiveRecord relation objects, is there a way to issue one SQL query to compare the counts of the relations?
eg. say I have two ActiveRecord::Relation objects like this:
posts = Post.where().where().some_scope
users = User.where().some_other_scope.where().joins(:something)
To compare the counts of each relation, I'd have to issue two SQL queries.
posts.count == users.count
# => SELECT COUNT(*) FROM posts WHERE... ;
# => SELECT COUNT(*) FROM users INNER JOIN ... WHERE... ;
I want to be able to issue just one query. Something like:
Post.select("COUNT(first) == COUNT(second) as are_equal"), posts, users).are_equal
It is not possible to combine two counts over two different tables into one query, unless you use a UNION. Which will run the two separate queries and merge the results. This will take about the same time as running the two queries separately, except you only go to the db-server once (1 query), but you loose readability. So imho I really wonder if that is worth it.
E.g. in the one case you can write
if posts.count == users.count
In the other case one would write:
count_sql = <<-SQL
select "Posts" as count_type, count(*) from posts where ...
union
select "Users" as count_type, count(*) from users where ...
SQL
result = Post.connection.execute(count_sql)
if result[0]["count"] == result[1]["count"]
You will have to decide if the performance improvement ways up to the loss of readability.
This isn't possible with ActiveRecord query methods, but the underlying Arel query builder (which ActiveRecord uses internally) can achieve this, it just looks a bit less elegant:
posts = Post.where().where().some_scope
users = User.where().some_other_scope.where().joins(:something)
posts_table = Post.arel_table
users_table = User.arel_table
posts_count = Arel::Nodes::Count.new([posts_table[:id]]).as('count')
users_count = Arel::Nodes::Count.new([users_table[:id]]).as('count')
union = posts.select(posts_count).arel.union(users.select(users_count).arel)
post_count, user_count = Post.from(posts_table.create_table_alias(union, :posts)).map(&:count)
Although it may not actually be beneficial in this case (as discussed in other answers), it's worth being aware of Arel because there are times where it is useful - I always try to avoid raw SQL in my Rails applications and Arel makes that possible.
An excellent introduction can be found here: https://danshultz.github.io/talks/mastering_activerecord_arel/#/
You can always write your own SQL query.
Let's say you have two models, AdminUser and Company. One way of doing what you want would be the following:
ActiveRecord::Base.connection.execute("SELECT COUNT(*) as nb from admin_users UNION SELECT COUNT(*) as nb from companies;").to_a
You'll end up with an array of two hashes, each containing the number of records of each database table.

Is there anyway to make a lesser impact on my database with this request?

For the analytics of my site, I'm required to extract the 4 states of my users.
#members = list.members.where(enterprise_registration_id: registration.id)
# This pulls roughly 10,0000 records.. Which is evidently a huge data pull for Rails
# Member Load (155.5ms)
#invited = #members.where("user_id is null")
# Member Load (21.6ms)
#not_started = #members.where("enterprise_members.id not in (select enterprise_member_id from quizzes where quizzes.section_id IN (?)) AND enterprise_members.user_id in (select id from users)", #sections.map(&:id) )
# Member Load (82.9ms)
#in_progress = #members.joins(:quizzes).where('quizzes.section_id IN (?) and (quizzes.completed is null or quizzes.completed = ?)', #sections.map(&:id), false).group("enterprise_members.id HAVING count(quizzes.id) > 0")
# Member Load (28.5ms)
#completes = Quiz.where(enterprise_member_id: registration.members, section_id: #sections.map(&:id)).completed
# Quiz Load (138.9ms)
The operation returns a 503 meaning my app gives up on the request. Any ideas how I can refactor this code to run faster? Maybe by better joins syntax? I'm curious how sites with larger datasets accomplish what seems like such trivial DB calls.
The answer is your indexes. Check your rails logs (or check the console in development mode) and copy the queries to your db tool. Slap an "Explain" in front of the query and it will give you a breakdown. From here you can see what indexes you need to optimize the query.
For a quick pass, you should at least have these in your schema,
enterprise_members: needs an index on enterprise_member_id
members: user_id
quizes: section_id
As someone else posted definitely look into adding indexes if needed. Some of how to refactor depends on what exactly you are trying to do with all these records. For the #members query, what are you using the #members records for? Do you really need to retrieve all attributes for every member record? If you are not using every attribute, I suggest only getting the attributes that you actually use for something, .pluck usage could be warranted. 3rd and 4th queries, look fishy. I assume you've run the queries in a console? Again not sure what the queries are being used for but I'll toss in that it is often useful to write raw sql first and query on the db first. Then, you can apply your findings to rewriting activerecord queries.
What is the .completed tagged on the end? Is it supposed to be there? only thing I found close in the rails api is .completed? If it is a custom method definitely look into it. You potentially also have an use case for scopes.
THIRD QUERY:
I unfortunately don't know ruby on rails, but from a postgresql perspective, changing your "not in" to a left outer join should make it a little faster:
Your code:
enterprise_members.id not in (select enterprise_member_id from quizzes where quizzes.section_id IN (?)) AND enterprise_members.user_id in (select id from users)", #sections.map(&:id) )
Better version (in SQL):
select blah
from enterprise_members em
left outer join quizzes q on q.enterprise_member_id = em.id
join users u on u.id = q.enterprise_member_id
where quizzes.section_id in (?)
and q.enterprise_member_id is null
Based on my understanding this will allow postgres to sort both the enterprise_members table and the quizzes and do a hash join. This is better than when it will do now. Right now it finds everything in the quizzes subquery, brings it into memory, and then tries to match it to enterprise_members.
FIRST QUERY:
You could also create a partial index on user_id for your first query. This will be especially good if there are a relatively small number of user_ids that are null in a large table. Partial index creation:
CREATE INDEX user_id_null_ix ON enterprise_members (user_id)
WHERE (user_id is null);
Anytime you query enterprise_members with something that matches the index's where clause, the partial index can be used and quickly limit the rows returned. See http://www.postgresql.org/docs/9.4/static/indexes-partial.html for more info.
Thanks everyone for your ideas. I basically did what everyone said. I added indexes, resorted how I called everything, but the major difference was using the pluck method.. Here's my new stats :
#alt_members = list.members.pluck :id # 23ms
if list.course.sections.tests.present? && #sections = list.course.sections.tests
#quiz_member_ids = Quiz.where(section_id: #sections.map(&:id)).pluck(:enterprise_member_id) # 8.5ms
#invited = list.members.count('user_id is null') # 12.5ms
#not_started = ( #alt_members - ( #alt_members & #quiz_member_ids ).count #0ms
#in_progress = ( #alt_members & #quiz_member_ids ).count # 0ms
#completes = ( #alt_members & Quiz.where(section_id: #sections.map(&:id), completed: true).pluck(:enterprise_member_id) ).count # 9.7ms
#question_count = Quiz.where(section_id: #sections.map(&:id), completed: true).limit(5).map{|quiz|quiz.answers.count}.max # 3.5ms

Rails + ActiveRecord + optimization: Is there a better way to update on 300,000 records?

So I have a rake task that does this:
wine_club_memberships = WineClubMembership.pluck(:billing_info_id)
total_updated = BillingInfo.joins(:order).where(["orders.ordered_date < (CURRENT_DATE - 90) AND billing_infos.card_number IS NOT NULL AND billing_infos.card_number != '' AND billing_infos.id NOT IN (?)", wine_club_memberships]).update_all("card_number = ''")
log.error("Total records updated #{total_updated}")
The thing is that BillingInfo has 300,000+ records, and I'm wondering if all this joins, where, update_all is just the same as using pure SQL. Currently it's not too efficient, since I have a huge array of WineClubMembership records that I stuff in the statement.
Is there a more efficient way of doing this? Even though this is a long ugly statement, I was thinking that it would be efficient for the most part because it does everything pretty much in one or two hits to the database. However, people around me are thinking there must be other "Rails methods" that could do this in a better way that won't affect the performance of the production website.
I did see doing searches in "batches" but I am not sure if that will help.
UPDATE
I'm using Postgres 9.1+. In the old (just a little simpler) version of my activerecord search, This is what came out:
Ruby code:
wine_club_memberships = WineClubMembership.pluck(:billing_info_id)
total_updated = BillingInfo.joins(:order).where(["orders.ordered_date < (CURRENT_DATE - 90) AND billing_infos.id NOT IN (?)", wine_club_memberships]).update_all("card_number = ''")
SQL generated:
SQL (127848.6ms) UPDATE "billing_infos" SET card_number = '' WHERE "billing_infos"."id" IN (SELECT "billing_infos"."id" FROM "billing_infos" INNER JOIN "orders" ON "orders"."id" = "billing_infos"."order_id" WHERE (orders.ordered_date < (CURRENT_DATE - 90) AND billing_infos.id NOT IN (423908,390663,387323,402393,383446,416114,391009,456371,384305,386681,384382,384418, ...)))
It's possible that if you have your db manage the source of the final NOT IN comparison there will be optimizations in the db for dealing with it I.e. let sql manage the list of ids instead of passing it a 300,000 item long array. If your db allows try something like
... NOT IN (SELECT billing_info_id FROM wine_club_memberships)").update_all("card_number = ''")
As far as a Rails specific method for speeding this up, you're usually not going to be able to do better (performance-wise, if not maintainability-wise) than just passing a pure sql string to the dbs.

Find all records that don't have any of an associated model

I'm using Rails 3.2.
I have a product model, and a variant model. A product can have many variants. A variant can belong to many products.
I want to make a lookup on the Products model, to find only products that have a specific variant count, like such (pseudocode):
Product.where("Product.variants.count == 0")
How do you do this with activerecord?
You can use a LEFT OUTER JOIN to return the records that you need. Rails issues a LEFT OUTER JOIN when you use includes.
For example:
Product.includes(:variants).where('variants.id' => nil)
That will return all products where there are no variants. You can also use an explicit joins.
Product.joins('LEFT OUTER JOIN variants ON variants.product_id = products.id').where('variants.id' => nil)
The LEFT OUTER JOIN will return records on the left side of the join, even if the right side is not present. It will place null values into the associated columns, which you can then use to check negative presence, as I did above. You can read more about left joins here: http://www.w3schools.com/sql/sql_join_left.asp.
The good thing about this solution is that you're not doing subqueries as a conditional, which will most likely be more performant.
products= Product.find(:all,:select => 'variant').select{|product| product.varients.count > 10}
This is rails 2.3 , but only the activeRecord part, you need to see the select part
I don't know of any ActiveRecord way to do this but the following should help with your problem. The good thing about this solution is that everything's done on the db side.
Product.where('(SELECT COUNT(*) FROM variants WHERE variants.product_id = products.id) > 0')
If you want to pull products which have a specific non-0 number of variants, you could do that with something like this (admittedly untested):
Product.select('product.id, product.attr1_of_interest, ... product.attrN_of_interest, variant.id, COUNT(*)')
.joins('variants ON product.id = variants.product_id')
.group('product.id, product.attr1_of_interest, ... product.attrN_of_interest, variant.id')
.having('COUNT(*) = 5') #(or whatever number manipulation you want to do here)
If you want to allow for 0 products, you would have to use Sean's solution above.

Resources