ActiveRecord querying a tree structure efficiently - ruby-on-rails

I have inherited a Rails 3 app that stores much of it's data as a fairly sophisticated tree structure. The application works pretty well in general but we are noticing some problems with performance, mostly around database interactions.
I am seeing a lot of queries along the lines of these showing up in my logs:
SELECT `messages`.* FROM `messages` WHERE `messages`.`context_type` = 'Node' AND `messages`.`context_id` IN (153740, 153741, /* about a thousand records... */ 154837, 154838, 154839, 154840, 154841, 154842, 154843)
Followed by many single queries where it looks as though the same record is being queried time and again:
[1m[35mCACHE (0.0ms)[0m SELECT `plans`.* FROM `plans` WHERE `plans`.`type` IN ('Scenario') AND `plans`.`id` = 1435 LIMIT 1
My log has that exact query roughly eighty times- now I'm guessing that initial Cache message means it is probably being pulled from a cache rather than going back to the database every time, but it still looks like a lot and this type of thing is happening repeatedly.
I am guessing that the above queries are an association being pulled out backwards so that message belongs_to plan and it is loading all the messages then pulling out the plan for each one rather than, as one might do in a sane world, starting with the plan and then loading all the messages.
Working in this vein, a single request contains 1641 SELECT statements and it seems very likely to me that the sheer amount of database traffic ( not to mention the number of sequential LIMIT 1 queries for neighbouring data in the same table ) is a significant bottleneck. I am reluctant to post too much code but this is a typical example of one of the larger queries:
def nodes
include_objects = [:sector, :market, :messages, :node_user_prefs, :reference_node, :project, {:available_events => :available_event_nodes}, :customer_factors, :active_components, :tags, { :event_histories => :node}, {:event_histories => :user }]
project = self
#cached_nodes ||= begin
all_nodes = orig_nodes.includes(include_objects)
all_nodes = all_nodes.map { |n| n.tap { |i| i.cached_project = project } }
all_node_ids = all_nodes.map(&:id)
all_nodes.select{ |n| n.type == 'InitialNode' || all_node_ids.include?(n.parent_node_id) }
end
end
Obviously, the queries are pretty diverse and the application is large, but this is fairly representative of the standard approach taken.
What are the easy wins with ActiveRecord that I can use to try and speed things up? I can easily enough put together a query that would pull all the required data out in a single round trip, but how easy would it be to form that - redundancies and all - into my model hierarchy? What other techniques can I use to help out here?

Ancestry Gem
Not a direct fix by any means, but you may wish to consider the ancestry gem -
This will give you a way to create a tree structure, whereby you'll be able to call single records & then have their descendents called a you wish. This will cut back on your SQL queries:
If you set up your nodes / objects in this fashion, it will allow you to call the records you require once & ancestry will populate the rest. If you want me to divulge more information on this, please let me know in the comments & I'll detail more specifics

Related

Improving a Active Record / Postgresql Query Further

Following up on my question here, I'm trying to improve a search further. We first search a replays table (searching 2k records) and then get unique players associated with that table (10 per, so 20k records) and render a JSON. This is done through the controller, the search reads as:
def index
#replays = Replay.includes(:players).where(map_id: params['map_id'].to_i).order(id: :desc).limit(2000)
render json: #replays[0..2000].to_json(include: [:players])
end
The performance:
Completed 200 OK in 254032ms (Views: 34.1ms | ActiveRecord: 20682.4ms)
The actual Active Record search reads as:
Replay Load (80.4ms) SELECT "replays".* FROM "replays" WHERE "replays"."map_id" = $1 ORDER BY "replays"."id" DESC LIMIT $2 [["map_id", 1], ["LIMIT", 2000]]
Player Load (20602.0ms) SELECT "players".* FROM "players" WHERE "players"."replay_id" IN (117217...
This mostly works, but still takes an exceptional amount of time. Is there are way to improve performance?
You're getting bitten by this issue https://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#Predicate_IN_optimalization
I found note pg_performance about optimalization possibility of IN predicate when list of values is longer than eighty numbers. For longer list is better create constant subqueries with using multi values:
SELECT *
FROM tab
WHERE x IN (1,2,3,..n); -- n > 70
-- faster case
SELECT *
FROM tab
WHERE x IN (VALUES(10),(20));
Using VALUES is faster for bigger number of items, so don't use it for small set of values.
Basically, SELECT * FROM WHERE IN ((1),(2)...) with a long list of values is very slow. It's ridiculously faster if you can convert it to a list of values, like SELECT * FROM WHERE IN (VALUES(1),(2) ...)
Unfortunately, since this is happening in active record, it's a little tricky to exercise control over the query. You can either avoid using an includes call and just manually construct the SQL to load all your child records, and then manually build up the associations.
Alternatively, you can monkey patch active record. Here's what I've done on rails 4.2, in an initializer.
module PreloaderPerformance
private
def query_scope(ids)
if ids.count > 100
type = klass.columns_hash[association_key_name.to_s].sql_type
values_list = ids.map do |id|
if id.kind_of?(Integer)
" (#{id})"
elsif type == "uuid"
" ('#{id.to_s}'::uuid)"
else
" ('#{id.to_s}')"
end
end.join(",")
scope.where("#{association_key_name} in (VALUES #{values_list})")
else
super
end
end
end
module ActiveRecord
module Associations
class Preloader
class Association #:nodoc:
prepend PreloaderPerformance
end
end
end
end
Doing this I've seen a 50x speed up of some of my queries, with no issues as of yet. Note it's not fully battle tested, and I bet it will have some issues if you're association is using a unique data type for the foreign_key relationship. In my data base, I only use uuids or integers for our associations. The usual caveats about monkey patching core rails behavior applies.
I know find_each can be used to batch queries, which might lighten the memory load here. Could you try out the following and see how it impacts upon the time?
Replay.where(map_id: params['map_id'].to_i).includes(:players).find_each(batch_size: 100).map do |replay|
replay.to_json(includes: :players)
end
I'm not sure this will work. It might be the mapping negates the benefits of batching - there are certainly more queries, but it'll use less memory as it doesn't need to store > 20k records at a time.
Have a play and see how it looks - fiddle with the batch size too, see how that affects things.
There's a caveat in that you can't apply a limit, so bear that in mind.
I'm sure someone else'll come up with a far slicker solution, but hope this might help in the meantime. If it's awful when you check the speed, let me know and I'll delete this answer :)

What is one way that I can reduce .includes association query?

I have an extremely slow query that looks like this:
people = includes({ project: [{ best_analysis: :main_language }, :logo] }, :name, name_fact: :primary_language)
.where(name_id: limit(limit).unclaimed_people(opts))
Look at the includes method call and notice that is loading huge number of associations. In the RailsSpeed book, there is the following quote:
“For example, consider this:
Car.all.includes(:drivers, { parts: :vendors }, :log_messages)
How many ActiveRecord objects might get instantiated here?
The answer is:
# Cars * ( avg # drivers/car + avg log messages/car + average parts/car * ( average parts/vendor) )
Each eager load increases the number of instantiated objects, and in turn slows down the query. If these objects aren't used, you're potentially slowing down the query unnecessarily. Note how nested eager loads (parts and vendors in the example above) can really increase the number of objects instantiated.
Be careful with nesting in your eager loads, and always test with production-like data to see if includes is really speeding up your overall performance.”
The book fails to mention what could be a good substitute for this though. So my question is what sort of technique could I substitute for includes?
Before i jump to answer. I don't see you using any pagination or limit on a query, that may help quite a lot.
Unfortunately, there aren't any, really. And if you use all of the objects in a view that's okay. There is a one possible substitute to includes, though. It quite complex, but it still helpful sometimes: you join all needed tables, select only fields from them that you use, alias them and access them as a flat structure.
Something like
(NOTE: it uses arel helpers. You need to include ArelHelpers::ArelTable in models where you use syntax like NameFact[:id])
relation.join(name_fact: :primary_language).select(
NameFact[:id].as('name_fact_id')
PrimaryLanguage[:language].as('primary_language')
)
I'm not sure it will work for your case, but that's the only alternative I know.
I have an extremely slow query that looks like this
There are couple of potential causes:
Too many unnecessary objects fetched and created. From you comment, looks like that is not the case and you need all the data that is being fetched.
DB indexes not optimised. Check the time taken by query. Explain the generated query (check logs to get query or .to_sql) and make sure it is not doing table scan and other costly operations.

How do you order an array by a connected integer in Ruby on Rails?

I'm creating a most popular activity section for user profiles. I have no difficulty pulling questions through the user_id but I'm having trouble pulling then ordering by the associated integer: question.votes.size . This is probably a simply question but how do I sort then limit the output to 3? How do I do this without lagging the database? There will eventually be a lot of votes to be counted. Should this be a named_scope?
#user_id = User.find_by_username(params[:username]).id
questions = Question.find(:all, :conditions => {:user_id => #user_id })
I wanted to pop in and suggest another way that is, perhaps, a bit more native RoR. :-)
#user = User.find_by_username( params[:username], :include => [{:questions => :votes}])
#sorted_questions = #user.questions.sort { |q1,q2| q2.votes.length <=> q1.votes.length }
This has a number of advantages:
1) No SQL written, maintains DB portability, easier to read(?)
2) Relieves DB of sort compute, should scale better
and a couple of disadvantages:
1) Works Ruby harder, higher latency at low loads, less efficient on single box
2) Moves more data, potentially mitigates advantage #2
Ideally, you'd want to look at ActiveRecord's counter cache functionality. It automatically caches relationship counts by denormalizing the child row count into the parent table. For it to work, all child row manipulation must occur via the parent object, but that's Rails best practice in any case.
Having a votes counter cache in questions would eliminate the need to reference the votes table in the query. Doing this, and sorting in Ruby, might be the ideal situation from both performance and code esthetic points of view.
Finally, I have to admit punting on Rails 3's very cool relational algebra stuff. With it, it's likely that this could be a super readable one-liner that generates optimal SQL. How cool is that going to be? :-)
I'm assuming that each question has many votes.
You can do this by using
Question.find_by_sql("
SELECT question.*, COUNT(votes.id) as vote_count
FROM questions
LEFT JOIN votes on questions.id = votes.question_id
GROUP BY questions.id
ORDER BY vote_count DESC
");
or something roughly equivalent to that (I didn't test it)

Does Ruby on Rails "has_many" array provide data on a "need to know" basis?

On Ruby on Rails, say, if the Actor model object is Tom Hanks, and the "has_many" fans is 20,000 Fan objects, then
actor.fans
gives an Array with 20,000 elements. Probably, the elements are not pre-populated with values? Otherwise, getting each Actor object from the DB can be extremely time consuming.
So it is on a "need to know" basis?
So does it pull data when I access actor.fans[500], and pull data when I access actor.fans[0]? If it jumps from each record to record, then it won't be able to optimize performance by doing sequential read, which can be faster on the hard disk because those records could be in the nearby sector / platter layer -- for example, if the program touches 2 random elements, then it will be faster just to read those 2 records, but what if it touches all elements in random order, then it may be faster just to read all records in a sequential way, and then process the random elements. But how will RoR know whether I am doing only a few random elements or all elements in random?
Why would you want to fetch 50000 records if you only use 2 of them? Then fetch only those two from DB. If you want to list the fans, then you will probably use pagination - i.e. use limit and offset in your query, or some pagination gem like will_paginate.
I see no logical explanation why should you go the way you try to. Explain a real situation so we could help you.
However there is one think you need to know wile loading many associated objects from DB - use :include like
Actor.all(:include => :fans)
this will eager-load all the fans so there will only be 2 queries instead of N+1, where N is a quantity of actors
Look at the SQL which is spewed out by the server in development mode, and that will tell you how many fan records are being loaded. In this case actor.fans will indeed cause them all to be loaded, which is probably not what you want.
You have several options:
Use a paginator as suggested by Tadas;
Set up another association with the fans table that pulls in just the ones you're interested in. This can be done either with a conditions on the has_many statement, e.g.
has_many :fans, :conditions => "country of residence = 'UK'"
Specifying the full SQL to narrow down the rows returned with the :finder_sql option
Specifying the :limit option which will, well, limit, the number of rows returned.
All depends on what you want to do.

Question about Ruby on Rails, Constants, belongs_to & Database Optimization/Performance

I've developed a web based point of sale system for one of my clients in Ruby on Rails with MySQL backend. These guys are growing so fast that they are ringing close to 10,000 transactions per day corporate-wide. For this question, I will use the transactions table as an example. Currently, I store the transactions.status as a string (ie: 'pending', 'completed', 'incomplete') within a varchar(255) field that has an index. In the beginning, it was fine when I was trying to lookup records by different statuses as I didn't have to worry about so many records. Over time, using the query analyzer, I have noticed that performance has worsened and that varchar fields can really slowdown your query speed over thousands of lookups. I've been thinking about converting these varchar fields to integer based status fields utilizing STATUS CONSTANT within the Transaction model like so:
class Transaction < ActiveRecord::Base
STATUS = { :incomplete => 0, :pending => 1, :completed => 2 }
def expensive_query_by_status(status)
self.find(:all,
:select => "id, cashier, total, status",
:condition => { :status => STATUS[status.to_sym] })
end
Is this the best route for me to take? What do you guys suggest? I am already using proper indexes on various lookup fields and memcached for query caching wherever possible. They're currently setup on a distributed server environment of 3 servers where 1st is for application, 2nd for DB & 3rd for caching (all in 1 datacenter & on same VLAN).
Have you tried the alternative on a representative database? From the example given, I'm a little sceptical that it's going to make much difference, you see. If there are only three statuses then a query by status may be better-off not using an index at all.
Say "completed" comprises 80% of your table - with no other indexed column involved, you're going to be requiring more reads if the index is used than not. So a query of that type is almost certainly going to get slower as the table grows. "incomplete" and "pending" queries would probably still benefit from an index, however; they'd only be affected as the total number of rows with those statuses grew.
How often do you look at everything, complete and otherwise, without some more selective criterion? Could you partition the table in some (internal or external) way? For example, store completed transactions in a separate table, moving new ones there as they reach their final (?) state. I think internal database partitioning was introduced in MySQL 5.1 - looking at the documentation it seems that a RANGE partition might be appropriate.
All that said, I do think there's probably some benefit to moving away from storing statuses as strings. Storage and bandwidth considerations aside, it's a lot less likely that you'll inadvertently mis-spell an integer or, better yet, a constant or symbol.
You might want to start limiting your searchings (if your not doing that already), #find(:all) is pretty taxing on that scale. Also you might want to think about what your Transaction model is reaching out for as it gets translated into your views and perhaps eager load those to minimize requests to the db for extra information.

Resources