Improving a Active Record / Postgresql Query Further - ruby-on-rails

Following up on my question here, I'm trying to improve a search further. We first search a replays table (searching 2k records) and then get unique players associated with that table (10 per, so 20k records) and render a JSON. This is done through the controller, the search reads as:
def index
#replays = Replay.includes(:players).where(map_id: params['map_id'].to_i).order(id: :desc).limit(2000)
render json: #replays[0..2000].to_json(include: [:players])
end
The performance:
Completed 200 OK in 254032ms (Views: 34.1ms | ActiveRecord: 20682.4ms)
The actual Active Record search reads as:
Replay Load (80.4ms) SELECT "replays".* FROM "replays" WHERE "replays"."map_id" = $1 ORDER BY "replays"."id" DESC LIMIT $2 [["map_id", 1], ["LIMIT", 2000]]
Player Load (20602.0ms) SELECT "players".* FROM "players" WHERE "players"."replay_id" IN (117217...
This mostly works, but still takes an exceptional amount of time. Is there are way to improve performance?

You're getting bitten by this issue https://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#Predicate_IN_optimalization
I found note pg_performance about optimalization possibility of IN predicate when list of values is longer than eighty numbers. For longer list is better create constant subqueries with using multi values:
SELECT *
FROM tab
WHERE x IN (1,2,3,..n); -- n > 70
-- faster case
SELECT *
FROM tab
WHERE x IN (VALUES(10),(20));
Using VALUES is faster for bigger number of items, so don't use it for small set of values.
Basically, SELECT * FROM WHERE IN ((1),(2)...) with a long list of values is very slow. It's ridiculously faster if you can convert it to a list of values, like SELECT * FROM WHERE IN (VALUES(1),(2) ...)
Unfortunately, since this is happening in active record, it's a little tricky to exercise control over the query. You can either avoid using an includes call and just manually construct the SQL to load all your child records, and then manually build up the associations.
Alternatively, you can monkey patch active record. Here's what I've done on rails 4.2, in an initializer.
module PreloaderPerformance
private
def query_scope(ids)
if ids.count > 100
type = klass.columns_hash[association_key_name.to_s].sql_type
values_list = ids.map do |id|
if id.kind_of?(Integer)
" (#{id})"
elsif type == "uuid"
" ('#{id.to_s}'::uuid)"
else
" ('#{id.to_s}')"
end
end.join(",")
scope.where("#{association_key_name} in (VALUES #{values_list})")
else
super
end
end
end
module ActiveRecord
module Associations
class Preloader
class Association #:nodoc:
prepend PreloaderPerformance
end
end
end
end
Doing this I've seen a 50x speed up of some of my queries, with no issues as of yet. Note it's not fully battle tested, and I bet it will have some issues if you're association is using a unique data type for the foreign_key relationship. In my data base, I only use uuids or integers for our associations. The usual caveats about monkey patching core rails behavior applies.

I know find_each can be used to batch queries, which might lighten the memory load here. Could you try out the following and see how it impacts upon the time?
Replay.where(map_id: params['map_id'].to_i).includes(:players).find_each(batch_size: 100).map do |replay|
replay.to_json(includes: :players)
end
I'm not sure this will work. It might be the mapping negates the benefits of batching - there are certainly more queries, but it'll use less memory as it doesn't need to store > 20k records at a time.
Have a play and see how it looks - fiddle with the batch size too, see how that affects things.
There's a caveat in that you can't apply a limit, so bear that in mind.
I'm sure someone else'll come up with a far slicker solution, but hope this might help in the meantime. If it's awful when you check the speed, let me know and I'll delete this answer :)

Related

ActiveRecord querying a tree structure efficiently

I have inherited a Rails 3 app that stores much of it's data as a fairly sophisticated tree structure. The application works pretty well in general but we are noticing some problems with performance, mostly around database interactions.
I am seeing a lot of queries along the lines of these showing up in my logs:
SELECT `messages`.* FROM `messages` WHERE `messages`.`context_type` = 'Node' AND `messages`.`context_id` IN (153740, 153741, /* about a thousand records... */ 154837, 154838, 154839, 154840, 154841, 154842, 154843)
Followed by many single queries where it looks as though the same record is being queried time and again:
[1m[35mCACHE (0.0ms)[0m SELECT `plans`.* FROM `plans` WHERE `plans`.`type` IN ('Scenario') AND `plans`.`id` = 1435 LIMIT 1
My log has that exact query roughly eighty times- now I'm guessing that initial Cache message means it is probably being pulled from a cache rather than going back to the database every time, but it still looks like a lot and this type of thing is happening repeatedly.
I am guessing that the above queries are an association being pulled out backwards so that message belongs_to plan and it is loading all the messages then pulling out the plan for each one rather than, as one might do in a sane world, starting with the plan and then loading all the messages.
Working in this vein, a single request contains 1641 SELECT statements and it seems very likely to me that the sheer amount of database traffic ( not to mention the number of sequential LIMIT 1 queries for neighbouring data in the same table ) is a significant bottleneck. I am reluctant to post too much code but this is a typical example of one of the larger queries:
def nodes
include_objects = [:sector, :market, :messages, :node_user_prefs, :reference_node, :project, {:available_events => :available_event_nodes}, :customer_factors, :active_components, :tags, { :event_histories => :node}, {:event_histories => :user }]
project = self
#cached_nodes ||= begin
all_nodes = orig_nodes.includes(include_objects)
all_nodes = all_nodes.map { |n| n.tap { |i| i.cached_project = project } }
all_node_ids = all_nodes.map(&:id)
all_nodes.select{ |n| n.type == 'InitialNode' || all_node_ids.include?(n.parent_node_id) }
end
end
Obviously, the queries are pretty diverse and the application is large, but this is fairly representative of the standard approach taken.
What are the easy wins with ActiveRecord that I can use to try and speed things up? I can easily enough put together a query that would pull all the required data out in a single round trip, but how easy would it be to form that - redundancies and all - into my model hierarchy? What other techniques can I use to help out here?
Ancestry Gem
Not a direct fix by any means, but you may wish to consider the ancestry gem -
This will give you a way to create a tree structure, whereby you'll be able to call single records & then have their descendents called a you wish. This will cut back on your SQL queries:
If you set up your nodes / objects in this fashion, it will allow you to call the records you require once & ancestry will populate the rest. If you want me to divulge more information on this, please let me know in the comments & I'll detail more specifics

What is the 'Rails Way' to implement a dynamic reporting system on data

Intro
I'm doing a system where I have a very simple layout only consisting of transactions (with basic CRUD). Each transaction has a date, a type, a debit amount (minus) and a credit amount (plus). Think of an online banking statement and that's pretty much it.
The issue I'm having is keeping my controller skinny and worrying about possibly over-querying the database.
A Simple Report Example
The total debit over the chosen period e.g. SUM(debit) as total_debit
The total credit over the chosen period e.g. SUM(credit) as total_credit
The overall total e.g. total_credit - total_debit
The report must allow a dynamic date range e.g. where(date BETWEEN 'x' and 'y')
The date range would never be more than a year and will only be a max of say 1000 transactions/rows at a time
So in the controller I create:
def report
#d = Transaction.select("SUM(debit) as total_debit").where("date BETWEEN 'x' AND 'y'")
#c = Transaction.select("SUM(credit) as total_credit").where("date BETWEEN 'x' AND 'y'")
#t = #c.credit_total - #d.debit_total
end
Additional Question Info
My actual report has closer to 6 or 7 database queries (e.g. pulling out the total credit/debit as per type == 1 or type == 2 etc) and has many more calculations e.g totalling up certain credit/debit types and then adding and removing these totals off other totals.
I'm trying my best to adhere to 'skinny model, fat controller' but am having issues with the amount of variables my controller needs to pass to the view. Rails has seemed very straightforward up until the point where you create variables to pass to the view. I don't see how else you do it apart from putting the variable creating line into the controller and making it 'skinnier' by putting some query bits and pieces into the model.
Is there something I'm missing where you create variables in the model and then have the controller pass those to the view?
A more idiomatic way of writing your query in Activerecord would probably be something like:
class Transaction < ActiveRecord::Base
def self.within(start_date, end_date)
where(:date => start_date..end_date)
end
def self.total_credit
sum(:credit)
end
def self.total_debit
sum(:debit)
end
end
This would mean issuing 3 queries in your controller, which should not be a big deal if you create database indices, and limit the number of transactions as well as the time range to a sensible amount:
#transactions = Transaction.within(start_date, end_date)
#total = #transaction.total_credit - #transaction.total_debit
Finally, you could also use Ruby's Enumerable#reduce method to compute your total by directly traversing the list of transactions retrieved from the database.
#total = #transactions.reduce(0) { |memo, t| memo + (t.credit - t.debit) }
For very small datasets this might result in faster performance, as you would hit the database only once. However, I reckon the first approach is preferable, and it will certainly deliver better performance when the number of records in your db starts to increase
I'm putting in params[:year_start]/params[:year_end] for x and y, is that safe to do?
You should never embed params[:anything] directly in a query string. Instead use this form:
where("date BETWEEN ? AND ?", params[:year_start], params[:year_end])
My actual report probably has closer to 5 database calls and then 6 or 7 calculations on those variables, should I just be querying the date range once and then doing all the work on the array/hash etc?
This is a little subjective but I'll give you my opinion. Typically it's easier to scale the application layer than the database layer. Are you currently having performance issues with the database? If so, consider moving the logic to Ruby and adding more resources to your application server. If not, maybe it's too soon to worry about this.
I'm really not seeing how I would get the majority of the work/calculations into the model, I understand scopes but how would you put the date range into a scope and still utilise GET params?
Have you seen has_scope? This is a great gem that lets you define scopes in your models and have them automatically get applied to controller actions. I generally use this for filtering/searching, but it seems like you might have a good use case for it.
If you could give an example on creating an array via a broad database call and then doing various calculations on that array and then passing those variables to the template that would be awesome.
This is not a great fit for Stack Overflow and it's really not far from what you would be doing in a standard Rails application. I would read the Rails guide and a Ruby book and it won't be too hard to figure out.

iterating through table in Ruby using hash runs slow

I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!
You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.
You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.
As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.

What is the difference between these two statements, and why would you choose them?

I'm a beginner at rails. And I've come to understand two different ways to return the same result.
What is the difference between these two? And what situation would require you to choose one from the other?
Example 1:
Object.find(:all).select {|c| c.name == "Foobar" }.size
Example 2:
Object.count(:conditions => ['name = ?', 'Foobar'])
FURTHER NOTE:
I seriously wish I could vote everyone correct answers for this one. Thank you so much. I just had a serious rails affirmation.
Object.count always hits the DB, and the find()....size() call can optimize. Good discussion here
http://rhnh.net/2007/09/26/counting-activerecord-associations-count-size-or-length
Example 1:
This constructs a query:
SELECT * FROM objects
then turns all the records into a collection of objects in your memory, then iterates through every object to see if it meets the condition, and then counts the number of elements that meet condition.
Example 2:
This constructs a query:
SELECT count(id) FROM objects WHERE name = 'Foobar'
lets sql do all the hard work, and returns just an integer - a number of objects meeting condition.
Usually you want no 2 - faster and less memory
Example 1 will load all of your records from the DB (assuming Object is an ActiveRecord model), then uses Ruby to reduce the set, and then return the size of that array. So this is potentially memory and CPU heavy - not good.
Example 2 performs the count in SQL, so all the heavy lifting is performed in the database, not in Ruby. Much better :)
In example 1, you are getting all objects from the datastore, and then iterating over all of them, selecting the objects that has the name Foobar. And then getting the size of that array. Example 1 is the clear loser here.
Example 1 sql:
select * from whatever
# then iterate over entire array
Example two executes a where clause in SQL to the datastore.
select count(id) from whatever where name = 'foobar'
# The SQL above is sql-server accurate, but not necessarily mysql or sqlite3

How do you order an array by a connected integer in Ruby on Rails?

I'm creating a most popular activity section for user profiles. I have no difficulty pulling questions through the user_id but I'm having trouble pulling then ordering by the associated integer: question.votes.size . This is probably a simply question but how do I sort then limit the output to 3? How do I do this without lagging the database? There will eventually be a lot of votes to be counted. Should this be a named_scope?
#user_id = User.find_by_username(params[:username]).id
questions = Question.find(:all, :conditions => {:user_id => #user_id })
I wanted to pop in and suggest another way that is, perhaps, a bit more native RoR. :-)
#user = User.find_by_username( params[:username], :include => [{:questions => :votes}])
#sorted_questions = #user.questions.sort { |q1,q2| q2.votes.length <=> q1.votes.length }
This has a number of advantages:
1) No SQL written, maintains DB portability, easier to read(?)
2) Relieves DB of sort compute, should scale better
and a couple of disadvantages:
1) Works Ruby harder, higher latency at low loads, less efficient on single box
2) Moves more data, potentially mitigates advantage #2
Ideally, you'd want to look at ActiveRecord's counter cache functionality. It automatically caches relationship counts by denormalizing the child row count into the parent table. For it to work, all child row manipulation must occur via the parent object, but that's Rails best practice in any case.
Having a votes counter cache in questions would eliminate the need to reference the votes table in the query. Doing this, and sorting in Ruby, might be the ideal situation from both performance and code esthetic points of view.
Finally, I have to admit punting on Rails 3's very cool relational algebra stuff. With it, it's likely that this could be a super readable one-liner that generates optimal SQL. How cool is that going to be? :-)
I'm assuming that each question has many votes.
You can do this by using
Question.find_by_sql("
SELECT question.*, COUNT(votes.id) as vote_count
FROM questions
LEFT JOIN votes on questions.id = votes.question_id
GROUP BY questions.id
ORDER BY vote_count DESC
");
or something roughly equivalent to that (I didn't test it)

Resources