Rails fastest way to get table records count - ruby-on-rails

I have a user table in Postgresql database, if run User.count, it takes 150ms to get result. It is too too slow to us. Ideally, it shall take less than 10ms to return me the result. Is there any way to cache sql result in model level? Something like
def self.total_count
User.count.cached # that's my imagination
end

In my opinion, there are several ways you could go about this -
You could have another table that stores the count of the total number of users by incrementing the count there when a user is added/deleted or at frequent time intervals.
If your table is extremely big and accuracy is not the most important thing, you also look into Postgres' COUNT ESTIMATE query.
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';

You should look into counter_cache. It will work great if your user belongs_to another model http://guides.rubyonrails.org/association_basics.html

Related

Optimizing has many record association query

I have this query that I've built using Enumerable#select. The purpose is to find records thave have no has many associated records or if it does have those records select only those with it's preview attribute set to true. The code below works perfectly for that use case. However, this query does not scale well. When I test against thousands of records it takes several hundred seconds to complete. How can this query be improved upon?
# User has many enrollments
# Enrollment belongs to user.
users_with_no_courses = User.includes(:enrollments).select {|user| user.enrollments.empty? || user.enrollments.where(preview: false).empty?}
So first, make sure enrollments.user_id has an index.
Second, you can speed this up by not loading all the enrollments, and doing your filtering in SQL:
User.where(<<-EOQ)
NOT EXISTS (SELECT 1
FROM enrollments e
WHERE e.user_id = users.id
AND NOT e.preview)
EOQ
By the way here I'm simplifying your two conditions into one: "no enrollments or no real enrollments" is the same as "no real enrollments".
If you want you can put this condition into a scope so it is more reusable.
Third, this is still going to be slow if you're instantiating thousands of User objects. So I would look into paginating if that makes sense, or find_each if this is an offline script. Or use raw SQL to avoid all the object instances.
Oh by the way: even though you are saying includes(:enrollments), this will still go back to the database, giving you an n+1 problem:
user.enrollments.where(preview: false)
That is because the where means ActiveRecord can't use the already-loaded association. You can avoid that by using select instead of where. But not loading the enrollments in the first place is even better.

ActiveRecord return the newest record per user (unique)

I've got a User model and a Card model. User has many Cards, so card has a attribute user_id.
I want to fetch the newest single Card for each user. I've been able to do this:
Card.all.order(:user_id, :created_at)
# => gives me all the Cards, sorted by user_id then by created_at
This gets me half way there, and I could certainly iterate through these rows and grab the first one per user. But this smells really bad to me as I'd be doing a lot of this using Arrays in Ruby.
I can also do this:
Card.select('user_id, max(created_at)').group('user_id')
# => gives me user_id and created_at
...but I only get back user_ids and created_at timestamps. I can't select any other columns (including id) so what I'm getting back is worthless. I also don't understand why PG won't let me select more columns than above without putting them in the group_by or an aggregate function.
I'd prefer to find a way to get what I want using only ActiveRecord. I'm also willing to write this query in raw SQL but that's if I can't get it done with AR. BTW, I'm using a Postgres DB, which limits some of my options.
Thanks guys.
We join the cards table on itself, ON
a) first.id != second.id
b) first.user_id = second.user_id
c) first.created_at < second.created_at
Card.joins("LEFT JOIN cards AS c ON cards.id != c.id AND c.user_id = cards.user_id AND cards.created_at < c.created_at").where('c.id IS NULL')
This is a bit late, but I am working on the same matter, and i found this one works for me :
Card.all.group_by(&:user_id).map{|s| s.last.last}
What do you think ?
I've found one solution that is suboptimal performance-wise but will work for very small datasets, when time is short or it's a hobby project:
Card.all.order(:user_id, :created_at).to_a.uniq(&:user_id)
This takes the AR:Relation results, casts them into a Ruby Array, then performs a Array#uniq on the results with a Proc. After some brief testing it appears #uniq will preserve order, so as long as everything is in order before using uniq you should be good.
The feature is time sensitive so I'm going to use this for now, but I will be looking at something in raw SQL following #Gene's response and link.

Rails subquery reduce amount of raw SQL

I have two ActiveRecord models: Post and Vote. I want a make a simple query:
SELECT *,
(SELECT COUNT(*)
FROM votes
WHERE votes.id = posts.id) AS vote_count
FROM posts
I am wondering what's the best way to do it in activerecord DSL. My goal is to minimize the amount of SQL I have to write.
I can do Post.select("COUNT(*) from votes where votes.id = posts.id as vote_count")
Two problems with this:
Raw SQL. Anyway to write this in DSL?
This returns only attribute vote_count and not "*" + vote_count. I can append .select("*") but I will be repeating this every time. Is there an much better/DRY way to do this?
Thanks
Well, if you want to reduce amount of SQL, you can split that query into smaller two end execute them separately. For instance, the votes counting part could be extracted to query:
SELECT votes.id, COUNT(*) FROM votes GROUP BY votes.id;
which you may write with ActiveRecord methods as:
Vote.group(:id).count
You can store the result for later use and access it directly from Post model, for example you may define #votes_count as a method:
class Post
def votes_count
##votes_count_cache ||= Vote.group(:id).count
##votes_count_cache[id] || 0
end
end
(Of course every use of cache raises a question about invalidating or updating it, but this is out of the scope of this topic.)
But I strongly encourage you to consider yet another approach.
I believe writing complicated queries like yours with ActiveRecord methods — even if would be possible — or splitting queries into two as I proposed earlier are both bad ideas. They result in extremely cluttered code, far less readable than raw SQL. Instead, I suggest introducing query objects. IMO there is nothing wrong in using raw, complicated SQL when it's hidden behind nice interface. See: M. Fowler's P of EAA and Brynary's post on Code Climate Blog.
How about doing this with no additional SQL at all - consider using the Rails counter_cache feature.
If you add an integer votes_count column to the posts table, you can get Rails to automatically increment and decrement that counter by changing the belongs_to declaration in Vote to:
belongs_to :post, counter_cache: true
Rails will then keep each Post updated with the number of votes it has. That way the count is already calculated and no sub-query is needed.
Maybe you can create mysql view and just map it to new AR model. It works similar way to table, you just need to specify with set_table_name "your_view_name"....maybe on DB level it will work faster and will be automatically re-calculating.
Just stumbled upon postgres_ext gem which adds support for Common Table Expressions in Arel and ActiveRecord which is exactly what you asked. Gem is not for SQLite, but perhaps some portions could be extracted or serve as examples.

Doing analytics on a large table in Rails / PostGreSQL

I have a "Vote" table in my database which is growing in size everyday, currently at around 100 million rows. For internal analytics / insights I used to have a rake task which would compute a few basic metrics, like the number of votes made daily in the past few days. It's just a COUNT with a where clause on the date "created_at".
This rake task was doing fine until I deleted the index on "created_at" because it seems that it had a negative impact on the app performance for all the other user-facing queries that didn't need this index, especially when inserting a new row.
Currently I don't have a lot of insights as to what is going on in my app and in this table. However I don't really want to add indexes on such a large table if it's only for my own use.
What else can I try ?
Alternately, you could sidestep the Vote table altogether and keep an external tally.
Every time a vote is cast, a separate tally class that keeps a running count of votes cast will be invoked. There will be one tally record per day. A tally record will have an integer representing the number of votes cast on that day.
Each increment call to the tally class will find a tally record for the current date (today), increment the vote count, and save the record. If no record exists, one will be created and incremented accordingly.
For example, let's have a class called VoteTally with two attributes: a date (date), and a vote count (integer), no timestamps, no associations. Here's what the model will look like:
class VoteTally < ActiveRecord::Base
def self.tally_up!
find_or_create_by_date(Date.today).increment!(:votes)
end
def self.tally_down!
find_or_create_by_date(Date.today).decrement!(:votes)
end
def self.votes_on(date)
find_by_date(date).votes
end
end
Then, in the Vote model:
class Vote < ActiveRecord::Base
after_create :tally_up
after_destroy :tally_down
# ...
private
def tally_up ; VoteTally.tally_up! ; end
def tally_down ; VoteTally.tally_down! ; end
end
These methods will get vote counts:
VoteTally.votes_on Date.today
VoteTally.votes_on Date.yesterday
VoteTally.votes_on 3.days.ago
VoteTally.votes_on Date.parse("5/28/13")
Of course, this is a simple example and you will have to adapt it to suit. This will result in an extra query during vote casting, but it's a hell of a lot faster than a where clause on 100M records with no index. Minor inaccuracies are possible with this solution, but I assume that's acceptable given the anecdotal nature of daily vote counts.
It's just a COUNT with a where clause on the date "created_at".
In that case the only credible index you can use is the one on created_at...
If write performance is an issue (methinks it's unlikely...) and you're using a composite primary key, clustering the table using that index might help too.
If the index has really an impact on the write performance, and it's only a few persons which run statistics now and then, you might consider another general approach:
You could separate your "transaction processing database" from your "reporting database".
You could update your reporting database on a regular basis, and create reporting-only indexes only there. What is more queries regarding reports will not conflict with transaction-oriented traffic, and it doesn't matter how long they run.
Of course, this increases a certain delay, and it increases system complexity. On the other hand, if you roll-forward your reporting database on a regular basis, you can ensure that your backup scheme actually works.

ActiveRecord: size vs count

In Rails, you can find the number of records using both Model.size and Model.count. If you're dealing with more complex queries is there any advantage to using one method over the other? How are they different?
For instance, I have users with photos. If I want to show a table of users and how many photos they have, will running many instances of user.photos.size be faster or slower than user.photos.count?
Thanks!
You should read that, it's still valid.
You'll adapt the function you use depending on your needs.
Basically:
if you already load all entries, say User.all, then you should use length to avoid another db query
if you haven't anything loaded, use count to make a count query on your db
if you don't want to bother with these considerations, use size which will adapt
As the other answers state:
count will perform an SQL COUNT query
length will calculate the length of the resulting array
size will try to pick the most appropriate of the two to avoid excessive queries
But there is one more thing. We noticed a case where size acts differently to count/lengthaltogether, and I thought I'd share it since it is rare enough to be overlooked.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
class Image < ActiveRecord::Base
belongs_to :product, counter_cache: true
end
class Product < ActiveRecord::Base
has_many :images
end
> product = Product.first # query, load product into memory
> product.images.size # no query, reads the :images_count column
> product.images.count # query, SQL COUNT
> product.images.length # query, loads images into memory
This behaviour is documented in the Rails Guides, but I either missed it the first time or forgot about it.
tl;dr
If you know you won't be needing the data use count.
If you know you will use or have used the data use length.
If you don't know where it is used or the speed difference is neglectable, use size...
count
Resolves to sending a Select count(*)... query to the DB. The way to go if you don't need the data, but just the count.
Example: count of new messages, total elements when only a page is going to be displayed, etc.
length
Loads the required data, i.e. the query as required, and then just counts it. The way to go if you are using the data.
Example: Summary of a fully loaded table, titles of displayed data, etc.
size
It checks if the data was loaded (i.e. already in rails) if so, then just count it, otherwise it calls count. (plus the pitfalls, already mentioned in other entries).
def size
loaded? ? #records.length : count(:all)
end
What's the problem?
That you might be hitting the DB twice if you don't do it in the right order (e.g. if you render the number of elements in a table on top of the rendered table, there will be effectively 2 calls sent to the DB).
Sometimes size "picks the wrong one" and returns a hash (which is what count would do)
In that case, use length to get an integer instead of hash.
The following strategies all make a call to the database to perform a COUNT(*) query.
Model.count
Model.all.size
records = Model.all
records.count
The following is not as efficient as it will load all records from the database into Ruby, which then counts the size of the collection.
records = Model.all
records.size
If your models have associations and you want to find the number of belonging objects (e.g. #customer.orders.size), you can avoid database queries (disk reads). Use a counter cache and Rails will keep the cache value up to date, and return that value in response to the size method.
I recommended using the size function.
class Customer < ActiveRecord::Base
has_many :customer_activities
end
class CustomerActivity < ActiveRecord::Base
belongs_to :customer, counter_cache: true
end
Consider these two models. The customer has many customer activities.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
Consider one example:
in my database, one customer has 20,000 customer activities and I try to count the number of records of customer activities of that customer with each of count, length and size method. here below the benchmark report of all these methods.
user system total real
Count: 0.000000 0.000000 0.000000 ( 0.006105)
Size: 0.010000 0.000000 0.010000 ( 0.003797)
Length: 0.030000 0.000000 0.030000 ( 0.026481)
so I found that using :counter_cache Size is the best option to calculate the number of records.
Here's a flowchart to simplify your decision-making process. Hope it helps.
Source: Difference Between the Length, Size, and Count Methods in Rails

Resources