Kaminari without COUNT - ruby-on-rails

Can Kaminari work without hitting the DB with a COUNT(*) query?
My app's database is huge and counting the items takes much much longer than getting the items itself, leading to performance issues.
Suggestions for other pagination solutions with large datasets are also welcome.

Paginating Without Issuing SELECT COUNT Query
Generally the paginator needs to know the total number of records to display the links, but sometimes we don't need the total number of records and just need the "previous page" and "next page" links. For such use case, Kaminari provides without_count mode that creates a paginatable collection without counting the number of all records. This may be helpful when you're dealing with a very large dataset because counting on a big table tends to become slow on RDBMS.
Just add .without_count to your paginated object:
User.page(3).without_count
In your view file, you can only use simple helpers like the following instead of the full-featured paginate helper:
<%= link_to_prev_page #users, 'Previous Page' %>
<%= link_to_next_page #users, 'Next Page' %>
Source: github.com/kaminari

Well, Kaminari or will_paginate needs to count total somehow in order to determine total_pages to be rendered. This is inevitable. My solution was to look at the database query and try to optimize it. That's the way to go.
(this answer is outdated, see above answers)

We have a case where we do want a total count, but don't want to hit the database for it — our COUNT query takes a couple of seconds in some cases, even with good indexes.
So we've added a counter cache to the parent table, keep it up to date with triggers, and override the total_count singleton on the Relation object:
my_house = House.find(1)
paginated = my_house.cats.page(1)
def paginated.total_count
my_house.cats_count
end
... and all the things that require counts work without making that query.
This is an unusual thing to do. Maintaining a counter cache has some costs. There may be weird side effects if you do further relational stuff with your paginated data. Overriding singleton methods can sometimes make debugging into a nightmare. But used sparingly and documented well, you can get the behavior you want with good performance.

Related

Randomize Selections in a List of 100

This is a follow-up to this last question I asked: Sort Users by Number of Followers. That code is:
#ordered_users = User.all.sort{|a,b| b.followers.count <=> a.followers.count}
What I hope to accomplish is take the ordered users and get the top 100 of those and then randomly choose 5 out of that 100. Is there a way to accomplish this?
Thanks.
users_in_descending_order_of_followers = User.all.sort_by { |u| -u.followers.count }
sample_of_top = users_in_descending_order_of_followers.take(100).sample(5)
You can use sort_by which can be easier to use than sort, and combine take and sample to get the top 100 users and sample 5 of those users.
User.all.sort can "potentially" pose some problems in the long-run, depending on the number of total users, and the availability of resources particularly computer memory, not to mention it would be a lot slower because you're calling 2x .followers.count inside the sort block, which essentially calls 2xN times more DB query; N being the number of users. This is because User.all.sort will immediately execute the User.all query, thereby fetching all User records into memory, as opposed to your usual User.all, which is lazy loaded, until you (for example use .each, or better yet .find_each somewhere down the line)
I suggest something like below (I extended Deekshith's answer referring to your link to the other question):
User.joins(:followers).order('count(followers.user_id) desc').limit(100).sample(5)
.joins, .order, and .limit above will all extend the SQL string query into one string, then executes that SQL string, and finally run .sample(5) (not a SQL anymore!, but is already just a plain ruby method at this point), finally yielding the result that you needed.
I would strongly consider using a counter cache on the User model, to hold the count of followers.
This would give a very small performance impact on adding or removing followers, and greatly increase performance when performing sorts:
User.order(followers_count: :desc)
This would be particularly noticeable if you wanted the top-n users by follower count, or finding users with no followers.
User.order(followers_count: :desc).limit(100).sample(5)
This method will out-perform others using count(*). Add an index on followers_count for best effect.

How can I optimise this method in Ruby using preload, includes, or eager_load?

I want to reduce allocations and speed up a Ruby worker. I've been reading about eager loading, but I don't fully understand it yet. Here's the method:
def perform(study_id, timestamp)
study = Study.includes(:questions, :participants).find(study_id)
questions = study.questions.not_random.not_paused
participants = study.participants
return unless questions && participants
end_timestamp = timestamp_window(timestamp)
participants.each do |participant|
process_participant(participant, questions, timestamp, end_timestamp, study)
end
end
I was hoping that Study.includes() would reduce the number of database queries, but looking at Skylight, it doesn't seem to have changed anything:
Am I using includes incorrectly, or should I be using something else?
The example you've given doesn't seem like it's benefiting much from eager loading. Its utility is to avoid N+1 queries; something like this:
User.first(100).each do |user|
comments = user.comments
end
This will make 1 query for the 100 users, and 100 queries for the comments. That's why it's called N+1 (N being 100 here).
To prevent this from happening, you'd use eager loading:
User.first(100).includes(:comments).each do |user|
comments = user.comments
end
Now it makes two queries - one for the users and one for the comments. The fact that it makes 2 queries instead of 1 isn't a problem. Part of optimization (big O) is to find bottlenecks at different 'scales'. I'm not going to explain all that, but this is a good tutorial: https://samurails.com/interview/big-o-notation-complexity-ruby/
In the example without eager loading, the time complexity is O(N), which means 'linear'. The time required increases linearly with the value of N. If you use eager loading, though, then you can increase N without adding additional queries, and it's a O(1) complexity - constant time.
In your case, you have a method that makes three queries:
Study (find one)
associated questions
associated participants
An easy way to determine if you should use eager loading is to check your code for any SQL fetching that happens inside a loop. That's not happening here, so the eager loading won't do much. For example, it'd be good to use includes if you were instead fetching associated data for a list of studies.
It might technically possible to make a SQL query that gets all three tables' data in a single request, but I don't think ActiveRecord has anything to do it for you. It's probably unnecessary, though. If you're not convinced you can try writing that SQL yourself and report on the performance gains.

Pre-sorting associations in controller: How and is there a performance gain?

I have a view which loops through #regions. For each region its countries are displayed.
<% region.countries.each do |country| %>
A new requirement is to sort the countries by some column, which I have a scope for.
<% region.countries.order_alphabetically.each do |country| %>
However I heard that writing logic in views will severely impact the performance. Is it true for this case? Is is possible to pre-sort this in the controller?
P.S. I don't want to use default_scope because I need to sort it differently in other views.
EDIT: changed title to better reflect my question
Whether it's faster likely depends on how many records you have in your table and whether you're indexing on that column. You can experiment with passing this load of onto the database by doing:
region.countries.order(:column_name)
This should be faster in most cases than loading all the records into Ruby and sorting.
What would be undesirable would be if, in 3 different places in your view you put
region.countries.order(:column_name)
This would hit the database 3 times. Some would also argue that you're making the view do too much. You could address both concerns by doing
#sorted_countries = region.countries.order(:column_name)
You're keeping the specifics of how to order out of the view and by reusing the same relation active record will cache the sorted array between reuses.
If you're only using the sorted countries in one place then there shouldn't be any difference, although splitting it out like this makes it a maybe little easier to write a spec that tests that the countries are being sorted and makes it less likely you'll accidentally end up in the performance pitfall detailed above
Sorry for the late answer. I wanted to see some evidence, so I finally make time to sit down and write a benchmarking comparison of the two: sorting in view and controller demo.
In the page there are many regions, each regions have many countries. The page display all of them, sorting countries by name for each of the region. Run rake test:benchmark and the results will be saved in tmp/performance folder. The results of the two are the same, at around 0.0035 per page render.
So in summary, calling a sorting scope in view VS in controller makes no difference in performance.

Endless scroll pagination in ruby on rails with one query per page?

The problem with your typical rails pagination gem is that it does 2 queries: one for the page you're on and one for the total count. When you don't care about how many pages there are (e.g. in an endless scroll), that 2nd query is unnecessary (just add 1 to your LIMIT clause in the 1st query and you know if there are more or not).
Is there a gem that'll do pagination without the 2nd query? The 2nd query is expensive when applying non-indexed filters in my WHERE clause on large datasets and indexing all my various filters is unacceptable because I need my inserts to be fast.
Thanks!
Figured it out. When using the will_paginate gem, you can supply your own total_entries option to AR:Base.paginate. This makes it so the 2nd query doesn't run.
This works for sufficiently large datasets where you only care about recent entries.
This isn't necessarily acceptable if you actually expect to hit the end of your list because if the list size is divisible by per_page you're going to query an empty set on your last query. With endless scroll, this is fine. With a manual "load more" button, you'll be displaying "load more" at the very end when there are no more items to load.
The standard approach, as you've identified, is to fetch N+1 records when you need N and if you get more than N records in the response, there is at least one additional page of results you can display.
The only reason you'd want to do an explicit COUNT(*) call is if you need to know specifically how many more records you will need to fetch. On some engines this can take a good chunk of time to compute so it is best avoided especially if the value is never directly used.
Since this is so simple, you really don't need a plugin to do it. Plugins like will_paginate is more concerned with the number of pages available so it does the count operation.

Does Ruby on Rails "has_many" array provide data on a "need to know" basis?

On Ruby on Rails, say, if the Actor model object is Tom Hanks, and the "has_many" fans is 20,000 Fan objects, then
actor.fans
gives an Array with 20,000 elements. Probably, the elements are not pre-populated with values? Otherwise, getting each Actor object from the DB can be extremely time consuming.
So it is on a "need to know" basis?
So does it pull data when I access actor.fans[500], and pull data when I access actor.fans[0]? If it jumps from each record to record, then it won't be able to optimize performance by doing sequential read, which can be faster on the hard disk because those records could be in the nearby sector / platter layer -- for example, if the program touches 2 random elements, then it will be faster just to read those 2 records, but what if it touches all elements in random order, then it may be faster just to read all records in a sequential way, and then process the random elements. But how will RoR know whether I am doing only a few random elements or all elements in random?
Why would you want to fetch 50000 records if you only use 2 of them? Then fetch only those two from DB. If you want to list the fans, then you will probably use pagination - i.e. use limit and offset in your query, or some pagination gem like will_paginate.
I see no logical explanation why should you go the way you try to. Explain a real situation so we could help you.
However there is one think you need to know wile loading many associated objects from DB - use :include like
Actor.all(:include => :fans)
this will eager-load all the fans so there will only be 2 queries instead of N+1, where N is a quantity of actors
Look at the SQL which is spewed out by the server in development mode, and that will tell you how many fan records are being loaded. In this case actor.fans will indeed cause them all to be loaded, which is probably not what you want.
You have several options:
Use a paginator as suggested by Tadas;
Set up another association with the fans table that pulls in just the ones you're interested in. This can be done either with a conditions on the has_many statement, e.g.
has_many :fans, :conditions => "country of residence = 'UK'"
Specifying the full SQL to narrow down the rows returned with the :finder_sql option
Specifying the :limit option which will, well, limit, the number of rows returned.
All depends on what you want to do.

Resources