Most Efficient Way to Get Counts of Users with Certain Attributes in Ruby - ruby-on-rails

I have a collection of users with various statuses: active, disabled, or deleted (as an enum). I want a count of users with each status as well as a count of the total number of users. What is the most efficient way for me to do that?
I've read the questions on size vs. length vs. count in Ruby and that makes me think I should load all of the user records and then iterate over the collection multiple times to get the length of each status array.
This is what my code looks like currently:
# pagination code omitted...
all_users = User.all
total_count = all_users.length
active_count = all_users.select {|u| u.status == User.statuses['active']}.length
disabled_count = all_users.select {|u| u.status == User.statuses['disabled']}.length
deleted_count = all_users.select {|u| u.status == User.statuses['deleted']}.length
The requests from the client take about 1.25-1.5 seconds as written for 1,000 users.
I've also tried making multiple DB queries with code like this:
# pagination code omitted...
total_count = User.count
active_count = User.where(status: User.statuses['active']).count
disabled_count = User.where(status: User.statuses['disabled']).count
deleted_count = User.where(status: User.statuses['deleted']).count
That might be marginally faster by ~100ms. Is there a faster way to do this?
I'm not sure if it is relevant, but for background info: I am using Rails as an API in this context to an AngularJS frontend. I am using Kaminari to paginate the collection, but I still need counts of each status. I am in a B2B environment so it is unlikely that any instance will have more than 1,000 users. I don't need to scale higher than that.
Thanks in advance!

Do it all at once, in the database by grouping your count query.
User.group(:status).count
Then to get the total number of users just sum the result. Here's an example from one of my tables. Here I'm grouping on a boolean field, but you can group on whatever you want.
> Course.group(:is_enabled).count
=> {false=>46, true=>26524}

That might be marginally faster by ~100ms.
Create an index on your 'status' column in your database:
# in your terminal
rails g migration AddIndexOnStatusOfUsers
# in db/migrate/xxxxx_add_index_on_status_of_users.rb
def change
add_index :users, :status
end

You should benchmark them all and let us know. Would be interesting. Pure SQL answers are always more scalable of course...
u = User.select('user.status')
active_count = 0
disabled_count = 0
deleted_count = 0
u.each do |u|
if u.status = 'active'
active_count += 1
elsif u.status = 'deleted'
deleted_count +=1
else
disabled_count +=1
end
end

Related

Rail's ActiveRecord find_each with DISTINCT select

I want to get a list of all unique emails I have in my database and process them in batch using find_each.
The code below works fine until it has more then 1000 records (batch size) to process. Then it breaks after the 1000th record with the error message Primary key not included in the custom select clause
Tourist.select('DISTINCT email').where("DATE(created_at) = ?", Date.today- 1).find_in_batches do |group|
something
end
So, how can I chain all this:
I only need a specific field (email)
I need them to be unique
I need a where a clause
I need a find_each
You have to do it manually with a loop limit and offset
batch_size = 1000
offset = 0
loop do
emails = Tourist.where("DATE(created_at) = ?", Date.today-1).select('DISTINCT email').limit(batch_size).offset(offset)
emails.each do |email|
# your stuff
end
break if emails.size < batch_size
offset += batch_size
end
Of course this is needed only if the request will retrieve a large number of emails. Otherwise simply use Tourist.where(condition).pluck('DISTINCT email').each { |email| your stuff }

ActiveRecord query performance, performing a where after initial query has been executed

I have this query:
absences = Absence.joins(:user).where('users.company_id = ?', #company.id).where('"from" <= ? and "to" >= ?', self.date, self.date).group('user_id').select('user_id, sum(hours) as hours')
This will return user_id's with a total of hours.
Now I need to to loop through all users of the company and do some calculations.
company.users.each do |user|
tc = TimeCheck.find_or_initialize_by(:user_id => user.id, :date => self.date)
tc.expected_hours = user.working_hours - absences.where('user_id = ?', user.id).first.hours
end
For performance reasons I want to have only one query to the absences table (the first one) and afterwards to look in memory for the correct user. How do I best accomplish this? I believe by default absences will be a ActiveRecord::Relation and not a result set. Is there a command I can use to instruct activerecord to execute the query, and afterwards search in memory?
Or do I need to store absences as array or hash first?
One SQL optimization you could make is:
change:
absences.where('user_id = ?', user.id).first.hours
to:
absences.detect { |u| u.user_id == user.id }.hours
Also, You might not need to loop through company.users. You may be able to loop through absences instead, depending on the business requirements.

find_each with order and limit

I need to limit and order batches of records and am using find_each. I've seen a lot of people asking for this and no really good solution. If I've missed it, please post a link!
I have 30M records and want to deal with 10M with the highest value in the weight column.
I tried using this method someone wrote: find_each_with_order but can't get it to work.
The code from that site doesn't take order as an option. Seems strange given that the name is find_each_with_order. I added it as follows:
class ActiveRecord::Base
# normal find_each does not use given order but uses id asc
def self.find_each_with_order(options={})
raise "offset is not yet supported" if options[:offset]
page = 1
limit = options[:limit] || 1000
order = options[:order] || 'id asc'
loop do
offset = (page-1) * limit
batch = find(:all, options.merge(:limit=>limit, :offset=>offset, :order=>order))
page += 1
batch.each{|x| yield x }
break if batch.size < limit
end
end
and I'm trying to use it as follows:
class GetStuff
def self.grab_em
file = File.open("1000 things.txt", "w")
rels = Thing.find_each_with_order({:limit=>100, :order=>"weight desc"})
binding.pry
things.each do |t|
binding.pry
file.write("#{t.name} #{t.id} #{t.weight}\n" )
if t.id % 20 == 0
puts t.id.to_s
end
end
file.close
end
end
BTW I have the data in postgres and am going to grab a subset and move it to neo4j, so I'm tagging with neo4j in case any of you neo4j people know how to do this. thanks.
Not exactly sure if this is what you're looking for, but you can do something like this:
weight = Thing.order(:weight).select(:weight).last(10_000_000).first.weight
Thing.where("weight > ?", weight).find_each do |t|
...your code...
end

Rails: how and where to add this method

I have an app where I retrieve a list of users from a specific country.
I did this in the UsersController:
#fromcanada = User.find(:all, :conditions => { :country => 'canada' })
and then turned it into a scope on the User model
scope :canada, where(:country => 'Canada').order('created_at DESC')
but I also want to be able to retrieve a random person or multiple persons from the country. I found this method that's supposed to be an efficient way to retrieve a random user from the database.
module ActiveRecord
class Base
def self.random
if (c = count) != 0
find(:first, :offset =>rand(c))
end
end
end
end
However, I have a few questions about how to add it, and how the syntax works.
Where would I put that code? Direct in the User model?
Syntax: so that I don't use code that I don't understand, can you explain how the syntax is working? I don't get (c = count). What is count counting? What is rand(c) doing? Is it finding the first one starting at the offset? If rand is an expensive method (hence the need to create a different more efficient random method), why use the expensive 'rand' in this new more efficient random method?
How could I add the call to random on my find method in the UsersController? How to add it to the scope in the model?
Building on question 3, is there a way to get two or three random users?
I wouldn't monkey patch that (or anything else!) into ActiveRecord, putting that into your User would make more sense.
The count is counting how many elements there are in your table and storing that number in c. Then rand(c) gives you a random integer in the interval [0,c) (i.e. 0 <= rand(c) < c). The :offset works the way you think it does.
rand isn't terribly expensive but doing order by random() inside the database can be very expensive. The random method that you're looking at is just a convenient way to get a random record/object from the database.
Adding it to your own User would look something like this:
def self.random
n = scoped.count
scoped.offset(rand(n)).first
end
That would allow you to chain random after a bunch of scopes:
u = User.canadians_eh.some_other_scope.random
but the result of random would be a single user so your chaining would stop there.
If you wanted multiple users you'd want to call random multiple times until you got the number of users you wanted. You could try this:
def self.random
n = scoped.count
scoped.offset(rand(n))
end
us = User.canadians_eh.random.limit(3)
to get three random users but the users would be clustered together in whatever order the database ended up with after your other scopes and that's probably not what you're after. If you want three you'd be better off with something like this:
# In User...
def self.random
n = scoped.count
scoped.offset(rand(n)).first
end
# Somewhere else...
scopes = User.canadians_eh.some_other_scope
users = 3.times.each_with_object([]) do |_, users|
users << scopes.random
scopes = scopes.where('id != :latest', :latest => users.last.id)
end
You'd just grab a random user, update your scope chain to exclude them, and repeat until you're done. You would, of course, want to make sure you had three users first.
You might want to move the ordering out of your canada scope: one scope, one task.
That code is injecting a new method into ActiveRecord::Base. I would put it in lib/ext/activerecord/base.rb. But you can put it anywhere you want.
count is a method being called on self. self will be some class inheriting from ActiveRecord::Base, eg. User. User.count returns the number of user records (sql: SELECT count(*) from users;). rand is a ruby stdlib method Kernel#rand. rand(c) returns a random integer in the Range 0...c and c was previously computed by calling #count. rand is not expensive.
You don't call random with find, User#random is a find, it returns one random record from all User records. In your controller you say User.random and it returns a single random record (or nil if there are no user records at all).
modify the AR::Base::random method like so:
module ActiveRecord
class Base
def self.random( how_many = 1 )
if (c = count) != 0
res = (0..how_many).inject([]) do |m,i|
m << find(:first, :offset =>rand(c))
end
how_many == 1 ? res.first : res
end
end
end
end
User.random(3) # => [<User Rand1>,<User Rand2>,<User Rand3>]

Possible override the way count works, or finding a better way, altogether to do this

I have this scope in my artist model that gives me the artists, in the order of their popularity within a certain time period. popularity in the popularity_caches table is computed every day.
scope :by_popularity, lambda { |*args|
options = (default_popularity_options).merge(args[0] || {})
select("SUM(popularity) AS popularity, artists.*").
from("popularity_caches FORCE INDEX (popularity_cache_group), artists FORCE INDEX (index_artists_on_id_and_genre_id)").
where("popularity_caches.target_type = 'Artist'").
where("popularity_caches.target_id = artists.id").
where("popularity_caches.time_frame = ?", options[:time_frame]).
where("popularity_caches.started_on > ?", options[:started_on]).
where("popularity_caches.started_on < ?", options[:ended_on]).
group("artists.id").
order("popularity DESC")
}
This seems to work except when I want to get the count: Artist.by_popularity.count. I get a funky hash in return (probably the count of artists that have popularity_caches within that period):
#<OrderedHash {295954=>1, 20143=>1, 157532=>1, 181291=>1, 300086=>1, 50100=>1, 262898=>1, 293888=>1, 130158=>2, 279943=>1, 336758=>1, 100201=>1, 134290=>2, 22726=>3, 144620=>2, 62497=>2 # snip
This is the SQL I probably want in return:
SELECT COUNT(DISTINCT(artists.id)) AS count_all
FROM popularity_caches FORCE INDEX (popularity_cache_group), artists FORCE INDEX (index_artists_on_id_and_genre_id)
WHERE (popularity_caches.target_type = 'Artist')
AND (popularity_caches.target_id = artists.id)
AND (popularity_caches.time_frame = 'week')
AND (popularity_caches.started_on > '2011-02-28 16:00:00')
AND (popularity_caches.started_on < '2011-10-05')
ORDER BY popularity DESC
To get the count, I had to make a separate method that pretty much does the same thing, except the SQL is formed differently. It kinds sucks through, because when I want to paginate, I have to pass two things:
#artists = Artists.by_popularity(some args).paginate(
:total_entries => Artist.count_by_popularity(pass in the same args here as in Artist.by_popularity),
:per_page => 5,
page => ...
)
That smells to me because it's very brittle.
Is there a way to do this in ARel? Maybe override how it counts things (distinct artists.id) and removing the group by so it doesn't return a hash for the count?
Thanks!
Solved with the amazing scuttle.io:
PopularityCach.select(
Arel::Nodes::Group.new(Artist.arel_table[:id]).count.as('count_all')
).where(
PopularityCach.arel_table[:target_type].eq('Artist').and(
PopularityCach.arel_table[:target_id].eq(Artist.arel_table[:id]).and(
PopularityCach.arel_table[:time_frame].eq('week').and(
PopularityCach.arel_table[:started_on].gt('2011-02-28 16:00:00').and(
PopularityCach.arel_table[:started_on].lt('2011-10-05')
)
)
)
)
).order(:popularity).reverse_order

Resources