I need to limit and order batches of records and am using find_each. I've seen a lot of people asking for this and no really good solution. If I've missed it, please post a link!
I have 30M records and want to deal with 10M with the highest value in the weight column.
I tried using this method someone wrote: find_each_with_order but can't get it to work.
The code from that site doesn't take order as an option. Seems strange given that the name is find_each_with_order. I added it as follows:
class ActiveRecord::Base
# normal find_each does not use given order but uses id asc
def self.find_each_with_order(options={})
raise "offset is not yet supported" if options[:offset]
page = 1
limit = options[:limit] || 1000
order = options[:order] || 'id asc'
loop do
offset = (page-1) * limit
batch = find(:all, options.merge(:limit=>limit, :offset=>offset, :order=>order))
page += 1
batch.each{|x| yield x }
break if batch.size < limit
end
end
and I'm trying to use it as follows:
class GetStuff
def self.grab_em
file = File.open("1000 things.txt", "w")
rels = Thing.find_each_with_order({:limit=>100, :order=>"weight desc"})
binding.pry
things.each do |t|
binding.pry
file.write("#{t.name} #{t.id} #{t.weight}\n" )
if t.id % 20 == 0
puts t.id.to_s
end
end
file.close
end
end
BTW I have the data in postgres and am going to grab a subset and move it to neo4j, so I'm tagging with neo4j in case any of you neo4j people know how to do this. thanks.
Not exactly sure if this is what you're looking for, but you can do something like this:
weight = Thing.order(:weight).select(:weight).last(10_000_000).first.weight
Thing.where("weight > ?", weight).find_each do |t|
...your code...
end
Related
I have a collection of users with various statuses: active, disabled, or deleted (as an enum). I want a count of users with each status as well as a count of the total number of users. What is the most efficient way for me to do that?
I've read the questions on size vs. length vs. count in Ruby and that makes me think I should load all of the user records and then iterate over the collection multiple times to get the length of each status array.
This is what my code looks like currently:
# pagination code omitted...
all_users = User.all
total_count = all_users.length
active_count = all_users.select {|u| u.status == User.statuses['active']}.length
disabled_count = all_users.select {|u| u.status == User.statuses['disabled']}.length
deleted_count = all_users.select {|u| u.status == User.statuses['deleted']}.length
The requests from the client take about 1.25-1.5 seconds as written for 1,000 users.
I've also tried making multiple DB queries with code like this:
# pagination code omitted...
total_count = User.count
active_count = User.where(status: User.statuses['active']).count
disabled_count = User.where(status: User.statuses['disabled']).count
deleted_count = User.where(status: User.statuses['deleted']).count
That might be marginally faster by ~100ms. Is there a faster way to do this?
I'm not sure if it is relevant, but for background info: I am using Rails as an API in this context to an AngularJS frontend. I am using Kaminari to paginate the collection, but I still need counts of each status. I am in a B2B environment so it is unlikely that any instance will have more than 1,000 users. I don't need to scale higher than that.
Thanks in advance!
Do it all at once, in the database by grouping your count query.
User.group(:status).count
Then to get the total number of users just sum the result. Here's an example from one of my tables. Here I'm grouping on a boolean field, but you can group on whatever you want.
> Course.group(:is_enabled).count
=> {false=>46, true=>26524}
That might be marginally faster by ~100ms.
Create an index on your 'status' column in your database:
# in your terminal
rails g migration AddIndexOnStatusOfUsers
# in db/migrate/xxxxx_add_index_on_status_of_users.rb
def change
add_index :users, :status
end
You should benchmark them all and let us know. Would be interesting. Pure SQL answers are always more scalable of course...
u = User.select('user.status')
active_count = 0
disabled_count = 0
deleted_count = 0
u.each do |u|
if u.status = 'active'
active_count += 1
elsif u.status = 'deleted'
deleted_count +=1
else
disabled_count +=1
end
end
In my Rails 4 app I need to find all plans that do either have an interval of month OR an amount of 0.
This doesn't work:
class Plan < ActiveRecord::Base
def self.by_interval(interval)
where("interval = ? OR amount = ?", interval, 0)
end
end
I am getting this error:
Mysql2::Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '= 'month' OR amount = 0) ORDER BY amount DESC' at line 1: SELECT `plans`.* FROM `plans` WHERE (interval = 'month' OR amount = 0) ORDER BY amount DESC
What else might work?
Thanks for any help.
'interval' in mysql is a reserved word (http://dev.mysql.com/doc/refman/5.6/en/reserved-words.html).
Try it like this:
def self.by_interval(interval)
where("`interval` = ? OR amount = ?", interval, 0)
end
note the backticks around "interval" (not quotes)
Since you're not doing an exclusive or, but an inclusive, I would do it in two requests:
class Plan < ActiveRecord::Base
def self.by_interval(interval)
where(interval: interval) << where(amount: 0)
end
end
These are both arrays of results and the second set of results get injected into the first. I do realize this is two separate requests so it might not be as optimized as you'd like.
I believe using Rails ActiveRecord caching may be a way to save on a performance hit. I don't know if it's done automatically for you in this case, or if you should load the full table request before the queries are performed.
Just pass the arguments directly into the strings
def self.by_interval(interval)
where("interval = #{interval} OR amount = 0")
end
I am trying to build an array that looks like this via a model method:
[['3/25/13', 2], ['3/26/13', 1], ['3/27/13', 2]]
Where, the dates are strings and the numbers after them are the count of an table/object.
I have the following model method right now:
def self.weekly_count_array
counts = count(group: "date(#{table_name}.created_at)", conditions: { created_at: 1.month.ago.to_date..Date.today }, order: "date(#{table_name}.created_at) DESC")
(1.week.ago.to_date).upto(Date.today) do |x|
counts[x.to_s] ||= 0
end
counts.sort
end
However, it doesn't return the count accurately (all values are zero). There seem to be some similar questions on SO that I've checked out, but can't seem to get them to work either.
Can someone help (1) let me know if this is the best way to do it, and (2) provide some guidance in terms of what the problem might be with the above code, if so? Thanks!
Use this as a template if you wish
def self.period_count_array(from = (Date.today-1.month).beginning_of_day,to = Date.today.end_of_day)
where(created_at: from..to).group('date(created_at)').count
end
This will return you a hash with dates as key and the count as value. (Rails 3.2.x)
maybe this is what you are trying to do?
class YourActiveRecordModel < ActiveRecord::Base
def.self weekly_count_array
records = self.select("COUNT(id) AS record_count, DATE(created_at) AS created")
.group("DATE(created_at)")
.where("created_at >= ?", 1.month.ago.to_date)
.where("created_at <= ?", Date.current)
records.each do |x|
puts x.record_count
puts x.created # 2013-03-14
# use I18n.localize(x.created, format: :your_format)
# where :your_format is defined in config/locales/en.yml (or other .yml)
end
end
end
Fantastic answer by #Aditya Sanghi.
If you have the exact requirement, you can opt:
def self.weekly_count_array
records = select('DATE(created_at) created_at, count(id) as id').group('created_at')
1.week.ago.to_date.upto(Date.today).map do |d|
[d, records.where('DATE(created_at) = ?', d.to_date).first.try(:id) || 0]
end
end
You do not need a process to perform the count. Simply perform a query for this.
def self.weekly_count_array
select("created_at, COUNT(created_at) AS count")
where(created_at: 1.month.ago.to_date..Date.today)
group("created_at")
order("created_at DESC")
end
Built on #kiddorails Answer,
so not to make a lot of Requests to the DataBase, Created a Hash from the ActiveRecord
& changed the group from .group('created_at') to .group('DATE(created_at)') to base it on date
def self.weekly_count_array
# records = select('DATE(created_at) created_at, count(id) as id').group('created_at')
records_hash = Hash[Download.select('DATE(created_at) created_at, count(id) as id').group('DATE(created_at)').map{|d|[d.created_at, d.id] }]
1.month.ago.to_date.upto(Date.today).map do |d|
[ d, records_hash[d.to_date] || 0 ]
end
end
I have this scope in my artist model that gives me the artists, in the order of their popularity within a certain time period. popularity in the popularity_caches table is computed every day.
scope :by_popularity, lambda { |*args|
options = (default_popularity_options).merge(args[0] || {})
select("SUM(popularity) AS popularity, artists.*").
from("popularity_caches FORCE INDEX (popularity_cache_group), artists FORCE INDEX (index_artists_on_id_and_genre_id)").
where("popularity_caches.target_type = 'Artist'").
where("popularity_caches.target_id = artists.id").
where("popularity_caches.time_frame = ?", options[:time_frame]).
where("popularity_caches.started_on > ?", options[:started_on]).
where("popularity_caches.started_on < ?", options[:ended_on]).
group("artists.id").
order("popularity DESC")
}
This seems to work except when I want to get the count: Artist.by_popularity.count. I get a funky hash in return (probably the count of artists that have popularity_caches within that period):
#<OrderedHash {295954=>1, 20143=>1, 157532=>1, 181291=>1, 300086=>1, 50100=>1, 262898=>1, 293888=>1, 130158=>2, 279943=>1, 336758=>1, 100201=>1, 134290=>2, 22726=>3, 144620=>2, 62497=>2 # snip
This is the SQL I probably want in return:
SELECT COUNT(DISTINCT(artists.id)) AS count_all
FROM popularity_caches FORCE INDEX (popularity_cache_group), artists FORCE INDEX (index_artists_on_id_and_genre_id)
WHERE (popularity_caches.target_type = 'Artist')
AND (popularity_caches.target_id = artists.id)
AND (popularity_caches.time_frame = 'week')
AND (popularity_caches.started_on > '2011-02-28 16:00:00')
AND (popularity_caches.started_on < '2011-10-05')
ORDER BY popularity DESC
To get the count, I had to make a separate method that pretty much does the same thing, except the SQL is formed differently. It kinds sucks through, because when I want to paginate, I have to pass two things:
#artists = Artists.by_popularity(some args).paginate(
:total_entries => Artist.count_by_popularity(pass in the same args here as in Artist.by_popularity),
:per_page => 5,
page => ...
)
That smells to me because it's very brittle.
Is there a way to do this in ARel? Maybe override how it counts things (distinct artists.id) and removing the group by so it doesn't return a hash for the count?
Thanks!
Solved with the amazing scuttle.io:
PopularityCach.select(
Arel::Nodes::Group.new(Artist.arel_table[:id]).count.as('count_all')
).where(
PopularityCach.arel_table[:target_type].eq('Artist').and(
PopularityCach.arel_table[:target_id].eq(Artist.arel_table[:id]).and(
PopularityCach.arel_table[:time_frame].eq('week').and(
PopularityCach.arel_table[:started_on].gt('2011-02-28 16:00:00').and(
PopularityCach.arel_table[:started_on].lt('2011-10-05')
)
)
)
)
).order(:popularity).reverse_order
I need to return exactly ten records for use in a view. I have a highly restrictive query I'd like to use, but I want a less restrictive query in place to fill in the results in case the first query doesn't yield ten results.
Just playing around for a few minutes, and this is what I came up with, but it doesn't work. I think it doesn't work because merge is meant for combining queries on different models, but I could be wrong.
class Article < ActiveRecord::Base
...
def self.listed_articles
Article.published.order('created_at DESC').limit(25).where('listed = ?', true)
end
def self.rescue_articles
Article.published.order('created_at DESC').where('listed != ?', true).limit(10)
end
def self.current
Article.rescue_articles.merge(Article.listed_articles).limit(10)
end
...
end
Looking in console, this forces the restrictions in listed_articles on the query in rescue_articles, showing something like:
Article Load (0.2ms) SELECT `articles`.* FROM `articles` WHERE (published = 1) AND (listed = 1) AND (listed != 1) ORDER BY created_at DESC LIMIT 4
Article Load (0.2ms) SELECT `articles`.* FROM `articles` WHERE (published = 1) AND (listed = 1) AND (listed != 1) ORDER BY created_at DESC LIMIT 6 OFFSET 4
I'm sure there's some ridiculously easy method I'm missing in the documentation, but I haven't found it yet.
EDIT:
What I want to do is return all the articles where listed is true out of the twenty-five most recent articles. If that doesn't get me ten articles, I'd like to add enough articles from the most recent articles where listed is not true to get my full ten articles.
EDIT #2:
In other words, the merge method seems to string the queries together to make one long query instead of merging the results. I need the top ten results of the two queries (prioritizing listed articles), not one long query.
with your initial code:
You can join two arrays using + then get first 10 results:
def self.current
(Article.listed_articles + Article.rescue_articles)[0..9]
end
I suppose a really dirty way of doing it would be:
def self.current
oldest_accepted = Article.published.order('created_at DESC').limit(25).last
Artcile.published.where(['created_at > ?', oldest_accepted.created_at]).order('listed DESC').limit(10)
end
If you want an ActiveRecord::Relation object instead of an Array, you can use:
ActiveRecordUnion gem.
Install gem: gem install active_record_union and use:
def self.current
Article.rescue_articles.union(Article.listed_articles).limit(10)
end
UnionScope module.
Create module UnionScope (lib/active_record/union_scope.rb).
module ActiveRecord::UnionScope
def self.included(base)
base.send :extend, ClassMethods
end
module ClassMethods
def union_scope(*scopes)
id_column = "#{table_name}.id"
if (sub_query = scopes.reject { |sc| sc.count == 0 }.map { |s| "(#{s.select(id_column).to_sql})" }.join(" UNION ")).present?
where "#{id_column} IN (#{sub_query})"
else
none
end
end
end
end
Then call it in your Article model.
class Article < ActiveRecord::Base
include ActiveRecord::UnionScope
...
def self.current
union_scope(Article.rescue_articles, Article.listed_articles).limit(10)
end
...
end
All you need to do is sum the queries:
result1 = Model.where(condition)
result2 = Model.where(another_condition)
# your final result
result = result1 + result2
I think you can do all of this in one query:
Article.published.order('listed ASC, created_at DESC').limit(10)
I may have the sort order wrong on the listed column, but in essence this should work. You'll get any listed items first, sorted by created_at DESC, then non-listed items.