Rails Optimize query and loop through large entity - ruby-on-rails

I have a method that outputs the following hash format for charting.
# Monthly (Jan - Dec)
{
"john": [1,2,3,4,5,6,7,8,9,10,11,12],
"mike": [1,2,3,4,5,6,7,8,9,10,11,12],
"rick": [1,2,3,4,5,6,7,8,9,10,11,12]
}
# the indices represents the month
# e.g [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
# Index
# 0 = Jan
# 1 = Feb
# 2 = Mar
...
The following method loops through all the store invoices within given year with specific sales rep name and generate above outcome
def chart_data
hash = Hash.new {|h,k| h[k] = [] }
(1..12).each do |month|
date_range = "1/#{month}/#{date.year}".to_date.all_month
all_reps.each do |name|
hash[name] << store.bw_invoices.where(sales_rep_name: name,
purchase_date: date_range).sum(:subtotal).to_f
end
end
return hash
end
When I run run this method it takes over 4~5 sec to execute. I really need to optimize this query. I came up with two solutions that I think it would help but I would love to get some of your expertise.
move it to background job
perform a SQL query to optimize(I need help with this if this is optimal)
Thank you so much for your time

Yes, you've found a problem that is very hard to solve efficiently without letting the database do the hard work.
Assuming your dataset is potentially too large to load a whole year raw into ruby objects, this approach using just 1 postgreSQL query would be probably the best kind of idea:
More SQL approach
def chart_data
result = Hash.new {|h,k| h[k] = [] }
total_lines = store.bw_invoices.select("sales_rep_name, to_char(purchase_date, 'mm') as month, sum(subtotal) as total")
.where(purchase_date: Date.today.all_year)
.group("sales_rep_name, to_char(purchase_date, 'mm')")
total_lines.each do |total_line|
result[total_line.sales_rep_name][total_line.month.to_i - 1] = total_line.total.to_f
end
result
end
Note that this solution will leave nil rather than 0 for months where a rep had no sales. And if their last month with sales was June then there will only be 6 items in the array.
We can avoid this either with more complex SQL left joining from a virtual table or by filling in the array gaps afterwards. However, depending on how you setup your charting this might make no practical difference anyway.
More ruby approach
def chart_data
result = Hash.new {|h,k| h[k] = [] }
(1..12).each do |month|
date_range = "1/#{month}/#{Date.today.year}".to_date.all_month
rows = store.bw_invoices.select("sales_rep_name, SUM(subtotal) as total")
.where(purchase_date: date_range)
.group(:sales_rep_name)
all_reps.each do |rep_name|
row = rows.detect { |x| x.sales_rep_name == rep_name }
result[rep_name] << (row ? row.total : 0).to_f
end
end
result
end
This is more similar to your approach but takes the querying outside of the inner loop so we do 12 queries instead of 12 * number of reps. The detect used may become a little slow but only if there are thousands of reps. In which case you could sort both all_reps and the query output and implement your own kind of merge join but at that point you're getting into complexity you might as well let the database handle again.

Related

Generate array of daily avg values from db table (Rails)

Context:
Trying to generating an array with 1 element for each created_at day in db table. Each element is the average of the points (integer) column from records with that created_at day.
This will later be graphed to display the avg number of points on each day.
Result:
I've been successful in doing this, but it feels like an unnecessary amount of code to generate the desired result.
Code:
def daily_avg
# get all data for current user
records = current_user.rounds
# make array of long dates
long_date_array = records.pluck(:created_at)
# create array to store short dates
short_date_array = []
# remove time of day
long_date_array.each do |date|
short_date_array << date.strftime('%Y%m%d')
end
# remove duplicate dates
short_date_array.uniq!
# array of avg by date
array_of_avg_values = []
# iterate through each day
short_date_array.each do |date|
temp_array = []
# make array of records with this day
records.each do |record|
if date === record.created_at.strftime('%Y%m%d')
temp_array << record.audio_points
end
end
# calc avg by day and append to array_of_avg_values
array_of_avg_values << temp_array.inject(0.0) { |sum, el| sum + el } / temp_array.size
end
render json: array_of_avg_values
end
Question:
I think this is a common extraction problem needing to be solved by lots of applications, so I'm wondering if there's a known repeatable pattern for solving something like this?
Or a more optimal way to solve this?
(I'm barely a junior developer so any advice you can share would be appreciated!)
Yes, that's a lot of unnecessary stuff when you can just go down to SQL to do it (I'm assuming you have a class called Round in your app):
class Round
DAILY_AVERAGE_SELECT = "SELECT
DATE(rounds.created_at) AS day_date,
AVG(rounds.audio_points) AS audio_points
FROM rounds
WHERE rounds.user_id = ?
GROUP BY DATE(rounds.created_at)
"
def self.daily_average(user_id)
connection.select_all(sanitize_sql_array([DAILY_AVERAGE_SELECT, user_id]), "daily-average")
end
end
Doing this straight into the database will be faster (and also include less code) than doing it in ruby as you're doing now.
I advice you to do something like this:
grouped =
records.order(:created_at).group_by do |r|
r.created_at.strftime('%Y%m%d')
end
At first here you generate proper SQL near to that you wish to get in first approximation, then group result records by created_at field converted to just a date.
points =
grouped.map do |(date, values)|
[ date, values.reduce(0.0, :audio_points) / values.size ]
end.to_h
# => { "1-1-1970" => 155.0, ... }
Then you remap your grouped hash via array, to calculate average values with audio_points.
You can use group and calculations methods built in AR: http://guides.rubyonrails.org/active_record_querying.html#group
http://guides.rubyonrails.org/active_record_querying.html#calculations

Ruby/Rails how to iterate months over a DateTime range?

I am trying to build a graph from data in a Rails table: The amount of sold products per time-fragment.
Because the graph should be able to show the last hour(in 1-minute steps), the last day (in 1-hour steps), the last week (in 1-day steps), the last month (in 1-day steps), etc, I am trying to reduce the code duplication by iterating over a range of DateTime objects:
# To prevent code-duplication, iterate over different time ranges.
times = {
:hour=>{newer_than: 1.hour.ago, timestep: :minute},
:day=>{newer_than: 1.day.ago, timestep: :hour},
:week=>{newer_than: 1.week.ago, , timestep: :day},
:month=>{newer_than: 1.week.ago, , timestep: :day}
}
products = Product.all
# Create symbols `:beginning_of_minute`, `:beginning_of_hour`, etc. These are used to group products and timestamps by.
times.each do|name, t|
t[:beginning_of] = ("beginning_of_" << t[:timestep].to_s).to_sym
end
graphs = times.map do |name, t|
graphpoints = {}
seconds_in_a_day = 1.day.to_f
step_ratio = 1.send(t[:timestep]).ago.to_f / seconds_in_a_day
time_enum = 1.send(t[:timestep]).ago.to_datetime.step(DateTime.now, step_ratio)
time_enum.each do |timestep|
graphpoints[time_moment.send(timehash[:beginning_of]).to_datetime] = []
end
# Load all products that are visible in this graph size
visible_products = products.select {|p| p.created_at >= t.newer_than}
# Group them per graph point
grouped_products = visible_products.group_by {|item| item.created_at.send(timehash[:beginning_of]).to_datetime}
graphpoints.merge!(grouped_products)
{
points: graphpoints,
labels: graphpoints.keys
}
end
This code works great for all time-intervals that have a constant size (hour,day,week). For months, however, it uses a step_ratio of 30 days: 1.month / 1.day == 30. Obviously, the amount of days that months has is not constant. In my script, this has the result that a month might be 'skipped' and therefore missing from the graph.
How can this problem be solved? How to iterate over months while keeping the different amount of days in the months in mind?
if you have to select month over a gigantic arrays, just make the range between two Date:class.
(1.year.ago.to_date..DateTime.now.to_date)).select{|date| date.day==1}.each do |date|
p date
end
Use groupdate gem. For example (modified example from the docs):
visible_products = Product.where("created_at > ?", 1.week.ago).group_by_day
# {
# 2015-07-29 00:00:00 UTC => 50,
# 2013-07-30 00:00:00 UTC => 100,
# 2013-08-02 00:00:00 UTC => 34
# }
Also, this will be much faster, because your grouping/counting will be done by database itself, without the need to pass all the records via Product.all call to your Rails code, and without the need to create ActiveRecord object for each one (even irrelevant).

How can I speed up my Ruby/Rake task, which counts occurrences of dates among 300K date strings?

I have an array of 300K strings which represent dates:
date_array = [
"2007-03-25 14:24:29",
"2007-03-25 14:27:00",
...
]
I need to count occurrences of each date in this array (e.g., all date strings for "2011-03-25"). The exact time doesn't matter -- just the date. I know the range of dates within the file. So I have:
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
count = 0
date_array.each do |date_string|
if Date.parse(date_string) >= date_to_count &&
Date.parse(date_string) <= date_to_count
count += 1
end
end
puts "#{date_to_count} occurred #{count} times."
end
Counting occurrences of just one date takes longer than 60 seconds on my machine. In what ways can I optimize the performance of this task?
Possibly useful notes: I'm using Ruby 1.9.2. This script is running in a Rake task with rake 0.9.2. The date_array is loaded from a CSV file. On each iteration, the count is saved as a record in my Rails project database.
Yes, you don't need to parse the dates at all if they are formatted the same. Knowing your data is one of the most powerful tools you can have.
If the datetime strings are all in the same format (yyyy-mm-dd HH:MM:SS) then you could do something like
data_array.group_by{|datetime| datetime[0..9]}
This will give you a hash like with the date strings as the keys and the array of dates as values
{
"2007-05-06" => [...],
"2007-05-07" => [...],
...
}
So you'd have to get the length of each array
data_array.group_by{|datetime| datatime[0..9]}.each do |date_string, date_array|
puts "#{date_string} occurred #{date_array.length} times."
end
Of course that method is wasting memory by arrays of dates when you don't need them.
so how about
A more memory-efficient method
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
You'll end up with a hash with the date strings as the keys and the counts as values
{
"2007-05-06" => 123,
"2007-05-07" => 456,
...
}
Putting everything together
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{date_counts[date_to_count.to_s].to_i} times."
end
This is a really awful algorithm to use. You're scanning through the entire list for each date, and further, you're parsing the same date twice for no apparent reason. That means for N dates in the range and M dates in the list you're doing N*M*2 date parses.
What you really need is to use group_by and do it in one pass:
dates = date_array.group_by do |date_string|
Date.parse(date_string)
end
Then you can use this as a reference for your counts:
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{dates[date_to_count] ? dates[date_to_count].length : 0} times."
end

Query sum speedup - Date series for charts

The following query runs fairly quickly, but the series processing that needs to take place afterwards is really slowing this method down. I could use some help in refactoring.
def self.sum_amount_chart_series(start_time)
orders_by_day = Widget.archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
# THIS IS WHAT IS SLOWING THE METHOD DOWN!
(start_time.to_date..Date.today).map do |date|
order = orders_by_day.detect { |order| order.print_date.to_date == date }
order && order.total_amount.to_f.round(2) || 0.0
end
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
I have benchmarked this method and the offending code is the series loop where it generates a series of dates and then maps out a new array with an amount for each date. This way I get a series back with amounts for every date, regardless if it has an amount or not.
When the query only returns a few dates, it runs fairly quickly. But set the start date back a year or two and it becomes impossibly slow. The real offender is the .detect method. It's very slow at scanning the array of activerecord objects.
Is there a faster method to generates this series?
orders_by_day is grouped by "pg_print_date_group" so it should be a hash of "date" to objects. so why don't you just do
(start_time.to_date..Date.today).map do |date|
order = orders_by_day[date.to_s(:db)]
order && order.total_amount.to_f.round(2) || 0.0
end
That should seriously reduce the Big O of your run. And if I'm misunderstanding and your orders_by_day isn't a hash, preprocess it into a hash and then run the map, you definitely don't want to detect for every date.
Since the primary offender in your code is the detect method that has to scan the array again and again, I suggest that you invert the order in which you create the series, so that you only scan the array once, and your code runs in O(n) time.
Try something along the lines of:
series = []
next_date = start_time.to_date
orders_by_day.each do |order|
while order.print_date.to_date < next_date
series << 0.0
next_date = next_date.next
end
series << order.total_amount.to_f.round(2)
next_date += 1
end
while next_date < Date.today
series << 0.0
next_date = next_date.next
end
Please note that my code is untested ;)

Clean way to find ActiveRecord objects by id in the order specified

I want to obtain an array of ActiveRecord objects given an array of ids.
I assumed that
Object.find([5,2,3])
Would return an array with object 5, object 2, then object 3 in that order, but instead I get an array ordered as object 2, object 3 and then object 5.
The ActiveRecord Base find method API mentions that you shouldn't expect it in the order provided (other documentation doesn't give this warning).
One potential solution was given in Find by array of ids in the same order?, but the order option doesn't seem to be valid for SQLite.
I can write some ruby code to sort the objects myself (either somewhat simple and poorly scaling or better scaling and more complex), but is there A Better Way?
It's not that MySQL and other DBs sort things on their own, it's that they don't sort them. When you call Model.find([5, 2, 3]), the SQL generated is something like:
SELECT * FROM models WHERE models.id IN (5, 2, 3)
This doesn't specify an order, just the set of records you want returned. It turns out that generally MySQL will return the database rows in 'id' order, but there's no guarantee of this.
The only way to get the database to return records in a guaranteed order is to add an order clause. If your records will always be returned in a particular order, then you can add a sort column to the db and do Model.find([5, 2, 3], :order => 'sort_column'). If this isn't the case, you'll have to do the sorting in code:
ids = [5, 2, 3]
records = Model.find(ids)
sorted_records = ids.collect {|id| records.detect {|x| x.id == id}}
Based on my previous comment to Jeroen van Dijk you can do this more efficiently and in two lines using each_with_object
result_hash = Model.find(ids).each_with_object({}) {|result,result_hash| result_hash[result.id] = result }
ids.map {|id| result_hash[id]}
For reference here is the benchmark i used
ids = [5,3,1,4,11,13,10]
results = Model.find(ids)
Benchmark.measure do
100000.times do
result_hash = results.each_with_object({}) {|result,result_hash| result_hash[result.id] = result }
ids.map {|id| result_hash[id]}
end
end.real
#=> 4.45757484436035 seconds
Now the other one
ids = [5,3,1,4,11,13,10]
results = Model.find(ids)
Benchmark.measure do
100000.times do
ids.collect {|id| results.detect {|result| result.id == id}}
end
end.real
# => 6.10875988006592
Update
You can do this in most using order and case statements, here is a class method you could use.
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
# User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
# #=> [3,2,1]
Apparently mySQL and other DB management system sort things on their own. I think that you can bypass that doing :
ids = [5,2,3]
#things = Object.find( ids, :order => "field(id,#{ids.join(',')})" )
A portable solution would be to use an SQL CASE statement in your ORDER BY. You can use pretty much any expression in an ORDER BY and a CASE can be used as an inlined lookup table. For example, the SQL you're after would look like this:
select ...
order by
case id
when 5 then 0
when 2 then 1
when 3 then 2
end
That's pretty easy to generate with a bit of Ruby:
ids = [5, 2, 3]
order = 'case id ' + (0 .. ids.length).map { |i| "when #{ids[i]} then #{i}" }.join(' ') + ' end'
The above assumes that you're working with numbers or some other safe values in ids; if that's not the case then you'd want to use connection.quote or one of the ActiveRecord SQL sanitizer methods to properly quote your ids.
Then use the order string as your ordering condition:
Object.find(ids, :order => order)
or in the modern world:
Object.where(:id => ids).order(order)
This is a bit verbose but it should work the same with any SQL database and it isn't that difficult to hide the ugliness.
As I answered here, I just released a gem (order_as_specified) that allows you to do native SQL ordering like this:
Object.where(id: [5, 2, 3]).order_as_specified(id: [5, 2, 3])
Just tested and it works in SQLite.
Justin Weiss wrote a blog article about this problem just two days ago.
It seems to be a good approach to tell the database about the preferred order and load all records sorted in that order directly from the database. Example from his blog article:
# in config/initializers/find_by_ordered_ids.rb
module FindByOrderedIdsActiveRecordExtension
extend ActiveSupport::Concern
module ClassMethods
def find_ordered(ids)
order_clause = "CASE id "
ids.each_with_index do |id, index|
order_clause << "WHEN #{id} THEN #{index} "
end
order_clause << "ELSE #{ids.length} END"
where(id: ids).order(order_clause)
end
end
end
ActiveRecord::Base.include(FindByOrderedIdsActiveRecordExtension)
That allows you to write:
Object.find_ordered([2, 1, 3]) # => [2, 1, 3]
Here's a performant (hash-lookup, not O(n) array search as in detect!) one-liner, as a method:
def find_ordered(model, ids)
model.find(ids).map{|o| [o.id, o]}.to_h.values_at(*ids)
end
# We get:
ids = [3, 3, 2, 1, 3]
Model.find(ids).map(:id) == [1, 2, 3]
find_ordered(Model, ids).map(:id) == ids
Another (probably more efficient) way to do it in Ruby:
ids = [5, 2, 3]
records_by_id = Model.find(ids).inject({}) do |result, record|
result[record.id] = record
result
end
sorted_records = ids.map {|id| records_by_id[id] }
Here's the simplest thing I could come up with:
ids = [200, 107, 247, 189]
results = ModelObject.find(ids).group_by(&:id)
sorted_results = ids.map {|id| results[id].first }
#things = [5,2,3].map{|id| Object.find(id)}
This is probably the easiest way, assuming you don't have too many objects to find, since it requires a trip to the database for each id.

Resources