The following query runs fairly quickly, but the series processing that needs to take place afterwards is really slowing this method down. I could use some help in refactoring.
def self.sum_amount_chart_series(start_time)
orders_by_day = Widget.archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
# THIS IS WHAT IS SLOWING THE METHOD DOWN!
(start_time.to_date..Date.today).map do |date|
order = orders_by_day.detect { |order| order.print_date.to_date == date }
order && order.total_amount.to_f.round(2) || 0.0
end
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
I have benchmarked this method and the offending code is the series loop where it generates a series of dates and then maps out a new array with an amount for each date. This way I get a series back with amounts for every date, regardless if it has an amount or not.
When the query only returns a few dates, it runs fairly quickly. But set the start date back a year or two and it becomes impossibly slow. The real offender is the .detect method. It's very slow at scanning the array of activerecord objects.
Is there a faster method to generates this series?
orders_by_day is grouped by "pg_print_date_group" so it should be a hash of "date" to objects. so why don't you just do
(start_time.to_date..Date.today).map do |date|
order = orders_by_day[date.to_s(:db)]
order && order.total_amount.to_f.round(2) || 0.0
end
That should seriously reduce the Big O of your run. And if I'm misunderstanding and your orders_by_day isn't a hash, preprocess it into a hash and then run the map, you definitely don't want to detect for every date.
Since the primary offender in your code is the detect method that has to scan the array again and again, I suggest that you invert the order in which you create the series, so that you only scan the array once, and your code runs in O(n) time.
Try something along the lines of:
series = []
next_date = start_time.to_date
orders_by_day.each do |order|
while order.print_date.to_date < next_date
series << 0.0
next_date = next_date.next
end
series << order.total_amount.to_f.round(2)
next_date += 1
end
while next_date < Date.today
series << 0.0
next_date = next_date.next
end
Please note that my code is untested ;)
Related
I have a method that outputs the following hash format for charting.
# Monthly (Jan - Dec)
{
"john": [1,2,3,4,5,6,7,8,9,10,11,12],
"mike": [1,2,3,4,5,6,7,8,9,10,11,12],
"rick": [1,2,3,4,5,6,7,8,9,10,11,12]
}
# the indices represents the month
# e.g [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
# Index
# 0 = Jan
# 1 = Feb
# 2 = Mar
...
The following method loops through all the store invoices within given year with specific sales rep name and generate above outcome
def chart_data
hash = Hash.new {|h,k| h[k] = [] }
(1..12).each do |month|
date_range = "1/#{month}/#{date.year}".to_date.all_month
all_reps.each do |name|
hash[name] << store.bw_invoices.where(sales_rep_name: name,
purchase_date: date_range).sum(:subtotal).to_f
end
end
return hash
end
When I run run this method it takes over 4~5 sec to execute. I really need to optimize this query. I came up with two solutions that I think it would help but I would love to get some of your expertise.
move it to background job
perform a SQL query to optimize(I need help with this if this is optimal)
Thank you so much for your time
Yes, you've found a problem that is very hard to solve efficiently without letting the database do the hard work.
Assuming your dataset is potentially too large to load a whole year raw into ruby objects, this approach using just 1 postgreSQL query would be probably the best kind of idea:
More SQL approach
def chart_data
result = Hash.new {|h,k| h[k] = [] }
total_lines = store.bw_invoices.select("sales_rep_name, to_char(purchase_date, 'mm') as month, sum(subtotal) as total")
.where(purchase_date: Date.today.all_year)
.group("sales_rep_name, to_char(purchase_date, 'mm')")
total_lines.each do |total_line|
result[total_line.sales_rep_name][total_line.month.to_i - 1] = total_line.total.to_f
end
result
end
Note that this solution will leave nil rather than 0 for months where a rep had no sales. And if their last month with sales was June then there will only be 6 items in the array.
We can avoid this either with more complex SQL left joining from a virtual table or by filling in the array gaps afterwards. However, depending on how you setup your charting this might make no practical difference anyway.
More ruby approach
def chart_data
result = Hash.new {|h,k| h[k] = [] }
(1..12).each do |month|
date_range = "1/#{month}/#{Date.today.year}".to_date.all_month
rows = store.bw_invoices.select("sales_rep_name, SUM(subtotal) as total")
.where(purchase_date: date_range)
.group(:sales_rep_name)
all_reps.each do |rep_name|
row = rows.detect { |x| x.sales_rep_name == rep_name }
result[rep_name] << (row ? row.total : 0).to_f
end
end
result
end
This is more similar to your approach but takes the querying outside of the inner loop so we do 12 queries instead of 12 * number of reps. The detect used may become a little slow but only if there are thousands of reps. In which case you could sort both all_reps and the query output and implement your own kind of merge join but at that point you're getting into complexity you might as well let the database handle again.
Context:
Trying to generating an array with 1 element for each created_at day in db table. Each element is the average of the points (integer) column from records with that created_at day.
This will later be graphed to display the avg number of points on each day.
Result:
I've been successful in doing this, but it feels like an unnecessary amount of code to generate the desired result.
Code:
def daily_avg
# get all data for current user
records = current_user.rounds
# make array of long dates
long_date_array = records.pluck(:created_at)
# create array to store short dates
short_date_array = []
# remove time of day
long_date_array.each do |date|
short_date_array << date.strftime('%Y%m%d')
end
# remove duplicate dates
short_date_array.uniq!
# array of avg by date
array_of_avg_values = []
# iterate through each day
short_date_array.each do |date|
temp_array = []
# make array of records with this day
records.each do |record|
if date === record.created_at.strftime('%Y%m%d')
temp_array << record.audio_points
end
end
# calc avg by day and append to array_of_avg_values
array_of_avg_values << temp_array.inject(0.0) { |sum, el| sum + el } / temp_array.size
end
render json: array_of_avg_values
end
Question:
I think this is a common extraction problem needing to be solved by lots of applications, so I'm wondering if there's a known repeatable pattern for solving something like this?
Or a more optimal way to solve this?
(I'm barely a junior developer so any advice you can share would be appreciated!)
Yes, that's a lot of unnecessary stuff when you can just go down to SQL to do it (I'm assuming you have a class called Round in your app):
class Round
DAILY_AVERAGE_SELECT = "SELECT
DATE(rounds.created_at) AS day_date,
AVG(rounds.audio_points) AS audio_points
FROM rounds
WHERE rounds.user_id = ?
GROUP BY DATE(rounds.created_at)
"
def self.daily_average(user_id)
connection.select_all(sanitize_sql_array([DAILY_AVERAGE_SELECT, user_id]), "daily-average")
end
end
Doing this straight into the database will be faster (and also include less code) than doing it in ruby as you're doing now.
I advice you to do something like this:
grouped =
records.order(:created_at).group_by do |r|
r.created_at.strftime('%Y%m%d')
end
At first here you generate proper SQL near to that you wish to get in first approximation, then group result records by created_at field converted to just a date.
points =
grouped.map do |(date, values)|
[ date, values.reduce(0.0, :audio_points) / values.size ]
end.to_h
# => { "1-1-1970" => 155.0, ... }
Then you remap your grouped hash via array, to calculate average values with audio_points.
You can use group and calculations methods built in AR: http://guides.rubyonrails.org/active_record_querying.html#group
http://guides.rubyonrails.org/active_record_querying.html#calculations
There are some stackoverflow posts related to my question but not all that similar.
I would like an efficient and somewhat elegant(if possible) solution as to get an array of missing dates after comparing a user specified date range to the summary table in postgresql. One method I know of is to lay the range out into a list of dates and then compare individually to all the dates by querying EXIST or if result == nil?/empty?, etc. But if user was to do a large range, this could be resource consuming and slow.
Is there any methods beside the ones that are currently listed?
Thank you
First, we need to sort the dates. In ruby this is as simple as
sorted_dates = dates.sort
If you know the dates are sorted, then just start with the first date and increment by one as you iterate through your date range. If the next date in your array is not the date you expected, add the missing date to your missing_dates array, and continue incrementing until you reach the date included.
This code might look like the following:
def find_missing_dates(sorted_dates)
current_date = sorted_dates[0]
missing_dates = Set.new
sorted_dates.each do |date|
while current_date != date
missing_dates << current_date
current_date += 1.day
end
current_date += 1.day
end
end
This is O(N) for average case, so to get it more efficient, we could split in half and recurse.
def dates_between(lower, upper)
(lower..upper).to_a - [lower,upper]
end
def find_missing_dates(sorted_dates, missing_dates = Set.new)
min_date = sorted_dates[0]
max_date = sorted_dates[-1]
if (min_date - max_date).to_i == (sorted_dates.count - 1)
missing_dates
else
middle_date_lower = sorted_dates[sorted_dates.count / 2 - 1]
middle_date_upper = sorted_dates[sorted_dates.count / 2]
unless (middle_date_upper - middle_date_lower) == 1
missing_dates.merge(dates_between(middle_date_lower, middle_date_upper))
end
find_missing_dates(sorted_dates[0..(sorted_dates.count/2 - 1)], missing_dates).merge(find_missing_dates(sorted_dates[(sorted_dates.count/2)..-1]))
end
end
find_missing_dates(sorted_dates)
This is still worst case O(N), but average case is O(log N)
Example I have:
range = start.to_date..(end.to_date + 1.day)
end and start are dates.
How do I create a month array based on this range?
Example:
I have the dates 23/1/2012 and 15/3/2012
The months are Januar, Februar and Marts.
I want to get a array like ["1/1/2012", "1/2/2012", "1/3/2012"]
and if the range was betweeen 25/6/2012 to the 10/10/2012
the array would be: ["1/6/2012", "1/7/2012", "1/8/2012", "1/9/2012", "1/10/2012"]
require 'date'
date_from = Date.parse('2011-10-14')
date_to = Date.parse('2012-04-30')
date_range = date_from..date_to
date_months = date_range.map {|d| Date.new(d.year, d.month, 1) }.uniq
date_months.map {|d| d.strftime "%d/%m/%Y" }
# => ["01/10/2011", "01/11/2011", "01/12/2011", "01/01/2012",
# "01/02/2012", "01/03/2012", "01/04/2012"]
Rails ActiveSupport core extensions includes a method for Date: beginning_of_month. Your function could be written as follows:
def beginning_of_month_date_list(start, finish)
(start.to_date..finish.to_date).map(&:beginning_of_month).uniq.map(&:to_s)
end
Caveats: this could be written more efficiently, assumes start and finish are in the expected order, but otherwise should give you the months you're looking for. You could also rewrite to pass a format symbol to the #to_s method to get the expected month format.
I was curious about performance here so I tested some variations. Here's a solution better optimized for performance (about 8x faster in my benchmark than the accepted solution). By incrementing by a month at a time we can remove the call to uniq which cuts quite a bit of time.
start_date = 1.year.ago.to_date
end_date = Date.today
dates = []
date = start_date.beginning_of_month
while date <= end_date.beginning_of_month
dates << date.to_date.to_s
date += 1.month
end
dates
#=> ["2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01", "2019-07-01", "2019-08-01", "2019-09-01", "2019-10-01", "2019-11-01", "2019-12-01", "2020-01-01", "2020-02-01"]
Benchmark Results:
Comparison:
month increment loop: 17788.3 i/s
accepted solution: 2140.1 i/s - 8.31x slower
gist of the benchmark code
Similar to one of the solutions above using beginning_of_month .. but taking less space (by using Set) and is neater for using inject.
(start_month..end_month).inject(Set.new) { |s, i| s << i.beginning_of_month; s }.to_a
I have an array of 300K strings which represent dates:
date_array = [
"2007-03-25 14:24:29",
"2007-03-25 14:27:00",
...
]
I need to count occurrences of each date in this array (e.g., all date strings for "2011-03-25"). The exact time doesn't matter -- just the date. I know the range of dates within the file. So I have:
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
count = 0
date_array.each do |date_string|
if Date.parse(date_string) >= date_to_count &&
Date.parse(date_string) <= date_to_count
count += 1
end
end
puts "#{date_to_count} occurred #{count} times."
end
Counting occurrences of just one date takes longer than 60 seconds on my machine. In what ways can I optimize the performance of this task?
Possibly useful notes: I'm using Ruby 1.9.2. This script is running in a Rake task with rake 0.9.2. The date_array is loaded from a CSV file. On each iteration, the count is saved as a record in my Rails project database.
Yes, you don't need to parse the dates at all if they are formatted the same. Knowing your data is one of the most powerful tools you can have.
If the datetime strings are all in the same format (yyyy-mm-dd HH:MM:SS) then you could do something like
data_array.group_by{|datetime| datetime[0..9]}
This will give you a hash like with the date strings as the keys and the array of dates as values
{
"2007-05-06" => [...],
"2007-05-07" => [...],
...
}
So you'd have to get the length of each array
data_array.group_by{|datetime| datatime[0..9]}.each do |date_string, date_array|
puts "#{date_string} occurred #{date_array.length} times."
end
Of course that method is wasting memory by arrays of dates when you don't need them.
so how about
A more memory-efficient method
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
You'll end up with a hash with the date strings as the keys and the counts as values
{
"2007-05-06" => 123,
"2007-05-07" => 456,
...
}
Putting everything together
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{date_counts[date_to_count.to_s].to_i} times."
end
This is a really awful algorithm to use. You're scanning through the entire list for each date, and further, you're parsing the same date twice for no apparent reason. That means for N dates in the range and M dates in the list you're doing N*M*2 date parses.
What you really need is to use group_by and do it in one pass:
dates = date_array.group_by do |date_string|
Date.parse(date_string)
end
Then you can use this as a reference for your counts:
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{dates[date_to_count] ? dates[date_to_count].length : 0} times."
end