Rails: how to calculate the average of a small set of elements - ruby-on-rails

I've looked at resources to know how to find the average with RoR built-in average ActiveRecord::Calculations. I've also looked online for ideas on how to calculate averages: Rails calculate and display average.
But cannot find any reference on how to calculate the average of a set of elements from the database column.
In the controller:
#jobpostings = Jobposting.all
#medical = #jobpostings.where("title like ? OR title like ?", "%MEDICAL SPECIALIST%", "%MEDICAL EXAMINER%").limit(4).order('max_salary DESC')
#medical_salary = "%.2f" % #medical.average(:max_salary).truncate(2)
#medical returns:
CITY MEDICAL SPECIALIST
220480.0
CITY MEDICAL EXAMINER (OCME)
180000.0
CITY MEDICAL SPECIALIST
158080.0
CITY MEDICAL SPECIALIST
130000.0
I want to find the average of :max_salary (each one is listed correctly under the job title). I use #medical_salary = "%.2f" % #medical.average(:max_salary).truncate(2) to convert the BigDecimal number and find the average of :max_salary from #medical which I thought would be limited to the top 4 displayed above.
But the result returned is: 72322.33, which is the average of the entire column (I checked), instead of the top 4.
Do I need to add another condition? Why does the average of #medical return the average of the entire column?
Any insight would help. Thanks.

#medical.average(:max_salary) is expanded as #jobpostings.where(...).average(:max_salary).limit(4) even though the limit(4) appears previously in the method chain.
You can confirm this by checking the query that is run which be as follows:
SELECT AVG(`jobpostings `.`max_salary `) FROM `tickets` ... LIMIT 4`
Effectively, LIMIT 4 doesn't do anything because there is only one average number in the result of the above query.
One way to accomplish what you are trying to do will be:
$ #top_salaries = #jobpostings.where("title like ? OR title like ?", "%MEDICAL SPECIALIST%", "%MEDICAL EXAMINER%").limit(4).map(&:max_salary)
=> [ 220480.0, 180000.0, 158080.0, 130000.0]
$ average = #top_salaries.reduce(:+) / #top_salaries.size
=> 172140.0

#medical.average(:max_salary)
This line corresponds to following MySQL query:
SELECT AVG(`jobpostings`.`max_salary`) AS avg_max_salary FROM `jobpostings` WHERE (`jobpostings`.`title` like "%MEDICAL SPECIALIST%" OR title like "%MEDICAL EXAMINER%") LIMIT 4
Since MySQL already calculated AVG on the column max_salary, it returns only 1 row. LIMIT 4 clause doesn't actually come into play.
You can try following:
#limit = 4
#medical = #jobpostings.where("title like ? OR title like ?", "%MEDICAL SPECIALIST%", "%MEDICAL EXAMINER%").limit(#limit).order('max_salary DESC')
#medical_salary = "%.2f" %(#medical.pluck(:max_salary).reduce(&:+)/#limit.to_f).truncate(2)
It will even work for cases when no. of results are 0.

Related

Ruby Rails Average two attributes in a query returning multiple objects

I've got two attributes I'm trying to average, but it's only averaging the second field here. is there a way to do this?
e = TiEntry.where('ext_trlid = ? AND mat_pidtc = ?', a.trlid, a.pidtc).average(:mat_mppss_rprcp && :mat_fppss_rprcp)
e = TiEntry.where('ext_trlid = ? AND mat_pidtc = ?', a.trlid, a.pidtc).select("AVG(mat_mppss_rprcp) AS avg1, AVG(mat_fppss_rprcp) AS avg2").map { |i| [i.avg1, i.avg2] }
Is this working for you? it works as the average method does, but you can support as may values as you want
The advantage between this and the other queries here is this only uses one simple SQL query. The others fetch with an SQL everything in your table(can take some time if table is big) and then computes the average in ruby language
I am sure that you have all ready looked at http://api.rubyonrails.org/classes/ActiveRecord/Calculations.html#method-i-average
But you cant get the average of 2 things.
what you can do not to repeat your query is:
entries = TiEntry.where('ext_trlid = ? AND mat_pidtc = ?', a.trlid, a.pidtc)
average_mppss = entries.average(:mat_mppss_rprcp)
average_fppss = entries.average(:mat_fppss_rprcp)
this will only execute your query one time
I hope that this works for you

ActiveRecord: Alternative to find_in_batches?

I have a query that loads thousands of objects and I want to tame it by using find_in_batches:
Car.includes(:member).where(:engine => "123").find_in_batches(batch_size: 500) ...
According to the docs, I can't have a custom sorting order: http://www.rubydoc.info/docs/rails/4.0.0/ActiveRecord/Batches:find_in_batches
However, I need a custom sort order of created_at DESC. Is there another method to run this query in chunks like it does in find_in_batches so that not so many objects live on the heap at once?
Hm I've been thinking about a solution for this (I'm the person who asked the question). It makes sense that find_in_batches doesn't allow you to have a custom order because lets say you sort by created_at DESC and specify a batch_size of 500. The first loop goes from 1-500, the second loop goes from 501-1000, etc. What if before the 2nd loop occurs, someone inserts a new record into the table? That would be put onto the top of the query results and your results would be shifted 1 to the left and your 2nd loop would have a repeat.
You could argue though that created_at ASC would be safe then, but it's not guaranteed if your app specifies a created_at value.
UPDATE:
I wrote a gem for this problem: https://github.com/EdmundMai/batched_query
Since using it, the average memory of my application has HALVED. I highly suggest anyone having similar issues to check it out! And contribute if you want!
The slower manual way to do this, is to do something like this:
count = Cars.includes(:member).where(:engine => "123").count
count = count/500
count += 1 if count%500 > 0
last_id = 0
while count > 0
ids = Car.includes(:member).where("engine = "123" and id > ?", last_id).order(created_at: :desc).limit(500).ids #which plucks just the ids`
cars = Cars.find(ids)
#cars.each or #cars.update_all
#do your updating
last_id = ids.last
count -= 1
end
Can you imagine how find_in_batches with sorting will works on 1M rows or more? It will sort all rows every batch.
So, I think will be better to decrease number of sort calls. For example for batch size equal to 500 you can load IDs only (include sorting) for N * 500 rows and after it just load batch of objects by these IDs. So, such way should decrease have queries with sorting to DB in N times.

Find record with LIKE on partial attribute

In my app I have invoice numbers like this:
2014.DEV.0001
2014.DEV.0002
2014.TSZ.0003
The three character code is a company code. When a new invoice number needs to be assigned it should look for the last used invoice number for that specific company code and add one to it.
I know the company code, I use a LIKE to search on a partial invoice number like this:
last = Invoice.where("invoice_nr LIKE ?", "#{DateTime.now.year}.#{company_short}.").last
This results in this SQL query:
SELECT "invoices".* FROM "invoices" WHERE "invoices"."account_id" = 1 AND (invoice_nr LIKE '2014.TSZ.') ORDER BY "invoices"."id" DESC LIMIT 1
But unfortunately it doesn't return any results. Any idea to improve this, as searching with LIKE doesn't seem to be correct?
Try wrapping string with % and use lower to convert the query string and result into downcase to avoid any wrong results due to case, try this
last = Invoice.where("lower(invoice_nr) LIKE lower(?)", "%#{DateTime.now.year}.#{company_short}.%").last
You want % for partial match
last = Invoice.where("invoice_nr LIKE ?", "%#{DateTime.now.year}.#{company_short}.%").last
Since you want to match only the left part you need to add one % at the right part of your string
Invoice.where("invoice_nr LIKE ?", "#{DateTime.now.year}.#{company_short}.%").last

RoR sum column of datetime type

I'm calculating total "walk time" for dog walking app. The Walks table has two cols, start_time and end_time. Since I want to display total time out for ALL walks for a particular dog, I should just be able to sum the two columns, subtract end_times_total from start_time_totals and result will be my total time out. However I'm getting strange results. When I sum the columns thusly,
start_times = dog.walks.sum('start_time')
end_times = dog.walks.sum('end_time')
BOTH start_times and end_times return the same value. Doing a sanity check I see that my start and end times in the db are indeed set as I would expect them to be (start times in the morning, end times in the afternoon), so the sum should definitely return a different value for each of the columns. Additionally, the value is different for each dog and in line with the relative values I would expect, so dogs with more walks return larger values than dogs with fewer walks. So, it looks like the sum is probably working, only somehow returning the same value for each column.
Btw, running this in dev Rails 3.2.3, ruby 2.0, SQLite.
Don't think that summing datetimes is a good idea. What you need is calculate duration of each single walk and sum them. You can do it in 2 ways:
1. DB-dependent, but more efficient:
# sqlite in dev and test modes
sql = "strftime('%s',end_time) - strftime('%s',start_time)" if !Rails.env.production?
# production with postgres
sql = "extract(epoch from end_time - start_time)" if Rails.env.production?
total = dog.walks.sum(sql)
2. DB-agnostic, but less efficient in case of hundreds record for each dog:
total = dog.walks.all.inject(0) {|tot,w| tot+=w.end_time-w.start_time}
I don't know, how sqlite handles datetime and operations on this data type, but while playing in sqlite console, I noticed that I could get reliable effects when converting datetime to seconds.
I would write it like:
dog.walks.sum("strftime('%s', end_time) - strftime('%s', start_time)")
Query should look like:
select sum(strftime('%s', end_time) - strftime('%s', start_time)) from walks;

How to efficiently select a random row using a biased prob. distribution with Rails 2.3 and PostgreSQL 8?

I have wrote a few simple Rails application, accessing the database via the ActiveRecord abstraction, so I am afraid that I don't know much about the inner working of PostgreSQL engine.
However, I am writing a Rails app that need to support 100000+ rows with dynamically updated content and am wondering whether I am using the random functions efficiently:
Database migration schema setting:
t.float: attribute1
t.integer: ticks
add_index :mytable, :attribute1
add_index :mytable, :ticks
Basically, I want to have the following random function distribution:
a) row that has the top 10% value in attribute1 = 30% chance of being selected
b) the middle 60% (in attribute1) row = 50% chance of being selected
c) the lowest 30% (in attribute1) with less than 100 ticks = 15% chance of being selected,
d) and for those with the lowest 30% attribute1 that have more than X (use X = 100 in this question) ticks = 5% chance of being selected.
At the moment I have the following code:
#rand = rand()
if #rand>0.7
#randRowNum = Integer(rand(0.1) * Mytable.count )
#row = Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
if #rand>0.2
#randRowNum = Integer((rand(0.6)+0.1) * Mytable.count)
#row= Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
#row= Mytable.find(:first, :offset =>Integer(0.7 * Mytable.count), :order => '"attribute1" DESC')
if !(#row.nil?)
if (#rand >0.05)
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" < 100' ] )
else
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ] )
end
end
end
end
1) One thing I would like to do is avoid the use of :order => 'RANDOM()' as according to my research, it seems that each time it is called, involves the SQL engine first scanning through all the rows, assigning them a random value. Hence the use of #randRowNum = Integer(rand(0.1) * Mytable.count ) and offset by #randRowNum. for a) and b). Am I actually improving the efficiency? Is there any better way?
2) Should I be doing the same as 1) for c) and d), and is by using the :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ], am I forcing the SQL engine to scan through all the rows anyway? Is there anything apart from indexing that I can improve the efficiency of this call (with as little space/storage overhead as possible too)?
3) There is a chance that the total number of entries in Mytable may have been changed between the Mytable.count and Mytable.find calls. I could put the two calls within a transaction, but it seems excessive to lock that entire table just for a read operation (at the moment, I have extra code to fall back to a simple random row selection if I got a #row.nil from the above code). Is it psosible to move that .count call within a single atomic SQL query in Rails? Or would it have the same efficiently as locking via transaction in Rails?
4) I have also been reading up on stored procedure in PostgreSQL... but for this particular case, is there any gain in efficiency to be achieved, worth moving the code from Activerecording abstraction to stored procedure?
Many thanks!
P.S. development/deployment environment:
Rube 1.8.7
Rails 2.3.14
PostgreSQL 8 (on Heroku)
Your question seems a bit vague, so correct me if my interpretation is wrong.
If you didn't split (c) and (d), I would just convert the uniform random variable over [0,1) to a biased random variable over [0,1) and use that to select your row offset.
Unfortunately, (c) and (d) is split based on the value of "ticks" — a different column. That's the hard part, and also makes the query much less efficient.
After you fetch the value of attribute1 at 70% from the bottom, also fetch the number of (c) rows; something like SELECT COUNT(*) FROM foo WHERE attribute1 <= partiton_30 AND ticks < 100. Then use that to find the offset into either the ticks < 100 or the ticks >= 100 cases. (You probably want an index on (attribute1, ticks) or something; the order which is best depends on your data).
If the "ticks" threshold is known in advance and doesn't need to change often, you can cache it in a column (BOOL ticks_above_threshold or whatever) which makes the query much more efficient if you have an index on (ticks_above_threshold, attribute1) (note the reversal). Of course, every time you change the threshold you need to write to every row.
(I think you can also use a "materialized view" to avoid cluttering the main table with an extra column, but I'm not sure what the difference in efficiency is.)
There are obviously some efficiency gains possible by using stored procedures. I wouldn't worry about it too much, unless latency to the server is particularly high.
EDIT:
To answer your additional questions:
Indexing (BOOL ticks_above_threshold, attribute1) should work better than (ticks, attribute1) (I may have the order wrong, though) because it lets you compare the first index column for equality. This is important.
Indexes generally use some sort of balanced tree to do a lookup on what is effectively a list. For example, take A4 B2 C3 D1 (ordered letter,number) and look up "number of rows with letter greater than B and number greater than 2". The best you can do is start after the Bs and iterate over the whole table. If you order by number,letter, you get 1D 2B 3C 4A. Again, start after the 2s.
If you instead your index is on is_greater_than_2,letter, it looks like false,B false,D true,A true,C. You can ask the index for the position of (true,B) — between true,A and true,C — and count the number of entries until the end. Counting the number of items between two index positions is fast.
Google App Engine's Restrictions on Queries goes one step further:
Inequality Filters Are Allowed on One Property Only
A query may only use inequality filters (<, <=, >=, >, !=) on one property across all of its filters.
...
The query mechanism relies on all results for a query to be adjacent to one another in the index table, to avoid having to scan the entire table for results. A single index table cannot represent multiple inequality filters on multiple properties while maintaining that all results are consecutive in the table.
If none of you other queries benefit from an index on ticks, then yes.
In some cases, it might be faster to index a instead of a,b (or b,a) if the clause including b is almost always true and you're fetching row data (not just getting a COUNT()) (i.e. if 1000 <= a AND a <= 1010 matches a million rows and b > 100 only fails for two rows, then it might end up being faster to do two extra row lookups than to work with the bigger index).
As long as rows aren't being removed b/n the call to count() and the call to find() I wouldn't worry about transactions. Definitely get rid of all calls ordering by RANDOM() as there is no way to optimize it. Make sure that attribute1 has an index on it. I haven't tested it, but something like this should be pretty quick:
total_rows = MyTable.count
r = rand()
if r > 0.7 # 90-100%
lower = total_rows * 0.9
upper = total_rows
elsif r > 0.2 # 30-89%
lower = total_rows * 0.3
upper = total_rows * 0.89
else # 0-29%
lower = 0
upper = total_rows * 0.29
end
offset = [lower + (upper - lower) * rand(), total_rows - 1].min.to_i
#row = Mytable.find(:first, :offset => offset, :order => 'attribute1 ASC')

Resources