Weighted random pick from array in ruby/rails - ruby-on-rails

I have a model in Rails from which I want to pick a random entry.
So far I've done it with a named scope like this:
named_scope :random, lambda { { :order=>'RAND()', :limit => 1 } }
But now I've added an integer field 'weight' to the model representing the probability with which each row should be picked.
How can I now do a weighted random pick?
I've found and tried out two methods on snippets.dzone.com that extended the Array class and add a weighted random function, but both didn't work or pick random items for me.
I'm using REE 1.8.7 and Rails 2.3.

Maybe I understand this totally wrong, but couldn't you just use the column "weight" as a factor to the random number? (Depending on the Db, some precautions would be necessary to prevent the product from overflowing.)
named_scope :random, lambda { { :order=>'RAND()*weight', :limit => 1 } }

In one query you should:
calculate the total weight
multiply by a random factor, giving a weight threshold
scan again through the table summing, until the weight threshold is reached.
In SQL it would be sompething like this (not tried for real)
SELECT SUM(weight) FROM table INTO #totalwt;
#lim := FLOOR(RAND() * #totalwt);
SELECT id, weight, #total := #total + weight AS cumulativeWeight
FROM table WHERE cumulativeWeight < #lim, (SELECT #total:=0) AS t;
Inspired by Optimal query to fetch a cumulative sum in MySQL

Related

update_all with a method

Lets say I have a model:
class Result < ActiveRecord::Base
attr_accessible :x, :y, :sum
end
Instead of doing
Result.all.find_each do |s|
s.sum = compute_sum(s.x, s.y)
s.save
end
assuming compute_sum is a available method and does some computation that cannot be translated into SQL.
def compute_sum(x,y)
sum_table[x][y]
end
Is there a way to use update_all, probably something like:
Result.all.update_all(sum: compute_sum(:x, :y))
I have more than 80,000 records to update. Each record in find_each creates its own BEGIN and COMMIT queries, and each record is updated individually.
Or is there any other faster way to do this?
If the compute_sum function can't be translated into sql, then you cannot do update_all on all records at once. You will need to iterate over the individual instances. However, you could speed it up if there are a lot of repeated sets of values in the columns, by only doing the calculation once per set of inputs, and then doing one mass-update per calculation. eg
Result.all.group_by{|result| [result.x, result.y]}.each do |inputs, results|
sum = compute_sum(*inputs)
Result.update_all('sum = #{sum}', "id in (#{results.map(&:id).join(',')})")
end
You can replace result.x, result.y with the actual inputs to the compute_sum function.
EDIT - forgot to put the square brackets around result.x, result.y in the group_by block.
update_all makes an sql query, so any processing you do on the values needs to be in sql. So, you'll need to find the sql function, in whichever DBMS you're using, to add two numbers together. In Postgres, for example, i believe you would do
Sum.update_all(sum: "x + y")
which will generate this sql:
update sums set sum = x + y;
which will calculate the x + y value for each row, and set the sum field to the result.
EDIT - for MariaDB. I've never used this, but a quick google suggests that the sql would be
update sums set sum = sum(x + y);
Try this first, in your sql console, for a single record. If it works, then you can do
Sum.update_all(sum: "sum(x + y)")
in Rails.
EDIT2: there's a lot of things called sum here which is making the example quite confusing. Here's a more generic example.
set col_c to the result of adding col_a and col_b together, in class Foo:
Foo.update_all(col_c: "sum(col_a + col_b)")
I just noticed that i'd copied the (incorrect) Sum.all.update_all from your question. It should just be Sum.update_all - i've updated my answer.
I'm completely beginner, just wondering Why not add a self block like below, without adding separate column in db, you still can access Sum.sum from outside.
def self.sum
x+y
end

Ruby on Rails query returns extra column

I am working with Ruby 2.0.0 and Rails 4.0.9, on Oracle 11g database.
I query the database to get pairs of values [date, score] to draw a chart.
Unfortunately, my query returns triplets such as [date, score, something], and the chart fails.
Here is the query:
#business_process_history = DmMeasure.where("period_id between ? and ? and ODQ_object_id = ?",
first_period_id, current_period_id, "BP-#{#business_process.id}").
select("period_day, score").order("period_id")
Here is the result in the console:
DmMeasure Load (1.2ms) SELECT period_day, score FROM "DM_MEASURES" WHERE (period_id between 1684 and 1694 and ODQ_object_id = 'BP-147') ORDER BY period_id
=> #<ActiveRecord::Relation [#<DmMeasure period_day: "20140811", score: #<BigDecimal:54fabf0,'0.997E2',18(45)>>,
#<DmMeasure period_day: "20140812", score: #<BigDecimal:54fa7e0,'0.997E2',18(45)>>, ...]
Trying to format the result also returns triplets:
#business_process_history.map { |bp| [bp.period_day, bp.score] }
=> [["20140811", #<BigDecimal:54fabf0,'0.997E2',18(45)>],
["20140812", #<BigDecimal:54fa7e0,'0.997E2',18(45)>], ...]
Where does this come from?
How can I avoid this behaviour?
Thanks for your help,
Best regards,
Fred
what triplets? From what I can see, you have two attributes per item: 'period_day' (a string representing a date) and 'score' (a BigDecimal representation of a single number).
The ruby BigDecimal is just one way of representing a number.
eg if you play around with them in the rails console:
> BigDecimal.new(1234)
=> #<BigDecimal:f40c6d0,'0.1234E4',9(36)>
The first part, as you can see, a bit like scientific notation, it contains the significant digits and precision.
To figure out what the 18(45) is, I had to dig into the original c-code for the BigDecimal#inspect method.
Here's the commenting for that method:
/* Returns debugging information about the value as a string of comma-separated
* values in angle brackets with a leading #:
*
* BigDecimal.new("1234.5678").inspect ->
* "#<BigDecimal:b7ea1130,'0.12345678E4',8(12)>"
*
* The first part is the address, the second is the value as a string, and
* the final part ss(mm) is the current number of significant digits and the
* maximum number of significant digits, respectively.
*/
The answer being: the current number of significant digits and the maximum number of significant digits, respectively

Sum on multiple columns with Activerecord

I am new to Activerecord. I want to do sum on multiple columns of a model Student. My model student is like following:
class Student < ActiveRecord::Base
attr_accessible :class, :roll_num, :total_mark, :marks_obtained, :section
end
I want something like that:
total_marks, total_marks_obtained = Student.where(:id=>student_id).sum(:total_mark, :marks_obtained)
But it is giving following error.
NoMethodError: undefined method `except' for :marks_obtained:Symbol
So I am asking whether I have to query the model two times for the above, i.e. one to find total marks and another to find marks obtained.
You can use pluck to directly obtain the sum:
Student.where(id: student_id).pluck('SUM(total_mark)', 'SUM(marks_obtained)')
# SELECT SUM(total_mark), SUM(marks_obtained) FROM students WHERE id = ?
You can add the desired columns or calculated fields to pluck method, and it will return an array with the values.
If you just want sum of columns total_marks and marks_obtained, try this
Student.where(:id=>student_id).sum('total_mark + marks_obtained')
You can use raw SQL if you need to. Something like this to return an object where you'll have to extract the values... I know you specify active record!
Student.select("SUM(students.total_mark) AS total_mark, SUM(students.marks_obtained) AS marks obtained").where(:id=>student_id)
For rails 4.2 (earlier unchecked)
Student.select("SUM(students.total_mark) AS total_mark, SUM(students.marks_obtained) AS marks obtained").where(:id=>student_id)[0]
NB the brackets following the statement. Without it the statement returns an Class::ActiveRecord_Relation, not the AR instance. What's significant about this is that you CANNOT use first on the relation.
....where(:id=>student_id).first #=> PG::GroupingError: ERROR: column "students.id" must appear in the GROUP BY clause or be used in an aggregate function
Another method is to ActiveRecord::Calculations.pluck then Enumerable#sum on the outer array and again on the inner array pair:
Student
.where(id: student_id)
.pluck(:total_mark, :marks_obtained)
.map(&:sum)
.sum
The resulting SQL query is simple:
SELECT "students"."total_mark",
"students"."marks_obtained"
FROM "students"
WHERE "students"."id" = $1
The initial result of pluck will be an array of array pairs, e.g.:
[[10, 5], [9, 2]]
.map(&:sum) will run sum on each pair, totalling the pair and flattening the array:
[15, 11]
Finally .sum on the flattened array will result in a single value.
Edit:
Note that while there is only a single query, your database will return a result row for each record matched in the where. This method uses ruby to do the totalling, so if there are many records (i.e. thousands), this may be slower than having SQL do the calculations itself like noted in the accepted answer.
Similar to the accepted answer, however, I'd suggest using arel as follows to avoid string literals (apart from renaming columns, if needed).
Student
.where(id: student_id).
.where(Student.arel_table[:total_mark].sum, Student.arel_table[:marks_obtained].sum)
which will give you an ActiveRecord::Relation result over which you can iterate, or, as you'll only get one row, you can use .first (at least for mysql).
Recently, I also had the requirement to sum up multiple columns of a ActiveRecord relation. I ended up with the following (reusable) scope:
scope :values_sum, ->(*keys) {
summands = keys.collect { |k| arel_table[k].sum.as(k.to_s) }
select(*summands)
}
So, having a model e.g. Order with columns net_amount and gross_amount you could use it as follows:
o = Order.today.values_sum(:net_amount, :gross_amount)
o.net_amount # -> sum of net amount
o.gross_amount # -> sum of gross amount

How to efficiently select a random row using a biased prob. distribution with Rails 2.3 and PostgreSQL 8?

I have wrote a few simple Rails application, accessing the database via the ActiveRecord abstraction, so I am afraid that I don't know much about the inner working of PostgreSQL engine.
However, I am writing a Rails app that need to support 100000+ rows with dynamically updated content and am wondering whether I am using the random functions efficiently:
Database migration schema setting:
t.float: attribute1
t.integer: ticks
add_index :mytable, :attribute1
add_index :mytable, :ticks
Basically, I want to have the following random function distribution:
a) row that has the top 10% value in attribute1 = 30% chance of being selected
b) the middle 60% (in attribute1) row = 50% chance of being selected
c) the lowest 30% (in attribute1) with less than 100 ticks = 15% chance of being selected,
d) and for those with the lowest 30% attribute1 that have more than X (use X = 100 in this question) ticks = 5% chance of being selected.
At the moment I have the following code:
#rand = rand()
if #rand>0.7
#randRowNum = Integer(rand(0.1) * Mytable.count )
#row = Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
if #rand>0.2
#randRowNum = Integer((rand(0.6)+0.1) * Mytable.count)
#row= Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
#row= Mytable.find(:first, :offset =>Integer(0.7 * Mytable.count), :order => '"attribute1" DESC')
if !(#row.nil?)
if (#rand >0.05)
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" < 100' ] )
else
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ] )
end
end
end
end
1) One thing I would like to do is avoid the use of :order => 'RANDOM()' as according to my research, it seems that each time it is called, involves the SQL engine first scanning through all the rows, assigning them a random value. Hence the use of #randRowNum = Integer(rand(0.1) * Mytable.count ) and offset by #randRowNum. for a) and b). Am I actually improving the efficiency? Is there any better way?
2) Should I be doing the same as 1) for c) and d), and is by using the :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ], am I forcing the SQL engine to scan through all the rows anyway? Is there anything apart from indexing that I can improve the efficiency of this call (with as little space/storage overhead as possible too)?
3) There is a chance that the total number of entries in Mytable may have been changed between the Mytable.count and Mytable.find calls. I could put the two calls within a transaction, but it seems excessive to lock that entire table just for a read operation (at the moment, I have extra code to fall back to a simple random row selection if I got a #row.nil from the above code). Is it psosible to move that .count call within a single atomic SQL query in Rails? Or would it have the same efficiently as locking via transaction in Rails?
4) I have also been reading up on stored procedure in PostgreSQL... but for this particular case, is there any gain in efficiency to be achieved, worth moving the code from Activerecording abstraction to stored procedure?
Many thanks!
P.S. development/deployment environment:
Rube 1.8.7
Rails 2.3.14
PostgreSQL 8 (on Heroku)
Your question seems a bit vague, so correct me if my interpretation is wrong.
If you didn't split (c) and (d), I would just convert the uniform random variable over [0,1) to a biased random variable over [0,1) and use that to select your row offset.
Unfortunately, (c) and (d) is split based on the value of "ticks" — a different column. That's the hard part, and also makes the query much less efficient.
After you fetch the value of attribute1 at 70% from the bottom, also fetch the number of (c) rows; something like SELECT COUNT(*) FROM foo WHERE attribute1 <= partiton_30 AND ticks < 100. Then use that to find the offset into either the ticks < 100 or the ticks >= 100 cases. (You probably want an index on (attribute1, ticks) or something; the order which is best depends on your data).
If the "ticks" threshold is known in advance and doesn't need to change often, you can cache it in a column (BOOL ticks_above_threshold or whatever) which makes the query much more efficient if you have an index on (ticks_above_threshold, attribute1) (note the reversal). Of course, every time you change the threshold you need to write to every row.
(I think you can also use a "materialized view" to avoid cluttering the main table with an extra column, but I'm not sure what the difference in efficiency is.)
There are obviously some efficiency gains possible by using stored procedures. I wouldn't worry about it too much, unless latency to the server is particularly high.
EDIT:
To answer your additional questions:
Indexing (BOOL ticks_above_threshold, attribute1) should work better than (ticks, attribute1) (I may have the order wrong, though) because it lets you compare the first index column for equality. This is important.
Indexes generally use some sort of balanced tree to do a lookup on what is effectively a list. For example, take A4 B2 C3 D1 (ordered letter,number) and look up "number of rows with letter greater than B and number greater than 2". The best you can do is start after the Bs and iterate over the whole table. If you order by number,letter, you get 1D 2B 3C 4A. Again, start after the 2s.
If you instead your index is on is_greater_than_2,letter, it looks like false,B false,D true,A true,C. You can ask the index for the position of (true,B) — between true,A and true,C — and count the number of entries until the end. Counting the number of items between two index positions is fast.
Google App Engine's Restrictions on Queries goes one step further:
Inequality Filters Are Allowed on One Property Only
A query may only use inequality filters (<, <=, >=, >, !=) on one property across all of its filters.
...
The query mechanism relies on all results for a query to be adjacent to one another in the index table, to avoid having to scan the entire table for results. A single index table cannot represent multiple inequality filters on multiple properties while maintaining that all results are consecutive in the table.
If none of you other queries benefit from an index on ticks, then yes.
In some cases, it might be faster to index a instead of a,b (or b,a) if the clause including b is almost always true and you're fetching row data (not just getting a COUNT()) (i.e. if 1000 <= a AND a <= 1010 matches a million rows and b > 100 only fails for two rows, then it might end up being faster to do two extra row lookups than to work with the bigger index).
As long as rows aren't being removed b/n the call to count() and the call to find() I wouldn't worry about transactions. Definitely get rid of all calls ordering by RANDOM() as there is no way to optimize it. Make sure that attribute1 has an index on it. I haven't tested it, but something like this should be pretty quick:
total_rows = MyTable.count
r = rand()
if r > 0.7 # 90-100%
lower = total_rows * 0.9
upper = total_rows
elsif r > 0.2 # 30-89%
lower = total_rows * 0.3
upper = total_rows * 0.89
else # 0-29%
lower = 0
upper = total_rows * 0.29
end
offset = [lower + (upper - lower) * rand(), total_rows - 1].min.to_i
#row = Mytable.find(:first, :offset => offset, :order => 'attribute1 ASC')

How to calculate mean based on number of votes/scores/samples/etc?

For simplicity say we have a sample set of possible scores {0, 1, 2}. Is there a way to calculate a mean based on the number of scores without getting into hairy lookup tables etc for a 95% confidence interval calculation?
dreeves posted a solution to this here: How can I calculate a fair overall game score based on a variable number of matches?
Now say we have 2 scenarios ...
Scenario A) 2 votes of value 2 result in SE=0 resulting in the mean to be 2
Scenario B) 10000 votes of value 2 result in SE=0 resulting in the mean to be 2
I wanted Scenario A to be some value less than 2 because of the low number of votes, but it doesn't seem like this solution handles that (dreeve's equations hold when you don't have all values in your set equal to each other). Am I missing something or is there another algorithm I can use to calculate a better score.
The data available to me is:
n (number of votes)
sum (sum of votes)
{set of votes} (all vote values)
Thanks!
You could just give it a weighted score when ranking results, as opposed to just displaying the average vote so far, by multiplying with some function of the number of votes.
An example in C# (because that's what I happen to know best...) that could easily be translated into your language of choice:
double avgScore = Math.Round(sum / n);
double rank = avgScore * Math.Log(n);
Here I've used the logarithm of n as the weighting function - but it will only work well if the number of votes is neither too small or too large. Exactly how large is "optimal" depends on how much you want the number of votes to matter.
If you like the logarithmic approach, but base 10 doesn't really work with your vote counts, you could easily use another base. For example, to do it in base 3 instead:
double rank = avgScore * Math.Log(n, 3);
Which function you should use for weighing is probably best decided by the order of magnitude of the number of votes you expect to reach.
You could also use a custom weighting function by defining
double rank = avgScore * w(n);
where w(n) returns the weight value depending on the number of votes. You then define w(n) as you wish, for example like this:
double w(int n) {
// caution! ugly example code ahead...
// if you even want this approach, at least use a switch... :P
if (n > 100) {
return 10;
} else if (n > 50) {
return 8;
} else if (n > 40) {
return 6;
} else if (n > 20) {
return 3;
} else if (n > 10) {
return 2;
} else {
return 1;
}
}
If you want to use the idea in my other referenced answer (thanks!) of using a pessimistic lower bound on the average then I think some additional assumptions/parameters are going to need to be injected.
To make sure I understand: With 10000 votes, every single one of which is "2", you're very sure the true average is 2. With 2 votes, each a "2", you're very unsure -- maybe some 0's and 1's will come in and bring down the average. But how to quantify that, I think is your question.
Here's an idea: Everyone starts with some "baggage": a single phantom vote of "1". The person with 2 true "2" votes will then have an average of (1+2+2)/3 = 1.67 where the person with 10000 true "2" votes will have an average of 1.9997. That alone may satisfy your criteria. Or to add the pessimistic lower bound idea, the person with 2 votes would have a pessimistic average score of 1.333 and the person with 10k votes would be 1.99948.
(To be absolutely sure you'll never have the problem of zero standard error, use two different phantom votes. Or perhaps use as many phantom votes as there are possible vote values, one vote with each value.)

Resources