How to calculate mean based on number of votes/scores/samples/etc? - mean

For simplicity say we have a sample set of possible scores {0, 1, 2}. Is there a way to calculate a mean based on the number of scores without getting into hairy lookup tables etc for a 95% confidence interval calculation?
dreeves posted a solution to this here: How can I calculate a fair overall game score based on a variable number of matches?
Now say we have 2 scenarios ...
Scenario A) 2 votes of value 2 result in SE=0 resulting in the mean to be 2
Scenario B) 10000 votes of value 2 result in SE=0 resulting in the mean to be 2
I wanted Scenario A to be some value less than 2 because of the low number of votes, but it doesn't seem like this solution handles that (dreeve's equations hold when you don't have all values in your set equal to each other). Am I missing something or is there another algorithm I can use to calculate a better score.
The data available to me is:
n (number of votes)
sum (sum of votes)
{set of votes} (all vote values)
Thanks!

You could just give it a weighted score when ranking results, as opposed to just displaying the average vote so far, by multiplying with some function of the number of votes.
An example in C# (because that's what I happen to know best...) that could easily be translated into your language of choice:
double avgScore = Math.Round(sum / n);
double rank = avgScore * Math.Log(n);
Here I've used the logarithm of n as the weighting function - but it will only work well if the number of votes is neither too small or too large. Exactly how large is "optimal" depends on how much you want the number of votes to matter.
If you like the logarithmic approach, but base 10 doesn't really work with your vote counts, you could easily use another base. For example, to do it in base 3 instead:
double rank = avgScore * Math.Log(n, 3);
Which function you should use for weighing is probably best decided by the order of magnitude of the number of votes you expect to reach.
You could also use a custom weighting function by defining
double rank = avgScore * w(n);
where w(n) returns the weight value depending on the number of votes. You then define w(n) as you wish, for example like this:
double w(int n) {
// caution! ugly example code ahead...
// if you even want this approach, at least use a switch... :P
if (n > 100) {
return 10;
} else if (n > 50) {
return 8;
} else if (n > 40) {
return 6;
} else if (n > 20) {
return 3;
} else if (n > 10) {
return 2;
} else {
return 1;
}
}

If you want to use the idea in my other referenced answer (thanks!) of using a pessimistic lower bound on the average then I think some additional assumptions/parameters are going to need to be injected.
To make sure I understand: With 10000 votes, every single one of which is "2", you're very sure the true average is 2. With 2 votes, each a "2", you're very unsure -- maybe some 0's and 1's will come in and bring down the average. But how to quantify that, I think is your question.
Here's an idea: Everyone starts with some "baggage": a single phantom vote of "1". The person with 2 true "2" votes will then have an average of (1+2+2)/3 = 1.67 where the person with 10000 true "2" votes will have an average of 1.9997. That alone may satisfy your criteria. Or to add the pessimistic lower bound idea, the person with 2 votes would have a pessimistic average score of 1.333 and the person with 10k votes would be 1.99948.
(To be absolutely sure you'll never have the problem of zero standard error, use two different phantom votes. Or perhaps use as many phantom votes as there are possible vote values, one vote with each value.)

Related

What is a better way than a for loop to implement an algorithm that involves sets?

I'm trying to create an algorithm along these lines:
-Create 8 participants
-Each participant has a set of interests
-Pair them with another participant with the least amount of interests
So what I've done so far is create 2 classes, the Participant and Interest, where the Interest is Hashable so that I can create a Set with it. I manually created 8 participants with different names and interests.
I've made an array of participants selected and I've used a basic for in loop to somewhat pair them together using the intersection() function of sets. Somehow my index always kicks out of range and I'm positive there's a better way of doing this, but it's just so messy and I don't know where to start.
for i in 0..<participantsSelected.count {
if participantsSelected[i].interest.intersection(participantsSelected[i+1].interest) == [] {
participantsSelected.remove(at: i)
participantsSelected.remove(at: i+1)
print (participantsSelected.count)
}
}
So my other issue is using a for loop for this specific algorithm seems a bit off too since what if they all have 1 similar interest, and it won't equal to [] / nil.
Basically the output I'm trying is to remove them from the participants selected array once they're paired up, and for them to be paired up they would have to be with another participant with the least amount of interests with each other.
EDIT: Updated code, here's my attempt to improve my algorithm logic
for participant in 0..<participantsSelected {
var maxInterestIndex = 10
var currentIndex = 1
for _ in 0..<participantsSelected {
print (participant)
print (currentIndex)
let score = participantsAvailable[participant].interest.intersection(participantsAvailable[currentIndex].interest)
print ("testing score, \(score.count)")
if score.count < maxInterestIndex {
maxInterestIndex = score.count
print ("test, \(maxInterestIndex)")
} else {
pairsCreated.append(participantsAvailable[participant])
pairsCreated.append(participantsAvailable[currentIndex])
break
// participantsAvailable.remove(at: participant)
// participantsAvailable.remove(at: pairing)
}
currentIndex = currentIndex + 1
}
}
for i in 0..<pairsCreated.count {
print (pairsCreated[i].name)
}
Here is a solution in the case that what you are looking for is to pair your participants (all of them) optimally regarding your criteria:
Then the way to go is by finding a perfect matching in a participants graph.
Create a graph with n vertices, n being the number of participants. We can denote by u_p the vertex corresponding to participant p.
Then, create weighted edges as follows:
For each couple of participants p, q (p != q), create the edge (u_p, u_q), and weight it with the number of interests these two participants have in common.
Then, run a minimum weight perfect matching algorithm on your graph, and the job is done. You will obtain an optimal result (meaning the best possible, or one among the best possible matchings) in polynomial time.
Minimum weight perfect matching algorithm: The problem is strictly equivalent to the maximum weight matching algorithm. Find the edge of maximum weight (let's denote by C its weight). Then replace the weight w of each edge by C-w, and run a maximum weight matching algorithm on the resulting graph.
I would suggest that yoy use Edmond's blossom algorithm to find a perfect matching in your graph. First because it is efficient and well documented, second because I believe you can find implementations in most existing languages, but also because it truly is a very, very beautiful algorithm (it ain't called blossom for nothing).
Another possibility, if you are sure that your number of participants will be small (you mention 8), you can also go for a brute-force algorithm, meaning to test all possible ways to pair participants.
Then the algorithm would look like:
find_best_matching(participants, interests, pairs):
if all participants have been paired:
return sum(cost(p1, p2) for p1, p2 in pairs), pairs // cost(p1, p2) = number of interests in common
else:
p = some participant which is not paired yet
best_sum = + infinite
best_pairs = empty_set
for all participants q which have not been paired, q != p:
add (p, q) to pairs
current_sum, current_pairs = find_best_matching(participants, interests, pairs)
if current_sum < best_sum:
best_sum = current_sum
best_pairs = current_pairs
remove (p, q) from pairs
return best_sum, best_pairs
Call it initially with find_best_matching(participants, interests, empty_set_of_pairs) to get the best way to match your participants.

Finding the consecutive win in Cypher query language

from fig we can see that Arsenal have won three match consecutively but I could not write the query.
Here is a query that should return the maximum number of consecutive wins for Arsenal:
MATCH (a:Club {name:'Arsenal FC'})-[r:played_with]-(:Club)
WITH ((CASE a.name WHEN r.home THEN 1 ELSE -1 END) * (TOINT(r.score[0]) - TOINT(r.score[1]))) > 0 AS win, r
ORDER BY TOINT(r.time)
RETURN REDUCE(s = {max: 0, curr: 0}, w IN COLLECT(win) |
CASE WHEN w
THEN {
max: CASE WHEN s.max < s.curr + 1 THEN s.curr + 1 ELSE s.max END,
curr: s.curr + 1}
ELSE {max: s.max, curr: 0}
END
).max AS result;
The WITH clause sets the win variable to true iff Arsenal won a particular game. Notice that the ORDER BY clause converts the time property to an integer, because the ordering of numeric strings does not work properly if the strings could be of different lengths (I am being a bit picky here, admittedly). The REDUCE function is used to calculate the maximum number of consecutive wins.
======
Finally, here are some suggestions for some improvements to your data model. For example:
It looks like your played_with relationship always points from the home team to the away team. If so, you can get rid of the redundant home and away properties, and you can also rename the relationship type to HOSTED to make the direction of the relationship more clear.
The scores and time should be stored as integers, not strings. That would make your queries more efficient, and easier to write and understand.
You could also consider splitting the scores property into two scalar properties, say homeScore and awayScore, which would make your code more clear. There seems to be no advantage to storing the scores in an array.
If you made all the above suggested changes, then you would just need to change the beginning of the above query to this:
MATCH (a:Club {name:'Arsenal FC'})-[r:HOSTED]-(:Club)
WITH ((CASE a WHEN STARTNODE(r) THEN 1 ELSE -1 END) * (r.homeScore - r.awayScore)) > 0 AS win, r
ORDER BY r.time
...

Can I add where clauses after putting limit on a scoped query?

I have a model called Game in which I build up a scoped query.
Something like:
games = Game.scoped
games = games.team(team_name) if team_name
games = game.opponent(opponent_name) if opponent_name
total_games = games
I then calculate several subsets like:
wins = games.where("team_score > opponent_score").count
losses = games.where("opponent_score > team_score").count
Everything is great. Then I decided that I want to limit the original scope to show the last X number of games.
total_games = games.limit(10)
If there are 100 games that match what I want for total_games, and then I add .limit(10) - it gets the last 10. Great. But now calling
total_games.where("team_score > opponent_score").count
will reach back beyond the last 10, and into results that aren't part of total_games. Since adding .limit(10), I'll always get 10 total games, but also 10 wins, and 10 losses.
After typing this all out, I've realized that the cases where I want to use limit are for showing a smaller set of results - so I'll probably end up just looping through the results to calculate things like wins and losses (instead of doing separate queries as in my subsets above).
I tried this out when total_games had hundreds or thousands of results, and it's significantly slower to loop through than it is to just do separate queries for the subsets.
So, now I must know - what is the best way to limit a scoped query, and then do future queries of those results that restrict themselves results returned by the original .limit(x)?
I don't think you can do what you want to do without separating your query into two steps, first getting 10 games from total_games and making the DB query with all:
last_10_games = total_games.limit(10).all
then selecting from the resulting array and getting the size of the result:
wins = last_10_games.select { |g| g.team_score > g.opponent_score }.count
losses = last_10_games.select { |g| g.opponent_score > g.team_score }.count
I know this is not exactly what you asked for, but I think it's probably the most straightforward solution to the problem.

How to efficiently select a random row using a biased prob. distribution with Rails 2.3 and PostgreSQL 8?

I have wrote a few simple Rails application, accessing the database via the ActiveRecord abstraction, so I am afraid that I don't know much about the inner working of PostgreSQL engine.
However, I am writing a Rails app that need to support 100000+ rows with dynamically updated content and am wondering whether I am using the random functions efficiently:
Database migration schema setting:
t.float: attribute1
t.integer: ticks
add_index :mytable, :attribute1
add_index :mytable, :ticks
Basically, I want to have the following random function distribution:
a) row that has the top 10% value in attribute1 = 30% chance of being selected
b) the middle 60% (in attribute1) row = 50% chance of being selected
c) the lowest 30% (in attribute1) with less than 100 ticks = 15% chance of being selected,
d) and for those with the lowest 30% attribute1 that have more than X (use X = 100 in this question) ticks = 5% chance of being selected.
At the moment I have the following code:
#rand = rand()
if #rand>0.7
#randRowNum = Integer(rand(0.1) * Mytable.count )
#row = Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
if #rand>0.2
#randRowNum = Integer((rand(0.6)+0.1) * Mytable.count)
#row= Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
#row= Mytable.find(:first, :offset =>Integer(0.7 * Mytable.count), :order => '"attribute1" DESC')
if !(#row.nil?)
if (#rand >0.05)
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" < 100' ] )
else
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ] )
end
end
end
end
1) One thing I would like to do is avoid the use of :order => 'RANDOM()' as according to my research, it seems that each time it is called, involves the SQL engine first scanning through all the rows, assigning them a random value. Hence the use of #randRowNum = Integer(rand(0.1) * Mytable.count ) and offset by #randRowNum. for a) and b). Am I actually improving the efficiency? Is there any better way?
2) Should I be doing the same as 1) for c) and d), and is by using the :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ], am I forcing the SQL engine to scan through all the rows anyway? Is there anything apart from indexing that I can improve the efficiency of this call (with as little space/storage overhead as possible too)?
3) There is a chance that the total number of entries in Mytable may have been changed between the Mytable.count and Mytable.find calls. I could put the two calls within a transaction, but it seems excessive to lock that entire table just for a read operation (at the moment, I have extra code to fall back to a simple random row selection if I got a #row.nil from the above code). Is it psosible to move that .count call within a single atomic SQL query in Rails? Or would it have the same efficiently as locking via transaction in Rails?
4) I have also been reading up on stored procedure in PostgreSQL... but for this particular case, is there any gain in efficiency to be achieved, worth moving the code from Activerecording abstraction to stored procedure?
Many thanks!
P.S. development/deployment environment:
Rube 1.8.7
Rails 2.3.14
PostgreSQL 8 (on Heroku)
Your question seems a bit vague, so correct me if my interpretation is wrong.
If you didn't split (c) and (d), I would just convert the uniform random variable over [0,1) to a biased random variable over [0,1) and use that to select your row offset.
Unfortunately, (c) and (d) is split based on the value of "ticks" — a different column. That's the hard part, and also makes the query much less efficient.
After you fetch the value of attribute1 at 70% from the bottom, also fetch the number of (c) rows; something like SELECT COUNT(*) FROM foo WHERE attribute1 <= partiton_30 AND ticks < 100. Then use that to find the offset into either the ticks < 100 or the ticks >= 100 cases. (You probably want an index on (attribute1, ticks) or something; the order which is best depends on your data).
If the "ticks" threshold is known in advance and doesn't need to change often, you can cache it in a column (BOOL ticks_above_threshold or whatever) which makes the query much more efficient if you have an index on (ticks_above_threshold, attribute1) (note the reversal). Of course, every time you change the threshold you need to write to every row.
(I think you can also use a "materialized view" to avoid cluttering the main table with an extra column, but I'm not sure what the difference in efficiency is.)
There are obviously some efficiency gains possible by using stored procedures. I wouldn't worry about it too much, unless latency to the server is particularly high.
EDIT:
To answer your additional questions:
Indexing (BOOL ticks_above_threshold, attribute1) should work better than (ticks, attribute1) (I may have the order wrong, though) because it lets you compare the first index column for equality. This is important.
Indexes generally use some sort of balanced tree to do a lookup on what is effectively a list. For example, take A4 B2 C3 D1 (ordered letter,number) and look up "number of rows with letter greater than B and number greater than 2". The best you can do is start after the Bs and iterate over the whole table. If you order by number,letter, you get 1D 2B 3C 4A. Again, start after the 2s.
If you instead your index is on is_greater_than_2,letter, it looks like false,B false,D true,A true,C. You can ask the index for the position of (true,B) — between true,A and true,C — and count the number of entries until the end. Counting the number of items between two index positions is fast.
Google App Engine's Restrictions on Queries goes one step further:
Inequality Filters Are Allowed on One Property Only
A query may only use inequality filters (<, <=, >=, >, !=) on one property across all of its filters.
...
The query mechanism relies on all results for a query to be adjacent to one another in the index table, to avoid having to scan the entire table for results. A single index table cannot represent multiple inequality filters on multiple properties while maintaining that all results are consecutive in the table.
If none of you other queries benefit from an index on ticks, then yes.
In some cases, it might be faster to index a instead of a,b (or b,a) if the clause including b is almost always true and you're fetching row data (not just getting a COUNT()) (i.e. if 1000 <= a AND a <= 1010 matches a million rows and b > 100 only fails for two rows, then it might end up being faster to do two extra row lookups than to work with the bigger index).
As long as rows aren't being removed b/n the call to count() and the call to find() I wouldn't worry about transactions. Definitely get rid of all calls ordering by RANDOM() as there is no way to optimize it. Make sure that attribute1 has an index on it. I haven't tested it, but something like this should be pretty quick:
total_rows = MyTable.count
r = rand()
if r > 0.7 # 90-100%
lower = total_rows * 0.9
upper = total_rows
elsif r > 0.2 # 30-89%
lower = total_rows * 0.3
upper = total_rows * 0.89
else # 0-29%
lower = 0
upper = total_rows * 0.29
end
offset = [lower + (upper - lower) * rand(), total_rows - 1].min.to_i
#row = Mytable.find(:first, :offset => offset, :order => 'attribute1 ASC')

Ruby on Rails method to calculate percentiles - can it be refactored?

I have written a method to calculate a given percentile for a set of numbers for use in an application I am building. Typically the user needs to know the 25th percentile of a given set of numbers and the 75th percentile.
My method is as follows:
def calculate_percentile(array,percentile)
#get number of items in array
return nil if array.empty?
#sort the array
array.sort!
#get the array length
arr_length = array.length
#multiply items in the array by the required percentile (e.g. 0.75 for 75th percentile)
#round the result up to the next whole number
#then subtract one to get the array item we need to return
arr_item = ((array.length * percentile).ceil)-1
#return the matching number from the array
return array[arr_item]
end
This looks to provide the results I was expecting but can anybody refactor this or offer an improved method to return specific percentiles for a set of numbers?
Some remarks:
If a particular index of an Array does not exist, [] will return nil, so your initial check for an empty Array is unnecessary.
You should not sort! the Array argument, because you are affecting the order of the items in the Array in the code that called your method. Use sort (without !) instead.
You don't actually use arr_length after assignment.
A return statement on the last line is unnecessary in Ruby.
There is no standard definition for the percentile function (there can be a lot of subtleties with rounding), so I'll just assume that how you implemented it is how you want it to behave. Therefore I can't really comment on the logic.
That said, the function that you wrote can be written much more tersely while still being readable.
def calculate_percentile(array, percentile)
array.sort[(percentile * array.length).ceil - 1]
end
Here's the same refactored into a one liner. You don't need an explicit return as the last line in Ruby. The return value of the last statement of the method is what's returned.
def calculate_percentile(array=[],percentile=0.0)
# multiply items in the array by the required percentile
# (e.g. 0.75 for 75th percentile)
# round the result up to the next whole number
# then subtract one to get the array item we need to return
array ? array.sort[((array.length * percentile).ceil)-1] : nil
end
Not sure if it's worth it, but here is how I did it for the quartiles:
def median(list)
(list[(list.size - 1) / 2] + list[list.size / 2]) / 2
end
numbers = [1, 2, 3, 4, 5, 6]
if numbers.size % 2 == 0
puts median(numbers[0...(numbers.size / 2)])
puts median(numbers)
puts median(numbers[(numbers.size / 2)..-1])
else
median_index = numbers.index(median(numbers))
puts median(numbers[0..(median_index - 1)])
puts median(numbers)
puts median(numbers[(median_index + 1)..-1])
end
If you're calculating both quartiles, you might want to move the "sort" outside the function, so that it only needs to be done once. This also means you aren't modifying your caller's data (sort!), nor making a copy every time the function is called (sort).
I know, premature optimisation and all that. And it's a bit awkward for the function to say, "the array must be sorted before calling this function". So it's reasonable to leave it as it is.
But sorting already-sorted data is going to take considerably longer than the whole rest of the function put together(*). It also has higher algorithmic complexity: O(N) at best, when the function could be O(1) for the second quartile (although O(N log N) for the first one if the data is not already sorted, of course). So it's worth avoiding if performance might ever be an issue for this function.
There are slightly faster ways of finding the two quartiles than a full sort (look up "selection algorithms"). For instance if you're familiar with the way qsort uses pivots, observe that if you need to know the 25th and 75th items out of 100, and your pivot at some stage ends up in position 80, then there's absolutely no point recursing into the block above the pivot. You really don't care what order those elements are in, just that they're in the top quartile. But this will considerably increase the complexity of the code compared with just calling a library to sort for you. Unless you really need a minor performance boost, I think you're good as you are.
(*) Unless ruby arrays have a flag to remember they're already sorted and haven't been modified since. I don't know whether they do, but if so then using sort! a second time is of course free.

Resources