Relevance algorithm - mean

Let's say I want to order some products based on two variables: rating and the number of ratings.
For example, let's say I have these 2 products:
Product A
4.9 of 10000
Product B
5.0 of 1
It's kind obviously that the product A should come first. Probably using weighted mean, but what weight to use for each variable?

Product A has a rating of 4.9 and it has 10000 ratings. So,(sum of 10000 votes)/10000 = 4.9
Therefore,the sum of 10000 votes = 4.9*10000.
If you need to find which Product to choose,you do use:
4.9*10000/5 = X
and
5*1/4.9 = Y
Then perform a comparison of X and Y. Its basically comparing hoe the reviews stack up against one another.

Related

Rails sum all prices inside a record

I have a model Products that belongs_to model Kit
Kit has_many Products
¿How can I sum all of the prices of each product that belongs to a kit?
I tried with no luck:
#kitprice = Kit.products.price.sum
Try this:
#kitprice = kit.products.sum(:price)
In my case i have wallet with many operations
wallet = Wallet.first
amount = wallet.operations.sum(:amount)
Question is slightly ambiguous. Assuming you have an instance of Kit named as kit, and a kit has many product objects; following would get you the desired result.
sum = 0
kit_id = <enter kit id here>
kit = Kit.find_by_id(kit_id)
# iterate through every product object and compound price here
kit.products.each{|product| sum = sum + product.price}
puts sum
Basically you need to iterate through every product object and compute the sum, since it's a has many relationship.
This will give you every kit with the sum of it's products, i assume there is a column named name in your kit model
#kit_products_price = Kit.includes(:products).group(:name).sum('products.price')
If you want the sum of all the kit products :
#kit_price = Kit.includes(:products).sum('products.price')
The following should work I believe:
Active Record collection
#total = kit.products.sum("price")
Ruby array
#total = kit.products.sum(&:price)

What's difference between item-based and content-based collaborative filtering?

I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book:
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
How can I calculate the similarity between items? If using the content, isn't it a content-based recommendation?
Item-Based Collaborative Filtering
The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know anything other than all users' history of ratings. So the similarity between items is computed based on the ratings instead of the meta data of item content.
Let me give you an example. Suppose you have only access to some rating data like below:
user 1 likes: movie, cooking
user 2 likes: movie, biking, hiking
user 3 likes: biking, cooking
user 4 likes: hiking
Suppose now you want to make recommendations for user 4.
First you create an inverted index for items, you will get:
movie: user 1, user 2
cooking: user 1, user 3
biking: user 2, user 3
hiking: user 2, user 4
Since this is a binary rating (like or not), we can use a similarity measure like Jaccard Similarity to compute item similarity.
|user1|
similarity(movie, cooking) = --------------- = 1/3
|user1,2,3|
In the numerator, user1 is the only element that movie and cooking both has. In the denominator the union of movie and cooking has 3 distinct users (user1,2,3). |.| here denote the size of the set. So we know the similarity between movie and cooking is 1/3 in our case. You just do the same thing for all possible item pairs (i,j).
After you are done with the similarity computation for all pairs, say, you need to make a recommendation for user 4.
Look at the similarity score of similarity(hiking, x) where x is any other tags you might have.
If you need to make a recommendation for user 3, you can aggregate the similarity score from each items in its list. For example,
score(movie) = Similarity(biking, movie) + Similarity(cooking, movie)
score(hiking) = Similarity(biking, hiking) + Similarity(cooking, hiking)
Content-Based Recommendation
The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). For user profile, you can do the same thing based on the users likes some movie stars/genres etc. Then the similarity of user and item can be computed using e.g., cosine similarity.
Here is a concrete example:
Suppose this is our user-profile (using binary encoding, 0 means not-like, 1 means like), which contains user's preference over 5 movie stars and 5 movie genres:
Movie stars 0 - 4 Movie Genres
user 1: 0 0 0 1 1 1 1 1 0 0
user 2: 1 1 0 0 0 0 0 0 1 1
user 3: 0 0 0 1 1 1 1 1 1 0
Suppose this is our movie-profile:
Movie stars 0 - 4 Movie Genres
movie1: 0 0 0 0 1 1 1 0 0 0
movie2: 1 1 1 0 0 0 0 1 0 1
movie3: 0 0 1 0 1 1 0 1 0 1
To calculate how good a movie is to a user, we use cosine similarity:
dot-product(user1, movie1)
similarity(user 1, movie1) = ---------------------------------
||user1|| x ||movie1||
0x0+0x0+0x0+1x0+1x1+1x1+1x1+1x0+0x0+0x0
= -----------------------------------------
sqrt(5) x sqrt(3)
= 3 / (sqrt(5) x sqrt(3)) = 0.77460
Similarly:
similarity(user 2, movie2) = 3 / (sqrt(4) x sqrt(5)) = 0.67082
similarity(user 3, movie3) = 3 / (sqrt(6) x sqrt(5)) = 0.54772
If you want to give one recommendation for user i, just pick movie j that has the highest similarity(i, j).
"Item-based" really means "item-similarity-based". You can put whatever similarity metric you like in here. Yes, if it's based on content, like a cosine similarity over term vectors, you could also call this "content-based".

How mahout user based recommendation works?

I am using generic user based recommender of mahout taste api to generate recommendations..
I know it recommends based on ratings given to past users..I am not getting mathematics behind its selection of recommended item..for example..
for user id 58
itemid ratings
231 5
235 5.5
245 5.88
3 neighbors are,with itemid and ratings as,{231 4,254 5,262 2,226 5}
{235 3,245 4,262 3}
{226 4,262 3}
It recommends me 226 how?
With advance thanks,
It depends on the UserSimilarity and the UserNeighborhood you have chosen for your recommender. But in general the algorithm works as follows for user u:
for every other user w
compute a similarity s between u and w
retain the top users, ranked by similarity, as a neighborhood n
for every item i that some user in n has a preference for, but that u has no preference for yet
for every other user v in n that has a preference for i
compute a similarity s between u and v
incorporate v's preference for i, weighted by s, into a running average
Source: Mahout in Action http://manning.com/owen/

Weighted random pick from array in ruby/rails

I have a model in Rails from which I want to pick a random entry.
So far I've done it with a named scope like this:
named_scope :random, lambda { { :order=>'RAND()', :limit => 1 } }
But now I've added an integer field 'weight' to the model representing the probability with which each row should be picked.
How can I now do a weighted random pick?
I've found and tried out two methods on snippets.dzone.com that extended the Array class and add a weighted random function, but both didn't work or pick random items for me.
I'm using REE 1.8.7 and Rails 2.3.
Maybe I understand this totally wrong, but couldn't you just use the column "weight" as a factor to the random number? (Depending on the Db, some precautions would be necessary to prevent the product from overflowing.)
named_scope :random, lambda { { :order=>'RAND()*weight', :limit => 1 } }
In one query you should:
calculate the total weight
multiply by a random factor, giving a weight threshold
scan again through the table summing, until the weight threshold is reached.
In SQL it would be sompething like this (not tried for real)
SELECT SUM(weight) FROM table INTO #totalwt;
#lim := FLOOR(RAND() * #totalwt);
SELECT id, weight, #total := #total + weight AS cumulativeWeight
FROM table WHERE cumulativeWeight < #lim, (SELECT #total:=0) AS t;
Inspired by Optimal query to fetch a cumulative sum in MySQL

How to calculate mean based on number of votes/scores/samples/etc?

For simplicity say we have a sample set of possible scores {0, 1, 2}. Is there a way to calculate a mean based on the number of scores without getting into hairy lookup tables etc for a 95% confidence interval calculation?
dreeves posted a solution to this here: How can I calculate a fair overall game score based on a variable number of matches?
Now say we have 2 scenarios ...
Scenario A) 2 votes of value 2 result in SE=0 resulting in the mean to be 2
Scenario B) 10000 votes of value 2 result in SE=0 resulting in the mean to be 2
I wanted Scenario A to be some value less than 2 because of the low number of votes, but it doesn't seem like this solution handles that (dreeve's equations hold when you don't have all values in your set equal to each other). Am I missing something or is there another algorithm I can use to calculate a better score.
The data available to me is:
n (number of votes)
sum (sum of votes)
{set of votes} (all vote values)
Thanks!
You could just give it a weighted score when ranking results, as opposed to just displaying the average vote so far, by multiplying with some function of the number of votes.
An example in C# (because that's what I happen to know best...) that could easily be translated into your language of choice:
double avgScore = Math.Round(sum / n);
double rank = avgScore * Math.Log(n);
Here I've used the logarithm of n as the weighting function - but it will only work well if the number of votes is neither too small or too large. Exactly how large is "optimal" depends on how much you want the number of votes to matter.
If you like the logarithmic approach, but base 10 doesn't really work with your vote counts, you could easily use another base. For example, to do it in base 3 instead:
double rank = avgScore * Math.Log(n, 3);
Which function you should use for weighing is probably best decided by the order of magnitude of the number of votes you expect to reach.
You could also use a custom weighting function by defining
double rank = avgScore * w(n);
where w(n) returns the weight value depending on the number of votes. You then define w(n) as you wish, for example like this:
double w(int n) {
// caution! ugly example code ahead...
// if you even want this approach, at least use a switch... :P
if (n > 100) {
return 10;
} else if (n > 50) {
return 8;
} else if (n > 40) {
return 6;
} else if (n > 20) {
return 3;
} else if (n > 10) {
return 2;
} else {
return 1;
}
}
If you want to use the idea in my other referenced answer (thanks!) of using a pessimistic lower bound on the average then I think some additional assumptions/parameters are going to need to be injected.
To make sure I understand: With 10000 votes, every single one of which is "2", you're very sure the true average is 2. With 2 votes, each a "2", you're very unsure -- maybe some 0's and 1's will come in and bring down the average. But how to quantify that, I think is your question.
Here's an idea: Everyone starts with some "baggage": a single phantom vote of "1". The person with 2 true "2" votes will then have an average of (1+2+2)/3 = 1.67 where the person with 10000 true "2" votes will have an average of 1.9997. That alone may satisfy your criteria. Or to add the pessimistic lower bound idea, the person with 2 votes would have a pessimistic average score of 1.333 and the person with 10k votes would be 1.99948.
(To be absolutely sure you'll never have the problem of zero standard error, use two different phantom votes. Or perhaps use as many phantom votes as there are possible vote values, one vote with each value.)

Resources