I am using generic user based recommender of mahout taste api to generate recommendations..
I know it recommends based on ratings given to past users..I am not getting mathematics behind its selection of recommended item..for example..
for user id 58
itemid ratings
231 5
235 5.5
245 5.88
3 neighbors are,with itemid and ratings as,{231 4,254 5,262 2,226 5}
{235 3,245 4,262 3}
{226 4,262 3}
It recommends me 226 how?
With advance thanks,
It depends on the UserSimilarity and the UserNeighborhood you have chosen for your recommender. But in general the algorithm works as follows for user u:
for every other user w
compute a similarity s between u and w
retain the top users, ranked by similarity, as a neighborhood n
for every item i that some user in n has a preference for, but that u has no preference for yet
for every other user v in n that has a preference for i
compute a similarity s between u and v
incorporate v's preference for i, weighted by s, into a running average
Source: Mahout in Action http://manning.com/owen/
Related
I am working on q learning algorithm for rummy, I have to generate a Q table where it goes as Q[state, action], since in game of rummy, actions are either pick or drop i have the value set to 2 where as when it comes to states, what are the number? (Question 1)
For now the cards in the deck/pile/stash are A, 1,2,3,4,5,6 and 7 of each type, thats about 28 cards + 4 Aces does that mean i have 32 states? if so what does it actually mean when i update the Q table value in it. (Question 2)
In the above mentioned case, how do i design a reward table (Question 3)
Help appreciated.
Assume that a tourist has no idea about the city to visit , I want to recommend top 10 cities based on his features about the city (budgetToTravel , isCoastel , isHitorical , withFamily, etc ...).
My dataset contains features for every city for exemple :
Venice Italy
(budgetToTravel='5000' , isCoastel=1 , isHistorical =1 , withFamily=1,...)
Berlin Germany (BudgetToTravel='6000' ,isHistorical=1, isCoastel =0 , withFamily=1 ,...).
I want to know the best algorithm of machine learning to recommend the top 10 cities to visit based on the features of a tourist .
As stated Pierre S. you can start withKNearestNeigbours
This algorithm will allow you do exactly what you want by doing:
n_cities_to_recommend = 10
neigh = NearestNeighbors(2, radius=1.0) # you need to play with radius here o scale your data to [0, 1] with [scaler][2]
neigh.fit(cities)
user_input = [budgetToTravel, isCoastel, isHistorical, withFamily, ...]
neigh.kneighbors([user_input], n_cities_to_recommend, return_distance=False) # this will return you closest entities id's from cities
You can use (unsupervised) clustering algorithm like Hierarchical Clustering or K-Means Clustering to have clusters of 10 and then you can match the person (tourist) features with the clusters.
currently I'm using the proc discrim in SAS to run a kNN analysis for a data set, but the problem may require me to get the top k neighbor list for each rows in my table, so how can I get this list from SAS??
thanks for the answer, but I'm looking for the neighbor list for each data point, for example if i got data set:
name age zipcode alcohol
John 26 08439 yes
Cathy 49 47789 no
smith 37 90897 no
Tom 34 88642 yes
then i need list:
name neighbor1 neighbor2
John Tom cathy
Cathy Tom Smith
Smith Cathy Tom
Tom John Cathy
I could not find this output from SAS, is there any whay that I can program to get this list? Thank you!
I am not a SAS user, but a quick web lookup seems to give a good answers for your problem:
As far as i know you do not have to implement it by yourself. DISCRIM is enough.
Code for iris data from http://www.sas-programming.com/2010/05/k-nearest-neighbor-in-sas.html
ods select none;
proc surveyselect data=iris out=iris2
samprate=0.5 method=srs outall;
run;
ods select all;
%let k=5;
proc discrim data=iris2(where=(selected=1))
test=iris2(where=(selected=0))
testout=iris2testout
method=NPAR k=&k
listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 'Using KNN on Iris Data';
run;
The long and detailed description is also avaliable here:
http://analytics.ncsu.edu/sesug/2012/SD-09.pdf
And from the sas community:
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance. Since PROC DISCRIM doesn't output the Tree it built internally, use "data= test= testout=" option to score new data set.
I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book:
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
How can I calculate the similarity between items? If using the content, isn't it a content-based recommendation?
Item-Based Collaborative Filtering
The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know anything other than all users' history of ratings. So the similarity between items is computed based on the ratings instead of the meta data of item content.
Let me give you an example. Suppose you have only access to some rating data like below:
user 1 likes: movie, cooking
user 2 likes: movie, biking, hiking
user 3 likes: biking, cooking
user 4 likes: hiking
Suppose now you want to make recommendations for user 4.
First you create an inverted index for items, you will get:
movie: user 1, user 2
cooking: user 1, user 3
biking: user 2, user 3
hiking: user 2, user 4
Since this is a binary rating (like or not), we can use a similarity measure like Jaccard Similarity to compute item similarity.
|user1|
similarity(movie, cooking) = --------------- = 1/3
|user1,2,3|
In the numerator, user1 is the only element that movie and cooking both has. In the denominator the union of movie and cooking has 3 distinct users (user1,2,3). |.| here denote the size of the set. So we know the similarity between movie and cooking is 1/3 in our case. You just do the same thing for all possible item pairs (i,j).
After you are done with the similarity computation for all pairs, say, you need to make a recommendation for user 4.
Look at the similarity score of similarity(hiking, x) where x is any other tags you might have.
If you need to make a recommendation for user 3, you can aggregate the similarity score from each items in its list. For example,
score(movie) = Similarity(biking, movie) + Similarity(cooking, movie)
score(hiking) = Similarity(biking, hiking) + Similarity(cooking, hiking)
Content-Based Recommendation
The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). For user profile, you can do the same thing based on the users likes some movie stars/genres etc. Then the similarity of user and item can be computed using e.g., cosine similarity.
Here is a concrete example:
Suppose this is our user-profile (using binary encoding, 0 means not-like, 1 means like), which contains user's preference over 5 movie stars and 5 movie genres:
Movie stars 0 - 4 Movie Genres
user 1: 0 0 0 1 1 1 1 1 0 0
user 2: 1 1 0 0 0 0 0 0 1 1
user 3: 0 0 0 1 1 1 1 1 1 0
Suppose this is our movie-profile:
Movie stars 0 - 4 Movie Genres
movie1: 0 0 0 0 1 1 1 0 0 0
movie2: 1 1 1 0 0 0 0 1 0 1
movie3: 0 0 1 0 1 1 0 1 0 1
To calculate how good a movie is to a user, we use cosine similarity:
dot-product(user1, movie1)
similarity(user 1, movie1) = ---------------------------------
||user1|| x ||movie1||
0x0+0x0+0x0+1x0+1x1+1x1+1x1+1x0+0x0+0x0
= -----------------------------------------
sqrt(5) x sqrt(3)
= 3 / (sqrt(5) x sqrt(3)) = 0.77460
Similarly:
similarity(user 2, movie2) = 3 / (sqrt(4) x sqrt(5)) = 0.67082
similarity(user 3, movie3) = 3 / (sqrt(6) x sqrt(5)) = 0.54772
If you want to give one recommendation for user i, just pick movie j that has the highest similarity(i, j).
"Item-based" really means "item-similarity-based". You can put whatever similarity metric you like in here. Yes, if it's based on content, like a cosine similarity over term vectors, you could also call this "content-based".
Let's say I want to order some products based on two variables: rating and the number of ratings.
For example, let's say I have these 2 products:
Product A
4.9 of 10000
Product B
5.0 of 1
It's kind obviously that the product A should come first. Probably using weighted mean, but what weight to use for each variable?
Product A has a rating of 4.9 and it has 10000 ratings. So,(sum of 10000 votes)/10000 = 4.9
Therefore,the sum of 10000 votes = 4.9*10000.
If you need to find which Product to choose,you do use:
4.9*10000/5 = X
and
5*1/4.9 = Y
Then perform a comparison of X and Y. Its basically comparing hoe the reviews stack up against one another.