Is there a way to limit the movie rating output from matrix factorization algorithm? I had a matrix with ratings ranging from 1 to 5 but after training the model, some movies are rated above 5. Is it normal? Is there a way to normalize the ratings in order to have ratings strictly between 1-5.
In the link below https://www.researchgate.net/publication/282663370_Film_Recommendation_Systems_using_Matrix_Factorization_and_Collaborative_Filtering, they mentioned about normalization before running the algorithm, I couldn't try it yet but still, I don't have the feeling it will solve the problem.
Related
I'm following a tutorial that for creating a recommendation system in BigQueryML. The tutorial uses matrix factorization first to calculate user and item factors. In the end I have a model that can be queried with user ids or item ids to get recommendations.
The next step is feeding the factors and additional item + user features into a linear regression model to incorporate more context.
"Essentially, we have a couple of attributes about the movie, the
product factors array corresponding to the movie, a couple of
attributes about the user, and the user factors array corresponding to
the user. These form the inputs to our “hybrid” recommendations model
that builds off the matrix factorization model and adds in metadata
about users and movies."
I just don't understand why the dataset for linear regression excludes the user and item ids:
SELECT
p.* EXCEPT(movieId),
u.* EXCEPT(userId),
rating
FROM productFeatures p, userFeatures u
JOIN movielens.ratings r
ON r.movieId = p.movieId AND r.userId = u.userId
My question is:
How will I be able to get recommendations for a user from the linear model, when I don't have the user or item ids in the model?
Here you can find the full code:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/recommendation_systems/solutions/als_bqml_hybrid.ipynb
In the example you have shared, the goal is to fit a linear regression to the discovered factor values so that a novel set of factor values can be used to predict the rating. In this kind of setup, you don't want information about which samples are being used; the only crucial information is the training features (the factor scores) and the rating (the training/test label). For more on this topic, take a look at "Dimensionality reduction using non-negative matrix factorization for information retrieval."
If you included the movie ids and user ids in as features, your regression would try to learn on those, which would either add noise to the model or learn that low ids = lower score etc. This is possible, especially if this ids are in some kind of order you're not aware of, such as chronological or by genre.
Note: You could use movie-specific or user-specific information to build a model, but you would have many, many dimensions of data, and that tends to create poorly performing models. The idea here is to avoid the problem of dimensionality by first reducing the dimensionality of the problem space. Matrix factorization is just one method among many to do this. See, for example, PCA, LDA, and word2vec.
Let's say that we have a database where each row includes:
a person vector such as (1,0, ..., 1)
a movie vector (1,1, ..., 0)
the rating that the person gave to the content
What machine learning algorithm would be able to establish a movie ranking based on the user vector ?
Thank you in advance for your answers!
Your question is bit vague. But I will try to answer. So it looks like what you are asking is if you already have the latent factors what algorithms can be used. How did you obtain these latent factors? are these part of the dataset. If they are not derived and given it is going to be hard for you to predict the movie rating from these given data. Do you have ratings as part of this data set. Then combining this you can use MF or Clustering to obtain movie rankings.
I have this use case and want to build a ML model around it.
Based on the purchase history,I have to predict whether user will buy a product or not.
A product has these attributes:
ItemCategory: eg: Shoes, Accessories,Jewellery
Color: eg: Black,Red
PriceBucket: eg: 500-1000,1000-1500
User has some % liking for each color,priceBucket,Itemcategory
eg: user u1 likes black 30%, red 20%, shoes 10%.
This % likings are calculated based on purchase history of the user.
Now suppose we match user u1's profile across all products, we have to predict whether user will buy that product or not.
ItemCategory PriceBucket Color Buy
item1 30% 20% 10% 1
item2 20% 15% 30% 0
item3 10% 50% 40% 1
Buy 1/0 denotes whether user has actually bought this item or not.
I have tried with tensorFlow's LinearClassifier but getting very low accuracy.
Please suggest what model can be used here.
Reasons for lower accuracy is many.
I would suggest you to do some pre-processing steps before feeding in data into linear regressor model.
Since you just have only 3 dimensions/features in your data, you cannot get much information from it. It is very highly likely that your model get over-fit/underfit to any one feature among these three categories. Try adding more features if you have, or else increase number of training sample but still achieving decent result is very low because of low dimensionality.
Try and do some experiment with some other ensemble models like Decision Tree Classifier,Gaussian Naive Bayes, Gradient Boosting Classifier, SVM, Random forest,K-nearest-neighbors and perform cross validation to evaluate the performance of each classifiers.
One of the reasons for low accuracy could be having an imbalanced dataset, i.e. the ratio of buy values (0,1) is greater than 2. If this is the case, use simple techniques such as under-sampling and then try different classification models on it. In your case, random forest probably will do a good job; play around with the parameters to avoid under/over fitting.
I collected some reviews on Books, DVD, Mobile, and Camera from www.amazon.com. I converted reviews with 1 star as negative and 5 stars as positive. The ratio of negative and positive reviews are 1:5. The collected reviews are converted into Document-Term-Matrix and few features were selected using chi-square feature subset selection method and some our proposed feature selection methods. We employed some classification algorithms like MLP, SVM, DT, etc to classify the samples. I reported the result under 10-fold cross-validation framework.
In order to compare our results with baseline, the reviewers asked me to perform the human evaluation to compare our results. How to perform annotation here? Whether we should employ annotators on randomly selected samples to compare our results or we should perform annotation for all samples?
My professor is asking to perform annotation of all samples and then divide the dataset into 10 folds, then calculate the average of an accuracy of 10-folds of annotators response to comparing our result?
I found some literature, they are performing annotation on randomly selected samples. Any reference suggested in this regard would be quite helpful to me.
Thanks in advance.
I have data of various sellers on ecommerce platform. I am trying to compute seller ranking score based on various features, such as
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
My first instinct was to normalize all the features, then multiply parameters/feature by some weight . Add them together for each seller score. Finally, find relative ranking of sellers based on this score.
My Seller score equation looks like
Seller score = w1* Order fulfillment rates - w2*Order cancel rate + w3 * User rating + w4 * Time taken to confirm order
where, w1,w2,w3,w4 are weights.
My question is three fold
Are there better algorithms/approaches to solve this problem? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
How to incorporate numeric and categorical variables in finding seller ranking score? (I have few categorical variables)
Is there an accepted way to weight multivariate systems like this ?
I would suggest the following approach:
First of all, keep in a matrix all features that you have available, whether you consider them useful or not.
(Hint: categorical variables are converted to numerical by simple encoding. Thus you can easily incorporate them (in the exact way you encoded user rating))
Then, you have to apply a dimensionality reduction algorithm, such as Singular Value Decomposition (SVD), in order to keep the most significant variables. Applying SVD may surprise you as to which features may be significant and which aren't.
After applying SVD, choosing the right weights for the n-most important features you decided to keep, is really up to you because it is purely qualitative and domain-dependent (as far as which features are more important).
The only way you could possibly calculate weights in a formalistic way is if the features were directly connected to something, e.g., revenue. Since this very rarely occurs, I suggest manually defining the weights; but for the sake of normalization, set:
w1 + w2 + ... + wn = 1
That is, split the "total importance" among the features you selected in a manner that sums to 1.