I'm following a tutorial that for creating a recommendation system in BigQueryML. The tutorial uses matrix factorization first to calculate user and item factors. In the end I have a model that can be queried with user ids or item ids to get recommendations.
The next step is feeding the factors and additional item + user features into a linear regression model to incorporate more context.
"Essentially, we have a couple of attributes about the movie, the
product factors array corresponding to the movie, a couple of
attributes about the user, and the user factors array corresponding to
the user. These form the inputs to our “hybrid” recommendations model
that builds off the matrix factorization model and adds in metadata
about users and movies."
I just don't understand why the dataset for linear regression excludes the user and item ids:
SELECT
p.* EXCEPT(movieId),
u.* EXCEPT(userId),
rating
FROM productFeatures p, userFeatures u
JOIN movielens.ratings r
ON r.movieId = p.movieId AND r.userId = u.userId
My question is:
How will I be able to get recommendations for a user from the linear model, when I don't have the user or item ids in the model?
Here you can find the full code:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/recommendation_systems/solutions/als_bqml_hybrid.ipynb
In the example you have shared, the goal is to fit a linear regression to the discovered factor values so that a novel set of factor values can be used to predict the rating. In this kind of setup, you don't want information about which samples are being used; the only crucial information is the training features (the factor scores) and the rating (the training/test label). For more on this topic, take a look at "Dimensionality reduction using non-negative matrix factorization for information retrieval."
If you included the movie ids and user ids in as features, your regression would try to learn on those, which would either add noise to the model or learn that low ids = lower score etc. This is possible, especially if this ids are in some kind of order you're not aware of, such as chronological or by genre.
Note: You could use movie-specific or user-specific information to build a model, but you would have many, many dimensions of data, and that tends to create poorly performing models. The idea here is to avoid the problem of dimensionality by first reducing the dimensionality of the problem space. Matrix factorization is just one method among many to do this. See, for example, PCA, LDA, and word2vec.
Related
Background
I am trying to create a model that can predict Type 2 diabetes in a patient based on MRI scans of their thigh muscle. Previous literature has shown that fat deposition in the muscle of femur is linked to Type 2 Diabetes, so there is some valid relationship here.
I have a dataset comprised of several hundred patients. I am analyzing radiomics features of their MRI scans, which are basically quantitative imaging features (think things like texture, intensity, variance of texture in a specific direction, etc.). The kicker here is that an MRI scan is a three-dimensional object, but I have radiomics features for each of the 2D slices, not radiomics of the entire 3D thigh muscle. So this dataset has repeated rows for each patients, "multiple records for one observation." My objective is to output a binary classification of Yes/No for T2DM for a single patient.
Problem Description
Based on some initial exploratory data analysis, I think the key here is that some slices are more informative than others. For example, one thing I tried was to group the slices by patient, and then analyze each slice in feature hyperspace. I selected the slice with the furthest distance from the center of all the other slices in feature hyperspace and used only that slice for the patient.
I have also tried just aggregating all the features, so that each patient is reduced to a single row, but has way more features. For example, there might be a feature called "median intensity." But now the patient will have 5 features, called "median intensity__mean" "median intensity__median", "median intensity__max," and so forth. These aggregations are across all the slices that belong to that patient. This did not work well and yielded an AUC of 0.5.
I'm trying to find a way to select the most informative slices for each patient that will then be used for the classification; or an informative way of reducing all the records for a single observation down to a single record.
Solution Thoughts
One thing I'm thinking is that it would probably be best to train some sort of neural net to learn which slices to pick before feeding those slices to another classifier. Effectively, how this would work would be the neural net would learn a linear transformation that could be applied to the matrix of (slices, features) for each patient. So some slices would be upweighted while others would be downweighted. Then I could compute the mean along the ith axis and then use that as input to the final classifier. If you have examples of code for how this would work (I'm not sure how you would hook up the loss function from the final classifier (in my case, a LGBMClassifier) to the neural net so that backpropagation occurs from the final classification all throughout the ensemble model.
Overall, I'm open to any ideas on how to approach this issue of reducing multiple records for one observation down to the most informative / subset of the most informative records for one observation.
Assume that I have a candidate selection system to generate product/user pairs for recommendation. Currently, in order to hold a quality bar for the recommended product, we trained a model to optimize for the click of the link, denoting as pClick(product, user) model, the output of the model is a score of (0,1) representing how likely the user will click on the recommended product.
For our initial launch product, we set a manually selected threshold, say T for all users. For all users, only when the threshold pass T, we will send user the recommendation.
Now we realize this is not optimal: Some users care less about recommendation quality while some other users have a high bar of recommendation quality. And a personalized threshold, instead of the global T can help us improve the overall relevance.
The goal is to output the threshold for each user, assume we have training data for each user's activity and user/product attributes.
The question is: How should we model this problem with machine learning? Any reference or papers is highly appreciated.
Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.
I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.
I'm about to start writing a recommender system for videos, mostly based on collaborative filtering as video metadata is pretty sparse, with a bit of content-based filtering as well. However, I'm not sure what to do about training data. Is training data something of importance in recommender systems, specifically in collaborative methods? If so, how can I generate that kind of data, or what type of data should I look for?
Any ML algorithm needs data. Take Matrix Factorization approach, for example.
It receives (incomplete) matrix of rates: rows represent users, columns represent items and a cell contains rate that particular user rated particular item. Then by factorizing this matrix you obtain latent vector representation for each user and each item, thus allowing you to predict future rates. Obviously, unseen items with highest rate are most interesting to the user, according to the model.
Essentially, Matrix Factorization learns predicting new rates for known users and items.
I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?