I have found what must be dozens of articles on Towards Data Science/ medium/ etc. of people making recommendation engines with imdb data (based on ratings that users gave to movies, what movies should we recommend to those users).
These articles begin with 'memory based approaches' of user-based content filtering and item-based content filtering.
I have been tasked with making a recommendation engine, and since none of the suits really care or know anything about this, I want to do the bare minimum (which seems to be user-based content filtering).
Problem is, all of my data is binary (no ratings, just based on the items that other users bought, should we recommend items to similar users - this is actually similar to the cartoons that all of the medium articles have stolen from eachother, but none of the medium articles give an example of how to do that).
All of the articles use Pearson Correlation or cosine similarity to determine user similarity, can I use these approaches with binary dimensions (bought or not), if so how, and if not is there a different way to measure user similarity?
I am working with python btw. And I was thinking of maybe using Hamming Distance (is there a reason that wouldn't be good)
Similarity score based approaches do work even with binary dimension. When you have scores, two similar users may look like [5,3,3,0,1] and [4,3,3,0,0], where as in your case it would be something like [1,1,1,0,1] and [1,1,1,0,0].
from scipy.spatial.distance import cosine
1 - cosine([5,3,2,0,1],[4,3,3,0,0])
0.961161313666907
1 - cosine([1,1,1,0,1],[1,1,1,0,0])
0.8660254037844386
Another approach is, if you can get the number of times a user bought a product, that count can be used as rating and then similarities can be calculated
The data you have is an implicit data which means interactions are not necessarily indicate user's interest it's just interaction. Interaction value of 1 and interaction value of 1000 has no difference in this case they both shows interaction nothing else, such that memory based algorithms are useless here. If you are not familiar with neural networks, then you have to at least use matrix factorization techniques to make a meaningful recommendation using this data, you can start with surprise library
here which has a bunch of matrix factorization models.
It will be better if you use ALS as optimization technique, but SGD will also do the work. If you are ok with deep-learning I can refer to the sources of the best work so far.
I once used non-negative matrix factorization(NNMF for short) algorithm in surprise for data like yours and the results was good enough.
It seems, that in your situation the best approach would be collaborative filtering. You don't need scores, everything that you need is a user-item interaction matrix. The simplest algorithm, in this case, is Alternating Least Square (ALS).
There're already a few implementations in python. For instance, this one. Also,
there's an implementation in PySpark recommendation module.
Let's say that we have a database where each row includes:
a person vector such as (1,0, ..., 1)
a movie vector (1,1, ..., 0)
the rating that the person gave to the content
What machine learning algorithm would be able to establish a movie ranking based on the user vector ?
Thank you in advance for your answers!
Your question is bit vague. But I will try to answer. So it looks like what you are asking is if you already have the latent factors what algorithms can be used. How did you obtain these latent factors? are these part of the dataset. If they are not derived and given it is going to be hard for you to predict the movie rating from these given data. Do you have ratings as part of this data set. Then combining this you can use MF or Clustering to obtain movie rankings.
Which classifier in Machine Learning should I use to predict the expected subsequent purchased category based on the month he is purchasing?
Given a dataset consisting of columns
uuid date price product_id category
I am guessing the Naive Bayes algorithm, any better suggestions?
In ML, you cann't rely on single model, try every model which you can try out and pick which gives best result. I generally start with simple model such as Ridge, Logistic regression, Elastic net, SVM etc. which still work very good. then i look into tree based and gradient boosting models.
If you have product/users features then use can try KNN model on those features, which find the k-similar neighbours.
Also, if you would like to look in the recommendation system, this may be helpful. In product recommendation, these models works pretty good.
Suppose that for a given ML problem, we have a feature which car the person possesses. We can encode this information in one of the following ways:
Assign an id to each of the car. Make a column 'CAR_POSSESSED' and put feature id as value.
Make columns for each of the car and put 0 or 1 according to whether that car is possessed by the considered sample or not. Columns will be like "BMW_POSSESSED", "AUDI_POSSESSED".
In my experiments the 2nd way performed much better than 1st one, when tried with SVM.
How does the encoding way affects the model learning, and are there some resources in which affect of encoding has been studied? Or do we need to do hit and trials to check where it performs best?
The problem with the first way is that you use arbitrary numbers to represent the features (e.g. BMW=2, etc.) and SVM take those numbers seriously, as if they have order: e.g. it may try to use cases with CAR_OWNED>3 for the prediction.
So the second way is better.
Chapter 2.1 Categorical Features:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
You'll find many more if you search for "svm Categorical Features"
I'm trying to figure out a way I could represent a Facebook user as a vector. I decided to go with stacking the different attributes/parameters of the user into one big vector (i.e. age is a vector of size 100, where 100 is the maximum age you can have, if you are lets say 50, the first 50 values of the vector would be 1 just like a thermometer). I just can't figure out a way to represent the Facebook interests as a vector too, they are a collection of words and the space that represents all the words is huge, I can't go for a model like a bag of words or something similar. Does anyone know how I should proceed? I'm still new to this, any reference would be highly appreciated.
In the case of a desire to down vote this question just let me know what is wrong about it so that I could improve the wording and context.
Thanks
The "right" approach depends on what your learning algorithm is and what the decision problem is.
It would often be better, though, to represent age as a single numeric feature rather than 100 indicator features. That way learning algorithms don't have to learn the relationship between those hundred features (it's baked-in), and the problem has 99 fewer dimensions, which'll make everything better.
To model the interests, you might want to start with an extremely high-dimensional bag of words model and then use one of various options to reduce the dimensionality:
a general dimensionality-reduction technique like PCA or smarter nonlinear ones, including Kernel PCA or various nonlinear approaches: see wikipedia's overview of dimensionality reduction and of specifically nonlinear techniques
pass it through a topic model and use the learned topic weights as your features; examples include LSA, LDA, HDP and many more