Is there a machine learning algorithm to deal with this problem? - machine-learning

Let's say that we have a database where each row includes:
a person vector such as (1,0, ..., 1)
a movie vector (1,1, ..., 0)
the rating that the person gave to the content
What machine learning algorithm would be able to establish a movie ranking based on the user vector ?
Thank you in advance for your answers!

Your question is bit vague. But I will try to answer. So it looks like what you are asking is if you already have the latent factors what algorithms can be used. How did you obtain these latent factors? are these part of the dataset. If they are not derived and given it is going to be hard for you to predict the movie rating from these given data. Do you have ratings as part of this data set. Then combining this you can use MF or Clustering to obtain movie rankings.

Related

In a content-based recommender systems, how to judge per-user rather than per-rating?

I'm studying the recommender systems from the Andrew Ng course on Coursera, and this question popped into my mind.
In the course, Andrew does recommendations for movies, like Netflix does.
We have an output matrix Y of ratings of various movies, where each cell Y(i,j) is the rating given by user j to movie i. If the user has not rated it, Y(i,j)=?
Assuming we are doing linear regression, we had the following minimization objective:
My question is, doesn't this calculate on a per-rating basis? As in, all ratings are equal. So if someone rates 100 movies, he has more effect on the algorithm than someone who rates only 10 movies.
I was wondering if it is possible to judge on a per-user basis, i.e. all users are equal.
It is definitely possible to apply a weight to the loss function with either weight = 1/ratings_for_user[u] or weight = 1/sqrt(ratings_for_user[u]). where ratings_per_user[u] is the number of rating for the user who gave the rating in your particular sample. Whether it's a good idea or not is another question.
To answer that question, I would first ask the question: Is this more meaningful to the problem you are really trying to solve? If it is, as the second question: Does the model you built work well? Does it have a good cross validation score?

Best machine learning solution for recommendations based on parameters

I'm looking for solution that will help me to recommend best matching records for my existing database. I consider to use machine learning for this task.
I have a set of data which describe user choices for movies:
movie, age, gender, movie_rating (0-10) (in future there will be more parameters)
Then what i would would like to get is a solution that helps me to find best movie recommendations by parameters. So input will be user with:
20 years old, male, movie rating 8+
And i result i would like to receive best matching movies for this parameters.
Im considering decision forest regression but maybe there are some other way to do this.
There is no straight forward algo for your problem because you add multiple features like age,sex,rating in into it.For accomplishing your goal you may use multiple low-rank matrix factorization algorithm like SVD or ALS to find the missing values for a matrix(Collaborative filtering ).Then you need to apply classification algorithm to classify 8+ rating movies with age(20+),sex (male) and take their intersection point.
One way of doing this is to use matrix factorization to find the missing values for a matrix. For your problem, a lot of users will not have movie ratings for many of the movies in the data base. So you can use matrix factorization to fill (approximate) that matrix and then, based on the scores given to different movies, will recommend movies to the users.
For convenience, just use Naive Bayes. It'll get you to 80%+ accuracy in tests, and for things like movie recommendations, it's not as testable for 100% accuracy.

Use Cosine Similarity with Binary Data - Mahout

I have a boolean/binary where a customer and product id are found when the customer actually bought the product and not found if the customer did not buy it. The dataset represented like this:
Dataset
I have tried different approaches like GenericBooleanPrefUserBasedRecommender with TanimotoCoefficient or LogLikelihood similarities, but I have also tried GenericUserBasedRecommender with the Uncentered Cosine Similarity and it gave me the highest precision and recall 100% and 60% respectively.
I am not sure if it makes sense to use the Uncentered Cosine Similarity in this situation, or this is a wrong logic ? and what does the Uncentered Cosine Similairty do with such dataset.
Any ideas would be really appreciated.
Thank you.
100% precision is impossible so something is wrong. All the similarity metrics work fine with boolean data. Remember the space is of very high dimensionality.
Your sample data only has two items (BTW ids should be 0 based for the old hadoop version of Mahout). So the dataset as shown is not going to give valid precision scores.
I've done this with large E-Com datasets and Log-likelihood considerably out-performs the other metrics on boolean data.
BTW Mahout has moved on to Spark from Hadoop and our only metric is LLR. A full Universal Recommender with event store and prediction server based on Mahout-Samsara is implemented here:
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
Slides describing it here: http://www.slideshare.net/pferrel/unified-recommender-39986309

Binary recommendation algorithms

I'm currently doing some research for a school assignment. I have two data streams, one is user ratings and the other is search, click and order history (binary data) of a webshop.
I found that collaborative filtering is the best family of algorithms if you are using rating data. I found and researched these algorithms:
Memory-based
user-based
pearson correlation
constrainted pearson
vector similaritys (cosinus)
Mean squared difference
weighted pearson
correlation threshold
max number of neighbours
weighted by correlation
Z-score normalization
item-based
adjusted cosine
maximum number of neighbours
similarity fusion
model based
regression based
slope one
lsi/svd
regularized svd (rsvd/rsvd2/nsvd2/svd++)
integrated neighbor based
cluster based smoothing
Now I'm looking for a way to use the binary data, but I'm having a hard time figuring out if it is possible to use binary data instead of rating data with these algorithms or is there a different family of algorithms I should be looking at ?
I apologize in advance for spelling errors since I have dyslexia and am not a native writer.Thanks marc_s for helping.
Take a look at data mining algorithms such as association rule mining (aka market basket analysis). You've come upon a tough problem in recommendation systems: unary and binary data are common but the best algorithms for personalization don't work well with them. Rating data can represent preference for a single user-item pair; e.g., I rate this movie 4 stars out of 5. But with binary data, we have the least granular type of rating data: I either like or don't like something, or have or have not consumed it. Be careful not to confuse binary and unary data: unary data means that you have information that a user consumed something (which is coded as 1, much like binary data), but you have no information about whether a user didn't like or consume something (which is coded as NULL instead of binary data's 0). For instance, you may know that a person viewed 10 web pages, but you don't have any idea what she would have thought of other pages had she known they were available. That's unary data. You can't assume any preference information from NULL.

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

Resources