Logistic Regression Usecase - machine-learning

Problem: Given a list of movies watched by a specific user, calculate the probability that he will watch any specific movie.
Approach: It seems that it's a typical logistic regression use case (please correct me if I'm wrong)
Initial Logistic Regression code (please correct if something is wrong):
def sigmoid(x):
return (1/(1+math.exp(-x)))
def gradientDescentLogistic(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = sigmoid(np.dot(x, theta))
loss = hypothesis - y
# The ONLY difference between linear and logistic is the definition of hypothesis
gradient = np.dot(xTrans, loss) / m
theta = theta - alpha * gradient
return theta
Now the parameters here can be different actors, different genres, etc..
I'm unable to figure out how to fit these kind of parameters in the above code

Why it is not a use case of LR?
I would say that this is not a typical use case of Logistic Regression. Why? Because you only know what someone watched, you only have positive samples, you do not know what someone did not watch by decision. Obviously if I watched movies {m1,m2,m3} then I did not watch M\{m1,m2,m3} where M is the set of all movies in the history of mankind. But this is not a good assumption. I did not watch most of them because I do not own them, do not know about them or simply did not yet have time for this. In such case you can only model this as a one-class problem or a kind of density estimation (I do assume you do not have access to any other knowledge then the list of movies seen, so we cannot for example do collaborative filtering or other - crowd based - analysis).
Why not generate negative samples by hand?
Obviously you could for example select randomly movies from some database which were not seen by user and assume that (s)he does not like to see it. But this is just an arbitrary, abstract assumption, and your model will be extremely biased towards this procedure. For example, if you would take all unseen movies as negative samples, then a correct model would simply learn to say "Yes" only for training set, and "No" for the rest. If you randomly sample m movies, it will only learn to distinguish your taste from these m movies. But they can represent anything! In particular - movies that one woul love to see. To sum up - you can do this, to be honest it could even work in some particular applications; but from probabilistic perspective this is not a valid approach as you build in unjustifiable assumptions to the model.
How could I approach this?
So what can you do to do it in a probabilistic manner? You can, for example represent your movies as numerical features (some characteristics), and consequently have a cloud of points in some space R^d (where d is number of features extracted). Then, you can fit any distribution, such as Gaussian distribution (radial with d is big), GMM, or any other. This will give you a clear (easy to understand and "defend") model for P(user will watch|x).

Related

Is my method to detect overfitting in matrix factorization correct?

I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.

How to build multivariate ranking system?

I have data of various sellers on ecommerce platform. I am trying to compute seller ranking score based on various features, such as
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
My first instinct was to normalize all the features, then multiply parameters/feature by some weight . Add them together for each seller score. Finally, find relative ranking of sellers based on this score.
My Seller score equation looks like
Seller score = w1* Order fulfillment rates - w2*Order cancel rate + w3 * User rating + w4 * Time taken to confirm order
where, w1,w2,w3,w4 are weights.
My question is three fold
Are there better algorithms/approaches to solve this problem? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
How to incorporate numeric and categorical variables in finding seller ranking score? (I have few categorical variables)
Is there an accepted way to weight multivariate systems like this ?
I would suggest the following approach:
First of all, keep in a matrix all features that you have available, whether you consider them useful or not.
(Hint: categorical variables are converted to numerical by simple encoding. Thus you can easily incorporate them (in the exact way you encoded user rating))
Then, you have to apply a dimensionality reduction algorithm, such as Singular Value Decomposition (SVD), in order to keep the most significant variables. Applying SVD may surprise you as to which features may be significant and which aren't.
After applying SVD, choosing the right weights for the n-most important features you decided to keep, is really up to you because it is purely qualitative and domain-dependent (as far as which features are more important).
The only way you could possibly calculate weights in a formalistic way is if the features were directly connected to something, e.g., revenue. Since this very rarely occurs, I suggest manually defining the weights; but for the sake of normalization, set:
w1 + w2 + ... + wn = 1
That is, split the "total importance" among the features you selected in a manner that sums to 1.

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

What machine learning algorithms can be used in this scenario?

My data consists of objects as follows.
Obj1 - Color - shape - size - price - ranking
So I want to be able to predict what combination of color/shape/size/price is a good combination to get high ranking. Or even a combination could work like for eg: in order to get good ranking, the alg predicts best performance for this color and this shape. Something like that.
What are the advisable algorithms for such a prediction?
Also may be if you can briefly explain how I can approach towards the model building I would really appreciate it. Say for eg: my data looks like
Blue pentagon small $50.00 #5
Red Squre large $30.00 #3
So what is a useful prediction model that I should look at? What algorithm should I try to predict like say highest weightage is for price followed by color and then size. What if I wanted to predict in combinations like a Red small shape is less likely to higher rank compared to pink small shape . (In essence trying to combine more than one nominal values column to make the prediction)
Sounds like you want to learn models that you can interpret as a human. Depending on what type your ranking variable is, a number of different learners are possible.
If ranking is categorical (e.g. stars), a classifier is probably best. There are many in Weka. Some that produce models that are understandable by humans are the J48 decision tree learner and the OneR rule learner.
If the ranking is continuous (e.g. a score), regression might be more appropriate. Suitable algorithms are for example SimpleLogistic and LinearRegression.
Alternatively, you could try clustering your examples with any of the algorithms in Weka and then analyzing the clusters. That is, ideally examples in a cluster would all be of the same (or very similar) ranking and you can have a look at the range of values of the other attributes and draw your own conclusions.
Treat the combination as a linear equation, and apply a Monte Carlo algorithm (like Genetic Algorithm) to tune the parameters of the equation.
Code the color/shape/size/price/rankings into digital values.
Treat the combination as a linear equation, say a*color + b*shape + c*size + d*price = ranking.
Apply Genetic Algorithm to tune a/b/c/d, in order to make calculated rankings to be as closer to the ground-truth as possible.
Finally you got the equation, you could use it to:
1) find maximal rankings by a simple linear planning;
2) predict rankings by just assign other parameters.

importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand.
I am trying to think (since long time) but I am not able to guess why is it so.
In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization.
But how does something like SVD helps.
So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):
1) Without SVD
2) With SVD
how does it helps
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.
The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.
SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.
Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.
See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition
https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th
The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:
Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.
The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).
This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.
(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).
PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.
Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Resources