Distance measure for categorical attributes for k-Nearest Neighbor - machine-learning

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.

Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.

I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Related

Appropriate choice of k for knn [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have seen many threads asking for "best choice of knn for my problem X" and I would like a more general answer, so it applies to any K-NN classification problem.
Should one only care about the accuracy of one's model, and therefore tune to obtain best possible answer with one's data set?
Are there any general problems problem with choosing best possible K for our problem?
Does such skill come naturally after building many models, and one can instinctively choose the right value, or at least come up with a sensible range to test through?
In general:
Too small K (say 1) is sensitive to noisy data i.e. an outlier can heavily influence your model
Too large K can lead to misclassification i.e. model gives inaccurate predictions
The way you calculate distance matters. For example, in sparse data sets cosine distance will yield much better results than euclidean distance. You could choose a right value for K, but if your distance calculation is irrelevant then the performance of the model is going to be bad anyway.
K equal to number of classes is a very bad choice, because final classification will be random.
Imagine a binary k-nn classification model, where output is either dog or a cat.
Now imagine you choose k to be equal to 2 (or any other even number).
Also, assume that a data point lies so that it's k nearest neighbours belong to equally one and the other class (two nearest neighbours are both dog and cat or 2 in each class or 3 in each class etc.).
Now, how do you determine which class the point belongs to?
You can't. You would need to randomise the process, or choose the first one, both giving equally bad results.
The K-NN algorithm is a non-parametric machine learning algorithm that is relatively fast and easy to implement. It's fast during training but slow during testing/inference.
Determining the number of K really depends on the data set at hand, as it's heavily dependent on the spread (distribution) of your sample points in the decision (feature) space. If the given data set forms a "dense" feature space relative to the number of dimensions (features), then K-NN will work best. However, if the data set results in a sparse feature space, then the K-NN will likely have low accuracy; and opting for another machine learning algorithm will probably be a better option.
As with attempting to find the "best" K for a given data set, it's usually best practice to implement a k-fold Cross Validation procedure for different values of K, then plot the Accuracy of your model against the number of K used for the model. That will generate k accuracy values for each chosen value of K. The K value that results in the highest average accuracy is taken to be the best value of K for your model using your specified data set. Such a plot typically (done once) looks something like this:
(A 10-fold CV is typically used in practice as it gives a good balance of using more samples to generate a more accurate confidence interval and to decrease bias towards estimating "true" error of model)

What is a Distance Sensitive Data how it Differs from other Data? Any Examples will be helpful

i was reading about Classification Algorithm KNN and came across with one term Distance Sensitive Data. I was not able to Found what exactly is Distance Sensitive Data wha are it's classifications, How to say if our Data is Distance-Sensitive or Not?
Suppose that xi and xj are vectors of observed features in cases i and j. Then, as you probably know, kNN is based on distances ||xi-xj||, such as the Euclidean one.
Now if xi and xj contain just a single feature, individual's height in meters, we are fine, as there are no other "competing" features. Suppose that next we add annual salary in thousands. Consequently, we look at distances between vectors like (1.7, 50000) and (1.8, 100000).
Then, in the case of the Euclidean distance, clearly salary feature dominates height and it's almost like we are using the salary feature alone. That is,
||xi-xj||2 ≈ |50000-100000|.
However, if the two features actually have similar importance, then we are doing a poor job. It is even worse if salary is actually irrelevant and we should be using height alone. Interestingly, under weak conditions, our classifier still has nice properties such as universal consistency even in such bad situations. The problem is that in finite samples the performance is our classifier is very bad so that the convergence is very slow.
So, as to deal with that, one may want to consider different distances, such that do something about the scale. Commonly people standardize (set the mean to zero and variance to 1) each feature, but that's not a complete solution either. There are various proposals what could be done (see, e.g., here).
On the other hand, algorithms based on decision trees do not suffer from this. In those cases we just look for a point where to split the variable. For instance, if salary takes values in [0,100000] and the split is at 40000, then Salary/10 would be slit at 4000 so that the results would not change.

Is my method to detect overfitting in matrix factorization correct?

I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....
Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?
You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:
L(theta|X) = P(X|theta)
For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.
In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).
The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.
To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).
As an example (that has been used billions of times) of what MLE does is:
If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...
Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution).
Cons: It can overfit when there isn't a lot of data (like our coin-toss example).
The example I got in my stat classes was as follows:
A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?
The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.
But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.
That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.
That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:
It makes intuitive sense
It's reasonably simple to compute, fit and optimise, for a large family of models
For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Resources