In the machine learning lecture slide, it is said that there is no specific model for KNN where the data is the model of KNN.
The previous assignment was NCC(Nearest Centroid Classifier) where there were two methods, one was fit_ncc, and the other was predict_ncc. So fit_ncc creates a model and using this model, predict_ncc makes a prediction.
However, for KNN, it is written that the data is the model. This statement is not clear to me and my question is why data is the model for KNN?
Please see the attached screenshot:
I think that means, that in knn model fit method just remembers data, and predict method calculates distance matrix for each test data sample using some kind of distance metrics(Manhattan or Euclidean distances),and predicts the label of the most frequent one among k nearest samples from train dataset.
Here's links about distance metrics:
Importance of Distance Metrics in Machine Learning Modelling
Types of metrics
Related
I have a set of labeled training data, and I am training a ML algorithm to predict the label. However, some of my data points are more important than others. Or, analogously, these points have less uncertainty than the others.
Is there a general method to include an importance-representing weight to each training point in the model? Are there instead some specific models which are capable of this while others are not?
I can imagine duplicating these points (and perhaps smearing their features slightly to avoid exact duplicates), or downsampling the less important points. Is there a more elegant way to approach this problem?
Scikit-learn allows you to pass an array of sample weights while fitting the model. Vowpal Wabbit (an online ML library) also has this option.
==> to see learning curves
I am trying a random forest regressor for a machine learning problem (price estimation of spatial points). I have a sample of spatial points in a city. The sample is not randomly drawn since there are very few observations downtown. And I want to estimate prices for all addresses in the city.
I have a good cross validation score (absolute mean squared error) an also a good test score after splitting the training set. But predictions are very bad.
What could explain this results ?
I plotted the learning curve (link above) : cross validation score increases with number of instances (that sounds logical), training score remains high (should it decrease ?) ... What do these learning curves show ? And in general how do we "read" learning curves ?
Moreover, I suppose that the sample is not representative. I tried to make the dataset for which I want predictions spatially similar to the training set by drawing whitout replacement according to proportions of observations in each district for the training set. But this didn't change the result. How can I handle this non representativity ?
Thanks in advance for any help
There are a few common cases that pop up when looking at training and cross-validation scores:
Overfitting: When your model has a very high training score but a poor cross-validation score. Generally this occurs when your model is too complex, allowing it to fit the training data exceedingly well but giving it poor generalization to the validation dataset.
Underfitting: When neither the training nor the cross-validation scores are high. This occurs when your model is not complex enough.
Ideal fit: When both the training and cross-validation scores are fairly high. You model not only learns to represent the training data, but it generalizes well to new data.
Here's a nice graphic from this Quora post showing how model complexity and error relate to the type a fit a model exhibits.
In the plot above, the errors for a given complexity are the errors found at equilibrium. In contrast, learning curves show how the score progresses throughout the entire training process. Generally you never want to see the score decreasing during training, as this usually means your model is diverging. But the difference between the training and validation scores as they move forward in time (towards equilibrium) indicates how well your model is fitting.
Notice that even when you have an ideal fit (middle of complexity axis) it is common to see a training score that's higher than the cross-validation score, since the model's parameters are updated using the training data. But since you're getting poor predictions, and since validation score is ~10% lower than training score (assuming the score is out of 1), I would guess that your model is overfitting and could benefit from less complexity.
To answer your second point, models will generalize better if the training data is a better representation of validation data. So when splitting the data into training and validation sets, I recommend finding a way to randomly segregate the data. For example, you could generate a list of all the points in the city, iterate of the list, and for each point draw from a uniform distribution to decide which dataset that point belongs to.
Imagine we have a classification problem on a dataset where the examples are only positive (equivalently negative). For instance, on a problem where the the winning class is specified by position (e.g. think of a tennis dataset problem where the first player is always the winner). How can we create negative examples in order to train a supervised learning algorithm on this dataset? One idea could be to generate negative examples, by exchanging the positions of the features that are tied to each of the classes. Do you think this will give an unbiased dataset? Could we create negative duplicates of our original dataset and train a supervised learning algorithm on this double dataset?
I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.
In supervised learning, such as SVMs, the dataset should be composed as follows:
<i-th feature vector><i-th label>
where i goes from 1 to the number of patterns (also examples or observations) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.
So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:
train the SVM using the training set and the training labels
predict the labels for the validation set using the model trained in the previous step
if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane
It is also important to know that the training set records should be correctly, a priori labelled: if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.
I have very small data that belongs to positive class and a large set of data from negative class. According to prof. Andrew Ng (anomaly detection vs supervised learning), I should use Anomaly detection instead of Supervised learning because of highly skewed data.
Please correct me if I am wrong but both techniques look same to me i.e. in both (supervised) Anomaly detection, and standard Supervised learning, we train data with both normal and anomalous samples and test on unknown data. Is there any difference?
Should I just perform under-sampling of negative class or over-sampling of positive class to get both type data of same size? Does it affect the overall accuracy?
Actually in supervised learning, you have the data set labelled (e.g good, bad) and you pass the labelled values as you train the model so that it learns parameters that will separate the 'good' from 'bad' results.
In anomaly detection, it is unsupervised as you do not pass any labelled values.. What you do is you train using only the 'non-anomalous' data. You then select epsilon values and evaluate with a numerical value (such as F1 score) so that your model will get a good balance of true positives.
Regarding trying to over/under sample so your data is not skewed, there are 2 things.
Prof Ng mentioned something like if your positive class is only 10 out of 10k or 100k then you need to use anomaly detection since your data is highly skewed.
Supervised learning makes sense if you know typically what 'bad' values are. If you only know what is 'normal'/'good' but your 'bad' value can really be very different every time then this is a good case for anomaly detection.
In anomaly detection you would determine model parameters from the portion of the data which is well supported (As Andrew explains). Since your negative class has many instances you would use these data for 'learning'. Kernel density estimation or GMMs are examples of approaches that are typically used. A model of 'normalcy' may thus be learnt and thresholding may be used to detect instances which are considered anomalous with respect to your derived model. The difference between this approach and conventional supervised learning lies in the fact that you are using only a portion of the data (the negative class in your case) for training. You would expect your positive instances to be identified as anomalous after training.
As for your second question, under-sampling the negative class will result in a loss of information whilst over-sampling the positive class doesn't add information. I don't think that following that route is desirable.