Why is permutation used in the knn classifier? - machine-learning

I am new to ML and I don't understand why a random permutation is used for KNN. I am referring to http://www.scipy-lectures.org/advanced/scikit-learn/ in the k-Nearest neighbors classifier section. The following code was provided:
>>> perm = np.random.permutation(iris.target.size)
>>> iris.data = iris.data[perm]
>>> iris.target = iris.target[perm]
>>> knn.fit(iris.data[:100], iris.target[:100])
KNeighborsClassifier(...)
>>> knn.score(iris.data[100:], iris.target[100:])
0.95999...
And this question was asked: Bonus question: why did we use a random permutation?
Can someone help explain why a permutation would affect the results?

Iris is by default sorted, first 50 instances form class 1, next class 2, and last class 3. So they would train solely on class 1 and 2 and try to predict labels of class 3 if they do not permute. In general it is a good practise to start from permuting data, as they can always be some kind of structure involved due to the approach taken by the dataset creator.

It is very likely that your dataset has sorting or groupings that you are not aware of. Usually you separate your model in training, test, and validation. At first glance in knn that is not explicitly required, because the algorithm in purely online. Let's see how it works,
A1. A data set is given.
A2. A candidate point is given
A3. The candidate point is classified with a majority voting of the k nearest neighbor classes.
However that is the case when the data set encompasses all the required knowledge, i.e. It is the ground truth.
In case that the dataset is not as such that we randomize and separate in training and validation, then we classify against train and check against validation to see if the training was successful. This is an iterative process of randomization and test until we get a train set that nicely evaluates on the validation set. Once this process is finished the test set is used to evaluate the generalization ability of the process.

Related

K Nearest Neighbour Classifier - random state for train test split leads to different accuracy scores

I'm fairly new to data analysis and machine learning. I've been carrying out some KNN classification analysis on a breast cancer dataset in python's sklearn module. I have the following code which attemps to find the optimal k for classification of a target variable.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data = load_breast_cancer()
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size = 0.2, random_state = 40)
results = []
for k in range(1,101):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(training_data, training_labels)
results.append(classifier.score(validation_data, validation_labels))
k_list = range(1,101)
plt.plot(k_list, results)
plt.ylim(0.85,0.99)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
The code loops through 1 to 100 and generates 100 KNN models with 'k' set to incremental values in the range 1 to 100. The performance of each of those models is saved to a list and a plot is generated showing 'k' on the x-axis and model performance on the y-axis.
The problem I have is that when I change the random_state parameter when spliting the data into training and testing partitions this results in completely different plots indicating varying model performance for different 'k'values for different dataset partitions.
For me this makes it difficult to decide which 'k' is optimal as the algorithm performs differently for different 'k's using different random states. Surely this doesn't mean that, for this particular dataset, 'k' is arbitrary? Can anyone help shed some light on this?
Thanks in anticipation
This is completely expected. When you do the train-test-split, you are effectively sampling from your original population. This means that when you fit a model, any statistic (such as a model parameter estimate, or a model score) will it self be a sample estimate taken from some distribution. What you really want is a confidence interval around this score and the easiest way to get that is to repeat the sampling and remeasure the score.
But you have to be very careful how you do this. Here are some robust options:
1. Cross Validation
The most common solution to this problem is to use k-fold cross-validation. In order not to confuse this k with the k from knn I'm going to use a capital for cross-validation (but bear in mind this is not normal nomenclature) This is a scheme to do the suggestion above but without a target leak. Instead of creating many splits at random, you split the data into K parts (called folds). You then train K models each time on K-1 folds of the data leaving aside a different fold as your test set each time. Now each model is independent and without a target leak. It turns out that the mean of whatever success score you use from these K models on their K separate test sets is a good estimate for the performance of training a model with those hyperparameters on the whole set. So now you should get a more stable score for each of your different values of k (small k for knn) and you can choose a final k this way.
Some extra notes:
Accuracy is a bad measure for classification performance. Look at scores like precision vs recall or AUROC or f1.
Don't try program CV yourself, use sklearns GridSearchCV
If you are doing any preprocessing on your data that calculates some sort of state using the data, that needs to be done on only the training data in each fold. For example if you are scaling your data you can't include the test data when you do the scaling. You need to fit (and transform) the scaler on the training data and then use that same scaler to transform on your test data (don't fit again). To get this to work in CV you need to use sklearn Pipelines. This is very important, make sure you understand it.
You might get more stability if you stratify your train-test-split based on the output class. See the stratify argument on train_test_split.
Note the CV is the industry standard and that's what you should do, but there are other options:
2. Bootstrapping
You can read about this in detail in introduction to statistical learning section 5.2 (pg 187) with examples in section 5.3.4.
The idea is to take you training set and draw a random sample from it with replacement. This means you end up with some repeated records. You take this new training set, train and model and then score it on the records that didn't make it into the bootstrapped sample (often called out-of-bag samples). You repeat this process multiple times. You can now get a distribution of your score (e.g. accuracy) which you can use to choose your hyper-parameter rather than just the point estimate you were using before.
3. Making sure you test set is representative of your validation set
Jeremy Howard has a very interesting suggestion on how to calibrate your validation set to be a good representation of your test set. You only need to watch about 5 minutes from where that link starts. The idea is to split into three sets (which you should be doing anyway to choose a hyper parameter like k), train a bunch of very different but simple quick models on your train set and then score them on both your validation and test set. It is OK to use the test set here because these aren't real models that will influence your final model. Then plot the validation scores vs the test scores. They should fall roughly on a straight line (the y=x line). If they do, this means the validation set and test set are both either good or bad, i.e. performance in the validation set is representative of performance in the test set. If they don't fall on this straight line, it means the model scores you get from you validation set are not indicative of the score you'll get on unseen data and thus you can't use that split to train a sensible model.
4. Get a larger data set
This is obviously not very practical for your situation but I thought I'd mention it for completeness. As your sample size increases, your standard error drops (i.e. you can get tighter bounds on your confidence intervals). But you'll need more training and more test data. While you might not have access to that here, it's worth keeping in mind for real world situations where you can assess the trade-off of the cost of gathering new data vs the desired accuracy in assessing your model performance (and probably the performance itself too).
This "behavior" is to be expected. Of course you get different results, when training and test is split differently.
You can approach the problem statistically, by repeating each 'k' several times with new train-validation-splits. Then take the median performance for each k. Or even better: look at the performance distribution and the median. A narrow performance distribution for a given 'k' is also a good sign that the 'k' is chosen well.
Afterwards you can use the test set to test your model

Cleveland heart disease dataset - can’t describe the class

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.
The dataset description says that the values go from 0 to 4 but the attribute description says:
0: < 50% coronary disease
1: > 50% coronary disease
I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?
If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.
I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.
While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.
References :
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.
is this dataset meant to be a multiclass or a binary classification problem?
Without changes, the dataset is ready to be used for a multi-class classification problem.
And must i group values 1-4 to a single class (presence of disease)?
Yes, you must, as long as you are interested in using the dataset for a binary classification problem.

Regression like quantification of the importance of variables in random forest

Is it possible to quantify the importance of variables in figuring out the probability of an observation falling into one class. Something similar to Logistic regression.
For example:
If I have the following independent variables
1) Number of cats the person has
2) Number of dogs a person has
3) Number of chickens a person has
With my dependent variable being: Whether a person is a part of PETA or not
Is it possible to say something like "if the person adopts one more cat than his existing range of animals, his probability of being a part of PETA increases by 0.12"
I am currently using the following methodology to reach this particular scenario:
1) Build a random forest model using the training data
2) Predict the customer's probability to fall in one particular class(Peta vs non Peta)
3) Artificially increase the number of cats owned by each observation by 1
4) Predict the customer's new probability to fall in one of the two classes
5) The average change between (4)'s probability and (2)'s probability is the average increase in a person's probability if he has adopted a cat.
Does this make sense? Is there any flaw in the methodology that I haven't thought of? Is there a better way of doing the same ?
If you're using scikitlearn, you can easily do this by accessing the feature_importance_ property of the fitted RandomForestClassifier. According to SciKitLearn:
The relative rank (i.e. depth) of a feature used as a decision node in
a tree can be used to assess the relative importance of that feature
with respect to the predictability of the target variable. Features
used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected
fraction of the samples they contribute to can thus be used as an
estimate of the relative importance of the features.
By averaging
those expected activity rates over several randomized trees one can
reduce the variance of such an estimate and use it for feature
selection.
The property feature_importance_ stores the average depth of each feature among the trees.
Here's an example. Let's start by importing the necessary libraries.
# using this for some array manipulations
import numpy as np
# of course we're going to plot stuff!
import matplotlib.pyplot as plt
# dummy iris dataset
from sklearn.datasets import load_iris
#random forest classifier
from sklearn.ensemble import RandomForestClassifier
Once we have these, we're going to load the dummy dataset, define a classification model and fit the data to the model.
data = load_iris()
​
# we're gonna use 100 trees
forest = RandomForestClassifier(n_estimators = 100)
​
# fit data to model by passing features and labels
forest.fit(data.data, data.target)
Now we can use the Feature Importance Property to get a score of each feature, based on how well it is able to classify the data into different targets.
# find importances of each feature
importances = forest.feature_importances_
# find the standard dev of each feature to assess the spread
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
​
# find sorting indices of importances (descending)
indices = np.argsort(importances)[::-1]
​
# Print the feature ranking
print("Feature ranking:")
​
for f in range(data.data.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Feature ranking:
1. feature 2 (0.441183)
2. feature 3 (0.416197)
3. feature 0 (0.112287)
4. feature 1 (0.030334)
Now we can plot the importance of each feature as a bar graph and decide if it's worth keeping them all. We also plot the error bars to assess the significance.
plt.figure()
plt.title("Feature importances")
plt.bar(range(data.data.shape[1]), importances[indices],
color="b", yerr=std[indices], align="center")
plt.xticks(range(data.data.shape[1]), indices)
plt.xlim([-1, data.data.shape[1]])
plt.show()
Bar graph of feature importances
I apologize. I didn't catch the part where you mention what kind of statement you're trying to make. I'm assuming your response variable is either 1 or zero. You could try something like this:
Fit a linear regression model over the data. This won't really give you the most accurate fit, but it will be robust to get the information you're looking for.
Find the response of the model with the original inputs. (It most likely won't be ones or zeros)
Artificially change the inputs, and find the difference in the outputs of the original data and modified data, like you suggested in your question.
Try it out with a logistic regression as well. It really depends on your data and how it is distributed to find what kind of regression will work best. You definitely have to use a regression to find the change in probability with change in input.
You can even try a single layer neural network with a regression/linear output layer to do the same thing. Add layers or non-linear activation functions if the data is less likely to be linear.
Cheers!

Why K-fold cross validation will built K+1 models?

I have read the general step for K-fold cross validation under
https://machinelearningmastery.com/k-fold-cross-validation/
It describe the general procedure is as follows:
Shuffle the dataset randomly.
Split the dataset into k groups (folds)
For each unique group:Take the group as a hold out or test data set
Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
So if it is K-fold then K models will be built, right? But why I read from the following link from H2O which is saying it built K+1 models?
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb
Arguably, "I read somewhere else" is too vague a statement (where?), because context does matter.
Most probably, such statements refer to some libraries which, by default, after finishing the CV proper procedure, go on to build a model on the whole training data using the hyperparameters found by CV to give best performance; see for example the relevant train function of the caret R package, which, apart from performing CV (if requested), returns also the finalModel:
finalModel
A fit object using the best parameters
Similarly, scikit-learn GridSearchCV has also a relevant parameter refit:
refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
[...]
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
But even then, the models fitted are almost never just K+1: when you use CV in practice for hyperparameter tuning (and keep in mind that there there are other uses, too, for CV), you will end up fitting m*K models, where m is the length of your hyperparameters combination set (all K-folds in a single round are run with one single set of hyperparameters).
In other words, if your hypeparameter search grid consists of, say, 3 values for the no. of trees and 2 values for the tree depth, you will fit 2*3*K = 6*K models during the CV procedure, and possibly +1 for fitting your model at the end to the whole data with the best hyperparameters found.
So, to summarize:
By definition, each K-fold CV procedure consists of fitting just K models, one for each fold, with fixed hyperparameters across all folds
In case of CV for hyperparameter search, this procedure will be repeated for each hyperparameter combination of the search grid, leading to m*K fits
Having found the best hyperparameters, you may want to use them for fitting the final model, i.e. 1 more fit
leading to a total of m*K + 1 model fits.
Hope this helps...

Choosing random_state for sklearn algorithms

I understand that random_state is used in various sklearn algorithms to break tie between different predictors (trees) with same metric value (say for example in GradientBoosting). But the documentation does not clarify or detail on this. Like
1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?
What I am mainly concerned about is
2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?
3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.
4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).
5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?
Yes, the choice of the random seeds will impact your prediction results and as you pointed out in your fourth question, the impact is not really predictable.
The common way to guard against predictions that happen to be good or bad just by chance is to train several models (based on different random states) and to average their predictions in a meaningful way. Similarly, you can see cross validation as a way to estimate the "true" performance of a model by averaging the performance over multiple training/test data splits.
1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?
random_state is used wherever randomness is needed:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?
Good problems should not depend too much on the random_state.
3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.
Do not choose it. Instead try to optimize the other aspects of classification to achieve good results, regardless of random_state.
4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).
As of Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?, random.seed(X) is not used by sklearn. If you need to control this, you could set np.random.seed() instead.
5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?
How can I know training data is enough for machine learning's answers mostly state that the more data the better.
If you do a lot of model-selection, maybe Sacred can help, too. Among other things, it sets and can log the random seed for each evaluation, f.ex.:
>>./experiment.py with seed=123
During the experiment, for tune-up and reproducibility, you fix temporarily random state but you repeat the experiment with different random states and take the mean of the results.
import os
# Set a Random State value
RANDOM_STATE = 42
# Set Python a random state
os.environ['PYTHONHASHSEED'] = str(RANDOM_STATE)
# Set Python random a fixed value
import random
random.seed(RANDOM_STATE)
# Set numpy random a fixed value
import numpy as np
np.random.seed(RANDOM_STATE)
# Set other library like TensorFlow random a fixed value
import tensorflow as tf
tf.set_seed(RANDOM_STATE)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
# Eventually don't forget to set random_state parameter in function like
RandomizedSearchCV(random_state = RANDOM_STATE, ...)
For production system, you remove random state by setting it to None
# Set a Random State value
RANDOM_STATE = None

Resources