Is it valid to put predicted data into training data set? - machine-learning

Suppose for K nearest neighbor algorithm, we have a original training data set x1,x2,...,xn and we test p1. After classification of p1, we put p1 into training data set.
The newest training data set now is {x1,x2,....,xn,p1} and we test p2... and so on.
I think the above is quite counter intuitive that we used "fake" data to train our program. But i cannot think any proof/reason to say why we cannot use the "fake" data.

It will only make the model more biased toward the original training set by updating the boundary between classes using its own prediction. In addition, adding more observations to your training set without offering any ground truth knowledge only makes the feature space more dense and reduces the impact of K which can lead to higher chance of over-fitting.

Related

K Nearest Neighbour Classifier - random state for train test split leads to different accuracy scores

I'm fairly new to data analysis and machine learning. I've been carrying out some KNN classification analysis on a breast cancer dataset in python's sklearn module. I have the following code which attemps to find the optimal k for classification of a target variable.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data = load_breast_cancer()
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size = 0.2, random_state = 40)
results = []
for k in range(1,101):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(training_data, training_labels)
results.append(classifier.score(validation_data, validation_labels))
k_list = range(1,101)
plt.plot(k_list, results)
plt.ylim(0.85,0.99)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
The code loops through 1 to 100 and generates 100 KNN models with 'k' set to incremental values in the range 1 to 100. The performance of each of those models is saved to a list and a plot is generated showing 'k' on the x-axis and model performance on the y-axis.
The problem I have is that when I change the random_state parameter when spliting the data into training and testing partitions this results in completely different plots indicating varying model performance for different 'k'values for different dataset partitions.
For me this makes it difficult to decide which 'k' is optimal as the algorithm performs differently for different 'k's using different random states. Surely this doesn't mean that, for this particular dataset, 'k' is arbitrary? Can anyone help shed some light on this?
Thanks in anticipation
This is completely expected. When you do the train-test-split, you are effectively sampling from your original population. This means that when you fit a model, any statistic (such as a model parameter estimate, or a model score) will it self be a sample estimate taken from some distribution. What you really want is a confidence interval around this score and the easiest way to get that is to repeat the sampling and remeasure the score.
But you have to be very careful how you do this. Here are some robust options:
1. Cross Validation
The most common solution to this problem is to use k-fold cross-validation. In order not to confuse this k with the k from knn I'm going to use a capital for cross-validation (but bear in mind this is not normal nomenclature) This is a scheme to do the suggestion above but without a target leak. Instead of creating many splits at random, you split the data into K parts (called folds). You then train K models each time on K-1 folds of the data leaving aside a different fold as your test set each time. Now each model is independent and without a target leak. It turns out that the mean of whatever success score you use from these K models on their K separate test sets is a good estimate for the performance of training a model with those hyperparameters on the whole set. So now you should get a more stable score for each of your different values of k (small k for knn) and you can choose a final k this way.
Some extra notes:
Accuracy is a bad measure for classification performance. Look at scores like precision vs recall or AUROC or f1.
Don't try program CV yourself, use sklearns GridSearchCV
If you are doing any preprocessing on your data that calculates some sort of state using the data, that needs to be done on only the training data in each fold. For example if you are scaling your data you can't include the test data when you do the scaling. You need to fit (and transform) the scaler on the training data and then use that same scaler to transform on your test data (don't fit again). To get this to work in CV you need to use sklearn Pipelines. This is very important, make sure you understand it.
You might get more stability if you stratify your train-test-split based on the output class. See the stratify argument on train_test_split.
Note the CV is the industry standard and that's what you should do, but there are other options:
2. Bootstrapping
You can read about this in detail in introduction to statistical learning section 5.2 (pg 187) with examples in section 5.3.4.
The idea is to take you training set and draw a random sample from it with replacement. This means you end up with some repeated records. You take this new training set, train and model and then score it on the records that didn't make it into the bootstrapped sample (often called out-of-bag samples). You repeat this process multiple times. You can now get a distribution of your score (e.g. accuracy) which you can use to choose your hyper-parameter rather than just the point estimate you were using before.
3. Making sure you test set is representative of your validation set
Jeremy Howard has a very interesting suggestion on how to calibrate your validation set to be a good representation of your test set. You only need to watch about 5 minutes from where that link starts. The idea is to split into three sets (which you should be doing anyway to choose a hyper parameter like k), train a bunch of very different but simple quick models on your train set and then score them on both your validation and test set. It is OK to use the test set here because these aren't real models that will influence your final model. Then plot the validation scores vs the test scores. They should fall roughly on a straight line (the y=x line). If they do, this means the validation set and test set are both either good or bad, i.e. performance in the validation set is representative of performance in the test set. If they don't fall on this straight line, it means the model scores you get from you validation set are not indicative of the score you'll get on unseen data and thus you can't use that split to train a sensible model.
4. Get a larger data set
This is obviously not very practical for your situation but I thought I'd mention it for completeness. As your sample size increases, your standard error drops (i.e. you can get tighter bounds on your confidence intervals). But you'll need more training and more test data. While you might not have access to that here, it's worth keeping in mind for real world situations where you can assess the trade-off of the cost of gathering new data vs the desired accuracy in assessing your model performance (and probably the performance itself too).
This "behavior" is to be expected. Of course you get different results, when training and test is split differently.
You can approach the problem statistically, by repeating each 'k' several times with new train-validation-splits. Then take the median performance for each k. Or even better: look at the performance distribution and the median. A narrow performance distribution for a given 'k' is also a good sign that the 'k' is chosen well.
Afterwards you can use the test set to test your model

Semi supervised learning for Regression task with tabular data?

Semi-supervised learning for Regression task with tabular data? I have like 20K labelled data and 200K unlabeled data.
Model gets better and better with semi-supervised learning for eg. self-learning and on top of it if you have just 20K and making predictions on 200K you assume variance and distribution of features remain same which is not always true so you make a prediction with a sizeable error even your model was 90 Mean R2 on K Fold CV. Hope that explains why I am searching for a semi-supervised approach for regression!
A key assumption for most semi-supervised learning (SSL) is that nearby points (e.g. between an unlabelled and labelled point) are likely to share the same label. It seems you're expecting the variance and distribution of your unlabelled set to be different to your labelled set which may violate the above assumption.
If you still wish to go ahead with SSL then there are no conventional techniques for regression methods. But...
The literature on less conventional methods has been reviewed: https://www.researchgate.net/publication/325798856_Semi-supervised_regression_A_recent_review
And...
You could always convert to a classification approach (helps if the labels are relatively uniformly distributed). For instance, just choose the median prediction value in your labelled set, then everything less than that is 0 and everything greater is 1. You can then apply a conventional classification SSL technique, such as label propagation. Then, you get the predicted probabilities on a test set (e.g. use predict_proba()) and scale these back to your regression range. For example, if the minimum value in your labelled set was 0 and the max was 10, then just multiply the predicted probabilities by 10.
Hope the above at least provides food for thought.

Machine Learning Predictions and Normalization

I am using z-score to normalize my data before training my model. When I do predictions on a daily basis, I tend to have very few observations each day, perhaps just a dozen or so. My question is, can I normalize the test data just by itself, or should I attach it to the entire training set to normalize it?
The reason I am asking is, the normalization is based on mean and std_dev, which obviously might look very different if my dataset consists only of a few observations.
You need to have all of your data in the same units. Among other things, this means that you need to use the same normalization transformation for all of your input. You don't need to include the new data in the training per se -- however, keep the parameters of the normalization (the m and b of y = mx + b) and apply those to the test data as you receive them.
It's certainly not a good idea to predict on a test set using a model trained with a very different data distribution. I would use the same mean and std of your training data to normalize you test set.

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

Pre-randomization before random forest training in Scikit-learn

I am getting a surprisingly significant performance boost (+10% cross-validation accuracy gain) with sklearn.ensemble.RandomForestClassifier just by virtue of pre-randomizing the training set.
This is very puzzling to me, since
(a) RandomForestClassifier supposedly randomized the training data anyway; and
(b) Why would the order of example matter so much anyway?
Any words of wisdom?
I have got the same issue and posted a question, which luckily got resolved.
In my case it's because the data are put in order, and I'm using K-fold cross-validation without shuffling when doing the test-train split. This means that the model is only trained on a chunk of adjacent samples with certain pattern.
An extreme example would be, if you have 50 rows of sample all of class A, followed by 50 rows of sample all of class B, and you manually do a train-test split right in the middle. The model is now trained with all samples of class A, but tested with all samples of class B, hence the test accuracy will be 0.
In scikit, the train_test_split do the shuffling by default, while the KFold class doesn't. So you should do one of the following according to your context:
Shuffle the data first
Use train_test_split with shuffle=True (again, this is the default)
Use KFold and remember to set shuffle=True
Ordering of the examples should not affect RF performance at all. Note Rf performance can vary by 1-2% across runs anyway. Are you keeping cross-validation set separately before training?(Just ensuring this is not because cross-validation set is different every time). Also by randomizing I assume you mean changing the order of the examples.
Also you can check the Out of Bag accuracy of the classifier in both cases for the training set itself, you don't need a separate cross-validation set for RF.
During the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Resources