cross validating a train set where the class variable has a different distribution than the actual population - machine-learning

(noob in ML, be patient)
I want to test the performance of my scikit-learn SVMLinear classifier. My train-set has a different class distribution than the actual population, but my test-set is a representative, and distributes like the actual population.
I noticed that there's a class-weight parameter, and I want to try giving my classifier the actual population distribution, and see if it helps it perform better.
However - as my train-set distribution is different, so will be my validation set, right? So should I expect an improvement on the validation, or must I use my test-set to see the improvement? And if so - isn't it against the rules to calibrate using the test-set which will lead to burning the test-set or overfitting?
I've thought about bootstrap re-sampling of my train-set: making it distribute the same as the general population, and only then training and validating my model. Is this a good solution?
Thanks!

It seems that you have some good ideas which are mostly worth trying. The answers mostly depend on the application and the size of your train/test set.
It is against the rules to calibrate based on test set and again use the whole test set for evaluation. However, if your test set is large enough, you can always divide your test set to two sets: validation set, and actual test set. Then, your final evaluation will be based on a smaller test set, which might be still acceptable depending on the application.
For your training set that you believe it has different class distribution than the actual population, there might be several things worth trying. Usually the most acceptable approach is to use a classifier that can handle these differences (usually with fewer parameters to avoid over-fitting). There is a whole topic of classification and regression on skewed datasets that you can look through. Other than the classifier, provided that you did not derive the actual population from your test set, the methods below might help too:
1- One of them can be (as you said) bootstrap re-sampling in case that your training set is large enough for that.
2- Another approach can be generating more training samples by adding some noise to the current samples of the training set. For example if you are classifying images of birds, you can randomly make images darker or brighter, or randomly move them a few pixels to the sides or up and down (select values randomly in a small enough range). This way, you can add to the training set in a way to get the desired distribution.

Related

Oversampled train set and test set - machine learning classification

Let's say that I have oversampled my training set after splitting, then I selected the features of interest to be extracted based on the training set analysis.
After this, do I use the oversampled training set with the testing set together to determine the classification performance (accuracy, precision, F1 measure, and etc) OR I just use the testing set for it?
(Not really a programming question but it's important enough to be clarified imho)
To measure performance reliably you must use the original test set, without any resampling.
This is one of the reasons why the train/test split should always be done first, the test set should be kept "fresh". Resampling the test set would be like cheating, because it makes the problem easier to solve.
Note: in general resampling rarely works, especially with text.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Data augmentation in test/validation set?

It is common practice to augment data (add samples programmatically, such as random crops, etc. in the case of a dataset consisting of images) on both training and test set, or just the training data set?
Only on training. Data augmentation is used to increase the size of the training set and to get more different images.
Technically, you could use data augmentation on the test set to see how the model behaves on such images, but usually, people don't do it.
Data augmentation is done only on training set as it helps the model become more generalize and robust. So there's no point of augmenting the test set.
This answer on stats.SE makes the case for applying crops on the validation / test sets so as to make that input similar the the input in the training set that the network was trained on.
Do it only on the training set. And, of course, make sure that the augmentation does not make the label wrong (e.g. when rotating 6 and 9 by about 180°).
The reason why we use a training and a test set in the first place is that we want to estimate the error our system will have in reality. So the data for the test set should be as close to real data as possible.
If you do it on the test set, you might have the problem that you introduce errors. For example, say you want to recognize digits and you augment by rotating. Then a 6 might look like a 9. But not all examples are that easy. Better be save than sorry.
I would argue that, in some cases, using data augmentation for the validation set can be helpful.
For example, I train a lot of CNNs for medical image segmentation. Many of the augmentation transforms that I use are meant to reduce the image quality so that the network is trained to be robust against such data. If the training set looks bad and the validation set looks nice, it will be hard to compare the losses during training and therefore assessing overfit will be complicated.
I would never use augmentation for the test set unless I'm using test-time augmentation to improve results or estimate aleatoric uncertainty.
In computer vision, you can use data augmentation during test time to obtain different views on the test image. You then have to aggregate the results obtained from each image for example by averaging them.
For example, given this symbol below, changing the point of view can lead to different interpretations :
Some image preprocessing software tools like Roboflow (https://roboflow.com/) apply data augmentation to test data as well. I'd say that if one is dealing with small and rare objects, say, cerebral microbleeds (which are tiny and difficult to spot on magnetic resonance images), augmenting one's test set could be useful. Then you can verify that your model has learned to detect these objects given different orientation and brightness conditions (given that your training data has been augmented in the same way).
The goal of data augmentation is to generalize the model and make it learn more orientation of the images, such that the during testing the model is able to apprehend the test data well. So, it is well practiced to use augmentation technique only for training sets.
The point of adding validation data is to build generalized model so it is nothing but to predict real-world data. inorder to predict real-world data, the validation set should contain real data. There is no problem with augmenting validation data but it won't increase the accuracy of the model.
Here are my two cents:
You train your model on the training data and the validation data: the former to optimize your parameters, and the latter to give you an appropriate stopping condition. The test data is to give you a real-world estimate of how well you can expect your model to perform.
For training, you can augment your training data to increase robustness to various factors including, but not limited to, sampling error, bias between data sources, shifts in global data distribution, positioning, and any other sort of variation you would like to account for.
The validation data should indicate to the training method when the model is most generalizable. By this logic, if you expect to see some variation in real-world data that can be simulated using data augmentation, then by all means, the validation dataset should be augmented.
The test data, on the other hand, should not be augmented, except potentially in special scenarios where data is very limited, and an estimate of real-world performance on test data has too much variance.
You can use augmentation data in training, validation and test sets.
The only thing to avoid is using the same data from the training set in validation or test sets.
For example, if you generate 3 augmented instances from an register of the training data, make sure that no one of these 3 augmented instances accidentally ends up in the validation or test sets.
It turns out that using data from the training set, even augmented data, to validate or test a model is a methodology mistake.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

Why should my training set also be skewed in terms of number of class distribution just because my test set is skewed

My question is why should my training set also be skewed (number of instances of positive class much fewer compared to negative class) when my test set is also skewed. I read that it is important to maintain the distribution between the classes the same in both training and test set to get the most realistic performance. For example, if my test set has 90%-10% distribution of class instances, should my training set also have the same proportions?
I am finding it difficult to understand why is it important to maintain the proportions of class instances in the training set as present in the test set.
The reason why I find it difficult to understand is don't we want a classifier to just learn the patterns in both the classes? So, should it matter to maintain skewness in the training set just because the test set is skewed?
Any thoughts will be helpful
IIUC, you're asking about the rationale for using Stratified Sampling (e.g., as used in Scikit's StratifiedKFold.
Once you've divided your data into train and test sets, you have three datasets to consider:
the "real world" set, on which your classifier will really run
the train set, on which you'll learn patterns
the test set, which you'll use to evaluate the performance of the classifier
(So the uses of 2. + 3. are really just for estimating how things will run on 1, including possibly tuning parameters.)
Suppose your data has some class represented far from uniform - say it appears only 5% of the times it would appear if classes would be generated uniformly. Moreover, you believe that this is not a GIGO case - in the real world, the probability of this class would be about 5%.
When you divide into 2. + 3., you run the chance that things will be skewed relative to 1.:
It's very possible that the class won't appear 5% of the times (in the train or test set), but rather more or less.
It's very possible that some of the feature instances of the class will be skewed in the train or test set, relative to 1.
In these cases, when you make decisions based on the 2. + 3. combination, it's probable that it won't indicate well the effect on 1., which is what you're really after.
Incidentally, I don't think the emphasis is on skewing the train to fit the test, but rather on making the train and test each fit the entire sampled data.

Resources