Why should my training set also be skewed in terms of number of class distribution just because my test set is skewed - machine-learning

My question is why should my training set also be skewed (number of instances of positive class much fewer compared to negative class) when my test set is also skewed. I read that it is important to maintain the distribution between the classes the same in both training and test set to get the most realistic performance. For example, if my test set has 90%-10% distribution of class instances, should my training set also have the same proportions?
I am finding it difficult to understand why is it important to maintain the proportions of class instances in the training set as present in the test set.
The reason why I find it difficult to understand is don't we want a classifier to just learn the patterns in both the classes? So, should it matter to maintain skewness in the training set just because the test set is skewed?
Any thoughts will be helpful

IIUC, you're asking about the rationale for using Stratified Sampling (e.g., as used in Scikit's StratifiedKFold.
Once you've divided your data into train and test sets, you have three datasets to consider:
the "real world" set, on which your classifier will really run
the train set, on which you'll learn patterns
the test set, which you'll use to evaluate the performance of the classifier
(So the uses of 2. + 3. are really just for estimating how things will run on 1, including possibly tuning parameters.)
Suppose your data has some class represented far from uniform - say it appears only 5% of the times it would appear if classes would be generated uniformly. Moreover, you believe that this is not a GIGO case - in the real world, the probability of this class would be about 5%.
When you divide into 2. + 3., you run the chance that things will be skewed relative to 1.:
It's very possible that the class won't appear 5% of the times (in the train or test set), but rather more or less.
It's very possible that some of the feature instances of the class will be skewed in the train or test set, relative to 1.
In these cases, when you make decisions based on the 2. + 3. combination, it's probable that it won't indicate well the effect on 1., which is what you're really after.
Incidentally, I don't think the emphasis is on skewing the train to fit the test, but rather on making the train and test each fit the entire sampled data.

Related

Oversampled train set and test set - machine learning classification

Let's say that I have oversampled my training set after splitting, then I selected the features of interest to be extracted based on the training set analysis.
After this, do I use the oversampled training set with the testing set together to determine the classification performance (accuracy, precision, F1 measure, and etc) OR I just use the testing set for it?
(Not really a programming question but it's important enough to be clarified imho)
To measure performance reliably you must use the original test set, without any resampling.
This is one of the reasons why the train/test split should always be done first, the test set should be kept "fresh". Resampling the test set would be like cheating, because it makes the problem easier to solve.
Note: in general resampling rarely works, especially with text.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Different composition for training and test sets

Training and test sets in machine learning, are normally discussed as though they will have the same composition, e.g. take X% of your examples as the training set, and the rest are the test set.
However, suppose you are trying to solve a classification problem - for simplicity, say binary classification, like distinguishing between photographs of horses and zebras. The classes are not equally common. Say 95% of photos are horses and the other 5% are zebras. If you feed that mix into a neural network, or any other machine learning algorithm, it will quickly settle on classifying everything as a horse and thereby achieving 95% accuracy.
There are such things as cost-sensitive neural networks, that can penalize a false negative more heavily than a false positive. But the added complexity increases development time and creates more opportunities for bugs to creep in.
A simpler, more general method is resampling, where you train the network on equal quantities of each class. If you have 10,000 pictures, take 250 zebra pictures, combined with 250 horse pictures, use that as your training set. The other 250 zebras can go with another 4,750 horses to form your test set. That way, you can calculate a confusion matrix on the test set that will reflect the performance that can be expected of the train network in the wild.
This means the training set and test set have deliberately different composition.
So my question: is it indeed normal for training set and test set to have different composition, and this just isn't often mentioned? Or am I missing something?

Balance classes in cross validation

I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced.
In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.
For corroboration, here is Max Kuhn, creator of the caret R package and co-author of the (highly recommended) Applied Predictive Modelling textbook, in Chapter 11: Subsampling For Class Imbalances of the caret ebook:
You would never want to artificially balance the test set; its class frequencies should be in-line with what one would see “in the wild”.
Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.
Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds.
A way to force balancing is using a weight columns to use different weights for different classes, in H2O weights_column

cross validating a train set where the class variable has a different distribution than the actual population

(noob in ML, be patient)
I want to test the performance of my scikit-learn SVMLinear classifier. My train-set has a different class distribution than the actual population, but my test-set is a representative, and distributes like the actual population.
I noticed that there's a class-weight parameter, and I want to try giving my classifier the actual population distribution, and see if it helps it perform better.
However - as my train-set distribution is different, so will be my validation set, right? So should I expect an improvement on the validation, or must I use my test-set to see the improvement? And if so - isn't it against the rules to calibrate using the test-set which will lead to burning the test-set or overfitting?
I've thought about bootstrap re-sampling of my train-set: making it distribute the same as the general population, and only then training and validating my model. Is this a good solution?
Thanks!
It seems that you have some good ideas which are mostly worth trying. The answers mostly depend on the application and the size of your train/test set.
It is against the rules to calibrate based on test set and again use the whole test set for evaluation. However, if your test set is large enough, you can always divide your test set to two sets: validation set, and actual test set. Then, your final evaluation will be based on a smaller test set, which might be still acceptable depending on the application.
For your training set that you believe it has different class distribution than the actual population, there might be several things worth trying. Usually the most acceptable approach is to use a classifier that can handle these differences (usually with fewer parameters to avoid over-fitting). There is a whole topic of classification and regression on skewed datasets that you can look through. Other than the classifier, provided that you did not derive the actual population from your test set, the methods below might help too:
1- One of them can be (as you said) bootstrap re-sampling in case that your training set is large enough for that.
2- Another approach can be generating more training samples by adding some noise to the current samples of the training set. For example if you are classifying images of birds, you can randomly make images darker or brighter, or randomly move them a few pixels to the sides or up and down (select values randomly in a small enough range). This way, you can add to the training set in a way to get the desired distribution.

Resources