Different composition for training and test sets - machine-learning

Training and test sets in machine learning, are normally discussed as though they will have the same composition, e.g. take X% of your examples as the training set, and the rest are the test set.
However, suppose you are trying to solve a classification problem - for simplicity, say binary classification, like distinguishing between photographs of horses and zebras. The classes are not equally common. Say 95% of photos are horses and the other 5% are zebras. If you feed that mix into a neural network, or any other machine learning algorithm, it will quickly settle on classifying everything as a horse and thereby achieving 95% accuracy.
There are such things as cost-sensitive neural networks, that can penalize a false negative more heavily than a false positive. But the added complexity increases development time and creates more opportunities for bugs to creep in.
A simpler, more general method is resampling, where you train the network on equal quantities of each class. If you have 10,000 pictures, take 250 zebra pictures, combined with 250 horse pictures, use that as your training set. The other 250 zebras can go with another 4,750 horses to form your test set. That way, you can calculate a confusion matrix on the test set that will reflect the performance that can be expected of the train network in the wild.
This means the training set and test set have deliberately different composition.
So my question: is it indeed normal for training set and test set to have different composition, and this just isn't often mentioned? Or am I missing something?

Related

Train Accuracy is very high, Validation accuracy is very high but the test set accuracy is very low

I have split the dataset ( around 28K images) into 75% trainset and 25% testset. Then I have taken randomly 15% of trainset and randomly 15% of testset to create a validation set. The goal is to classify the images into two categories. The exact image sample can't be shared. But its similar to the one attached. I'm using this model: VGG19 with imagenet weights with last two layers trainable and 4 dense layers appended. I am also using ImageDataGenerator to Augment the images. I trained the model for 30 epochs and found that the training accuracy is 95% and Validation accuracy is 96% and when trained on test dataset it fell down enormously to 75% only.
I have tried regularization and dropout to tackle the overfitting if it is suffering. I have also done one more thing to see what happens if I use the testset as Validation set and test the model on the same testset. The results were: Trainset Acc = 96% and Validation ACC = 96.3% and the testAcc = 68%. I don't understand what should I Do ?
image
First off, you need to make sure that when you split in data, the relative size of every class in the new datasets is equal. It can be imbalanced if that is the distribution of your initial data, but it must have the same imbalance in all datasets after the split.
Now, regarding the split. If you need a train, validation and test sets, they must all be independent of each-other (no-shared samples). This is important if you don't want to cheat yourself with the results that you are getting.
In general, in machine-learning we start from a training set and a test set. For choosing the best model architecture/hyper-parameters, we further divide the training set to get the validation set (the test set should not be touched).
After determining the best architecture/hyper-parameters for our model, we combine the training and validation set and train the best-case model from scratch with the combined full training set. Only now we get to test the results on the test set.
I had faced a similar issue in one of my practice projects.
My InceptionV3 model gave a high training accuracy (99%), a high validation accuracy (95%+) but a very low testing accuracy (55%).
The dataset was a subset of the popular Dogs vs. Cats dataset (https://www.kaggle.com/c/dogs-vs-cats/data), made by me, having 15k images split into 3 folders: train, valid, and test in the ratio of 60:20:20 (9000, 3000, 3000 each halved for cats folder and dogs folder).
The error in my case was actually in my code. It had nothing to do with the model or the data. The model had been defined inside a function and that was creating an untrained instance during the evaluation. Hence, an untrained model was being tested upon on the test dataset. After correcting the errors in my notebook I got a 96%+ testing accuracy.
Links:
https://colab.research.google.com/drive/1-PO1KJYvXdNC8LbvrdL70oG6QbHg_N-e?usp=sharing&fbclid=IwAR2k9ZCXvX_y_UNWpl4ljs1y0P3budKmlOggVrw6xI7ht0cgm03_VeoKVTI
https://drive.google.com/drive/u/3/folders/1h6jVHasLpbGLtu6Vsnpe1tyGCtR7bw_G?fbclid=IwAR3Xtsbm_EZA3TOebm5EfSvJjUmndHrWXm4Iet2fT3BjE6pPJmnqIwW8KWY
tyuhm
Other probable causes:
One possibility is that the testing set would have a different
distribution than the validation set (This could be excluded by
joining all the data, randomizing, and splitting again to train,
test, valid).
To swap valid and test with each other and see if it has an
effect (Sometimes if one set has relatively harder examples).
If the training somehow overfitted on the validation set (Is it
possible that during training, at one or more steps, the model giving the best score on the validation set is chosen).
Images overlapping, lack of shuffling.
In the deep learning world, if something seems way too odd to be
true, or even way too good to be even true, a good guess is that its
probably a bug unless proven otherwise!

why too many epochs will cause overfitting?

I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Neural Network gets stuck

I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem

Why should my training set also be skewed in terms of number of class distribution just because my test set is skewed

My question is why should my training set also be skewed (number of instances of positive class much fewer compared to negative class) when my test set is also skewed. I read that it is important to maintain the distribution between the classes the same in both training and test set to get the most realistic performance. For example, if my test set has 90%-10% distribution of class instances, should my training set also have the same proportions?
I am finding it difficult to understand why is it important to maintain the proportions of class instances in the training set as present in the test set.
The reason why I find it difficult to understand is don't we want a classifier to just learn the patterns in both the classes? So, should it matter to maintain skewness in the training set just because the test set is skewed?
Any thoughts will be helpful
IIUC, you're asking about the rationale for using Stratified Sampling (e.g., as used in Scikit's StratifiedKFold.
Once you've divided your data into train and test sets, you have three datasets to consider:
the "real world" set, on which your classifier will really run
the train set, on which you'll learn patterns
the test set, which you'll use to evaluate the performance of the classifier
(So the uses of 2. + 3. are really just for estimating how things will run on 1, including possibly tuning parameters.)
Suppose your data has some class represented far from uniform - say it appears only 5% of the times it would appear if classes would be generated uniformly. Moreover, you believe that this is not a GIGO case - in the real world, the probability of this class would be about 5%.
When you divide into 2. + 3., you run the chance that things will be skewed relative to 1.:
It's very possible that the class won't appear 5% of the times (in the train or test set), but rather more or less.
It's very possible that some of the feature instances of the class will be skewed in the train or test set, relative to 1.
In these cases, when you make decisions based on the 2. + 3. combination, it's probable that it won't indicate well the effect on 1., which is what you're really after.
Incidentally, I don't think the emphasis is on skewing the train to fit the test, but rather on making the train and test each fit the entire sampled data.

Resources