Determining the splitting ratio when augmenting image data - machine-learning

I have an image dataset that is quite imbalanced, with one class having 2873 images and another having only 115. The rest of the classes have ~250 images each. For reducing the imbalance, I decided to split the dataset into Train-Valid-Test components, with the major class having less proportion of images in the training set compared to the minor classes.
Then I'll be augmenting the data in the training set. I intend to perform an 80-10-10 split on the dataset.
Which outcome shall be considered as an 80-10-10 split?
Splitting the dataset in the proportion 80-10-10, and THEN augmenting the training images (which would eventually result in >80% proportion for the training set after augmentation).
Splitting the dataset in a proportion such that it eventually results in an 80-10-10 split AFTER augmentation.
Also, is it acceptable to have an 85-7.5-7.5 split, provided it reduces imbalance in the dataset?

First, the splitting of the dataset should be stratified which means that each class should exist in the training, test and validation sets with the defined proportions.
The data augmentation should only be applied on the training dataset, no need to apply it on the test data because its main goal is to improve the performance and outcomes of the models by making different examples in the train dataset. So, there is no need to apply data augmentation on the test and validation sets.
For the last question, it is possible whatever proportions you want for test and validation as long as they give you enough data to evaluate the model

Related

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

Combine training data and validation data, how to select hyper-parameters?

Suppose I split my data into training set and validation set. I perform a 5-fold cross-validation on my training set to obtain the optimal hyper-parameters for my model, then I use the optimal hyper-parameters to train my model and apply the resulting model on my validation set. My question is, is it reasonable to combine the training and validation set, and use the hyper-parameters obtained from the training set to build a final model?
It is resonable if training data was relatively small and adding validation set makes your model significantly stronger. However, at the same time, adding new data makes your previously selected hyperparameters possibly suboptimal (it is really hard to show what kind of transformation of hyperparameters you should apply when you add new data to your training set). Thus you balance two things - gain in model quality from more data and possible loss due to hard to predict change in hyperparameters meaning. To some extent you can simulate this process to make sure it makes sense, if you have N points in training data and M in validation, you can try to split training further to chunks with the same proportion (thus one is now N * (N/(N+M) and other N * (M/(N+M))), train on first one and check whether optimal hyperparameters transfer (more or less) to the optimal one on the whole training set - if so, you can safely add validation as they should transfer as well. If they do not - the risk might be not worth the gain.

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

Picking a training set from the larger application set

I'm trying to perform sentiment analysis on a dataset.But there is no existing corpus that my classifier can be trained on that is similar to the dataset that I want to analyze. My question is as follows: Can I use a randomly sampled subset of this data for training/validation phases and then use the trained classifier for performing analysis on the larger dataset? I plan to introduce some variability by adding data points to the training set that are similar to the application dataset but not from that set. Is this is a valid approach?
What you are looking for is the standard procedure of cross-validation. During cross-validation you split your data on (let's assume) 80%-20% training testing data and make 5-10 (depending on the size of data you have) different splits. So I would suggest that you keep a subset of the data and then perform cross-validation on this subset. This is the optimal way to train your model.

Overfitting and Data splitting

Let's say that I have a data file like:
Index,product_buying_date,col1,col2
0,2013-01-16,34,Jack
1,2013-01-12,43,Molly
2,2013-01-21,21,Adam
3,2014-01-09,54,Peirce
4,2014-01-17,38,Goldberg
5,2015-01-05,72,Chandler
..
..
2000000,2015-01-27,32,Mike
with some more data and I have a target variable y. Assume something as per your convenience.
Now I am aware that we divide the data into 2 parts i.e. Train and Test. And then we divide Train into 70:30, build the model with 70% and validate it with 30%. We tune the parameters so that model does not get overfit. And then predict with the Test data. For example: I divide 2000000 into two equal parts. 1000000 is train and I divide it in validate i.e. 30% of 1000000 which is 300000 and 70% is where I build the model i.e. 700000.
QUESTION: Is the above logic depending upon how the original data splits?
Generally we shuffle the data and then break it into train, validate and test. (train + validate = Train). (Please don't confuse here)
But what if the split is alternate. Like When I divide it in Train and Test first, I give even rows to Test and odd rows to Train. (Here data is initially sort on the basis of 'product_buying_date' column so when i split it in odd and even rows it gets uniformly split.
And when I build the model with Train I overfit it so that I get maximum AUC with Test data.
QUESTION: Isn't overfitting helping in this case?
QUESTION: Is the above logic depending upon how the original data
splits?
If dataset is large(hundred of thousand), you can randomly split the data and you should not have any problem but if dataset is small then you can adopt the different approaches like cross-validation to generate the data set. Cross-validation states that you split you make n number of training-validation set out of your Training set.
suppose you have 2000 data points, you split like
1000 - Training dataset
1000 - testing dataset.
5-cross validation would mean that you would make five 800/200 training/validation dataset.
QUESTION: Isn't overfitting helping in this case?
Number one rule of the machine learning is that, you don't touch the test data set. It's a holly data set that should not be touched.
If you overfit the test data to get maximum AUC score then there won't be any meaning of validation dataset. Foremost aim of any ml algorithm is to reduce the generalization error i.e. algorithm should be able to perform good on unseen data. If you would tune your algorithm with testing data. you won't be able to meet this criteria. In cross-validation also you do not touch your testing set. you select your algorithm. tune its parameter with validation dataset and after you have done with that apply your algorithm to test dataset which is your final score.

Resources