Relationship Between Training Set Size and Training Epochs - image-processing

I'm currently training a convulotional network on the Cifar10 dataset. Let's say I have 50,000 training images and 5,000 validation images.
To start testing my model, let's say I start with just 10,000 images to get an idea of how successful the model will be.
After 40 epochs of training and a batch size of 128 - ie every epoch I'm running my Optimizer to minimize loss 10,000 / 128 ~ 78 times on a batch of 128 images (SGD).
Now, let's say I found a model that achieves 70% accuracy on the validation set. Satisfied, I move on to train on the full training set.
This time, for every epoch I run the Optimizer to minimize loss 5 * 10,000 / 128 ~ 391 times.
This makes me think that my accuracy at each epoch should be higher on than on the limited set of 10,000. Much to my dismay, the accuracy on the limited training set increases much more quickly. At the end of the 40 epochs with the full training set, my accuracy is 30%.
Thinking the data may be corrupt, I perform limited runs on training images 10-20k, 20-30k, 30-40k, and 40-50k. Surprisingly, each of these runs resulted in an accuracy ~70%, close to the accuracy for images 0-10k.
Thus arise two questions:
Why would validation accuracy go down when the data set is larger and I've confirmed that that each segment of data indeed provides decent results on its own?
For larger training set, would I need to train through more epochs even though each epoch represents a larger number of training steps (391 vs 78)?

It turns out my intuition was right, but my code was not.
Basically, I had been running my validation accuracy with training data (data used to train the model), not with validation data (data the model had not yet seen).
After correcting this error, the validation accuracy did inevitably improve with a bigger training data set, as expected.

Related

Train Accuracy increases, Train loss is stable, Validation loss Increases, Validation Accuracy is low and increases

My neural network trainign in pytorch is getting very wierd.
I am training a known dataset that came splitted into train and validation.
I'm shuffeling the data during training and do data augmentation on the fly.
I have those results:
Train accuracy start at 80% and increases
Train loss decreases and stays stable
Validation accuracy start at 30% but increases slowly
Validation loss increases
I have the following graphs to show:
How can you explain that the validation loss increases and the validation accuracy increases?
How can be such a big difference of accuracy between validation and training sets? 90% and 40%?
Update:
I balanced the data set.
It is binary classification. It now has now 1700 examples from class 1, 1200 examples from class 2. Total 600 for validation and 2300 for training.
I still see similar behavior:
**Can it be becuase I froze the weights in part of the network?
**Can it be becuase the hyperparametrs like lr?
I found the solution:
I had different data augmentation for training set and validation set. Matching them also increased the validation accuracy!
If the training set is very large in comparison to the validation set, you are more likely to overfit and learn the training data, which would make generalizing the model very difficult. I see your training accuracy is at 0.98 and your validation accuracy increases at a very slow rate, which would imply that you have overfit your training data.
Try reducing the number of samples in your training set to improve how well your model generalizes to unseen data.
Let me answer your 2nd question first. High accuracy on training data and low accuracy on val/test data indicates the model might not generalize well to infer real cases. That is what the validation process is all about. You need to finetune or even rebuild your model.
With regard to the first question, val loss might not necessarily correspond to the val accuracy. The model makes the prediction based on its model, and loss function calculates the difference between probablities of matrix and the target if you are using CrossEntropy function.

The dilemma of overfitting in NN training

My question is in continuation to the one asked by another user: What's is the difference between train, validation and test set, in neural networks?
Once learning is over by terminating when the minimum MSE is reached by looking at the validation and train set performance (easy to do so using nntool box in Matlab), then using the trained net structure if the performance of the unseen test set is slightly poor than the training set we have an overfitting problem. I am always encountering this case eventhough the model for which during learning the parameters corresponding to validation and train set having nearly same performance is selected. Then how come the test set performance is worse than the train set?
Training data= Data we use to train our model.
Validation data= Data we use to test our model on every-epoch or on run-time So that we can early stop our model manually because of over-fitting or any other model. Now Suppose I am running 1000 epochs on my model and on 500 epochs I view that my model is giving 90% accuracy on training data and 70% accuracy on validation data. Now I can see that my model is over-fitting. I can manually stop my training and before 1000 epochs complete and tune my model more and than see the behavior.
Testing data= Now after completing my training on model after computing 1000 epochs. I will predict my test data and see the accuracy on test data. its giving 86%
My training accuracy is 90% validation accuracy is 87% and testing accuracy is 86%. this may vary because data in validation set, training set and testing set are totally different. We have 70% samples in training set 10% validation and 20% testing set. Now on my validation my model is predicting 8 images correctly and on testing my model predicting 18 images correctly out of 100. Its normal in real life projects because pixels in every image are varying form the other image thats why a little difference may happen.
In testing set their are more images than validation set that may be one reason. Because more the images more the risk of wrong prediction. e.g on 90% accuracy
my model predict 90 out of 100 correctly but if I increase the image sample to 1000 than my model may predict (850, 800 or 900) images correctly out 1000 on

Accuracy on 1st epoch - MNIST Deep Learning example

Im new to the world of Deep Learning and i would like to clarify something on my 1st Deep learning code, the MNIST example. Maybe also i'm completely wrong BTW so please take it easy :)
I have split the training data to batches, each one with a size of 50 and max epochs to 15 (or until the validation loss variable starts increasing).
I am getting 93% accuracy just on the 1st epoch, how is that possible if (as far as i know) on 1st epoch it has forward and backpropogate the complete training set just 1 time, so the training set have only abjust its weights and biases only once?
I thought i would get a fine accuracy after many epochs not just on 1st abjustance of the weights
Yes..you can get a good accuracy in the first epoch as well. It depends more on the complexity of the data and the model you build. sometimes if the learning rate is too high, than also it could so happen you get a higher training accuracy.
Also, coming to the adjusting weights and biases part, it could be a mini-batch training and for every mini-batch, the model updates the weights. So weights could have updated many times which is equal to number of training data images/ sample size

Machine Learning Training & Test data split method

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.
Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.
When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.
I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.
When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.
By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.
For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.
To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.
Actually, what you did is right and this process is something similar to "Stratified sampling".
In your first model,where accuracy was very low the model did not get enough correlations between features and target for positive class(1).Also it model might have somewhat over-fitted for negative class.This is called "High bias -High variance" situation.
"Stratified sampling" is nothing but when you are extracting a sample data from a big population,you make sure that all classes will have some what approximately equal proportion to make the model's training assumptions more accurate and reliable.
In the second case model was able to correlate relationships between features and target and positive and negative class characteristics was well distinguishable.
Eliminating noise is a part of data preparation that should be obviously done before putting data into a model.

Meaning of an Epoch in Neural Networks Training

while I'm reading in how to build ANN in pybrain, they say:
Train the network for some epochs. Usually you would set something
like 5 here,
trainer.trainEpochs( 1 )
I looked for what is that mean , then I conclude that we use an epoch of data to update weights, If I choose to train the data with 5 epochs as pybrain advice, the dataset will be divided into 5 subsets, and the wights will update 5 times as maximum.
I'm familiar with online training where the wights are updated after each sample data or feature vector, My question is how to be sure that 5 epochs will be enough to build a model and setting the weights probably? what is the advantage of this way on online training? Also the term "epoch" is used on online training, does it mean one feature vector?
One epoch consists of one full training cycle on the training set. Once every sample in the set is seen, you start again - marking the beginning of the 2nd epoch.
This has nothing to do with batch or online training per se. Batch means that you update once at the end of the epoch (after every sample is seen, i.e. #epoch updates) and online that you update after each sample (#samples * #epoch updates).
You can't be sure if 5 epochs or 500 is enough for convergence since it will vary from data to data. You can stop training when the error converges or gets lower than a certain threshold. This also goes into the territory of preventing overfitting. You can read up on early stopping and cross-validation regarding that.
sorry for reactivating this thread.
im new to neural nets and im investigating the impact of 'mini-batch' training.
so far, as i understand it, an epoch (as runDOSrun is saying) is a through use of all in the TrainingSet (not DataSet. because DataSet = TrainingSet + ValidationSet). in mini batch training, you can sub divide the TrainingSet into small Sets and update weights inside an epoch. 'hopefully' this would make the network 'converge' faster.
some definitions of neural networks are outdated and, i guess, must be redefined.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Resources