Why do my earlier epochs take longer than subsequent epochs? - machine-learning

I am training a model in keras, and experimenting with how the amount of data I feed in affects my resulting accuracy. I noticed something interesting though.
training samples: 5076
epoch 1: 142s
epoch 2: 60s
epoch 3: 61s
epoch 4: 60s
epoch 5: 61s
training samples: 10242
epoch 1: 277s
epoch 2: 131s
epoch 3: 131s
epoch 4: 132s
epoch 5: 131s
training samples: 15678
epoch 1: 385s
epoch 2: 323s
epoch 3: 167s
epoch 4: 168s
epoch 5: 168s
training samples: 20691
epoch 1: 577s
epoch 2: 440s
epoch 3: 273s
epoch 4: 274s
epoch 5: 274s
My intuition is that each epoch should take roughly the same amount of time.
I notice with smaller training sets, the first epoch takes longer than subsequent ones. I assumed that this was because I have written my own data loader and that there was some amount of 'spinning up' happening during the first epoch. But with larger training sets, I notice that the second epoch is taking longer than subsequent epochs too.
Why do the earlier epochs take longer? Are more weights being updated in those earlier runs?

The simplest and most intuitive reason I could think of for early epochs taking more than than latter ones, is that for your early epochs, your classification/regression system's error is very high at the beginning (which is a natural thing given random weights), thus there are plenty of gradients to back-propagate and many weights to update.
It could be that your model is fitting the training data too quickly (in approx 2 epochs), that the latter epochs are only updating a smaller percentage of the weights, since most of the gradients are now 0. This could lead to a lesser training time per epoch.
Try and output either the average accuracy or even better the gradients matrix for each epoch, and check for the above assumption.

The extra time in first epoch can be due to compilation overhead for building parts of computational graph for training.
About the second epoch, it's a bit tricky. I assume it can be something to do with your optimizer's way of upgrading gradients. for example, I have seen people mentioning that increasing beta_1 value from 0.9 to 0.99 for an adam optimizer, sometimes reduces epoch duration.
Also, if your model is fitting quickly to the data, that would mean less updates and hence faster execution. But that seems unlikely for your case as you seem to encounter the problem only when increasing the training sample size.

Although both answers have merit, I believe they miss an important part, the reliance on the amount of data.
Keras caches data to be able to use it again later. Hence, there is a significant boost in training after the first epoch. I am not exactly sure, but I assume that with larger amounts of data, it does not originally cache all the data.
Based on a similar question.

Related

Keras loss changes on epoch end

I am training a model in Keras (tensorflow2 backend) using the imagedatagenerator class to train on batches.
I have noticed that when the second epoch starts the loss value is really smaller than the value at the end of the first epoch.
Here is what I mean:
Keras loss
Note that the starting value in the second epoch is around the value that you are seeing in the screenshot.
Does anyone knows why that happens?
Does keras updates again the weights when all batches are processed ?
The loss is expected to be smaller, but your surprise to the extent of it is understandable.
The reason the second epoch has such a lower loss is because during the first epoch your model makes mistakes and yields great losses - which get better and better.
Keras displays the mean loss over all instances in an epoch.
So if the model made mistakes on the first 90% of the training set in epoch, and then was perfect for the last 10% of the data, the loss would still be very large because it's mean loss.
Then, at the start of 2nd epoch, model is already better at predicting, so the mean loss is lower.

Does training for 10 epoch has equal effect as that of training for 5 epoch twice?

I've not checked the training accuracy and losses after training using both the approaches.
If I understood your question correctly, it's a yes.
For example, I programmed my model to pick up the best epoch during training and output that state for test set.
Let's say I train during 10 epochs, each very "first time" (achieved by re-starting the Kernel) it will always choose 9 or 10 epoch, but If reuse the model and train for another 10 epochs, usually it chooses from 0 to 4 as best epoch, also the results are slightly better now. Hence it is telling me that the model is taking the first 10 into account. Furthermore these results are consistent with training for 20 epochs, it chooses from 10 to 14 as best epoch.

Is some overfitting in a convolutional network alright?

I am using resnet50 to classify pictures of flowers from a Kaggle dataset. I would like to clarify some things about my results.
epoch train_loss valid_loss error_rate time
0 0.205352 0.226580 0.077546 02:01
1 0.148942 0.205224 0.074074 02:01
These are the last two epochs of training. As you can see, the second epoch shows some overfitting because the train_loss is a good margin lower than the validation loss. Despite the overfitting, the error_rate and the validation loss decreased. I am wondering whether the model had actually improved in spite of the overfitting. Is it better to use the model from epoch 0 or epoch 1 for unseen data? Thank you!
Sadly, "overfitting" is a much abused term nowadays, used to mean almost everything linked to suboptimal performance; nevertheless, and practically speaking, overfitting means something very specific: its telltale signature is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
It's clear than nothing of the sorts happens in your case; the "margin" between your training and validation loss is another story altogether (it is called generalization gap), and does not signify overfitting.
Thus, in principle, you have absolutely no reason at all to choose a model with higher validation loss (i.e. your first one) instead of one with a lower validation loss (your second one).

Accuracy on 1st epoch - MNIST Deep Learning example

Im new to the world of Deep Learning and i would like to clarify something on my 1st Deep learning code, the MNIST example. Maybe also i'm completely wrong BTW so please take it easy :)
I have split the training data to batches, each one with a size of 50 and max epochs to 15 (or until the validation loss variable starts increasing).
I am getting 93% accuracy just on the 1st epoch, how is that possible if (as far as i know) on 1st epoch it has forward and backpropogate the complete training set just 1 time, so the training set have only abjust its weights and biases only once?
I thought i would get a fine accuracy after many epochs not just on 1st abjustance of the weights
Yes..you can get a good accuracy in the first epoch as well. It depends more on the complexity of the data and the model you build. sometimes if the learning rate is too high, than also it could so happen you get a higher training accuracy.
Also, coming to the adjusting weights and biases part, it could be a mini-batch training and for every mini-batch, the model updates the weights. So weights could have updated many times which is equal to number of training data images/ sample size

Meaning of an Epoch in Neural Networks Training

while I'm reading in how to build ANN in pybrain, they say:
Train the network for some epochs. Usually you would set something
like 5 here,
trainer.trainEpochs( 1 )
I looked for what is that mean , then I conclude that we use an epoch of data to update weights, If I choose to train the data with 5 epochs as pybrain advice, the dataset will be divided into 5 subsets, and the wights will update 5 times as maximum.
I'm familiar with online training where the wights are updated after each sample data or feature vector, My question is how to be sure that 5 epochs will be enough to build a model and setting the weights probably? what is the advantage of this way on online training? Also the term "epoch" is used on online training, does it mean one feature vector?
One epoch consists of one full training cycle on the training set. Once every sample in the set is seen, you start again - marking the beginning of the 2nd epoch.
This has nothing to do with batch or online training per se. Batch means that you update once at the end of the epoch (after every sample is seen, i.e. #epoch updates) and online that you update after each sample (#samples * #epoch updates).
You can't be sure if 5 epochs or 500 is enough for convergence since it will vary from data to data. You can stop training when the error converges or gets lower than a certain threshold. This also goes into the territory of preventing overfitting. You can read up on early stopping and cross-validation regarding that.
sorry for reactivating this thread.
im new to neural nets and im investigating the impact of 'mini-batch' training.
so far, as i understand it, an epoch (as runDOSrun is saying) is a through use of all in the TrainingSet (not DataSet. because DataSet = TrainingSet + ValidationSet). in mini batch training, you can sub divide the TrainingSet into small Sets and update weights inside an epoch. 'hopefully' this would make the network 'converge' faster.
some definitions of neural networks are outdated and, i guess, must be redefined.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Resources