Perplexity calculations rise between each significantly drop - machine-learning

I am training a conversational agent using LSTM and tensorflow's translation model. I use batchwise training, resulting in a significant drop in the training data perplexity after each epoch start. This drop can be explained by the way I read data into batches, as I guarantee that every training pair in my training data is processed exactly once every epoch. When a new epoch starts, the improvements done by the model in the previous epochs will show its profit as it encounters the training data once more, represented as a drop in the graph. Other batchwise approaches such as the one used in tensorflow's translation model, will not lead to the same behavior, as their methodology is to load the entire training data into memory, and randomly pick samples from it.
Step, Perplexity
330000, 19.36
340000, 19.20
350000, 17.79
360000, 17.79
370000, 17.93
380000, 17.98
390000, 18.05
400000, 18.10
410000, 18.14
420000, 18.07
430000, 16.48
440000, 16.75
(A small snipped from the perplexity showing a drop at 350000 and 430000. Between the drops, the perplexity is slightly rising)
However, my question is regarding the trend after the drop. From the graph, it is clearly that the perplexity is slightly rising (for every epoch after step ~350000), until the next drop. Can someone give an answer or theory for why this is happening?

That would be typical of overfitting.

Related

Why does pre-trained ResNet18 have a higher validation accuracy than training?

For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.

why too many epochs will cause overfitting?

I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.

Time Series Prediction using LSTM

I am using Jason Brownlee's tutorial (mirror) to apply LSTM network on some syslog/network log data. He's a master!
I have syslog data(a specific event) for each day for last 1 year and so I am using LSTM network for time series analysis. I am using LSTM from Keras deep learning library.
As I understand -
About Batch_size
A batch of data is a fixed-sized number of rows from the training
dataset that defines how many patterns to process before updating
the weights of the network. Based on the batch_size the Model
takes random samples from the data for the analysis. For time series
this is not desirable, hence the batch_size should always be 1.
About setting value for shuffle value
By default, the samples within an epoch are shuffled prior to being exposed to the network. This is undesirable for the LSTM
because we want the network to build up state as it learns across
the sequence of observations. We can disable the shuffling of
samples by setting “shuffle” to “False“.
Scenario1 -
Using above two rules/guidelines - I ran several trials with different number of neurons, epoch size and different layers and got better results from the baseline model(persistence model).
Scenario2-
Without using above guidelines/rules - I ran several trials with different number of neurons, epoch size and different layers and got even better results than Scenario 1.
Query - Setting shuffle to True and Batch_size values to 1 for time series. Is this a rule or a guideline?
It seems logical reading the tutorial that the data for time series should not be shuffled as we do not want to change the sequence of data, but for my data the results are better if I let the data be shuffled.
At the end what I think, what matters is how I get better predictions with my runs.
I think I should try and put away "theory" over concrete evidence, such as metrics, elbows, RMSEs,etc.
Kindly enlighten.
It depends a lot on the size of your data, also in the number of variables, decreasing batch size in my experience gives better results since the update is more frequent but in huge datasets it is very expensive. And you have to play with this trade-off (training time vs result).
About your shuffle it may be the case that your data is not that correlated with the past, if that is the case shuffling the data helps the network to learn and be able to generalize (like ordered by label) check reason 7 of the following 37 reasons your neural network not working
Batch size the larger the difficult it is to generalize (reason 11). When data clearly depends on the past you can declare your LSTM in Keras to stateful, this means: "that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch" according to Keras API. Hope this helps.

ResNet How to achieve accuracy as in the document?

I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.

Meaning of an Epoch in Neural Networks Training

while I'm reading in how to build ANN in pybrain, they say:
Train the network for some epochs. Usually you would set something
like 5 here,
trainer.trainEpochs( 1 )
I looked for what is that mean , then I conclude that we use an epoch of data to update weights, If I choose to train the data with 5 epochs as pybrain advice, the dataset will be divided into 5 subsets, and the wights will update 5 times as maximum.
I'm familiar with online training where the wights are updated after each sample data or feature vector, My question is how to be sure that 5 epochs will be enough to build a model and setting the weights probably? what is the advantage of this way on online training? Also the term "epoch" is used on online training, does it mean one feature vector?
One epoch consists of one full training cycle on the training set. Once every sample in the set is seen, you start again - marking the beginning of the 2nd epoch.
This has nothing to do with batch or online training per se. Batch means that you update once at the end of the epoch (after every sample is seen, i.e. #epoch updates) and online that you update after each sample (#samples * #epoch updates).
You can't be sure if 5 epochs or 500 is enough for convergence since it will vary from data to data. You can stop training when the error converges or gets lower than a certain threshold. This also goes into the territory of preventing overfitting. You can read up on early stopping and cross-validation regarding that.
sorry for reactivating this thread.
im new to neural nets and im investigating the impact of 'mini-batch' training.
so far, as i understand it, an epoch (as runDOSrun is saying) is a through use of all in the TrainingSet (not DataSet. because DataSet = TrainingSet + ValidationSet). in mini batch training, you can sub divide the TrainingSet into small Sets and update weights inside an epoch. 'hopefully' this would make the network 'converge' faster.
some definitions of neural networks are outdated and, i guess, must be redefined.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Resources