Is it possible that the accuracy of perceptron decreases as I go through the training more times? In this case, I use the same training set several times.
Neither the accuracy on training data set nor on test data set is stable as the epoch increases. Actually the experimental data indicated that the trend of either in-sample-error or out-sample-error is not even monotonic. And a "pocket" strategy is often applied. Unlike early stop, the pocket algorithm keeps the best solution seen so far "in its pocket" instead of the last solution.
Yes.
This is a commonly studied phenomenon, the accuracy on never-before-seen data (testing data) start to decrease after certain point (after a certain number of passes through the training data-- what you call epochs). This phenomenon is called overfitting and is well understood. You want to stop early, as early as possible or use regularization.
Related
For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.
I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.
I am doing text classification with scikit learn's logistic regression function (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). I am using grid search in order to choose a value for the C parameter. Do I need to do the same for max_iter parameter? why?
Both C and max_iter parameters have default values in Sklearn, which means they need to be tuned. But, from what I understand, early stopping and l1/l2 regularization are two desperate methods for avoiding overfitting and performing one of them is enough. Am I incorrect in assuming that tunning the value of max_iter is equivalent to early stopping?
To summarize, here are my main questions:
1- Does max_iter need tuning? why? (the documentation says it is only useful for certain solvers)
2- Is tuning the max_iter equivalent to early stopping?
3- Should we perform early stopping and L1/L2 regularization at the same time?
Here's some simple responses to your numbered questions and grossly simplified:
Yes, sometimes you need to tune max_iter. Why? See next.
No. max_iter is the number of iterations that the logistic regression classifier's solver is allowed to step through before being stopped. The aim is to reach a "stable" solution for the parameters of the logistic regression model, i.e., it is an optimisation problem. If your max_iter is too low, you may not reach an optimal solution and your model is underfit. If your value is too high, you can essentially wait forever to have a solution for little gain in accuracy. You may also get stuck at local optima if max_iter is too low.
Yes or No.
a. L1/L2 regularisation is essentially "smoothing" of your complex model so that it does not overfit to the training data. If parameters become too large, they are penalised in the cost.
b. Early stopping is when you stop optimising your model (e.g., via gradient descent) at some stage in which you deem acceptable (before max_iter). For example, a metric such as RMSE can be used to define when to stop, or a comparison of the metrics from your test/training data.
c. When to use them? This is dependent on your problem. If you have a simple linear problem, with limited features, you will not need regularisation or early stopping. If you have thousands of features and experience overfitting then apply regularisation as one solution. If you do not want to wait for the optimisation to run to the end when you are playing with parameters as you only care about a certain level of accuracy, you could apply early stopping.
Finally, how do I tune max_iter correctly? This depends on your problem at hand. If you find your classification metric shows your model is performing poorly, it could be that your solver has not taken enough steps to reach a minimum. I'd suggest you do this by hand and look at the cost vs. max_iter to see if it is reaching a minimum properly rather than automate it.
I'm trying to automatically determine when a Keras autoencoder converges. For example, look at this link under "Let's build the simplest autoencoder possible." The number of epochs is hardcoded at 50 (when the loss value converges). However, how would you code this using Keras if you didn't know the number was 50? Would you just keep calling fit()?
This question is actually ridiculously wide and hard. There are many techniques on how to set the number of epochs:
Early stopping- in this case you set the number of epochs to a really high number and you turn off the training when the improvement over next epochs is not satisfying. In Keras you have a special object called EarlyStopping which does the job for you.
Model Checkpoint - here you once again set up a really high number of epochs and you simply save only the best model w.r.t. to a metric chosen. Once again you have a special callback for this scenario.
Of course, there are other scenarios like e.g. using Reinforcement learning to find the stopping time or more complexed scenarios when you choose this in a Bayesian hyperparameter set up but those are much harder methods which are often not introducing any improvement.
One sure thing is that restarting a fit method might end up in unexpected behaviour as many inner states of a model are reset which could cause instability. For this scenario I strongly advise you to use train_on_batch which is not resetting model states and makes a lot of fancy training scenarios possible.
The title says it all: Should a neural network be able to have a perfect train accuracy? Mine saturates at ~0.9 accuracy and I am wondering if that indicates a problem with my network or the training data.
Training instances: ~4500 sequences with an average length of 10 elements.
Network: Bi-directional vanilla RNN with a softmax layer on top.
Perfect accuracy on training data is usually a sign of a phenomenon called overfitting (https://en.wikipedia.org/wiki/Overfitting) and the model may generalize poorly to unseen data. So, no, probably this alone is not an indication that there is something wrong (you could still be overfitting but it is not possible to tell from the information in your question).
You should check the accuracy of the NN on the validation set (data your network has not seen during training) and judge its generalizability. usually it's an iterative process where you train many networks with different configurations in parallel and see which one performs best on the validation set. Also see cross validation (https://en.wikipedia.org/wiki/Cross-validation_(statistics))
If you have low measurement noise, a model may still not get zero training error. This could be for many reasons including that the model is not flexible enough to capture the true underlying function (which can be a complicated, high-dimensional, non-linear function). You can try increasing the number of hidden layers and nodes but you have to be careful about the same things like overfitting and only judge based on evaluation through cross validation.
You can definitely get a 100% accuracy on training datasets by increasing model complexity but I would be wary of that.
You cannot expect your model to be better on your test set than on your training set. This means if your training accuracy is lower than the desired accuracy, you have to change something. Most likely you have to increase the number of parameters of your model.
The reason why you might be ok with not having a perfect training accuracy is (1) the problem of overfitting (2) training time. The more complex your model is, the more likely is overfitting.
You might want to have a look at Structural Risc Minimization:
(source: svms.org)