Validation Set in Backpropogation in a Neural Network - machine-learning

I have a neural network model, and so far I am running the training set forward, calculating the errors, and adjusting the weights.
As I understand it, after I do this for each training set example I need to run an example from the validation set forward and calculate the errors. When the validation set error stops decreasing, but the training set error is still decreasing it is time to stop because over-fitting is starting to occur. After we stop, we use the testing set to calculate how much error is in our network.
Please correct me if there are any mistakes so far.
My question is what error are we comparing? Are we just comparing the error of the output layer? Or are we comparing the errors from every node? If so, how exactly do we define the overall error of the network, just sum up all the errors?

My question is what error are we comparing?
We are comparing the error only on the output layer. So, if you plot a error vs epoch graph, you will have two curves there. The line for training error goes down as you have more epochs. But the line for validation error goes down up to certain point before starting to go up. This indicates overfitting and you want to find the last point where the validation error was lowest.
Note that you are talking about each individual samples while I am talking about epochs. For batch methods these errors are usually plotted after one iteration over the data set (training or validation). So each point on the plot is the mean error or mean squared error from that epoch.
Also, if we have more than 1 output, are we just taking the sum of the errors in the output layer, or should it be some kind of weighted sum?
It's interesting for the multiple output case. Basically we are trying to find the early stopping point to stop training the weights. On the very last layer of multiple output network, the weights are being trained using different error derivatives and can possibly have different optimal early stopping points. You may want to plot them separately if you think that is the case. Otherwise, simple sum of error is sufficient. Weighted sum would mean that you care to optimize for on output over another, even when that causes other one(s) to over/under train.
If you are thinking about implementing separate early stopping points, you can use sum of MSEs to get stopping point for all internal weights that depend on all error derivatives. For the weights on the last layer, use their corresponding MSEs to get their separate stopping points.
Let's say I have 60% training, 20% validation, and 20% test set. For each epoch, I run through the 60 training set samples while adjusting the weights on each sample and also calculating the error on each validation sample.
Another way to do the weight update is to calculate the updates for each sample and then apply an average of all updates at the end of the epoch. If your training data has noise/outliers/misclassified samples, this is good. For example, couple outliers will not be able to massively distort the weights since their 'bad' updates will get averaged out with other 'good' updates.
Since there are only 1/3 as many validation samples as training samples, do I run through the validation 3 times for each epoch?
Why do we iterate over the validation set? Do we calculate error in validation to get weight updates? No. We do all our updating only using the training set. Validation is only their to see how our trained model generalizes outside of training data. Think of it as a test before the test you run with test set. Now, does it make sense to run over the validation set 3 times in each epoch? No, it doesn't.
I use the last calculated weights for online learning correct?
Yes. Error calculation and weight updates happen as new samples come in.
When we use the test set to calculate the error of our final model, are we using mse for this or does it even really matter too much which we use?
If your model is producing real valued output, then use MSE. If you system is trying to solve a classification problem, use classification error. i.e. 10% classification error, meaning 10% of the test set was misclassified by your model during test.

Related

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

What does inconsistent test results mean?

I'm doing some research on CNN for text classification using tensorflow. When I run my model I get a very high training accuracy (arround 100%). However, on test split I get an inconsistent accuracy results (sometimes 11% and sometimes 90%).
Moreover, I noticed also that the loss in training is decreasing until it reaches small numbers like 0.000499564048368, while in testing it is not and sometimes it gets high values like 70. What does this mean? Any ideas?
If you get very high training accuracy and bad testing accuracy, you are almost definitely overfitting. To get a better picture of what your models real accuracy is, use cross-validation.
Cross validation splits the dataset into a training and validation set, and does this multiple times, slightly changing the training and validation data each time. This is beneficial because it can prevent scenarios where you train your model on one label, and it can't accurately identify another one. For example, picture a training set like this:
Feature1, Feature2, Label
x y 0
a y 0
b c 1
If we train the model only on the first two datapoints, it will not be able to identify the third datapoint because it is not built generally.

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

How to use a dataset in Neural Network training

I am trying to implement a Neural Network. I am currently working on the backpropagation part. I don't need help with the coding. I have wrote the feedForward part so far with great success. But my question more related to the dataset I am using.
this is my data set:
http://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data
I have to use a 5-fold cross validation to backpropagate and stop training until error threshold is .1, .01, .001 meaning I have to do 3 trials. Ignore the first 2 point for each data point. I have already normalized the set. The architecture of the network is 7 neuron in input layer, 3 neurons in 1 hidden layer and 1 output. Very basic implementation.
So my question is,
I have to break the set up in 5 smaller subset right? keep one for testing and rest for validation and training right?
how long do I train? say I use 1 fold (about 42 data sets per fold) and i reach my desired error threshold. Do I stop? and use the test data? else I load up the next set and keep training? what if I run out of data set before I reach my desired error threshold?
also should I follow something like this?
use input values
feed forward, then compare error
backpropagate and adjust weights
repeat process a-c until i reach error threshold and finish all data in the current fold?
thank you for your time and response. I am really trying to understand how to use the dataset. I will update you guys once I have written the code.

Incremental (on-line) Backpropagation stopping criteria

In an on-line implementation of a Backpropagation ANN, how would you determine the stopping criteria?
The way that I have been doing it(which I am sure is incorrect) is to average the error of each output node and then average this error over each epoch.
Is this an incorrect method? Is there a standard way of stopping an on-line implementation?
You should always consider the error (e.g. Root Mean Squared Error) on a validation set which is disjunct from your training set. If you train too long, your neural network will begin to overfit. This means, that the error on your training set will become minimal or even 0, but the error on general data will become worse.
To end up with the model parameters which yielded the best generalization performance, you should copy&save your model parameters whenever the error on your validation set is a new minimum. If performance is a problem, you can do this check only every N steps.
In an on-line learning setup, you will train with single training samples or mini-batches of a small number of training samples. You can consider the succsessive training of all samples/mini-batches that cover your total data as one training epoch.
There are several possibilities to define a so called Early Stopping Criterion. E.g. you could consider the best-so-far RMS Error on your validation set after each full epoch. You would stop as soon as there has not been a new optimum for M epochs. Depending on the complexity of your problem you must choose M high enough. You can also start with a rather small M and whenever you get a new optimum, you set M to the number of epochs you needed to reach it. It depends on whether it is more important to quickly converge or to be as thorough as possible.
You will always have situations where both your validation and/or training error will get bigger temporarily, because the learning algorithm is hill-climbing. This means it traverses regions on the error surface which render bad performance, but must be passed to reach a new, better optimum. If you simply stop as soon your validation or training error gets worse between two subsequent steps, you will end up in suboptimal solutions prematurely.

Resources