How does the model weights are modified in ML? - machine-learning

I have been reading this interesting link Linear Regression - SGD
and i have got question on below statement.
" The way this optimization algorithm works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction. This process is repeated for a fixed number of iterations."
Question:
Is my below pseudo code correct?
for each training input:
1) Input to Model
2) Find the prediction
3) Find the error
4) Update Model.
What i don't understand is "This process is repeated for a fixed number of iterations" . Does it mean step 4) and step 3) is repeated until the error is minimized?
Correct me if i am wrong?

"This process is repeated for a fixed number of iterations." means that you choose the number of epochs or the number of batches send to you network to train it.
When you train your network you have a training dataset. You give your network (with placeholders) iages and labels associated with these inputs (generally you give samples (input + label) by batches).
It makes a prediction for each input and computes the error (the loss function you uses). And then it tunes weights (and biases) to minimize the loss function (it does what is called a gradient descent).
You should tale a look at Gradient Descent here : http://sebastianruder.com/optimizing-gradient-descent/
You are the one deciding how long you want your network to train by fixing the number of time your whole training set is going to be send to your network (what's called an epoch) or the number of batches.
Hope it helps

Related

What's a "good" value for the loss function of a DL model like yolo?

I collected ~1500 labelled data and trained with yolo v3, got a training loss of ~10, validation loss ~ 16. Obviously we can use real test data to evaluate the model performance, but I am wondering if there is a way to tell if this training loss = 10 is a "good" one? Or does it indicate I need to use more training data to see if I can push it down to 5 or even less?
Ultimately my question is, for a well-known model with a pre-defined loss function, is there a "good" standard value for the training loss?
thanks.
you need to train your weights until avg loss become 0.0XXXXX. It is minimal requirement to detect object with matching anchor IOU.
Update:28th Nov, 2018
while training object detection model, Loss might be vary sometimes with large data set. but all you need to calculate is Mean Average Precision(MAP) which exactly gave the accuracy criteria of trained model.
./darknet detector map .data .cfg .weights
If your MAP is near to 0.1 i.e. 100%, model performing well.
Follow link to know more about MAP:
https://medium.com/#jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
Your validation loss is a good indicator of if the training loss can further alleviate, I mean i don't have any one-shot solutions ,you will have to tweak Hyper-parameters and check on the val test and iterate.You can also get a nice idea by looking at the loss curve, was it decreasing when you stopped training or was it flat, you can get a sense of how the training has progressed and make changes accordingly.GoodLuck

Overfitting and Data splitting

Let's say that I have a data file like:
Index,product_buying_date,col1,col2
0,2013-01-16,34,Jack
1,2013-01-12,43,Molly
2,2013-01-21,21,Adam
3,2014-01-09,54,Peirce
4,2014-01-17,38,Goldberg
5,2015-01-05,72,Chandler
..
..
2000000,2015-01-27,32,Mike
with some more data and I have a target variable y. Assume something as per your convenience.
Now I am aware that we divide the data into 2 parts i.e. Train and Test. And then we divide Train into 70:30, build the model with 70% and validate it with 30%. We tune the parameters so that model does not get overfit. And then predict with the Test data. For example: I divide 2000000 into two equal parts. 1000000 is train and I divide it in validate i.e. 30% of 1000000 which is 300000 and 70% is where I build the model i.e. 700000.
QUESTION: Is the above logic depending upon how the original data splits?
Generally we shuffle the data and then break it into train, validate and test. (train + validate = Train). (Please don't confuse here)
But what if the split is alternate. Like When I divide it in Train and Test first, I give even rows to Test and odd rows to Train. (Here data is initially sort on the basis of 'product_buying_date' column so when i split it in odd and even rows it gets uniformly split.
And when I build the model with Train I overfit it so that I get maximum AUC with Test data.
QUESTION: Isn't overfitting helping in this case?
QUESTION: Is the above logic depending upon how the original data
splits?
If dataset is large(hundred of thousand), you can randomly split the data and you should not have any problem but if dataset is small then you can adopt the different approaches like cross-validation to generate the data set. Cross-validation states that you split you make n number of training-validation set out of your Training set.
suppose you have 2000 data points, you split like
1000 - Training dataset
1000 - testing dataset.
5-cross validation would mean that you would make five 800/200 training/validation dataset.
QUESTION: Isn't overfitting helping in this case?
Number one rule of the machine learning is that, you don't touch the test data set. It's a holly data set that should not be touched.
If you overfit the test data to get maximum AUC score then there won't be any meaning of validation dataset. Foremost aim of any ml algorithm is to reduce the generalization error i.e. algorithm should be able to perform good on unseen data. If you would tune your algorithm with testing data. you won't be able to meet this criteria. In cross-validation also you do not touch your testing set. you select your algorithm. tune its parameter with validation dataset and after you have done with that apply your algorithm to test dataset which is your final score.

Meaning of an Epoch in Neural Networks Training

while I'm reading in how to build ANN in pybrain, they say:
Train the network for some epochs. Usually you would set something
like 5 here,
trainer.trainEpochs( 1 )
I looked for what is that mean , then I conclude that we use an epoch of data to update weights, If I choose to train the data with 5 epochs as pybrain advice, the dataset will be divided into 5 subsets, and the wights will update 5 times as maximum.
I'm familiar with online training where the wights are updated after each sample data or feature vector, My question is how to be sure that 5 epochs will be enough to build a model and setting the weights probably? what is the advantage of this way on online training? Also the term "epoch" is used on online training, does it mean one feature vector?
One epoch consists of one full training cycle on the training set. Once every sample in the set is seen, you start again - marking the beginning of the 2nd epoch.
This has nothing to do with batch or online training per se. Batch means that you update once at the end of the epoch (after every sample is seen, i.e. #epoch updates) and online that you update after each sample (#samples * #epoch updates).
You can't be sure if 5 epochs or 500 is enough for convergence since it will vary from data to data. You can stop training when the error converges or gets lower than a certain threshold. This also goes into the territory of preventing overfitting. You can read up on early stopping and cross-validation regarding that.
sorry for reactivating this thread.
im new to neural nets and im investigating the impact of 'mini-batch' training.
so far, as i understand it, an epoch (as runDOSrun is saying) is a through use of all in the TrainingSet (not DataSet. because DataSet = TrainingSet + ValidationSet). in mini batch training, you can sub divide the TrainingSet into small Sets and update weights inside an epoch. 'hopefully' this would make the network 'converge' faster.
some definitions of neural networks are outdated and, i guess, must be redefined.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Validation Set in Backpropogation in a Neural Network

I have a neural network model, and so far I am running the training set forward, calculating the errors, and adjusting the weights.
As I understand it, after I do this for each training set example I need to run an example from the validation set forward and calculate the errors. When the validation set error stops decreasing, but the training set error is still decreasing it is time to stop because over-fitting is starting to occur. After we stop, we use the testing set to calculate how much error is in our network.
Please correct me if there are any mistakes so far.
My question is what error are we comparing? Are we just comparing the error of the output layer? Or are we comparing the errors from every node? If so, how exactly do we define the overall error of the network, just sum up all the errors?
My question is what error are we comparing?
We are comparing the error only on the output layer. So, if you plot a error vs epoch graph, you will have two curves there. The line for training error goes down as you have more epochs. But the line for validation error goes down up to certain point before starting to go up. This indicates overfitting and you want to find the last point where the validation error was lowest.
Note that you are talking about each individual samples while I am talking about epochs. For batch methods these errors are usually plotted after one iteration over the data set (training or validation). So each point on the plot is the mean error or mean squared error from that epoch.
Also, if we have more than 1 output, are we just taking the sum of the errors in the output layer, or should it be some kind of weighted sum?
It's interesting for the multiple output case. Basically we are trying to find the early stopping point to stop training the weights. On the very last layer of multiple output network, the weights are being trained using different error derivatives and can possibly have different optimal early stopping points. You may want to plot them separately if you think that is the case. Otherwise, simple sum of error is sufficient. Weighted sum would mean that you care to optimize for on output over another, even when that causes other one(s) to over/under train.
If you are thinking about implementing separate early stopping points, you can use sum of MSEs to get stopping point for all internal weights that depend on all error derivatives. For the weights on the last layer, use their corresponding MSEs to get their separate stopping points.
Let's say I have 60% training, 20% validation, and 20% test set. For each epoch, I run through the 60 training set samples while adjusting the weights on each sample and also calculating the error on each validation sample.
Another way to do the weight update is to calculate the updates for each sample and then apply an average of all updates at the end of the epoch. If your training data has noise/outliers/misclassified samples, this is good. For example, couple outliers will not be able to massively distort the weights since their 'bad' updates will get averaged out with other 'good' updates.
Since there are only 1/3 as many validation samples as training samples, do I run through the validation 3 times for each epoch?
Why do we iterate over the validation set? Do we calculate error in validation to get weight updates? No. We do all our updating only using the training set. Validation is only their to see how our trained model generalizes outside of training data. Think of it as a test before the test you run with test set. Now, does it make sense to run over the validation set 3 times in each epoch? No, it doesn't.
I use the last calculated weights for online learning correct?
Yes. Error calculation and weight updates happen as new samples come in.
When we use the test set to calculate the error of our final model, are we using mse for this or does it even really matter too much which we use?
If your model is producing real valued output, then use MSE. If you system is trying to solve a classification problem, use classification error. i.e. 10% classification error, meaning 10% of the test set was misclassified by your model during test.

Incremental (on-line) Backpropagation stopping criteria

In an on-line implementation of a Backpropagation ANN, how would you determine the stopping criteria?
The way that I have been doing it(which I am sure is incorrect) is to average the error of each output node and then average this error over each epoch.
Is this an incorrect method? Is there a standard way of stopping an on-line implementation?
You should always consider the error (e.g. Root Mean Squared Error) on a validation set which is disjunct from your training set. If you train too long, your neural network will begin to overfit. This means, that the error on your training set will become minimal or even 0, but the error on general data will become worse.
To end up with the model parameters which yielded the best generalization performance, you should copy&save your model parameters whenever the error on your validation set is a new minimum. If performance is a problem, you can do this check only every N steps.
In an on-line learning setup, you will train with single training samples or mini-batches of a small number of training samples. You can consider the succsessive training of all samples/mini-batches that cover your total data as one training epoch.
There are several possibilities to define a so called Early Stopping Criterion. E.g. you could consider the best-so-far RMS Error on your validation set after each full epoch. You would stop as soon as there has not been a new optimum for M epochs. Depending on the complexity of your problem you must choose M high enough. You can also start with a rather small M and whenever you get a new optimum, you set M to the number of epochs you needed to reach it. It depends on whether it is more important to quickly converge or to be as thorough as possible.
You will always have situations where both your validation and/or training error will get bigger temporarily, because the learning algorithm is hill-climbing. This means it traverses regions on the error surface which render bad performance, but must be passed to reach a new, better optimum. If you simply stop as soon your validation or training error gets worse between two subsequent steps, you will end up in suboptimal solutions prematurely.

Resources