As far as I can understand, loss_weights parameter on compile function is used to define how much each of the model output loss contributes to the final loss value.
My question is: Is there a way to learn these loss_weights at the same time with the other parameters of the model?. For example, let say, I have a model with 4 losses, and weights equal to 0.25 initially assigned for each one of the losses, but these weights are automatically being updated (via backpropagation) as if they were parameters of the model.
Is there a possible solution to do this in Keras?
Related
i was doing some tests using a VGG3 CNN model and i zeroed out all weights and their gradients after every backprop call. i basically allowed only the bias of the model to be trained / updated. what i got was interesting results that i cant per say explain. the accuracy of the model kept on increasing. started off at about 27.12% until after only 10 epochs it reached around 71% accuracy.
this kept me wondering how the accuracy is increasing so significantly without using any weights at all during training. i would love to get some theories and inputs regarding this.
can you actually train a model(not necessarily the best model) without using any weights at all?
If you zero the weights and only use bias then output of the model should be independent of the input. In fact the model only depends on the final layer's bias value. If you implemented it correctly then you should have something akin to a maximum a-priori estimator. I.e. your model predicts the most common class in your training data. That means one of the classes in your training/testing data is overrepresented and your model is always predicting that class.
As I learned about cross-validation algorithm, from most of the articles on the web, there are variety of cross-validation methods. Here I want to be clear about the k-fold cross-validation technique.
In the k-fold cross-validation algorithm, we can split the training set in to k not-overlapped folds.
As we split the training data in to k folds, we have to train the model in k iterations.
So, in each iteration, we train the model with (k-1) folds and validate it with the remained fold.
In each split we can calculate the desired metric(s) of our model.
At the end we can report the training error by taking the average of scores of all iterations.
But what is the final trained model?
Some points in those articles are not clear for me?
Should I initiate model's parameters in each iteration?
I ask this, because if I don’t initialize the parameter's it could save the pattern of data which I want to be unseen in the next iteration and so on…
Should I save the initial parameter of the split in which I gained the best score, as the best initial values of the parameters?
Should I retrain the model initiating it with the initial values of the parameters gained in my second question and then feed it with whole training dataset and gain the final trained model?
Alright so before answering your question I will go a bit back to explain the purpose of cross validation and model evaluation. You can read these slides or research more about statistical learning theory if you want to go deeper.
Train/test split
Suppose you have a model with defined hyperparameter (or none) and you train it on the training split. If you calculate the metrics over the test split, this will give you the risk of the model on new data. Then you know that this particular model will perform like that on unseen data.
So we have a learning process B, that takes a dataset S (here the training dataset) as well as hyperparameters h, and gives a fitted model m; then B(S, h)->m (training B on S with hp h gives a model m, with its parameters). Then we tested this model to evaluate the risk R on the test dataset.
k-fold Cross validation
When doing k-fold cross validation, you fit k models using the learning process B. Each model is fitted on a different training set, and the risk is computed on non overlapping samples.
Then, you calculate the mean risk among the folds. A common mistake is that it gives you the performance of the model, that's not true. This gives you the mean (or expected) performances of the learning process B (and hyperparams h). That means, if you train a new model using B (and hyperparams h), its expected performance will be around the calculated metrics (of course this is not always true).
For your questions
Yes you should train the model from scratch, if possible with the same initial parameters (if initialization is not random) to avoid any difference between folds. Using a warm start with the previous parameters can modify the learning process, and the fitting.
No, if initialization is random let it be, if it is fixed use the same initial parameters for all folds
For the two previous questions, if by initial parameters you meant hyperparameters, then you should keep the same for all folds, otherwise the calculated risk will be useless. If you want to try multiple hyperparameters, you have to repeat the cross validation multiple times, and then you can select the best ones based on the risk calculated.
Once you tuned your hyperparameters you can train the model on your whole training set. This will give you a model m. Before your cross validation you can keep a small test split to evaluate this final model on unseen data
I used Keras biomedical image segmentation to segment brain neurons. I used model.evaluate() it gave me Dice coefficient: 0.916. However, when I used model.predict(), then loop through the predicted images by calculating the Dice coefficient, the Dice coefficient is 0.82. Why are these two values different?
The model.evaluate function predicts the output for the given input and then computes the metrics function specified in the model.compile and based on y_true and y_pred and returns the computed metric value as the output.
The model.predict just returns back the y_pred
So if you use model.predict and then compute the metrics yourself, the computed metric value should turn out to be the same as model.evaluate
For example, one would use model.predict instead of model.evaluate in evaluating an RNN/ LSTM based models where the output needs to be fed as input in next time step
The problem lies in the fact that every metric in Keras is evaluated in a following manner:
For each batch a metric value is evaluated.
A current value of loss (after k batches is equal to a mean value of your metric across computed k batches).
The final result is obtained as a mean of all losses computed for all batches.
Most of the most popular metrics (like mse, categorical_crossentropy, mae) etc. - as a mean of loss value of each example - have a property that such evaluation ends up with a proper result. But in case of Dice Coefficient - a mean of its value across all of the batches is not equal to actual value computed on a whole dataset and as model.evaluate() uses such way of computations - this is the direct cause of your problem.
The keras.evaluate() function will give you the loss value for every batch. The keras.predict() function will give you the actual predictions for all samples in a batch, for all batches. So even if you use the same data, the differences will be there because the value of a loss function will be almost always different than the predicted values. These are two different things.
It is about regularization. model.predict() returns the final output of the model, i.e. answer. While model.evaluate() returns the loss. The loss is used to train the model (via backpropagation) and it is not the answer.
This video of ML Tokyo should help to understand the difference between model.evaluate() and model.predict().
Can anyone please explain in simple words and possibly with some examples what is a loss function in the field of machine learning/neural networks?
This came out while I was following a Tensorflow tutorial:
https://www.tensorflow.org/get_started/get_started
It describes how far off the result your network produced is from the expected result - it indicates the magnitude of error your model made on its prediciton.
You can then take that error and 'backpropagate' it through your model, adjusting its weights and making it get closer to the truth the next time around.
The loss function is how you're penalizing your output.
The following example is for a supervised setting i.e. when you know the correct result should be. Although loss functions can be applied even in unsupervised settings.
Suppose you have a model that always predicts 1. Just the scalar value 1.
You can have many loss functions applied to this model. L2 is the euclidean distance.
If I pass in some value say 2 and I want my model to learn the x**2 function then the result should be 4 (because 2*2 = 4). If we apply the L2 loss then its computed as ||4 - 1||^2 = 9.
We can also make up our own loss function. We can say the loss function is always 10. So no matter what our model outputs the loss will be constant.
Why do we care about loss functions? Well they determine how poorly the model did and in the context of backpropagation and neural networks. They also determine the gradients from the final layer to be propagated so the model can learn.
As other comments have suggested I think you should start with basic material. Here's a good link to start off with http://neuralnetworksanddeeplearning.com/
Worth to note we can speak of different kind of loss functions:
Regression loss functions and classification loss functions.
Regression loss function describes the difference between the values that a model is predicting and the actual values of the labels.
So the loss function has a meaning on a labeled data when we compare the prediction to the label at a single point of time.
This loss function is often called the error function or the error formula.
Typical error functions we use for regression models are L1 and L2, Huber loss, Quantile loss, log cosh loss.
Note: L1 loss is also know as Mean Absolute Error. L2 Loss is also know as Mean Square Error or Quadratic loss.
Loss functions for classification represent the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).
Name a few: log loss, focal loss, exponential loss, hinge loss, relative entropy loss and other.
Note: While more commonly used in regression, the square loss function can be re-written and utilized for classification.
I have this 5-5-2 backpropagation neural network I'm training, and after reading this awesome article by LeCun I started to put in practice some of the ideas he suggests.
Currently I'm evaluating it with a 10-fold cross-validation algorithm I made myself, which goes basically like this:
for each epoch
for each possible split (training, validation)
train and validate
end
compute mean MSE between all k splits
end
My inputs and outputs are standardized (0-mean, variance 1) and I'm using a tanh activation function. All network algorithms seem to work properly: I used the same implementation to approximate the sin function and it does it pretty good.
Now, the question is as the title implies: should I standardize each train/validation set separately or do I simply need to standardize the whole dataset once?
Note that if I do the latter, the network doesn't produce meaningful predictions, but I prefer having a more "theoretical" answer than just looking at the outputs.
By the way, I implemented it in C, but I'm also comfortable with C++.
You will most likely be better off standardizing each training set individually. The purpose of cross-validation is to get a sense for how well your algorithm generalizes. When you apply your network to new inputs, the inputs will not be ones that were used to compute your standardization parameters. If you standardize the entire data set at once, you are ignoring the possibility that a new input will fall outside the range of values over which you standardized.
So unless you plan to re-standardize every time you process a new input (which I'm guessing is unlikely), you should only compute the standardization parameters for the training set of the partition being evaluated. Furthermore, you should compute those parameters only on the training set of the partition, not the validation set (i.e., each of the 10-fold partitions will use 90% of the data to calculate standardization parameters).
So you assume the inputs are normally distribution and are subtracting the mean, dividing by standard deviation, to get N(0,1) distributed inputs?
Yes I agree with #bogatron that you standardize each training set separately, but I would more strongly say it's a "must" to not use the validation set data too. The problem is not values outside the range in the training set; this is fine, the transformation to a standard normal is still defined for any value. You can't compute mean / standard deviation overa ll the data because you can't in any way use the validation data in the training set, even if just via this statistic.
It should further be emphasized that you use the mean from the training set with the validation set, not the mean from the validation set. It has to be the same transformation of features that was used during training. It would not be valid to transform the validation set differently.