What is the difference between loss function and metric in Keras? [duplicate] - machine-learning

This question already has answers here:
What is "metrics" in Keras?
(5 answers)
Closed 4 years ago.
It is not clear for me the difference between loss function and metrics in Keras. The documentation was not helpful for me.

The loss function is used to optimize your model. This is the function that will get minimized by the optimizer.
A metric is used to judge the performance of your model. This is only for you to look at and has nothing to do with the optimization process.

The loss function is that parameter one passes to Keras model.compile which is actually optimized while training the model . This loss function is generally minimized by the model.
Unlike the loss function , the metric is another list of parameters passed to Keras model.compile which is actually used for judging the performance of the model.
For example : In classification problems, we want to minimize the cross-entropy loss, while also want to assess the model performance with the AUC. In this case, cross-entropy is the loss function and AUC is the metric. Metric is the model performance parameter that one can see while the model is judging itself on the validation set after each epoch of training. It is important to note that the metric is important for few Keras callbacks like EarlyStopping when one wants to stop training the model in case the metric isn't improving for a certaining no. of epochs.

I have a contrived example in mind: Let's think about linear regression on a 2D-plane. In this case, loss function would be the mean squared error, the fitted line would minimize this error.
However, for some reason we are very very interested in the area under the curve from 0 to 1 of our fitted line, and thus this can be one of the metrics. And we monitor this metric while the model minimizes the mean squared error loss function.

Related

Is the loss function='Multiclass' in catboost same as log loss if I am doing a multiclassification problem?

I am making a multiclass prediction model using catboost, The final solution should have minimum Logloss error but Logloss is not present in catboost, they have something called 'Multiclass' as the loss function. Are they both same? if not then how can I measure the accuracy of the catboost model in terms of Logloss?
Are they both same? Effectively, Yes...
The catboost documentation describe the calculation of 'MultiClass' loss as what is generally considered as Multinomial/Multiclass Cross Entropy Loss. That is effectively, a Log Softmax applied to the classifier output 'a' to produce values that can be interpreted as probabilities, and subsequently then apply Negative Log Likelihood Loss (NLLLoss), wiki1 & wiki2.
Their documentation describe the calculation of 'LogLoss' also, which again is NLLLoss, however applied to 'p'. Which they describe here to be result of applying the sigmoid fn to the classifier output. Since the NLLLoss is reworked for the binary problem, only a single class probability is calculated, using 'p' and '1-p' for each class. And in this special (binary) case, use of sigmoid and softmax are equivalent.
How can I measure the the catboost model in terms of Logloss?
They describe a method to produce desired metrics on given data.
Be careful not to confuse loss/objective function 'loss_function' with evaluation metric 'eval_metric', however in this instance, the same function can be used for both, as listed in their supported metrics.
Hope this helps!
Log loss is not a loss function but a metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1.
Learn more here.

Activation functions in Neural Networks

I have a few set of questions related to the usage of various activation functions used in neural networks? I would highly appreciate if someone could give good explanatory answers.
Why ReLU is used only on hidden layers specifically?
Why Sigmoid is a not used in Multi-class classification?
Why we do not use any activation function in regression problems having all negative values?
Why we use "average='micro','macro','average'" while calculating performance metric in multi_class classification?
I'll answer to the best of my ability the 2 first questions:
Relu (=max(0,x)) is used to extract feature maps from data. This is why it is used in the hidden layers where we're learning what important characteristics or features the data holds that could make the model learn how to classify for example. In the FC layers, it's time to make a decision about the output, so we usually use sigmoid or softmax, which tend to give us numbers between 0 and 1 (probability) that can give an interpretable result.
Sigmoid gives a probability for each class. So, if you have 10 classes, you'll have 10 probabilities. And depending on the threshold used, your model would predict for example that the image corresponds to two classes when in multi-classification you want just one predicted class per image. That's why softmax is used in this context: It chooses the class with the maximum probability. So it'll predict just one class.

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

What's a "good" value for the loss function of a DL model like yolo?

I collected ~1500 labelled data and trained with yolo v3, got a training loss of ~10, validation loss ~ 16. Obviously we can use real test data to evaluate the model performance, but I am wondering if there is a way to tell if this training loss = 10 is a "good" one? Or does it indicate I need to use more training data to see if I can push it down to 5 or even less?
Ultimately my question is, for a well-known model with a pre-defined loss function, is there a "good" standard value for the training loss?
thanks.
you need to train your weights until avg loss become 0.0XXXXX. It is minimal requirement to detect object with matching anchor IOU.
Update:28th Nov, 2018
while training object detection model, Loss might be vary sometimes with large data set. but all you need to calculate is Mean Average Precision(MAP) which exactly gave the accuracy criteria of trained model.
./darknet detector map .data .cfg .weights
If your MAP is near to 0.1 i.e. 100%, model performing well.
Follow link to know more about MAP:
https://medium.com/#jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
Your validation loss is a good indicator of if the training loss can further alleviate, I mean i don't have any one-shot solutions ,you will have to tweak Hyper-parameters and check on the val test and iterate.You can also get a nice idea by looking at the loss curve, was it decreasing when you stopped training or was it flat, you can get a sense of how the training has progressed and make changes accordingly.GoodLuck

What is a loss function in simple words?

Can anyone please explain in simple words and possibly with some examples what is a loss function in the field of machine learning/neural networks?
This came out while I was following a Tensorflow tutorial:
https://www.tensorflow.org/get_started/get_started
It describes how far off the result your network produced is from the expected result - it indicates the magnitude of error your model made on its prediciton.
You can then take that error and 'backpropagate' it through your model, adjusting its weights and making it get closer to the truth the next time around.
The loss function is how you're penalizing your output.
The following example is for a supervised setting i.e. when you know the correct result should be. Although loss functions can be applied even in unsupervised settings.
Suppose you have a model that always predicts 1. Just the scalar value 1.
You can have many loss functions applied to this model. L2 is the euclidean distance.
If I pass in some value say 2 and I want my model to learn the x**2 function then the result should be 4 (because 2*2 = 4). If we apply the L2 loss then its computed as ||4 - 1||^2 = 9.
We can also make up our own loss function. We can say the loss function is always 10. So no matter what our model outputs the loss will be constant.
Why do we care about loss functions? Well they determine how poorly the model did and in the context of backpropagation and neural networks. They also determine the gradients from the final layer to be propagated so the model can learn.
As other comments have suggested I think you should start with basic material. Here's a good link to start off with http://neuralnetworksanddeeplearning.com/
Worth to note we can speak of different kind of loss functions:
Regression loss functions and classification loss functions.
Regression loss function describes the difference between the values that a model is predicting and the actual values of the labels.
So the loss function has a meaning on a labeled data when we compare the prediction to the label at a single point of time.
This loss function is often called the error function or the error formula.
Typical error functions we use for regression models are L1 and L2, Huber loss, Quantile loss, log cosh loss.
Note: L1 loss is also know as Mean Absolute Error. L2 Loss is also know as Mean Square Error or Quadratic loss.
Loss functions for classification represent the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).
Name a few: log loss, focal loss, exponential loss, hinge loss, relative entropy loss and other.
Note: While more commonly used in regression, the square loss function can be re-written and utilized for classification.

Resources