What is the difference between cross-entropy and log loss error? - machine-learning

What is the difference between cross-entropy and log loss error? The formulae for both seem to be very similar.

They are essentially the same; usually, we use the term log loss for binary classification problems, and the more general cross-entropy (loss) for the general case of multi-class classification, but even this distinction is not consistent, and you'll often find the terms used interchangeably as synonyms.
From the Wikipedia entry for cross-entropy:
The logistic loss is sometimes called cross-entropy loss. It is also known as log loss
From the fast.ai wiki entry on log loss [link is now dead]:
Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
From the ML Cheatsheet:
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.

Related

Is the loss function='Multiclass' in catboost same as log loss if I am doing a multiclassification problem?

I am making a multiclass prediction model using catboost, The final solution should have minimum Logloss error but Logloss is not present in catboost, they have something called 'Multiclass' as the loss function. Are they both same? if not then how can I measure the accuracy of the catboost model in terms of Logloss?
Are they both same? Effectively, Yes...
The catboost documentation describe the calculation of 'MultiClass' loss as what is generally considered as Multinomial/Multiclass Cross Entropy Loss. That is effectively, a Log Softmax applied to the classifier output 'a' to produce values that can be interpreted as probabilities, and subsequently then apply Negative Log Likelihood Loss (NLLLoss), wiki1 & wiki2.
Their documentation describe the calculation of 'LogLoss' also, which again is NLLLoss, however applied to 'p'. Which they describe here to be result of applying the sigmoid fn to the classifier output. Since the NLLLoss is reworked for the binary problem, only a single class probability is calculated, using 'p' and '1-p' for each class. And in this special (binary) case, use of sigmoid and softmax are equivalent.
How can I measure the the catboost model in terms of Logloss?
They describe a method to produce desired metrics on given data.
Be careful not to confuse loss/objective function 'loss_function' with evaluation metric 'eval_metric', however in this instance, the same function can be used for both, as listed in their supported metrics.
Hope this helps!
Log loss is not a loss function but a metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1.
Learn more here.

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

Loss curve in multilabel classification with high class imbalance

I am working on multilabel classification problem. The classes are highly imbalance. However, I balanced the imbalance problem with class weights. I am using "Binary cross entropy" as cost funtion and sigmoid activation function at output layer. But, I am confused with loss curve (since the validation loss and testing loss are parallel ). Is this the case of overfitting?
The telltale signature of overfitting is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
Here are some other plots indicating overfitting (source):
See also the SO thread How to know if underfitting or overfitting is occuring?.
Clearly, your plot does not exhibit such behavior, hence you are not overfitting.

Comparing MSE loss and cross-entropy loss in terms of convergence

For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss?
When I plot them it seems to me that MSE loss has a lower error margin. Why would that be?
Or for example when I have the target as [1,1,1,1....1] I get the following:
As complement to the accepted answer, I will answer the following questions
What is the interpretation of MSE loss and cross entropy loss from probability perspective?
Why cross entropy is used for classification and MSE is used for linear regression?
TL;DR Use MSE loss if (random) target variable is from Gaussian distribution and categorical cross entropy loss if (random) target variable is from Multinomial distribution.
MSE (Mean squared error)
One of the assumptions of the linear regression is multi-variant normality. From this it follows that the target variable is normally distributed(more on the assumptions of linear regression can be found here and here).
Gaussian distribution(Normal distribution) with mean and variance is given by
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
This is called standard normal distribution.
For normal distribution model with weight parameter and precision(inverse variance) parameter , the probability of observing a single target t given input x is expressed by the following equation
, where is mean of the distribution and is calculated by model as
Now the probability of target vector given input can be expressed by
Taking natural logarithm of left and right terms yields
Where is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to . Now maximum likelihood function for parameter is given by (constant terms with respect to can be omitted),
For training the model omitting the constant doesn't affect the convergence.
This is called squared error and taking the mean yields mean squared error.
,
Cross entropy
Before going into more general cross entropy function, I will explain specific type of cross entropy - binary cross entropy.
Binary Cross entropy
The assumption of binary cross entropy is probability distribution of target variable is drawn from Bernoulli distribution. According to Wikipedia
Bernoulli distribution is the discrete probability distribution of a random variable which
takes the value 1 with probability p and the value 0
with probability q=1-p
Probability of Bernoulli distribution random variable is given by
, where and p is probability of success.
This can be simply written as
Taking negative natural logarithm of both sides yields
, this is called binary cross entropy.
Categorical cross entropy
Generalization of the cross entropy follows the general case
when the random variable is multi-variant(is from Multinomial distribution
) with the following probability distribution
Taking negative natural logarithm of both sides yields categorical cross entropy loss.
,
You sound a little confused...
Comparing the values of MSE & cross-entropy loss and saying that one is lower than the other is like comparing apples to oranges
MSE is for regression problems, while cross-entropy loss is for classification ones; these contexts are mutually exclusive, hence comparing the numerical values of their corresponding loss measures makes no sense
When your prediction vector is like [0,0.1,0.2,....1] (i.e. with non-integer components), as you say, the problem is a regression (and not a classification) one; in classification settings, we usually use one-hot encoded target vectors, where only one component is 1 and the rest are 0
A target vector of [1,1,1,1....1] could be the case either in a regression setting, or in a multi-label multi-class classification, i.e. where the output may belong to more than one class simultaneously
On top of these, your plot choice, with the percentage (?) of predictions in the horizontal axis, is puzzling - I have never seen such plots in ML diagnostics, and I am not quite sure what exactly they represent or why they can be useful...
If you like a detailed discussion of the cross-entropy loss & accuracy in classification settings, you may have a look at this answer of mine.
I tend to disagree with the previously given answers. The point is that the cross-entropy and MSE loss are the same.
The modern NN learn their parameters using maximum likelihood estimation (MLE) of the parameter space. The maximum likelihood estimator is given by argmax of the product of probability distribution over the parameter space. If we apply a log transformation and scale the MLE by the number of free parameters, we will get an expectation of the empirical distribution defined by the training data.
Furthermore, we can assume different priors, e.g. Gaussian or Bernoulli, which yield either the MSE loss or negative log-likelihood of the sigmoid function.
For further reading:
Ian Goodfellow "Deep Learning"
A simple answer to your first question:
For a very simple classification problem ... would cross-entropy loss converge better/faster or would MSE loss?
is that MSE loss, when combined with sigmoid activation, will result in non-convex cost function with multiple local minima. This is explained by Prof Andrew Ng in his lecture:
Lecture 6.4 — Logistic Regression | Cost Function — [ Machine Learning | Andrew Ng]
I imagine the same applies to multiclass classification with softmax activation.

What is a loss function in simple words?

Can anyone please explain in simple words and possibly with some examples what is a loss function in the field of machine learning/neural networks?
This came out while I was following a Tensorflow tutorial:
https://www.tensorflow.org/get_started/get_started
It describes how far off the result your network produced is from the expected result - it indicates the magnitude of error your model made on its prediciton.
You can then take that error and 'backpropagate' it through your model, adjusting its weights and making it get closer to the truth the next time around.
The loss function is how you're penalizing your output.
The following example is for a supervised setting i.e. when you know the correct result should be. Although loss functions can be applied even in unsupervised settings.
Suppose you have a model that always predicts 1. Just the scalar value 1.
You can have many loss functions applied to this model. L2 is the euclidean distance.
If I pass in some value say 2 and I want my model to learn the x**2 function then the result should be 4 (because 2*2 = 4). If we apply the L2 loss then its computed as ||4 - 1||^2 = 9.
We can also make up our own loss function. We can say the loss function is always 10. So no matter what our model outputs the loss will be constant.
Why do we care about loss functions? Well they determine how poorly the model did and in the context of backpropagation and neural networks. They also determine the gradients from the final layer to be propagated so the model can learn.
As other comments have suggested I think you should start with basic material. Here's a good link to start off with http://neuralnetworksanddeeplearning.com/
Worth to note we can speak of different kind of loss functions:
Regression loss functions and classification loss functions.
Regression loss function describes the difference between the values that a model is predicting and the actual values of the labels.
So the loss function has a meaning on a labeled data when we compare the prediction to the label at a single point of time.
This loss function is often called the error function or the error formula.
Typical error functions we use for regression models are L1 and L2, Huber loss, Quantile loss, log cosh loss.
Note: L1 loss is also know as Mean Absolute Error. L2 Loss is also know as Mean Square Error or Quadratic loss.
Loss functions for classification represent the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).
Name a few: log loss, focal loss, exponential loss, hinge loss, relative entropy loss and other.
Note: While more commonly used in regression, the square loss function can be re-written and utilized for classification.

Resources