What is weight decay loss? - machine-learning

I have started recently with ML and TensorFlow. While going through the CIFAR10-tutorial on the website I came across a paragraph which is a bit confusing to me:
The usual method for training a network to perform N-way classification is multinomial logistic regression, aka. softmax regression. Softmax regression applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalized predictions and a 1-hot encoding of the label. For regularization, we also apply the usual weight decay losses to all learned variables. The objective function for the model is the sum of the cross entropy loss and all these weight decay terms, as returned by the loss() function.
I have read a few answers on what is weight decay on the forum and I can say that it is used for the purpose of regularization so that values of weights can be calculated to get the minimum losses and higher accuracy.
Now in the text above I understand that the loss() is made of cross-entropy loss(which is the difference in prediction and correct label values) and weight decay loss.
I am clear on cross entropy loss but what is this weight decay loss and why not just weight decay? How is this loss being calculated?

Weight decay is nothing but L2 regularisation of the weights, which can be achieved using tf.nn.l2_loss.
The loss function with regularisation is given by:
The second term of the above equation defines the L2-regularization of the weights (theta). It is generally added to avoid overfitting. This penalises peaky weights and makes sure that all the inputs are considered. (Few peaky weights means only those inputs connected to it are considered for decision making.)
During gradient descent parameter update, the above L2 regularization ultimately means that every weight is decayed linearly: W_new = (1 - lambda)* W_old + alpha*delta_J/delta_w. Thats why its generally called Weight decay.

Weight decay loss, because it adds to the cost function (the loss to be specific). Parameters are optimized from the loss. Using weight decay you want the effect to be visible to the entire network through the loss function.
TF L2 loss
Cost = Model_Loss(W) + decay_factor*L2_loss(W)
# In tensorflow it bascially computes half L2 norm
L2_loss = sum(W ** 2) / 2

What your tutorial is trying to say by "weight decay loss" is that compared to the cross-entropy cost you know from your unregularized models (i.e. how far off target were your model's predictions on training data), your new cost function penalizes not only prediction error but also the magnitude of the weights in your network. Whereas before you were optimizing only for correct prediction of the labels in your training set, now you are optimizing for correct label prediction as well as having small weights. The reason for this modification is that when a machine learning model trained by gradient descent yields large weights, it is likely they were arrived at in response to peculiarities (or, noise) in the training data. The model will not perform as well when exposed to held-out test data because it is overfit to the training set. The result of applying weight decay loss, more commonly called L2-regularization is that accuracy on training data will drop a bit but accuracy on test data can jump dramatically. And that's what you're after in the end: a model that generalizes well to data it did not see during training.
So you can get a firmer grasp on the mechanics of weight decay, let's look at the learning rule for weights in a L2-regularized network:
where eta and lambda are user-defined learning rate and regularization parameter, respectively and n is the number of training examples (you'll have to look up those Greek letters if you're not familiar). Since the values eta and (eta*lambda)/n both are constants for a given iteration of training, it's enough to interpret the learning rule for weight decay as "for a given weight, subract a small multiple of the derivative of the cost function with respect to that weight, and subtract a small multiple of the weight itself."
Let's look at four weights in an imaginary network and how the above learning rule affects them. As you can see, the regularization term shown in red pushes weights toward zero no matter what. It is designed to minimize the magnitude of the weight matrix, which it does by minimizing the absolute values of individual weights. Some key things to notice in these plots:
When the sign of the cost derivative and the sign are the weight are the same, the regularization term accelerates the weight's path to its optimum!
The amount that the regularization term affects the weight update is proportional to the current value of that weight. I've shown this in the plots with tiny red arrows showing contributions of weights with current values close to zero, and larger red arrows for weights with larger current magnitudes.

Related

Different cost functions pros cons

i've seen the book and Andrew Ng's neural network cost functions and i've noticed that Andrew Ng's cost function is different from the books for neural network.
Andrew Ng's uses
J(Θ)=−(1/m)∑∑[y * log((hΘ(x)))+(1−y) * log(1−(hΘ(x)))] while the book uses mean squared error.
What are the pros and cons of each error formula?
The first cost function so-called the cross-entropy loss or log loss is used to measure the performance of the classification model whose output lies between 0 and 1. Higher the deviation from the actual label, higher is the cross-entropy loss. For example, predicting a probability of 0.6 when the actual value is 1 is a bad result and results in high loss value. When the cross-entropy loss is 0 a model is said to be perfect.
MSE measures the average value that the model’s predictions vary from actual labels. One can think of it as a model’s performance on the training set so, when the model’s performance is poor on the training set the cost is higher. It is also called L2 loss. While training the model the task is to minimize the squared difference between the estimated and actual target values.

Comparing MSE loss and cross-entropy loss in terms of convergence

For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss?
When I plot them it seems to me that MSE loss has a lower error margin. Why would that be?
Or for example when I have the target as [1,1,1,1....1] I get the following:
As complement to the accepted answer, I will answer the following questions
What is the interpretation of MSE loss and cross entropy loss from probability perspective?
Why cross entropy is used for classification and MSE is used for linear regression?
TL;DR Use MSE loss if (random) target variable is from Gaussian distribution and categorical cross entropy loss if (random) target variable is from Multinomial distribution.
MSE (Mean squared error)
One of the assumptions of the linear regression is multi-variant normality. From this it follows that the target variable is normally distributed(more on the assumptions of linear regression can be found here and here).
Gaussian distribution(Normal distribution) with mean and variance is given by
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
This is called standard normal distribution.
For normal distribution model with weight parameter and precision(inverse variance) parameter , the probability of observing a single target t given input x is expressed by the following equation
, where is mean of the distribution and is calculated by model as
Now the probability of target vector given input can be expressed by
Taking natural logarithm of left and right terms yields
Where is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to . Now maximum likelihood function for parameter is given by (constant terms with respect to can be omitted),
For training the model omitting the constant doesn't affect the convergence.
This is called squared error and taking the mean yields mean squared error.
,
Cross entropy
Before going into more general cross entropy function, I will explain specific type of cross entropy - binary cross entropy.
Binary Cross entropy
The assumption of binary cross entropy is probability distribution of target variable is drawn from Bernoulli distribution. According to Wikipedia
Bernoulli distribution is the discrete probability distribution of a random variable which
takes the value 1 with probability p and the value 0
with probability q=1-p
Probability of Bernoulli distribution random variable is given by
, where and p is probability of success.
This can be simply written as
Taking negative natural logarithm of both sides yields
, this is called binary cross entropy.
Categorical cross entropy
Generalization of the cross entropy follows the general case
when the random variable is multi-variant(is from Multinomial distribution
) with the following probability distribution
Taking negative natural logarithm of both sides yields categorical cross entropy loss.
,
You sound a little confused...
Comparing the values of MSE & cross-entropy loss and saying that one is lower than the other is like comparing apples to oranges
MSE is for regression problems, while cross-entropy loss is for classification ones; these contexts are mutually exclusive, hence comparing the numerical values of their corresponding loss measures makes no sense
When your prediction vector is like [0,0.1,0.2,....1] (i.e. with non-integer components), as you say, the problem is a regression (and not a classification) one; in classification settings, we usually use one-hot encoded target vectors, where only one component is 1 and the rest are 0
A target vector of [1,1,1,1....1] could be the case either in a regression setting, or in a multi-label multi-class classification, i.e. where the output may belong to more than one class simultaneously
On top of these, your plot choice, with the percentage (?) of predictions in the horizontal axis, is puzzling - I have never seen such plots in ML diagnostics, and I am not quite sure what exactly they represent or why they can be useful...
If you like a detailed discussion of the cross-entropy loss & accuracy in classification settings, you may have a look at this answer of mine.
I tend to disagree with the previously given answers. The point is that the cross-entropy and MSE loss are the same.
The modern NN learn their parameters using maximum likelihood estimation (MLE) of the parameter space. The maximum likelihood estimator is given by argmax of the product of probability distribution over the parameter space. If we apply a log transformation and scale the MLE by the number of free parameters, we will get an expectation of the empirical distribution defined by the training data.
Furthermore, we can assume different priors, e.g. Gaussian or Bernoulli, which yield either the MSE loss or negative log-likelihood of the sigmoid function.
For further reading:
Ian Goodfellow "Deep Learning"
A simple answer to your first question:
For a very simple classification problem ... would cross-entropy loss converge better/faster or would MSE loss?
is that MSE loss, when combined with sigmoid activation, will result in non-convex cost function with multiple local minima. This is explained by Prof Andrew Ng in his lecture:
Lecture 6.4 — Logistic Regression | Cost Function — [ Machine Learning | Andrew Ng]
I imagine the same applies to multiclass classification with softmax activation.

Reason of having high AUC and low accuracy in a balanced dataset

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).
I have totally no idea why would this happen, can anyone explain this case for me?
The ROC curve is biased towards the positive class. The described situation with high AUC and low accuracy can occur when your classifier achieves the good performance on the positive class (high AUC), at the cost of a high false negatives rate (or a low number of true negatives).
The question of why the training process resulted in a classifier with poor predictive performance is very specific to your problem/data and the classification methods used.
The ROC analysis tells you how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.
About ROC analysis
The general context for ROC analysis is binary classification, where a classifier assigns elements of a set into two groups. The two classes are usually referred to as "positive" and "negative". Here, we assume that the classifier can be reduced to the following functional behavior:
def classifier(observation, t):
if score_function(observation) <= t:
observation belongs to the "negative" class
else:
observation belongs to the "positive" class
The core of a classifier is the scoring function that converts observations into a numeric value measuring the affinity of the observation to the positive class. Here, the scoring function incorporates the set of rules, the mathematical functions, the weights and parameters, and all the ingenuity that makes a good classifier. For example, in logistic regression classification, one possible choice for the scoring function is the logistic function that estimates the probability p(x) of an observation x belonging to the positive class.
In a final step, the classifier converts the computed score into a binary class assignment by comparing the score against a decision threshold (or prediction cutoff) t.
Given the classifier and a fixed decision threshold t, we can compute actual class predictions y_p for given observations x. To assess the capability of a classifier, the class predictions y_p are compared with the true class labels y_t of a validation dataset. If y_p and y_t match, we refer to as true positives TP or true negatives TN, depending on the value of y_p and y_t; or false positives FP or false negatives FN if y_p and y_t do not match.
We can apply this to the entire validation dataset and count the total number of TPs, TNs, FPs and FNs, as well as the true positive rate (TPR) and false positive rate rate (FPR), which are defined as follows:
TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives
Note that the TPR is often referred to as the sensitivity, and FPR is equivalent to 1-specifity.
In comparison, the accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:
accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)
Given a classifier and a validation dataset, we can evaluate the true positive rate TPR(t) and false positive rate FPR(t) for varying decision thresholds t. And here we are: Plotting FPR(t) against TPR(t) yields the receiver-operator characteristic (ROC) curve. Below are some sample ROC curves, plotted in Python using roc-utils*.
Think of the decision threshold t as a final free parameter that can be tuned at the end of the training process. The ROC analysis offers means to find an optimal cutoff t* (e.g., Youden index, concordance, distance from optimal point).
Furthermore, we can examine with the ROC curve how well the classifier can discriminate between samples from the "positive" and the "negative" class:
Try to understand how the FPR and TPR change for increasing values of t. In the first extreme case (with some very small value for t), all samples are classified as "positive". Hence, there are no true negatives (TN=0), and thus FPR=TPR=1. By increasing t, both FPR and TPR gradually decrease, until we reach the second extreme case, where all samples are classified as negative, and none as positive: TP=FP=0, and thus FPR=TPR=0. In this process, we start in the top right corner of the ROC curve and gradually move to the bottom left.
In the case where the scoring function is able to separate the samples perfectly, leading to a perfect classifier, the ROC curve passes through the optimal point FPR(t)=0 and TPR(t)=1 (see the left figure below). In the other extreme case where the distributions of scores coincide for both classes, resulting in a random coin-flipping classifier, the ROC curve travels along the diagonal (see the right figure below).
Unfortunately, it is very unlikely that we can find a perfect classifier that reaches the optimal point (0,1) in the ROC curve. But we can try to get as close to it as possible.
The AUC, or the area under the ROC curve, tries to capture this characteristic. It is a measure for how well a classifier can discriminate between the two classes. It varies between 1. and 0. In the case of a perfect classifier, the AUC is 1. A classifier that assigns a random class label to input data would yield an AUC of 0.5.
* Disclaimer: I'm the author of roc-utils
I guess you are miss reading the correct class when calculating the roc curve...
That will explain the low accuracy and the high (wrongly calculated) AUC.
It is easy to see that AUC can be misleading when used to compare two
classifiers if their ROC curves cross. Classifier A may produce a
higher AUC than B, while B performs better for a majority of the
thresholds with which you may actually use the classifier. And in fact
empirical studies have shown that it is indeed very common for ROC
curves of common classifiers to cross. There are also deeper reasons
why AUC is incoherent and therefore an inappropriate measure (see
references below).
http://sandeeptata.blogspot.com/2015/04/on-dangers-of-auc.html
Another simple explanation for this behaviour is that your model is actually very good - just its final threshold to make predictions binary is bad.
I came across this problem with a convolutional neural network on a binary image classification task. Consider e.g, that you have 4 samples with labels 0,0,1,1. Lets say your model creates continuous predictions for these four samples like so: 0.7, 0.75, 0.9 and 0.95.
We would consider this to be a good model, since high values (> 0.8) predict class 1 and low values (< 0.8) predict class 0. Hence, the ROC-AUC would be 1. Note how I used a threshold of 0.8. However, if you use a fixed and badly-chosen threshold for these predictions, say 0.5, which is what we sometimes force upon our model output, then all 4 sample predictions would be class 1, which leads to an accuracy of 50%.
Note that most models optimize not for accuracy, but for some sort of loss function. In my CNN, training for just a few epochs longer solved the problem.
Make sure that you know what you are doing when you transform a continuous model output into a binary prediction. If you do not know what threshold to use for a given ROC curve, have a look at Youden's index or find the threshold value that represents the "most top-left" point in your ROC curve.
If this is happening every single time, may be your model is not correct.
Starting from kernel you need to change and try the model with the new sets.
Look the confusion matrix every time and check TN and TP areas. The model should be inadequate to detect one of them.

What is `weight_decay` meta parameter in Caffe?

Looking at an example 'solver.prototxt', posted on BVLC/caffe git, there is a training meta parameter
weight_decay: 0.04
What does this meta parameter mean? And what value should I assign to it?
The weight_decay meta parameter govern the regularization term of the neural net.
During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.
As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, larger InnerProduct layers etc.) the higher this term should be.
Caffe also allows you to choose between L2 regularization (default) and L1 regularization, by setting
regularization_type: "L1"
However, since in most cases weights are small numbers (i.e., -1<w<1), the L2 norm of the weights is significantly smaller than their L1 norm. Thus, if you choose to use regularization_type: "L1" you might need to tune weight_decay to a significantly smaller value.
While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.
Weight decay is a regularization term that penalizes big weights.
When the weight decay coefficient is big the penalty for big weights is also big, when it is small weights can freely grow.
Look at this answer (not specific to caffe) for a better explanation:
Difference between neural net "weight decay" and "learning rate".

Why the average weight of rnn keeps climbing?

I'm using Pybrain to train a recurrent neural network. However, the average of the weights keeps climbing and after several iterations the train and test accuracy become lower. Now the highest performance on train data is about 55% and on test data is about 50%.
I think maybe the rnn have some training problems because of its high weights. How can I solve it? Thank you in advance.
The usual way to restrict the network parameters is to use a constrained error-functional which somehow penalizes the absolute magnitude of the parameters. Such is done in "weight decay" where you add to your sum-of-squares error the norm of the weights ||w||. Usually this is the Euclidian norm, but sometimes also the 1-norm in which case it is called "Lasso". Note that weight decay is also called ridge regression or Tikhonov regularization.
In PyBrain, according to this page in the documentation, there is available a Lasso-version of weight decay, which can be parametrized by the parameter wDecay.

Resources