What is `weight_decay` meta parameter in Caffe? - machine-learning

Looking at an example 'solver.prototxt', posted on BVLC/caffe git, there is a training meta parameter
weight_decay: 0.04
What does this meta parameter mean? And what value should I assign to it?

The weight_decay meta parameter govern the regularization term of the neural net.
During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.
As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, larger InnerProduct layers etc.) the higher this term should be.
Caffe also allows you to choose between L2 regularization (default) and L1 regularization, by setting
regularization_type: "L1"
However, since in most cases weights are small numbers (i.e., -1<w<1), the L2 norm of the weights is significantly smaller than their L1 norm. Thus, if you choose to use regularization_type: "L1" you might need to tune weight_decay to a significantly smaller value.
While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.

Weight decay is a regularization term that penalizes big weights.
When the weight decay coefficient is big the penalty for big weights is also big, when it is small weights can freely grow.
Look at this answer (not specific to caffe) for a better explanation:
Difference between neural net "weight decay" and "learning rate".

Related

Understanding logic behind Sparse Autoencoders

I am currently going through a sparse autoencoder. What I understood is we don't need all hidden units to fire for every input rather some specific hidden units depending on the type of input. For this we are adding a sparse regularization term to the loss function. But I am unable to get how adding this regularization term to loss function helps us in stopping certain hidden units to fire up?sparse autoencoder
But I am unable to get how adding this regularization term to loss function helps us in stopping certain hidden units to fire up?
Because the this regularization term will precisely penalize excessive activations. As opposed to conventional regularization, in which re penalize the weights through the L1 or L2 norms, in this case we penalize the output of the activation functions by a scale factor. This method ensures only a subset of neurons in a hidden layer are activated for specific inputs, which overall yields better results, since you end up with more "specialized" neurons, that only fire for specific inputs, rather than all.
So just think of it in the way conventional regularization works. By adding a regularization term in the loss function in a Lasso regression, we are penalizing high coefficients through the L1 norm, and enforcing that when minimized, the loss function will yield smaller weights. Well in sparse regularization we are instead shrinking the activation vectors and reducing the subset of neurons that will fire for each input.

Does L1 or L2 regularization give the most sparse weights for the same loss function and optimizer?

If I consider a dataset, which regularization technique (L1 regularization or L2 regularization) will output the highest sparse weights for the same loss function and same optimizer?
By definition, L1 regularization (lasso) forces some weights to zero, thus leading to sparser solutions; according to the Wikipedia entry on regularization:
It can be shown that the L1 norm induces sparsity
See also the L1 and L2 Regularization Methods post at Towards Data Science:
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
For more details, see the following threads # Cross Validated:
Sparsity in Lasso and advantage over ridge
Why does the Lasso provide Variable Selection?

What is weight decay loss?

I have started recently with ML and TensorFlow. While going through the CIFAR10-tutorial on the website I came across a paragraph which is a bit confusing to me:
The usual method for training a network to perform N-way classification is multinomial logistic regression, aka. softmax regression. Softmax regression applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalized predictions and a 1-hot encoding of the label. For regularization, we also apply the usual weight decay losses to all learned variables. The objective function for the model is the sum of the cross entropy loss and all these weight decay terms, as returned by the loss() function.
I have read a few answers on what is weight decay on the forum and I can say that it is used for the purpose of regularization so that values of weights can be calculated to get the minimum losses and higher accuracy.
Now in the text above I understand that the loss() is made of cross-entropy loss(which is the difference in prediction and correct label values) and weight decay loss.
I am clear on cross entropy loss but what is this weight decay loss and why not just weight decay? How is this loss being calculated?
Weight decay is nothing but L2 regularisation of the weights, which can be achieved using tf.nn.l2_loss.
The loss function with regularisation is given by:
The second term of the above equation defines the L2-regularization of the weights (theta). It is generally added to avoid overfitting. This penalises peaky weights and makes sure that all the inputs are considered. (Few peaky weights means only those inputs connected to it are considered for decision making.)
During gradient descent parameter update, the above L2 regularization ultimately means that every weight is decayed linearly: W_new = (1 - lambda)* W_old + alpha*delta_J/delta_w. Thats why its generally called Weight decay.
Weight decay loss, because it adds to the cost function (the loss to be specific). Parameters are optimized from the loss. Using weight decay you want the effect to be visible to the entire network through the loss function.
TF L2 loss
Cost = Model_Loss(W) + decay_factor*L2_loss(W)
# In tensorflow it bascially computes half L2 norm
L2_loss = sum(W ** 2) / 2
What your tutorial is trying to say by "weight decay loss" is that compared to the cross-entropy cost you know from your unregularized models (i.e. how far off target were your model's predictions on training data), your new cost function penalizes not only prediction error but also the magnitude of the weights in your network. Whereas before you were optimizing only for correct prediction of the labels in your training set, now you are optimizing for correct label prediction as well as having small weights. The reason for this modification is that when a machine learning model trained by gradient descent yields large weights, it is likely they were arrived at in response to peculiarities (or, noise) in the training data. The model will not perform as well when exposed to held-out test data because it is overfit to the training set. The result of applying weight decay loss, more commonly called L2-regularization is that accuracy on training data will drop a bit but accuracy on test data can jump dramatically. And that's what you're after in the end: a model that generalizes well to data it did not see during training.
So you can get a firmer grasp on the mechanics of weight decay, let's look at the learning rule for weights in a L2-regularized network:
where eta and lambda are user-defined learning rate and regularization parameter, respectively and n is the number of training examples (you'll have to look up those Greek letters if you're not familiar). Since the values eta and (eta*lambda)/n both are constants for a given iteration of training, it's enough to interpret the learning rule for weight decay as "for a given weight, subract a small multiple of the derivative of the cost function with respect to that weight, and subtract a small multiple of the weight itself."
Let's look at four weights in an imaginary network and how the above learning rule affects them. As you can see, the regularization term shown in red pushes weights toward zero no matter what. It is designed to minimize the magnitude of the weight matrix, which it does by minimizing the absolute values of individual weights. Some key things to notice in these plots:
When the sign of the cost derivative and the sign are the weight are the same, the regularization term accelerates the weight's path to its optimum!
The amount that the regularization term affects the weight update is proportional to the current value of that weight. I've shown this in the plots with tiny red arrows showing contributions of weights with current values close to zero, and larger red arrows for weights with larger current magnitudes.

How does regularization parameter work in regularization?

In machine learning cost function, if we want to minimize the influence of two parameters, let's say theta3 and theta4, it seems like we have to give a large value of regularization parameter just like the equation below.
I am not quite sure why the bigger regularization parameter reduces the influence instead of increasing it. How does this function work?
It is because that the optimum values of thetas are found by minimizing the cost function.
As you increase the regularization parameter, optimization function will have to choose a smaller theta in order to minimize the total cost.
Quoting from similar question's answer:
At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model w viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by λ, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. λ just determines the relative importance of keeping the model simple relative to reducing training error.
There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is L2 regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is L1 or lasso regularization which has the form λ∑i|wi|, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of L2 regularization. Generally L2 tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters.
There is also elastic net regularization, which is just a weighted combination of L1 and L2 regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.
You said that you want to minimize the influence of two parameters, theta3 and theta4, meaning those two are both NOT important, so we are going to tell the model we want to fit by:
minimize the weights of theta3 and theta4 cause they don't really matter
And here is the learning process of the model:
Given theta3 and theta4 a really big parameter lambda , when theta3 or theta4 grows, your loss functions grows heavily relatively cause they(theta3 and theta4) both have a big multiplier(lambda), to minimize your object function(loss function), both theta3 and theta4 can only be chosen a very small value, saying that they are not important.
As regularization parameter increases from 0 to infinity, the residual sum of squares in linear regression decreases ,Variance of model decreases and Bias increases .
I will try it in most simple language. i think what you are asking is, how does adding a regularization term at the end deceases the value of parameters like theta3 and theta4 here.
So, lets first assume you added this to the end of your loss function which should massively increase the loss, making the function a bit more bias compared to before. Now we will use any optimization method, lets say gradient descent here and the job of gradient descent is to find all values of theta, now remember the fact that until this point we dont any value of theta and if you solve it you will realize the the values of theta are gonna be different if you hadnt used the regularization term at the end. To be exact, its gonna be less for theta3 and theta4.
So this will make sure your hypothesis has more bias and less variance. In simple term, it will make the equation is bit worse or not as exact as before but it will generalize the equation better.

Why does keeping model weights low (with addition of regularization parameter) allow the model to better fit unseen / test data?

Consider the linear regression model with cost function:
Here we have = weights of the model
We add the regularization parameter to avoid overfitting the data. The regularization term discourages the use of large weights in favor of smaller weights by penalizing the model according to the weights of the model.
The question is:
Why does keeping model weights low (with addition of regularization parameter) reduce the variance i.e. allow the model to better fit unseen / test data?
Also, how does reducing the variance increase the bias?
If you look at chapter 7 of Elements of Statistical Learning (online for free here:
https://web.stanford.edu/~hastie/Papers/ESLII.pdf
)
You'll see on page 223 that the expected loss
E[(w^Tx - y)^2] can be broken down into an 3 parts. An irreducible error term, a squared bias term, and a variance term. As described in chapter 7 there, increasing the number of effective parameters p increases variance and decreases bias. The chapter also describes how increasing regularization strength decreases the effective number of parameters, which is defined to be the trace of the hat matrix.

Resources