I am currently going through a sparse autoencoder. What I understood is we don't need all hidden units to fire for every input rather some specific hidden units depending on the type of input. For this we are adding a sparse regularization term to the loss function. But I am unable to get how adding this regularization term to loss function helps us in stopping certain hidden units to fire up?sparse autoencoder
But I am unable to get how adding this regularization term to loss function helps us in stopping certain hidden units to fire up?
Because the this regularization term will precisely penalize excessive activations. As opposed to conventional regularization, in which re penalize the weights through the L1 or L2 norms, in this case we penalize the output of the activation functions by a scale factor. This method ensures only a subset of neurons in a hidden layer are activated for specific inputs, which overall yields better results, since you end up with more "specialized" neurons, that only fire for specific inputs, rather than all.
So just think of it in the way conventional regularization works. By adding a regularization term in the loss function in a Lasso regression, we are penalizing high coefficients through the L1 norm, and enforcing that when minimized, the loss function will yield smaller weights. Well in sparse regularization we are instead shrinking the activation vectors and reducing the subset of neurons that will fire for each input.
Related
I have started recently with ML and TensorFlow. While going through the CIFAR10-tutorial on the website I came across a paragraph which is a bit confusing to me:
The usual method for training a network to perform N-way classification is multinomial logistic regression, aka. softmax regression. Softmax regression applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalized predictions and a 1-hot encoding of the label. For regularization, we also apply the usual weight decay losses to all learned variables. The objective function for the model is the sum of the cross entropy loss and all these weight decay terms, as returned by the loss() function.
I have read a few answers on what is weight decay on the forum and I can say that it is used for the purpose of regularization so that values of weights can be calculated to get the minimum losses and higher accuracy.
Now in the text above I understand that the loss() is made of cross-entropy loss(which is the difference in prediction and correct label values) and weight decay loss.
I am clear on cross entropy loss but what is this weight decay loss and why not just weight decay? How is this loss being calculated?
Weight decay is nothing but L2 regularisation of the weights, which can be achieved using tf.nn.l2_loss.
The loss function with regularisation is given by:
The second term of the above equation defines the L2-regularization of the weights (theta). It is generally added to avoid overfitting. This penalises peaky weights and makes sure that all the inputs are considered. (Few peaky weights means only those inputs connected to it are considered for decision making.)
During gradient descent parameter update, the above L2 regularization ultimately means that every weight is decayed linearly: W_new = (1 - lambda)* W_old + alpha*delta_J/delta_w. Thats why its generally called Weight decay.
Weight decay loss, because it adds to the cost function (the loss to be specific). Parameters are optimized from the loss. Using weight decay you want the effect to be visible to the entire network through the loss function.
TF L2 loss
Cost = Model_Loss(W) + decay_factor*L2_loss(W)
# In tensorflow it bascially computes half L2 norm
L2_loss = sum(W ** 2) / 2
What your tutorial is trying to say by "weight decay loss" is that compared to the cross-entropy cost you know from your unregularized models (i.e. how far off target were your model's predictions on training data), your new cost function penalizes not only prediction error but also the magnitude of the weights in your network. Whereas before you were optimizing only for correct prediction of the labels in your training set, now you are optimizing for correct label prediction as well as having small weights. The reason for this modification is that when a machine learning model trained by gradient descent yields large weights, it is likely they were arrived at in response to peculiarities (or, noise) in the training data. The model will not perform as well when exposed to held-out test data because it is overfit to the training set. The result of applying weight decay loss, more commonly called L2-regularization is that accuracy on training data will drop a bit but accuracy on test data can jump dramatically. And that's what you're after in the end: a model that generalizes well to data it did not see during training.
So you can get a firmer grasp on the mechanics of weight decay, let's look at the learning rule for weights in a L2-regularized network:
where eta and lambda are user-defined learning rate and regularization parameter, respectively and n is the number of training examples (you'll have to look up those Greek letters if you're not familiar). Since the values eta and (eta*lambda)/n both are constants for a given iteration of training, it's enough to interpret the learning rule for weight decay as "for a given weight, subract a small multiple of the derivative of the cost function with respect to that weight, and subtract a small multiple of the weight itself."
Let's look at four weights in an imaginary network and how the above learning rule affects them. As you can see, the regularization term shown in red pushes weights toward zero no matter what. It is designed to minimize the magnitude of the weight matrix, which it does by minimizing the absolute values of individual weights. Some key things to notice in these plots:
When the sign of the cost derivative and the sign are the weight are the same, the regularization term accelerates the weight's path to its optimum!
The amount that the regularization term affects the weight update is proportional to the current value of that weight. I've shown this in the plots with tiny red arrows showing contributions of weights with current values close to zero, and larger red arrows for weights with larger current magnitudes.
In machine learning cost function, if we want to minimize the influence of two parameters, let's say theta3 and theta4, it seems like we have to give a large value of regularization parameter just like the equation below.
I am not quite sure why the bigger regularization parameter reduces the influence instead of increasing it. How does this function work?
It is because that the optimum values of thetas are found by minimizing the cost function.
As you increase the regularization parameter, optimization function will have to choose a smaller theta in order to minimize the total cost.
Quoting from similar question's answer:
At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model w viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by λ, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. λ just determines the relative importance of keeping the model simple relative to reducing training error.
There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is L2 regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is L1 or lasso regularization which has the form λ∑i|wi|, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of L2 regularization. Generally L2 tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters.
There is also elastic net regularization, which is just a weighted combination of L1 and L2 regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.
You said that you want to minimize the influence of two parameters, theta3 and theta4, meaning those two are both NOT important, so we are going to tell the model we want to fit by:
minimize the weights of theta3 and theta4 cause they don't really matter
And here is the learning process of the model:
Given theta3 and theta4 a really big parameter lambda , when theta3 or theta4 grows, your loss functions grows heavily relatively cause they(theta3 and theta4) both have a big multiplier(lambda), to minimize your object function(loss function), both theta3 and theta4 can only be chosen a very small value, saying that they are not important.
As regularization parameter increases from 0 to infinity, the residual sum of squares in linear regression decreases ,Variance of model decreases and Bias increases .
I will try it in most simple language. i think what you are asking is, how does adding a regularization term at the end deceases the value of parameters like theta3 and theta4 here.
So, lets first assume you added this to the end of your loss function which should massively increase the loss, making the function a bit more bias compared to before. Now we will use any optimization method, lets say gradient descent here and the job of gradient descent is to find all values of theta, now remember the fact that until this point we dont any value of theta and if you solve it you will realize the the values of theta are gonna be different if you hadnt used the regularization term at the end. To be exact, its gonna be less for theta3 and theta4.
So this will make sure your hypothesis has more bias and less variance. In simple term, it will make the equation is bit worse or not as exact as before but it will generalize the equation better.
Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.
What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?
I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :
1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.
Actually, Softmax functions are already used deep within neural networks, in certain cases, when dealing with differentiable memory and with attention mechanisms!
Softmax layers can be used within neural networks such as in Neural Turing Machines (NTM) and an improvement of those which are Differentiable Neural Computer (DNC).
To summarize, those architectures are RNNs/LSTMs which have been modified to contain a differentiable (neural) memory matrix which is possible to write and access through time steps.
Quickly explained, the softmax function here enables a normalization of a fetch of the memory and other similar quirks for content-based addressing of the memory. About that, I really liked this article which illustrates the operations in an NTM and other recent RNN architectures with interactive figures.
Moreover, Softmax is used in attention mechanisms for, say, machine translation, such as in this paper. There, the Softmax enables a normalization of the places to where attention is distributed in order to "softly" retain the maximal place to pay attention to: that is, to also pay a little bit of attention to elsewhere in a soft manner. However, this could be considered like to be a mini-neural network that deals with attention, within the big one, as explained in the paper. Therefore, it could be debated whether or not Softmax is used only at the end of neural networks.
Hope it helps!
Edit - More recently, it's even possible to see Neural Machine Translation (NMT) models where only attention (with softmax) is used, without any RNN nor CNN: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability
distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.
Softmax function is used for the output layer only (at least in most cases) to ensure that the sum of the components of output vector is equal to 1 (for clarity see the formula of softmax cost function). This also implies what is the probability of occurrence of each component (class) of the output and hence sum of the probabilities(or output components) is equal to 1.
Softmax function is one of the most important output function used in deep learning within the neural networks (see Understanding Softmax in minute by Uniqtech). The Softmax function is apply where there are three or more classes of outcomes. The softmax formula takes the e raised to the exponent score of each value score and devide it by the sum of e raised the exponent scores values. For example, if I know the Logit scores of these four classes to be: [3.00, 2.0, 1.00, 0.10], in order to obtain the probabilities outputs, the softmax function can be apply as follows:
import numpy as np
def softmax(x):
z = np.exp(x - np.max(x))
return z / z.sum()
scores = [3.00, 2.0, 1.00, 0.10]
print(softmax(scores))
Output: probabilities (p) = 0.642 0.236 0.087 0.035
The sum of all probabilities (p) = 0.642 + 0.236 + 0.087 + 0.035 = 1.00. You can try to substitute any value you know in the above scores, and you will get a different values. The sum of all the values or probabilities will be equal to one. That’s makes sense, because the sum of all probability is equal to one, thereby turning Logit scores to probability scores, so that we can predict better. Finally, the softmax output, can help us to understand and interpret Multinomial Logit Model. If you like the thoughts, please leave your comments below.
Looking at an example 'solver.prototxt', posted on BVLC/caffe git, there is a training meta parameter
weight_decay: 0.04
What does this meta parameter mean? And what value should I assign to it?
The weight_decay meta parameter govern the regularization term of the neural net.
During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.
As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, larger InnerProduct layers etc.) the higher this term should be.
Caffe also allows you to choose between L2 regularization (default) and L1 regularization, by setting
regularization_type: "L1"
However, since in most cases weights are small numbers (i.e., -1<w<1), the L2 norm of the weights is significantly smaller than their L1 norm. Thus, if you choose to use regularization_type: "L1" you might need to tune weight_decay to a significantly smaller value.
While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.
Weight decay is a regularization term that penalizes big weights.
When the weight decay coefficient is big the penalty for big weights is also big, when it is small weights can freely grow.
Look at this answer (not specific to caffe) for a better explanation:
Difference between neural net "weight decay" and "learning rate".
Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.