Noise injection to the weights in MLP - machine-learning

I'm reading the book "Deep learning" of I. Goodfellow, Y. Bengio, A. Courville.
At page 242, they talk about noise added to the weights of an MLP, which behave as a regularization.
J is the least square cost function for the problem without noise, and we train an MLP (with weights W to optimize) from (x,y) the training set to a prediction yhat.
The conclusion is for small eta, the minimization of the cost function with added weight noise (with covariance eta Id) is equivalent to minimization of J with an additional regularization term:
eta E[||grad_W yhat(x)||^2]
However, I cannot understand how to obtain this conclusion from (7.32) [i.e. from the definition of the cost function for this new problem]. I tried to use a Taylor development at order 1 with respect to the weights W but I cannot get the conclusion.
Can someone know the detailed calculus to obtain this result (using the same notations)?
[I don't have enough reputation to post the picture, I hope the post is enough straightforward to understand.]

Related

What is weight decay loss?

I have started recently with ML and TensorFlow. While going through the CIFAR10-tutorial on the website I came across a paragraph which is a bit confusing to me:
The usual method for training a network to perform N-way classification is multinomial logistic regression, aka. softmax regression. Softmax regression applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalized predictions and a 1-hot encoding of the label. For regularization, we also apply the usual weight decay losses to all learned variables. The objective function for the model is the sum of the cross entropy loss and all these weight decay terms, as returned by the loss() function.
I have read a few answers on what is weight decay on the forum and I can say that it is used for the purpose of regularization so that values of weights can be calculated to get the minimum losses and higher accuracy.
Now in the text above I understand that the loss() is made of cross-entropy loss(which is the difference in prediction and correct label values) and weight decay loss.
I am clear on cross entropy loss but what is this weight decay loss and why not just weight decay? How is this loss being calculated?
Weight decay is nothing but L2 regularisation of the weights, which can be achieved using tf.nn.l2_loss.
The loss function with regularisation is given by:
The second term of the above equation defines the L2-regularization of the weights (theta). It is generally added to avoid overfitting. This penalises peaky weights and makes sure that all the inputs are considered. (Few peaky weights means only those inputs connected to it are considered for decision making.)
During gradient descent parameter update, the above L2 regularization ultimately means that every weight is decayed linearly: W_new = (1 - lambda)* W_old + alpha*delta_J/delta_w. Thats why its generally called Weight decay.
Weight decay loss, because it adds to the cost function (the loss to be specific). Parameters are optimized from the loss. Using weight decay you want the effect to be visible to the entire network through the loss function.
TF L2 loss
Cost = Model_Loss(W) + decay_factor*L2_loss(W)
# In tensorflow it bascially computes half L2 norm
L2_loss = sum(W ** 2) / 2
What your tutorial is trying to say by "weight decay loss" is that compared to the cross-entropy cost you know from your unregularized models (i.e. how far off target were your model's predictions on training data), your new cost function penalizes not only prediction error but also the magnitude of the weights in your network. Whereas before you were optimizing only for correct prediction of the labels in your training set, now you are optimizing for correct label prediction as well as having small weights. The reason for this modification is that when a machine learning model trained by gradient descent yields large weights, it is likely they were arrived at in response to peculiarities (or, noise) in the training data. The model will not perform as well when exposed to held-out test data because it is overfit to the training set. The result of applying weight decay loss, more commonly called L2-regularization is that accuracy on training data will drop a bit but accuracy on test data can jump dramatically. And that's what you're after in the end: a model that generalizes well to data it did not see during training.
So you can get a firmer grasp on the mechanics of weight decay, let's look at the learning rule for weights in a L2-regularized network:
where eta and lambda are user-defined learning rate and regularization parameter, respectively and n is the number of training examples (you'll have to look up those Greek letters if you're not familiar). Since the values eta and (eta*lambda)/n both are constants for a given iteration of training, it's enough to interpret the learning rule for weight decay as "for a given weight, subract a small multiple of the derivative of the cost function with respect to that weight, and subtract a small multiple of the weight itself."
Let's look at four weights in an imaginary network and how the above learning rule affects them. As you can see, the regularization term shown in red pushes weights toward zero no matter what. It is designed to minimize the magnitude of the weight matrix, which it does by minimizing the absolute values of individual weights. Some key things to notice in these plots:
When the sign of the cost derivative and the sign are the weight are the same, the regularization term accelerates the weight's path to its optimum!
The amount that the regularization term affects the weight update is proportional to the current value of that weight. I've shown this in the plots with tiny red arrows showing contributions of weights with current values close to zero, and larger red arrows for weights with larger current magnitudes.

Minibatch SGD gradient computation- average or sum

I am trying to understand how the gradients are computed when using miinibatch SGD. I have implemented it in CS231 online course, but only came to realize that in intermediate layers the gradient is basically the sum over all the gradients computed for each sample (the same for the implementations in Caffe or Tensorflow). It is only in the last layer (the loss) that they are averaged by the number of samples.
Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically?
Thanks!
It is best to understand why SGD works first.
Normally, what a neural network actually is, a very complex composite function of an input vector x, a label y(or target variable, changes according to whether the problem is classification or regression) and some parameter vector, w. Assume that we are working on classification. We are actually trying to do a maximum likelihood estimation (actually MAP estimation since we are certainly going to use L2 or L1 regularization, but this is too much technicality for now) for variable vector w. Assuming that samples are independent; then we have the following cost function:
p(y1|w,x1)p(y2|w,x2) ... p(yN|w,xN)
Optimizing this wrt to w is a mess due to the fact that all of these probabilities are multiplicated (this will produce an insanely complicated derivative wrt w). We use log probabilities instead (taking log does not change the extreme points and we divide by N, so we can treat our training set as a empirical probability distribution, p(x) )
J(X,Y,w)=-(1/N)(log p(y1|w,x1) + log p(y2|w,x2) + ... + log p(yN|w,xN))
This is the actual cost function we have. What the neural network actually does is to model the probability function p(yi|w,xi). This can be a very complex 1000+ layered ResNet or just a simple perceptron.
Now the derivative for w is simple to state, since we have an addition now:
dJ(X,Y,w)/dw = -(1/N)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(yN|w,xN)/dw)
Ideally, the above is the actual gradient. But this batch calculation is not easy to compute. What if we are working on a dataset with 1M training samples? Worse, the training set may be a stream of samples x, which has an infinite size.
The Stochastic part of the SGD comes into play here. Pick m samples with m << N randomly and uniformly from the training set and calculate the derivative by using them:
dJ(X,Y,w)/dw =(approx) dJ'/dw = -(1/m)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(ym|w,xm)/dw)
Remember that we had an empirical (or actual in the case of infinite training set) data distribution p(x). The above operation of drawing m samples from p(x) and averaging them actually produces the unbiased estimator, dJ'/dw, for the actual derivative dJ(X,Y,w)/dw. What does that mean? Take many such m samples and calculate different dJ'/dw estimates, average them as well and you get dJ(X,Y,w)/dw very closely, even exactly, in the limit of infinite sampling. It can be shown that these noisy but unbiased gradient estimates will behave like the original gradient in the long run. On the average, SGD will follow the actual gradient's path (but it can get stuck at a different local minima, all depends on the selection of the learning rate). The minibatch size m is directly related to the inherent error in the noisy estimate dJ'/dw. If m is large, you get gradient estimates with low variance, you can use larger learning rates. If m is small or m=1 (online learning), the variance of the estimator dJ'/dw is very high and you should use smaller learning rates, or the algorithm may easily diverge out of control.
Now enough theory, your actual question was
It is only in the last layer (the loss) that they are averaged by the number of samples. Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically? Thanks!
Yes, it is enough to divide by m in the last layer, since the chain rule will propagate the factor (1/m) to all parameters once the lowermost layer is multiplied by it. You don't need to do separately for each parameter, this will be invalid.
In the last layer they are averaged, and in the previous are summed. The summed gradients in previous layers are summed across different nodes from the next layer, not by the examples. This averaging is done only to make the learning process behave similarly when you change the batch size -- everything should work the same if you sum all the layers, but decrease the learning rate appropriately.

What is Maximum Entropy?

Can someone give me a clear and simple definition of Maximum entropy classification? It would be very helpful if someone can provide a clear analogy, as I am struggling to understand.
"Maximum Entropy" is synonymous with "Least Informative". You wouldn't want a classifier that was least informative. It is in reference to how the priors are established. Frankly, "Maximum Entropy Classification" is an example of using buzz words.
For an example of an uninformative prior, consider given a six-sided object. The probability that any given face will appear if the object is tossed is 1/6. This would be your starting prior. It's the least informative. You really wouldn't want to start with anything else or you will bias later calculations. Of course, if you have knowledge that one side will appear more often you should incorporate that into your priors.
The Bayes formula is P(H|E) = P(E|H)P(H)/P(D)
where P(H) is the prior for the hypothesis and P(D) is the sum of all possible numerators.
For text classification where a missing word is to be inserted, E is some given document and H is the given word. IOW, the hypothesis is that H is the word which should be selected and P(H) is the weight given to the word.
Maximum Entropy Text classification means: start with least informative weights (priors) and optimize to find weights that maximize the likelihood of the data, the P(D). Essentially, it's the EM algorithm.
A simple Naive Bayes classifier would assume the prior weights would be proportional to the number of times the word appears in the document. However,this ignore correlations between words.
The so-called MaxEnt classifier, takes the correlations into account.
I can't think of a simple example to illustrate this but I can think of some correlations. For example, "the missing" in English should give higher weights to nouns but a Naive Bayes classifier might give equal weight to a verb if its relative frequency were the same as a given noun. A MaxEnt classifier considering missing would give more weight to nouns because they would be more likely in context.
I may also advise HIDDEN MARKOV AND
MAXIMUM ENTROPY
MODELS from the Department of Computer Science, Johns Hopkins. Specifically, take a look at chapter 6.6. This book explains the Maximum Entropy on the example of PoS tagging and compare MaxEnt application in MEMM with Hidden Markov Model. There are also explanation what is exactly MaxEnt with math behind.
(Taken from UNDERSTANDING DEEP LEARNING
GENERALIZATION
BY
MAXIMUM
ENTROPY (Zheng et al., 2017):
(Original Maximum Entropy Model) Supposing the dataset has input X and label
Y, the task is to find a good prediction of Y using X. The prediction Yˆ needs to maximize the
conditional entropy H(Yˆ |X) while preserving the same distribution with data (X, Y ). This is
formulated as:
min −H(Yˆ |X) (1)
s.t. P(X, Y ) = P(X, Yˆ ),
\sum(Yˆ) P(Yˆ |X) = 1
Berger et al., 1996 solves this with lagrange multipliers ωi as an exponential form:
Pω(Yˆ = y|X = x) = 1/Zω(x) exp (\sum(i) ωifi(x, y))

Difference between Probabilistic kNN and Naive Bayes

I'm trying to modify an standard kNN algorithm to obtain the probability of belonging to a class instead of just the usual classification. I haven't found much information about Probabilistic kNN, but as far as I understand, it works similar to kNN, with the difference that it calculates the percentage of examples of every class inside the given radius.
So I wonder, what's the difference then between Naive Bayes and Probabilistic kNN? I just can spot that Naive Bayes takes into consideration the prior possibility, while PkNN does not. Am I getting it wrong?
Thanks in advance!
To be honest there is nearly no similarity.
Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis. For contiuous case - It will fit a radial Normal distribution to your whole class (each of them) and then make a decision through argmax_y N(m_y, Sigma_y)
KNN on the other hand is not a probabilistic model. Modification that you are refering to is simply a "smooth" version of the original idea, where you return ratio of each class in the nearest neighbours set (and this is not really any "probabilistic kNN", it is just regular kNN which rough estimate of probability). This assumes nothing about data distribution (besides being localy smooth). In particular - it is a nonparametric model which, given enough training samples, will fit perfectly to any dataset. Naive Bayes will fit perfectly only to K gaussians (where K is number of classes).
(I don't know how to format math formulas. For more details and clear representations, please see this.)
I would like to propose an opposite view that KNN is a kind of simplified Naive Bayes (NB) by viewing KNN as a mean of density estimation.
To perform density estimation, we attempt to estimate p(x) = k/NV, where k is the number of samples lying in a region R, N is the total sample number, and V is the volume of the region R. Usually, there are two ways to estimate it: (1) fixing V, calculate k, which is known as kernel density estimation or Parzen window; (2) fixing k, calculate V, which is the KNN-based density estimation. The latter one is much less famous than the former one due to its many drawbacks.
Yet, we can use KNN-based density estimation to connect KNN and NB. Given total N samples, Ni samples for class ci, we can write the NB in the form of KNN-based density estimation by considering a region contain x:
P(ci|x) = P(x|ci)P(ci)/P(x) = (ki/NiV)(Ni/N)/(k/NV) = ki/k,
where ki is the sample number of class ci lying in the region. The final form ki/k is actually the KNN classifier.

importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand.
I am trying to think (since long time) but I am not able to guess why is it so.
In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization.
But how does something like SVD helps.
So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):
1) Without SVD
2) With SVD
how does it helps
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.
The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.
SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.
Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.
See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition
https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th
The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:
Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.
The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).
This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.
(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).
PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.
Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Resources