Minibatch SGD gradient computation- average or sum - machine-learning

I am trying to understand how the gradients are computed when using miinibatch SGD. I have implemented it in CS231 online course, but only came to realize that in intermediate layers the gradient is basically the sum over all the gradients computed for each sample (the same for the implementations in Caffe or Tensorflow). It is only in the last layer (the loss) that they are averaged by the number of samples.
Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically?
Thanks!

It is best to understand why SGD works first.
Normally, what a neural network actually is, a very complex composite function of an input vector x, a label y(or target variable, changes according to whether the problem is classification or regression) and some parameter vector, w. Assume that we are working on classification. We are actually trying to do a maximum likelihood estimation (actually MAP estimation since we are certainly going to use L2 or L1 regularization, but this is too much technicality for now) for variable vector w. Assuming that samples are independent; then we have the following cost function:
p(y1|w,x1)p(y2|w,x2) ... p(yN|w,xN)
Optimizing this wrt to w is a mess due to the fact that all of these probabilities are multiplicated (this will produce an insanely complicated derivative wrt w). We use log probabilities instead (taking log does not change the extreme points and we divide by N, so we can treat our training set as a empirical probability distribution, p(x) )
J(X,Y,w)=-(1/N)(log p(y1|w,x1) + log p(y2|w,x2) + ... + log p(yN|w,xN))
This is the actual cost function we have. What the neural network actually does is to model the probability function p(yi|w,xi). This can be a very complex 1000+ layered ResNet or just a simple perceptron.
Now the derivative for w is simple to state, since we have an addition now:
dJ(X,Y,w)/dw = -(1/N)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(yN|w,xN)/dw)
Ideally, the above is the actual gradient. But this batch calculation is not easy to compute. What if we are working on a dataset with 1M training samples? Worse, the training set may be a stream of samples x, which has an infinite size.
The Stochastic part of the SGD comes into play here. Pick m samples with m << N randomly and uniformly from the training set and calculate the derivative by using them:
dJ(X,Y,w)/dw =(approx) dJ'/dw = -(1/m)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(ym|w,xm)/dw)
Remember that we had an empirical (or actual in the case of infinite training set) data distribution p(x). The above operation of drawing m samples from p(x) and averaging them actually produces the unbiased estimator, dJ'/dw, for the actual derivative dJ(X,Y,w)/dw. What does that mean? Take many such m samples and calculate different dJ'/dw estimates, average them as well and you get dJ(X,Y,w)/dw very closely, even exactly, in the limit of infinite sampling. It can be shown that these noisy but unbiased gradient estimates will behave like the original gradient in the long run. On the average, SGD will follow the actual gradient's path (but it can get stuck at a different local minima, all depends on the selection of the learning rate). The minibatch size m is directly related to the inherent error in the noisy estimate dJ'/dw. If m is large, you get gradient estimates with low variance, you can use larger learning rates. If m is small or m=1 (online learning), the variance of the estimator dJ'/dw is very high and you should use smaller learning rates, or the algorithm may easily diverge out of control.
Now enough theory, your actual question was
It is only in the last layer (the loss) that they are averaged by the number of samples. Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically? Thanks!
Yes, it is enough to divide by m in the last layer, since the chain rule will propagate the factor (1/m) to all parameters once the lowermost layer is multiplied by it. You don't need to do separately for each parameter, this will be invalid.

In the last layer they are averaged, and in the previous are summed. The summed gradients in previous layers are summed across different nodes from the next layer, not by the examples. This averaging is done only to make the learning process behave similarly when you change the batch size -- everything should work the same if you sum all the layers, but decrease the learning rate appropriately.

Related

What is weight decay loss?

I have started recently with ML and TensorFlow. While going through the CIFAR10-tutorial on the website I came across a paragraph which is a bit confusing to me:
The usual method for training a network to perform N-way classification is multinomial logistic regression, aka. softmax regression. Softmax regression applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalized predictions and a 1-hot encoding of the label. For regularization, we also apply the usual weight decay losses to all learned variables. The objective function for the model is the sum of the cross entropy loss and all these weight decay terms, as returned by the loss() function.
I have read a few answers on what is weight decay on the forum and I can say that it is used for the purpose of regularization so that values of weights can be calculated to get the minimum losses and higher accuracy.
Now in the text above I understand that the loss() is made of cross-entropy loss(which is the difference in prediction and correct label values) and weight decay loss.
I am clear on cross entropy loss but what is this weight decay loss and why not just weight decay? How is this loss being calculated?
Weight decay is nothing but L2 regularisation of the weights, which can be achieved using tf.nn.l2_loss.
The loss function with regularisation is given by:
The second term of the above equation defines the L2-regularization of the weights (theta). It is generally added to avoid overfitting. This penalises peaky weights and makes sure that all the inputs are considered. (Few peaky weights means only those inputs connected to it are considered for decision making.)
During gradient descent parameter update, the above L2 regularization ultimately means that every weight is decayed linearly: W_new = (1 - lambda)* W_old + alpha*delta_J/delta_w. Thats why its generally called Weight decay.
Weight decay loss, because it adds to the cost function (the loss to be specific). Parameters are optimized from the loss. Using weight decay you want the effect to be visible to the entire network through the loss function.
TF L2 loss
Cost = Model_Loss(W) + decay_factor*L2_loss(W)
# In tensorflow it bascially computes half L2 norm
L2_loss = sum(W ** 2) / 2
What your tutorial is trying to say by "weight decay loss" is that compared to the cross-entropy cost you know from your unregularized models (i.e. how far off target were your model's predictions on training data), your new cost function penalizes not only prediction error but also the magnitude of the weights in your network. Whereas before you were optimizing only for correct prediction of the labels in your training set, now you are optimizing for correct label prediction as well as having small weights. The reason for this modification is that when a machine learning model trained by gradient descent yields large weights, it is likely they were arrived at in response to peculiarities (or, noise) in the training data. The model will not perform as well when exposed to held-out test data because it is overfit to the training set. The result of applying weight decay loss, more commonly called L2-regularization is that accuracy on training data will drop a bit but accuracy on test data can jump dramatically. And that's what you're after in the end: a model that generalizes well to data it did not see during training.
So you can get a firmer grasp on the mechanics of weight decay, let's look at the learning rule for weights in a L2-regularized network:
where eta and lambda are user-defined learning rate and regularization parameter, respectively and n is the number of training examples (you'll have to look up those Greek letters if you're not familiar). Since the values eta and (eta*lambda)/n both are constants for a given iteration of training, it's enough to interpret the learning rule for weight decay as "for a given weight, subract a small multiple of the derivative of the cost function with respect to that weight, and subtract a small multiple of the weight itself."
Let's look at four weights in an imaginary network and how the above learning rule affects them. As you can see, the regularization term shown in red pushes weights toward zero no matter what. It is designed to minimize the magnitude of the weight matrix, which it does by minimizing the absolute values of individual weights. Some key things to notice in these plots:
When the sign of the cost derivative and the sign are the weight are the same, the regularization term accelerates the weight's path to its optimum!
The amount that the regularization term affects the weight update is proportional to the current value of that weight. I've shown this in the plots with tiny red arrows showing contributions of weights with current values close to zero, and larger red arrows for weights with larger current magnitudes.

Updating the weights in a 2-layer neural network

I am trying to simulate a XOR gate using a neural network similar to this:
Now I understand that each neuron has certain number of weights and a bias. I am using a sigmoid function to determine whether a neuron should fire or not in each state (since this uses a sigmoid rather than a step function, I use firing in a loose sense as it actually spits out real values).
I successfully ran the simulation for feed-forwarding part, and now I want to use the backpropagation algorithm to update the weights and train the model. The question is, for each value of x1 and x2 there is a separate result (4 different combinations in total) and under different input pairs, separate error distances (the difference between the desired output and the actual result) could be be computed and subsequently a different set of weight updates will eventually be achieved. This means we would get 4 different sets of weight updates for each separate input pairs by using backpropagation.
How should we decide about the right weight updates?
Say we repeat the back propagation for a single input pair until we converge, but what if we would converge to a different set of weights if we choose another pair of inputs?
Now I understand that each neuron has certain weights. I am using a sigmoid function to determine a neuron should fire or not in each state.
You do not really "decide" this, typical MLP do not "fire", they output real values. There are neural networks which actually fire (like RBMs) but this is a completely different model.
This means we would get 4 different sets of weight updates for each input pairs by using back propagation.
This is actually a feature. Lets start from the beggining. You try to minimize some loss function on your whole training set (in your case - 4 samples), which is of form:
L(theta) = SUM_i l(f(x_i), y_i)
where l is some loss function, f(x_i) is your current prediction and y_i true value. You do this by gradient descent, thus you try to compute the gradient of L and go against it
grad L(theta) = grad SUM_i l(f(x_i), y_i) = SUM_i grad l(f(x_i), y_i)
what you now call "a single update" is grad l(f(x_i) y_i) for a single training pair (x_i, y_i). Usually you would not use this, but instead you would sum (or taken average) of updates across whole dataset, as this is your true gradient. Howver, in practise this might be computationaly not feasible (training set is usualy quite large), furthermore, it has been shown empirically that more "noise" in training is usually better. Thus another learning technique emerged, called stochastic gradient descent, which, in short words, shows that under some light assumptions (like additive loss function etc.) you can actually do your "small updates" independently, and you will still converge to local minima! In other words - you can do your updates "point-wise" in random order and you will still learn. Will it be always the same solution? No. But this is also true for computing whole gradient - optimization of non-convex functions is nearly always non-deterministic (you find some local solution, not global one).

word2vec: negative sampling (in layman term)?

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.

Why linear transformation improves accuracy and efficiency of classification for high-dimensional data?

Let X be an m×n (m: number of records, and n: number of attributes) dataset. When the number of attributes n is large and the dataset X is noisy, classification gets more complicated and the classification accuracy decreases. One way to over come this problem is to use linear transformation, i.e., perform classification on Y=XR, where R is an n×p matrix, and p<=n. I was wondering how linear transformation simplifies classification? and why classification accuracy increases if we do classification on the transformed data Y when X is noisy?
Not every sort of linear transformation would work, but some linear transformations are sometimes useful. Specifically, principal component analysis (PCA) and Factor Analysis are linear transformations often used for dimensionality reduction.
The basic idea is that most of the information is probably contained in some linear combination of the features of the dataset, and that by throwing the rest of them away, we are forcing ourselves to use simpler models / overfit less.
This isn't always so great. For example, even if one of the features is actually the thing we're trying to classify, it could still be discarded by PCA is it has low variability - thus losing important information.

Can someone explain to me the difference between a cost function and the gradient descent equation in logistic regression?

I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function

Resources