Short Definition of Backpropagation and Gradient Descent - machine-learning

I need to write a very short definition of backpropagation and gradient descent and I'm a bit confused what the difference is.
Is the following definition correct?:
For calculating the weights of a neuronal network the backpropagation algorithmn is used. It's a optimization process of reducing the model error. The technique is based on a gradient descent method. Conversely, the contribution of each weight to the total error is calculated from the output layer across all hidden layers to the input layer. For this, the partial derivative of the error function E to w is calculated. The resulting gradient is used to adjust the weights in direction of the steepest descen:
w_new = w_old - learning_rate* (part E / part w_old)
Any suggestions or corrections?
Thanks!

First gradient descent is just one of the method to perform back propagation other than this your definition is correct. We just compare the result generated with desired value and try to change the weights assigned to each edge so as to make the errors as low as possible. If after changing the error increases it reverts back to previous state. The learning rate which you are choosing should not be very low or very high otherwise it would lead to vanishing gradient or exploding gradient problem respectively and you wont be able to reach the minimum error.

Related

Will the shape of the Loss function change during training?

I have some problem understanding the theory of loss function and hope some one can help me.
Usually when people try to explain gradient descent to you, they will show you a loss function that looks like the very first image in this post gradient descent: all you need to know. I understand the entire theory of gradient descent is to adjust the weights and minimize the loss function.
My question is, will the shape of the Loss function change during the training or it will just stay remain as the image shown in the above post? I know that the weights are something that we are always tuning so the parameters that determines the shape of the Loss function should be the inputs X={x1,x2,...xn}. Let's make an easy example: suppose our inputs are [[1,2,3,4,5],[5,4,3,2,1]] and labels are [1,0] (Only two training sample for ease, and we are setting the batch size to 1). Then the loss function should be some thing like this for the first training sample
L = (1-nonlinear(1*w1+2*w2+3*w3+4*w4+5*w5+b))^2
and for the second training sample the loss function should be:
L = (0-nonlinear(5*w1+4*w2+3*w3+2*w4+1*w5+b))^2
Apparently, these two loss functions doesn't looks like the same if we plot them so does that mean the shape of the Loss function are changing during training? Then why are people still using that one image ( A point that slides down from the Loss function and finds the global minima) to explain the gradient descent theory?
Note: I'm not changing the loss function, the loss function are still mean square error. I'm trying to say that the shape of the Loss function seems to be changing.
I know where my problem comes from! I thought that we are not able to plot a function such as f(x,y) = xy without any constant in it, but we actually could! I searched the graph on google for f(x,y)=xy and truly we can plot them out! So now I understand, as long as we get the lost function, we can get the plot! Thanks guys
The function stays the same. The point of Gradient Decent is to find the lowest point on a given loss function that you define.
Generally, the loss function you are training to minimize does not change throughout the course of a training session. The flaw in reasoning is that you are assuming that the loss function is characterized by weights of the network, when in fact the weights of that network are a sort-of input to the loss function.
To clarify, let us assume we are predicting some N-dimensional piece of information and we have a ground truth vector, call it p, and a loss function L taking in a prediction vector p_hat which we define as
L(p_hat) := norm(p - p_hat).
This is a very primitive (and quite ineffective) loss function, but it is one nonetheless. Once we begin training, this loss function will be the function that we will try to minimize to get our network to perform the best with respect to. Notice that this loss function will attain different values for different inputs of p_hat, this does not mean the loss function is changing! In the end, the loss function will be an N-dimensional hypersurface in an N+1-dimensional hyperspace that stays the same no matter what (similar to the thing you see in the image where it is a 2-dimensional surface in a 3-dimensional space).
Gradient descent tries to find a minimum on this surface that is constructed by the loss function, but we do not really know what the surface looks like as a whole, instead, we find out small things about the surface by evaluating the loss function as the values of p_hat we give it.
Note, this is all a huge oversimplification, but can be a useful way to think about it getting started.
A Loss Function is a metric that measures the distance from your predictions to your targets.
The ideia is to choose the weighs so your predictions are close to your targets, that is, your model learned/memorized the input.
The loss function should usually not be changed during training, because the minimum in the original function might not coincide with the new one, so the gradient descent's work is lost.

Should The Gradients For The Output Layer of an RNN Clipped?

I am currently training an LSTM RNN for time-series forecasting. I understand that it is common practice to clip the gradients of the RNN when it crosses a certain threshold. However, I am not completely clear on whether or not this includes the output layer.
If we call the hidden layer of an RNN h, then the output is sigmoid(connected_weights*h + bias). I know that the gradients for the weights for determining the hidden layer are clipped, but does the same go for the output layer?
In other words, are the gradients for the connected_weights also clipped in gradient clipping?
While nothing prevents you from clipping them as well, there is no reason to do so. A nice paper with reasons is here, I'll try to give you an overview.
The problem we're trying to solve by gradient clipping is that of exploding gradients: Let's assume that your RNN layer is computed like this:
h_t = sigmoid(U * x + W * h_tm1 + b)
So forgetting about the nonlinearity for a while, you could say that a current state h_t depends on some earlier state h_{t-T} as h_t = W^T * h_tmT + input. So if the matrix W inflates the hidden state, the influence of that old hidden state is growing exponentially with time. And the same happens as you backpropagate the gradient, resulting in gradients that will most likely get you to to some useless point in the parameter space.
On the other hand, the output layer is applied just once during both forward and backward pass, so while it may complicate the learning, it will only be by a 'constant' factor, independent of the unrolling in time.
To get a bit more technical: The crucial quantity which determines whether you get exploding gradient is the largest eigenvalue of W. If it is larger than one (or smaller than -1, then it's real fun :-)), then you get exploding gradients. Conversely, if it's smaller than one, you'll suffer from vanishing gradients, making it difficult to learn long-term dependencies. You can find a nice discussion of these phenomena here, with pointers to classical literature.
If we take the sigmoid back into the picture, it becomes more difficult to get exploding gradients, as the gradients get dampened by at least a factor of 4 when being backpropagated through it. But still, have an eigenvalue larger than 4 and you'll have adventures :-) It's rather important to initialize carefully, the second paper gives some hints. With tanh, there is little dampening around zero and ReLU just propagates the gradient through, so these are rather prone to gradient explodions and thus sensitive to initialization and gradient clipping.
Overall, LSTMs have better learning properties than vanilla RNNs, esp. with regard to the vanishing gradients. Though from my experience, gradient clipping is usually necessary with them as well.
EDIT: When to clip?
Right before the update of the weights, i.e. you do the backprop unaltered. The thing is that gradient clipping is kind of a dirty hack. You still want your gradient as precise as possible, so you better don't distort it in the middle of the backprop. Just that if you see the gradient become very large, you say Nah, this smells. I better make a tiny step. and clipping is an easy way to do it (it may be that only some elements of the gradient are exploded while the others are still well behaved and informative). With most of the toolkits, you don't have the choice anyway, because the backpropagation happens atomically.

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.

Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?

I've taught myself machine learning with some online resources but I have a question about gradient descent that I couldn't figure out.
The formula for gradient descent is given by the following logistics regression:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for).
This is where I get confused.
It's not clear to me if summation is representing my entire training set or how many iterations I have done up to that point.
For example, imagine I have 10 training examples. If I perform gradient descent after each training example, my coefficients will be very different then if I performed gradient descent after all 10 training examples.
See below how the First Way is different then the Second Way:
First way
Step 1: Since coefficients initialized to 0, hθ(x)=0
Step 2: Perform gradient descent on the first training example.
Summation term only includes 1 training example
Step 3: Now use new coefficients for training examples 1 & 2... summation term includes first 2 training examples
Step 4: Perform gradient descent again.
Step 5: Now use new coefficients for training examples 1,2 &3... summation term includes first 3 training examples
Continue until convergence or all training examples used.
Second way
Step 1: Since coefficients initialized to 0, hθ(x)=0 for all 10
training examples
Step 2: Perform 1 step of gradient descent using all 10 training examples. Coefficients will be different from the First Way because the summation term includes all 10 training examples
Step 3: Use new coefficients on all 10 training examples again. summation term includes all 10 training examples
Step 4: Perform gradient descent and continue using coefficients on
all examples until convergence
I hope that explains my confusion. Does anyone know which way is correct?
Edit: Adding cost function and hypothesis function
cost function = −1/m∑[ylog(hθ(x))+(1−y)log(1−hθ(x))]
hθ(x) = 1/(1+ e^-z)
and z= θo + θ1X1+θ2X2 +θ3X3...θnXn
The second way you are describing it is the correct way to perform Gradient Descent. The true gradient is dependent on the whole data set, so one iteration of gradient descent requires using all of the data set. (This is true for any learning algorithm where you can take the gradient)
The "first way" is close to something that is called Stochastic Gradient Descent. The idea here is that using the whole data set for one update might be overkill, especially if some of the data points are redundant. In this case, we pick a random point from the data set - essentially setting m=1. We then update based on successive selections of single points in the data set. This way we can do m updates at about the same cost as one update of Gradient Descent. But each update is a bit noisy, which can make convergence to the final solution difficult.
The compromise between these approaches is called "MiniBatch". Taking the gradient of the whole data set is one full round of "batch" processing, as we need the whole data set on hand. Instead we will do a mini batch, selecting only a small subset of the whole data set. In this case we set k, 1 < k < m, where k is the number of points in the mini batch. We select k random data points to create the gradient from at every iteration, and then perform the update. Repeat until convergence. Obviously, increasing / decreasing k is a tradeoff between speed and accuracy.
Note: For both stochastic & mini batch gradient descent, it is important to shuffle / select randomly the next data point. If you use the same iteration order for each data point, you can get really weird / bad results - often diverging away from the solution.
In the case of batch gradient descent (take all samples), your solution will converge faster. In the case of stochastic gradient descent (take one sample at a time) the convergence will be slower.
When the training set is not huge, use the batch gradient descent. But there are situations where the training set is not fixed. For eg. the training happens on the fly - you keep getting more and more samples and update your vector accordingly. In this case you have to update per sample.

What should be a generic enough convergence criteria of Stochastic Gradient Descent

I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector? As, in - given dw_i, dw_i+1, how do I detect the change of sign, does it even have a meaning in more than 2 dimensions?
P.S. Apologies for non-math/latex symbols..still getting used to the stuff.
First, stochastic gradient descent is the on-line version of gradient descent method. The update rule is using a single example at a time.
Suppose, f(x) is your cost function for a single example, the stopping criteria of SGD for N-dimensional vector is usually:
See this1, or this2 for details.
Second, there is a further twist on stochastic gradient descent using so-called “minibatches”. It works identically to SGD, except that it uses more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. See this3.

Resources