How to monitor gradient through back propagation through time on each time step with Torch? - monitoring

I am trying to solve a vanishing gradient descent problem using a RNN network with Torch and thus would like to plot the evolution of the gradient associated to any learned parameter during back propagation through time. I have seen solutions to check the gradient after BPTT is done and I get the final gradient. I understand that Torch builds the graph when calling the backward function and that the BPTT is a loop over each time step of the learned sequence data. Is there any way to save each gradient at each time step of the BPTT with Torch?
Thank you for any help!

Related

Back propagation vs Levenberg Marquardt

Does anyone know the difference between Backpropagation and Levenberg–Marquardt in neural networks training? Sometimes I see that LM is considered as a BP algorithm and sometimes I see the opposite.
Your help will be highly appreciated.
Thank you.
Those are two completely unrelated concepts.
Levenberg-Marquardt (LM) is an optimization method, while backprop is just the recursive application of the chain rule for derivatives.
What LM intuitively does is this: when it is far from a local minimum, it ignores the curvature of the loss and acts as gradient descent. However, as it gets closer to a local minimum it pays more and more attention to the curvature by switching from gradient descent to a Gauss-Newton like approach.
The LM method needs both the gradient and the Hessian (as it solves variants of (H+coeff*Identity)dx=-g with H,g respectively the Hessian and the gradient. You can obtain the gradient via backpropagation. For the Hessian, it is most often not as simple although in least squares you can approximate it as 2gg^T, which means that in that case you can also obtain it easily at the end of the initial backprop.
For neural networks LM usually isn't really useful as you can't construct such a huge Hessian, and even if you do, it lacks the sparse structure needed to invert it efficiently.

Short Definition of Backpropagation and Gradient Descent

I need to write a very short definition of backpropagation and gradient descent and I'm a bit confused what the difference is.
Is the following definition correct?:
For calculating the weights of a neuronal network the backpropagation algorithmn is used. It's a optimization process of reducing the model error. The technique is based on a gradient descent method. Conversely, the contribution of each weight to the total error is calculated from the output layer across all hidden layers to the input layer. For this, the partial derivative of the error function E to w is calculated. The resulting gradient is used to adjust the weights in direction of the steepest descen:
w_new = w_old - learning_rate* (part E / part w_old)
Any suggestions or corrections?
Thanks!
First gradient descent is just one of the method to perform back propagation other than this your definition is correct. We just compare the result generated with desired value and try to change the weights assigned to each edge so as to make the errors as low as possible. If after changing the error increases it reverts back to previous state. The learning rate which you are choosing should not be very low or very high otherwise it would lead to vanishing gradient or exploding gradient problem respectively and you wont be able to reach the minimum error.

Will the shape of the Loss function change during training?

I have some problem understanding the theory of loss function and hope some one can help me.
Usually when people try to explain gradient descent to you, they will show you a loss function that looks like the very first image in this post gradient descent: all you need to know. I understand the entire theory of gradient descent is to adjust the weights and minimize the loss function.
My question is, will the shape of the Loss function change during the training or it will just stay remain as the image shown in the above post? I know that the weights are something that we are always tuning so the parameters that determines the shape of the Loss function should be the inputs X={x1,x2,...xn}. Let's make an easy example: suppose our inputs are [[1,2,3,4,5],[5,4,3,2,1]] and labels are [1,0] (Only two training sample for ease, and we are setting the batch size to 1). Then the loss function should be some thing like this for the first training sample
L = (1-nonlinear(1*w1+2*w2+3*w3+4*w4+5*w5+b))^2
and for the second training sample the loss function should be:
L = (0-nonlinear(5*w1+4*w2+3*w3+2*w4+1*w5+b))^2
Apparently, these two loss functions doesn't looks like the same if we plot them so does that mean the shape of the Loss function are changing during training? Then why are people still using that one image ( A point that slides down from the Loss function and finds the global minima) to explain the gradient descent theory?
Note: I'm not changing the loss function, the loss function are still mean square error. I'm trying to say that the shape of the Loss function seems to be changing.
I know where my problem comes from! I thought that we are not able to plot a function such as f(x,y) = xy without any constant in it, but we actually could! I searched the graph on google for f(x,y)=xy and truly we can plot them out! So now I understand, as long as we get the lost function, we can get the plot! Thanks guys
The function stays the same. The point of Gradient Decent is to find the lowest point on a given loss function that you define.
Generally, the loss function you are training to minimize does not change throughout the course of a training session. The flaw in reasoning is that you are assuming that the loss function is characterized by weights of the network, when in fact the weights of that network are a sort-of input to the loss function.
To clarify, let us assume we are predicting some N-dimensional piece of information and we have a ground truth vector, call it p, and a loss function L taking in a prediction vector p_hat which we define as
L(p_hat) := norm(p - p_hat).
This is a very primitive (and quite ineffective) loss function, but it is one nonetheless. Once we begin training, this loss function will be the function that we will try to minimize to get our network to perform the best with respect to. Notice that this loss function will attain different values for different inputs of p_hat, this does not mean the loss function is changing! In the end, the loss function will be an N-dimensional hypersurface in an N+1-dimensional hyperspace that stays the same no matter what (similar to the thing you see in the image where it is a 2-dimensional surface in a 3-dimensional space).
Gradient descent tries to find a minimum on this surface that is constructed by the loss function, but we do not really know what the surface looks like as a whole, instead, we find out small things about the surface by evaluating the loss function as the values of p_hat we give it.
Note, this is all a huge oversimplification, but can be a useful way to think about it getting started.
A Loss Function is a metric that measures the distance from your predictions to your targets.
The ideia is to choose the weighs so your predictions are close to your targets, that is, your model learned/memorized the input.
The loss function should usually not be changed during training, because the minimum in the original function might not coincide with the new one, so the gradient descent's work is lost.

Back Propagation in Convolutional Neural Networks and how to update filters

Im learning about Convolutional Neural Networks and right now i'm confused about how to implement it.
I know about regular neural networks and concepts like Gradient Descent and Back Propagation, And i can understand how CNN's how works intuitively.
My question is about Back Propagation in CNN's. How it happens? The last fully connected layers is the regular Neural Networks and there is no problem about that. But how i can update filters in convolution layers? How I can Back Propagate error from fully connected layers to these filters? My problem is updating Filters!
Filters are only simple matrixes? Or they have structures like regular NN's and connections between layers simulates that capability? I read about Sparse Connectivity and Shared Weights but I cant relate them to CNN's. Im really confused about implementing CNN's and i cant find any tutorials that talks about these concept. I can't read Papers because I'm new to these things and my Math is not good.
i dont want to use TensorFlow or tools like this, Im learning the main concept and using pure Python.
First off, I can recommend this introduction to CNNs. Maybe you can grasp the idea of it better with this.
To answer some of your questions in short:
Let's say you want to use a CNN for image classification. The picture consists of NxM pixels and has 3 channels (RBG). To apply a convolutional layer on it, you use a filter. Filters are matrices of (usually, but not necessarily) quadratic shape (e. g. PxP) and a number of channels that equals the number of channels of the representation it is applied on. Therefore, the first Conv layer filter has also 3 channels. Channels are the number of layers of the filter, so to speak.
When applying a filter to a picture, you do something called discrete convolution. You take your filter (which is usually smaller than your image) and slide it over the picture step by step, and calculate the convolution. This basically is a matrix multiplication. Then you apply a activation function on it and maybe even a pooling layer. Important to note is that the filter for all performed convolutions on this layer stays the same, so you only have P*P parameters per layer. You tweak the filter in a way, so that it fits the training data as well as possible. That's why its parameters are called shared weights. When applying GD, you simply have to apply it on said filter weights.
Also, you can find a nice demo for the convolutions here.
Implementing these things are certainly possible, but for starting out you could try out tensorflow for experimenting. At least that's the way I learn new concepts :)

Denoise EEG signal by using Daubechies function

I have an EEG signal and it contains eye blink artifacts, i read some references and know that can detect eye blink and remove them by using wavelet transform but i don't know that how do it, How to detect eye blink? Have any tutorials for me, after transformed EEG signal into wavelet coefficients, what should i do and which level of daubechies can be used to do that? Thank you!
I don't know whether this will work but you can give it a try.
Wavelet transform works like a filter bank.
Set the wavelet level to such a value so that the last level of the decomposition gives you a filter bank of nearly 0Hz - 5Hz.
Get the coefficients of the detail functions at this level and do a thresholding (soft/hard) on the same.and then compose back the signal using the new coeffiecients
Blinks have a relatively high amplitude and thresholding on them might give you what you want.
If you want to remove eye blinks, a commonly used approach is running Independent Component Analysis (ICA) on the data, identifying the blink artifact and backtransforming to the original data without that independent component. There are other approaches, but ICA works quite well even in very noisy EEG data (e.g. from simultaneous EEG-fMRI sessions).
Eye blinks will generally have a frequency between 2-5Hz.
You can first train a system to capture eyeblinks.
Then use the same to detect the blinks in an eeg signal

Resources