Backpropagation - how neuron-error influences the net parameters (weights, biases)

Backpropagation - how neuron-error influences the net parameters (weights, biases) - machine-learning

While I am trying to get the deep learning flow, I can not find out one detail -> when I reach an error on every neuron (in backpropagations flow), what is next I should do with that all errors. The calibration of model is about adjusting the weights and biases. But every course I get doesn't tell how the computed errors influence the weights.
And the other issue is - if I perform back propagation for EACH training example ... again, how each neuron error influence the whole weights for my net (which are general for all training examples).
I will be grateful for your help

While I am trying to get the deep learning flow, I can not find out one detail -> when I reach an error on every neuron (in backpropagations flow), what is next I should do with that all errors.
This isn't exactly correct, you have a loss function for your output, which is usually compared to some label. An example of this would be the classic MSE.
You have your error function and you're basically adjusting the weights of the different neurons according to the output error. If you take the derivative/gradient of error function and break down into the chain rule, you can see how to adjust each neuron. Look at this link for a detailed explanation - https://brilliant.org/wiki/backpropagation/.
And the other issue is - if I perform back propagation for EACH training >example ... again, how each neuron error influence the whole weights for my net >(which are general for all training examples).
Usually back propagation is performed per batch of examples. I reccommend reading up on stoachistic gradient descent to understand why the adjusting the neuron weights for batches is still valid for the whole training set.

Related

Weights becoming "NaN" in implementation of Neural Networks

I am trying to implement Neural Networks for classifcation having 5 hidden layers, and with softmax cross entropy in the output layer. The implementation is in JAVA.
For optimization, I have used MiniBatch gradient descent(Batch size=100, learning rate = 0.01)
However, after a couple of iterations, the weights become "NaN" and the predicted values turn out to be the same for every testcase.
Unable to debug the source of this error.
Here is the github link to the code(with the test/training file.)
https://github.com/ahana204/NeuralNetworks

In my case, i forgot to normalize the training data (by subtracting mean). This was causing the denominator of my softmax equation to be 0. Hope this helps.

Assuming the code you implemented is correct, one reason would be large learning rate. If learning rate is large, weights may not converge and may become very small or very large which could be shown NaN. Try to lower learning rate to see if anything changes.

Neural network online training

I want to implement a simple feed-forward neural network to approximate the function y=f(x)=ax^2 where a is some constant and x is the input value.
The NN has one input node, one hidden layer with 1-n nodes, and one output node. For example, I input the value 2.0 -> the NN produces 4.0, and again I input 3.0 -> the NN produces 9.0 or close to it and so on.
If I understand "online-training," the training data is fed one by one - meaning I input the value 2.0 -> I iterate with the gradient decent 100 times, and then I pass the value 3.0, and I iterate another 100 times.
However, when I try to do this with my experimental/learning NN - I input the value 2.0 -> the error gets very small -> the output is very close to 4.0.
Now if I want to predict for the input 3.0 -> the NN produces 4.36 or something instead of 9.0. So the NN just learns the last training value.
How can I use online-training to get a Neural Network that approximates the desired function for a range [-d, d]? What am I missing?
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function. This is besides the point but in case someone was wondering.
Any advise would be greatly appreciated.
More info - I am activating the hidden layer with the Sigmoid function and the output layer with the linear one.

The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function.
Recurrent Neural Networks (RNNs) are the state of the art for modeling time series. This is because they can take inputs of arbitrary length, and they can also use internal state to model the changing behavior of the series over time.
Training feedforward neural networks for time series is an old method which will generally not perform as well. They require a fixed sized input so you must choose a fixed sized sliding time window, and they also don't preserve state, so it is hard to learn a time-varying function.
I can find very little about "online training" of feedforward neural nets with stochastic gradient descent to model non-stationary behavior except for a couple of very vague references. I don't think this provides any benefit besides allowing you to train in real time when you are getting a stream of data one at a time. I don't think it will actually help you model time-dependent behavior.
Most of the older methods I can find in the literature about online learning for neural networks use a hybrid approach with a neural network and some other method that can help capture time dependencies. Again, these should all be inferior to RNNs, not to mention harder to implement in practice.
Furthermore, I don't think you are implementing online training correctly. It should be stochastic gradient descent with a mini-batch size of 1. Therefore, you only run one iteration of gradient descent on each training example per training epoch. Since you are running 100 iterations before moving on to the next training example, you are going too far down the error gradient with respect to that single example, resulting in serious overfitting to a single data point. This is why you get poor results on the next input. I don't think this is a justifiable method of training, nor do I think it will work for time series.
You haven't mentioned what your activations are or your loss function is, so I can't comment on whether those are appropriate for the task.
Also, I don't think the learning y=ax^2 is a good analogy for time series prediction. This is a static function that always gives the same output for a given input, regardless of the index of the input or the value of previous inputs.

Training of a CNN using Backpropagation

I have earlier worked in shallow(one or two layered) neural networks, so i have understanding of them, that how they work, and it is quite easy to visualize the derivations for forward and backward pass during the training of them, Currently I am studying about Deep neural networks(More precisely CNN), I have read lots of articles about their training, but still I am unable to understand the big picture of the training of the CNN, because in some cases people using pre- trained layers where convolution weights are extracted using auto-encoders, in some cases random weights were used for convolution, and then using back propagation they train the weights, Can any one help me to give full picture of the training process from input to fully connected layer(Forward Pass) and from fully connected layer to input layer (Backward pass).
Thank You

I'd like to recommend you a very good explanation of how to train a multilayer neural network using backpropagation. This tutorial is the 5th post of a very detailed explanation of how backpropagation works, and it also has Python examples of different types of neural nets to fully understand what's going on.
As a summary of Peter Roelants tutorial, I'll try to explain a little bit what is backpropagation.
As you have already said, there are two ways to initialize a deep NN: with random weights or pre-trained weights. In the case of random weights and for a supervised learning scenario, backpropagation works as following:
Initialize your network parameters randomly.
Feed forward a batch of labeled examples.
Compute the error (given by your loss function) within the desired output and the actual one.
Compute the partial derivative of the output error w.r.t each parameter.
These derivatives are the gradients of the error w.r.t to the network's parameters. In other words, they are telling you how to change the value of the weights in order to get the desired output, instead of the produced one.
Update the weights according to those gradients and the desired learning rate.
Perform another forward pass with different training examples, repeat the following steps until the error stops decreasing.
Starting with random weights is not a problem for the backpropagation algorithm, given enough training data and iterations it will tune the weights until they work for the given task.
I really encourage you to follow the full tutorial I linked, because you'll get a very detalied view of how and why backpropagation works for multi layered neural networks.

Adequate Mean Squared Error while using Deep Autoencoder/ Deep Learning in general

I'm currently wondering when to stop training of Deep Autoencoders, especially when it seems to be stuck in a local minimum.
Is it essential to get the training criterium (e.g. MSE) to e.g. 0.000001 and force it to perfectly reconstruct the input or is it okay to keep differences (e.g. stop when the MSE is at about 0.5) depending on the dataset used.
I know that a better reconstruction might lead to better classification results afterwards but is there a "rule of thumb" when to stop? I'm especially interested in rules that have no heuristic character like "if the MSE doesn't get smaller in x iterations".

I don't think it's possible to derive a general rule of thumb for this, as generating NN:s/machine learning is a very problem-specific procedure, and generally, there is no free lunch. How to decide what is a "good" training error to terminate at depends on various problem-specific factors, e.g. the noise in the data. Evaluating your NN only with regard to training sets, with the only objective of minimising the MSE, will many times lead to overfitting. With only the training error as feedback, you might tune your NN to the noise in the training data (hence the overfitting). One method to avoid this is holdout validation. Instead of only training your NN to given data, your divide your data set into a training set, a validation set (and a test set).
Training sets: Training and feedback to NN, will naturally keep decreasing with longer training (at least down to "OK" MSE values for the specific problem).
Validation sets: Evaluate your NN w.r.t. to these, but don't give feedback to your NN/genetic algoritm.
Along with the evaluation-feedback of your training sets you should hence also evaluate the validation set, however without giving feedback to your neural network (NN).
Track the decrease in MSE for training as well as validation sets; generally training error will steadily decrease, whereas, at some point, the validation error will reach a minimum and start to increase with further training. Of course, you cannot know during runtime where this minima occurs, so generally one stores the NN with the lowest validation error, and after this has seemingly not been updated in some time (i.e., in error retrospect: we've passed a minima in validation error), the algorithm is terminated.
See e.g. the following article Neural Network: Train-validate-Test Stopping for details, as well as this SE-statistics thread discussing two different validation methods.
For the training/validation of Deep Autoencoders/Deep Learning, specifically w.r.t. overfitting, I find the article Dropout: A Simple Way to Prevent Neural Networks from Overfitting (*) to be valuable.
(*) By H. Srivistava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, University of Toronto.

Incremental (on-line) Backpropagation stopping criteria

In an on-line implementation of a Backpropagation ANN, how would you determine the stopping criteria?
The way that I have been doing it(which I am sure is incorrect) is to average the error of each output node and then average this error over each epoch.
Is this an incorrect method? Is there a standard way of stopping an on-line implementation?

You should always consider the error (e.g. Root Mean Squared Error) on a validation set which is disjunct from your training set. If you train too long, your neural network will begin to overfit. This means, that the error on your training set will become minimal or even 0, but the error on general data will become worse.
To end up with the model parameters which yielded the best generalization performance, you should copy&save your model parameters whenever the error on your validation set is a new minimum. If performance is a problem, you can do this check only every N steps.
In an on-line learning setup, you will train with single training samples or mini-batches of a small number of training samples. You can consider the succsessive training of all samples/mini-batches that cover your total data as one training epoch.
There are several possibilities to define a so called Early Stopping Criterion. E.g. you could consider the best-so-far RMS Error on your validation set after each full epoch. You would stop as soon as there has not been a new optimum for M epochs. Depending on the complexity of your problem you must choose M high enough. You can also start with a rather small M and whenever you get a new optimum, you set M to the number of epochs you needed to reach it. It depends on whether it is more important to quickly converge or to be as thorough as possible.
You will always have situations where both your validation and/or training error will get bigger temporarily, because the learning algorithm is hill-climbing. This means it traverses regions on the error surface which render bad performance, but must be passed to reach a new, better optimum. If you simply stop as soon your validation or training error gets worse between two subsequent steps, you will end up in suboptimal solutions prematurely.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart