How does pre-training improve classification in neural networks?

How does pre-training improve classification in neural networks? - machine-learning

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?

One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.

I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

Related

In what circumstances might using biases in a neural network not be beneficial?

I am currently looking through Michael Nielsen's ebook Neural Networks and Deep Learning and have run the code found at the end of chapter 1 which trains a neural network to recognize hand-written digits (with a slight modification to make the backpropagation algorithm over a mini-batch matrix-based).
However, having run this code and achieving a classification accuracy of just under 94%, I decided to remove the use of biases from the network. After re-training the modified network, I found no difference in classification accuracy!
NB: The output layer of this network contains ten neurons; if the ith of these neurons has the highest activation then the input is classified as being the digit i.
This got me wondering why it is necessary to use biases in a neural network, rather than just weights, and what differentiates between a task where biases will improve the performance of a network and a task where they will not?
My code can be found here: https://github.com/pipthagoras/neural-network-1

Biases are used to account for the fact that your underlying data might not be centered. It is clearer to see in the case of a linear regression.
If you do a regression without an intercept (or bias), you are forcing the underlying model to pass through the origin, which will result in a poor model if the underlying data is not centered (for example if the true generating process is Y=3000). If, on the other hand, your data is centered or close to centered, then eliminating bias is good, since you won't introduce a term that is, in fact, independent to your predictive variable (it's like selecting a simpler model, which will tend to generalize better PROVIDED that it actually reflects the underlying data).

In Lay man's terms what's the difference between a LossFunction and an OptimizationAlgorithm?

I get the part that training a network is all about finding the right weights with Optimization Algorithms deciding how weights are updated until the one needed to get the right prediction is come about.
So the million dollar que$tion$ to the main one are:
(1.) If optimization algorithms updates the weights what do loss functions do to the weights of the network?
(2.) Are loss functions only specific to the output layer of a neural network? (most examples I see with the deeplearning4j framework implement it at the output layer).
P.S: I really want to understand the basic difference between this two in the simplest way possible. I am not looking for anything complex or with some mathematical explosions.

The optimization algorithm tries to find the minimum of the loss function. At which points the weights are ideal.

Things to try when Neural Network not Converging

One of the most popular questions regarding Neural Networks seem to be:
Help!! My Neural Network is not converging!!
See here, here, here, here and here.
So after eliminating any error in implementation of the network, What are the most common things one should try??
I know that the things to try would vary widely depending on network architecture.
But tweaking which parameters (learning rate, momentum, initial weights, etc) and implementing what new features (windowed momentum?) were you able to overcome some similar problems while building your own neural net?
Please give answers which are language agnostic if possible. This question is intended to give some pointers to people stuck with neural nets which are not converging..

If you are using ReLU activations, you may have a "dying ReLU" problem. In short, under certain conditions, any neuron with a ReLU activation can be subject to a (bias) adjustment that leads to it never being activated ever again. It can be fixed with a "Leaky ReLU" activation, well explained in that article.
For example, I produced a simple MLP (3-layer) network with ReLU output which failed. I provided data it could not possibly fail on, and it still failed. I turned the learning rate way down, and it failed more slowly. It always converged to predicting each class with equal probability. It was all fixed by using a Leaky ReLU instead of standard ReLU.

If we are talking about classification tasks, then you should shuffle examples before training your net. I mean, don't feed your net with thousands examples of class #1, after thousands examples of class #2, etc... If you do that, your net most probably wouldn't converge, but would tend to predict last trained class.

I had faced this problem while implementing my own back prop neural network. I tried the following:
Implemented momentum (and kept the value at 0.5)
Kept the learning rate at 0.1
Charted the error, weights, input as well as output of each and every neuron, Seeing the data as a graph is more helpful in figuring out what is going wrong
Tried out different activation function (all sigmoid). But this did not help me much.
Initialized all weights to random values between -0.5 and 0.5 (My network's output was in the range -1 and 1)
I did not try this but Gradient Checking can be helpful as well

If the problem is only convergence (not the actual "well trained network", which is way to broad problem for SO) then the only thing that can be the problem once the code is ok is the training method parameters. If one use naive backpropagation, then these parameters are learning rate and momentum. Nothing else matters, as for any initialization, and any architecture, correctly implemented neural network should converge for a good choice of these two parameters (in fact, for momentum=0 it should converge to some solution too, for a small enough learning rate).
In particular - there is a good heuristic approach called "resillient backprop" which is in fact parameterless appraoch, which should (almost) always converge (assuming correct implementation).

after you've tried different meta parameters (optimization / architecture), the most probable place to look at is - THE DATA
as for myself - to minimize fiddling with meta parameters, i keep my optimizer automated - Adam is by opt-of-choice.
there are some rules of thumb regarding application vs architecture... but its really best to crunch those on your own.
to the point:
in my experience, after you've debugged the net (the easy debugging), and still don't converge or get to an undesired local minima, the usual suspect is the data.
weather you have contradictory samples or just incorrect ones (outliers), a small amount can make the difference from say 0.6-acc to (after cleaning) 0.9-acc..
a smaller but golden (clean) dataset is much better than a big slightly dirty one...
with augmentation you can tweak results even further.

scaling inputs data to neural network

Do we have to scale input data for neural network? How does it affect the final solution of neural network?
I've tried to find some reliable sources on that. The book "elements of statistical learning" (page 400) says it will help choosing reasonable initial random weights to start with.
Aren't the final weights deterministic regardless of the initial random weights we use?
Thank you.

Firstly, there are many types of ANNs, I will assume you are talking about the simplest one - multilayer perceptron with backpropagation.
Secondly, in your question you are mixing up data scaling (normalization) and weight initialization.
You need to randomly initialize weights to avoid symmetry while learning (if all weights are initially the same, their update will also be the same). In general, concrete values don't matter, but too large values can cause slower convergence.
You are not required to normalize your data, but normalization can make learning process faster. See this question for more details.

Random Perturbation of Data to get Training Data for Neural Networks

I am working on Soil Spectral Classification using neural networks and I have data from my Professor obtained from his lab which consists of spectral reflectance from wavelength 1200 nm to 2400 nm. He only has 270 samples.
I have been unable to train the network for accuracy more than 74% since the training data is very less (only 270 samples). I was concerned that my Matlab code is not correct, but when I used the Neural Net Toolbox in Matlab, I got the same results...nothing more than 75% accuracy.
When I talked to my Professor about it, he said that he does not have any more data, but asked me to do random perturbation on this data to obtain more data. I have research online about random perturbation of data, but have come up short.
Can someone point me in the right direction for performing random perturbation on 270 samples of data so that I can get more data?
Also, since by doing this, I will be constructing 'fake' data, I don't see how the neural network would be any better cos isn't the point of neural nets using actual real valid data to train the network?
Thanks,
Faisal.

I think trying to fabricate more data is a bad idea: you can't create anything with higher information content than you already have, unless you know the true distribution of the data to sample from. If you did, however, you'd be able to classify with the Bayes optimal error rate, which would be impossible to beat.
What I'd be looking at instead is whether you can alter the parameters of your neural net to improve performance. The thing that immediately springs to mind with small amounts of training data is your weight regulariser (are you even using regularised weights), which can be seen as a prior on the weights if you're that way inclined. I'd also look at altering the activation functions if you're using simple linear activations, and the number of hidden nodes in addition (with so few examples, I'd use very few, or even bypass the hidden layer entirely since it's hard to learn nonlinear interactions with limited data).
While I'd not normally recommend it, you should probably use cross-validation to set these hyper-parameters given the limited size, as you're going to get unhelpful insight from a 10-20% test set size. You might hold out 10-20% for final testing, however, so as to not bias the results in your favour.

First, some general advice:
Normalize each input and output variable to [0.0, 1.0]
When using a feedforward MLP, try to use 2 or more hidden layers
Make sure your number of neurons per hidden layer is big enough, so the network is able to tackle the complexity of your data
It should always be possible to get to 100% accuracy on a training set if the complexity of your model is sufficient. But be careful, 100% training set accuracy does not necessarily mean that your model does perform well on unseen data (generalization performance).
Random perturbation of your data can improve generalization performance, if the perturbation you are adding occurs in practice (or at least similar perturbation). This works because this means teaching your network on how the data could look different but still belong to the given labels.
In the case of image classification, you could rotate, scale, noise, etc. the input image (the output stays the same, naturally). You will need to figure out what kind of perturbation could apply to your data. For some problems this is difficult or does not yield any improvement, so you need to try it out. If this does not work, it does not necessarily mean your implementation or data are broken.

The easiest way to add random noise to your data would be to apply gaussian noise.
I suppose your measures have errors associated with them (a measure without errors has almost no meaning). For each measured value M+-DeltaM you can generate a new number with N(M,DeltaM), where n is the normal distribution.
This will add new points as experimental noise from previous ones, and will add help take into account exprimental errors in the measures for the classification. I'm not sure however if it's possible to know in advance how helpful this will be !

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart