Why does random initialization of weights in neural network work? - machine-learning

The question of why the weights of a neural network cannot be initialized as 0's has been asked plenty of times. The answer is straightforward: zero initial weights would result in all nodes in a layer learning the same thing, hence the symmetry has to be broken.
However, what I failed to comprehend is that, why would initializing the weights to some random numbers close to zero work. Even more advanced initialization techniques such as Xavier modify only the variance, which remains close to zero. Some answers in the linked question point to the existence of multiple local optima, but I seriously doubt the validity of this argument because of the followings:
The (usual) cost function of an individual logistic regression has a unique minimum. Nonetheless this insight may not generalizable to more than one node, so let's forget it for now.
Assume for the sake of argument that multiple local optima exist. Then shouldn't the proper randomization technique be Monte-Carlo-ish-ly over the entire domain of possible weights, rather than some random epsilons about zero? What's stopping the weights from converging again after a couple of iterations? The only rationale I can think of is that there exists a global maximum at the origin and all local optima are nicely spread 'radially' so that a tiny perturbation in any direction is sufficient to move you down the gradient towards different local optima, which is highly improbable.
PS1: I am asking the question here in the main Stack Overflow site because my reference is here.
PS2: The answer to why the variance of the initial weights are scaled this way can be found here. However, it did not address my question of why random initialization would work at all due to the possibility of converging weights, or rather, the weights would 'diverge' to 'learn' different features.

You've hit the main reason: we need the kernels to differ so that the kernels (nodes) differentiate their learning.
First of all, random initialization doesn't always work; depending on how closely you've tuned your model structure and hyper-parameters, sometimes the model fails to converge; this is obvious from the loss function in the early iterations.
For some applications, there are local minima. However, in practical use, the happy outgrowth of problem complexity is that those minima have very similar accuracy. In short, it doesn't matter which solution we find, so long as we find one. For instance, in image classification (e.g. the ImageNet contest), there are many features useful in identifying photos. As with (simpler) PCA, when we have a sets of features that correlate highly with the desired output and with each other, it doesn't matter which set we use. Those features are cognate to the kernels of a CNN.

Related

Regarding the consistency of convolutional neural networks

I am currently building a 2-channel (also called double-channel) convolutional neural network in order to classify 2 binary images (containing binary objects) as 'similar' or 'different'.
The problem I am having is that it seems as though the network doesn't always converge to the same solution. For example, I can use exactly the same ordering of training pairs and all the same parameters and so forth, and when I run the network multiple times, each time produces a different solution; sometimes converging to below 2% error rates, and other times I get 50% error rates.
I have a feeling that it has something to do with the random initialization of the weights of the network, which results in different optimization paths each time the network is executed. This issue even occurs when I use SGD with momentum, so I don't really know how to 'force' the network to converge to the same solution (global optima) every time?
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
There are several sources of randomness in training.
Initialization is one. SGD itself is of course stochastic since the content of the minibatches is often random. Sometimes, layers like dropout are inherently random too. The only way to ensure getting identical results is to fix the random seed for all of them.
Given all these sources of randomness and a model with many millions of parameters, your quote
"I don't really know how to 'force' the network to converge to the same solution (global optima) every time?"
is something pretty much something anyone should say - no one knows how to find the same solution every time, or even a local optima, let alone the global optima.
Nevertheless, ideally, it is desirable to have the network perform similarly across training attempts (with fixed hyper-parameters and dataset). Anything else is going to cause problems in reproducibility, of course.
Unfortunately, I suspect the problem is inherent to CNNs.
You may be aware of the bias-variance tradeoff. For a powerful model like a CNN, the bias is likely to be low, but the variance very high. In other words, CNNs are sensitive to data noise, initialization, and hyper-parameters. Hence, it's not so surprising that training the same model multiple times yields very different results. (I also get this phenomenon, with performances changing between training runs by as much as 30% in one project I did.) My main suggestion to reduce this is stronger regularization.
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
As I mentioned, this problem is present inherently for deep models to an extent. However, your use of binary images may also be a factor, since the space of the data itself is rather discontinuous. Perhaps consider "softening" the input (e.g. filtering the inputs) and using data augmentation. A similar approach is known to help in label smoothing, for example.

Different random weight initialization leading to different performances

I'm training a 3D U-Net on an EM dataset of a brain. The objective is to segment neurons in it. During the experiments, I've noticed, different random initialization of the network leads to different performances. I evaluate the performance based on mean Intersection over Union, and I observe differences as large as 5%.
I use xavier initialization with a uniform distribution and use a constant learning rate of 1e-4.
Edit: I'm asking how I can avoid a behavior like this?
As Roberto said, different random initializations always lead to different resulting weights. This is quite obvious, as the initial state constrains the possible system trajectories.
If you read the paper introducing Xavier init, you will see that it is well known that the random init has a large init influence on the resulting performance (there is a paper which showed that hundreds of training runs with random init all end up in different local minima, but I can't find it right now). This is the very reason why we use heuristics like Xavier init: they tend to lead to better local minima than other forms of random initialization.
Typically, one performs multiple training runs (e.g. 10) and takes the best architecture, in articles, the mean is also reported sometimes.

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

Things to try when Neural Network not Converging

One of the most popular questions regarding Neural Networks seem to be:
Help!! My Neural Network is not converging!!
See here, here, here, here and here.
So after eliminating any error in implementation of the network, What are the most common things one should try??
I know that the things to try would vary widely depending on network architecture.
But tweaking which parameters (learning rate, momentum, initial weights, etc) and implementing what new features (windowed momentum?) were you able to overcome some similar problems while building your own neural net?
Please give answers which are language agnostic if possible. This question is intended to give some pointers to people stuck with neural nets which are not converging..
If you are using ReLU activations, you may have a "dying ReLU" problem. In short, under certain conditions, any neuron with a ReLU activation can be subject to a (bias) adjustment that leads to it never being activated ever again. It can be fixed with a "Leaky ReLU" activation, well explained in that article.
For example, I produced a simple MLP (3-layer) network with ReLU output which failed. I provided data it could not possibly fail on, and it still failed. I turned the learning rate way down, and it failed more slowly. It always converged to predicting each class with equal probability. It was all fixed by using a Leaky ReLU instead of standard ReLU.
If we are talking about classification tasks, then you should shuffle examples before training your net. I mean, don't feed your net with thousands examples of class #1, after thousands examples of class #2, etc... If you do that, your net most probably wouldn't converge, but would tend to predict last trained class.
I had faced this problem while implementing my own back prop neural network. I tried the following:
Implemented momentum (and kept the value at 0.5)
Kept the learning rate at 0.1
Charted the error, weights, input as well as output of each and every neuron, Seeing the data as a graph is more helpful in figuring out what is going wrong
Tried out different activation function (all sigmoid). But this did not help me much.
Initialized all weights to random values between -0.5 and 0.5 (My network's output was in the range -1 and 1)
I did not try this but Gradient Checking can be helpful as well
If the problem is only convergence (not the actual "well trained network", which is way to broad problem for SO) then the only thing that can be the problem once the code is ok is the training method parameters. If one use naive backpropagation, then these parameters are learning rate and momentum. Nothing else matters, as for any initialization, and any architecture, correctly implemented neural network should converge for a good choice of these two parameters (in fact, for momentum=0 it should converge to some solution too, for a small enough learning rate).
In particular - there is a good heuristic approach called "resillient backprop" which is in fact parameterless appraoch, which should (almost) always converge (assuming correct implementation).
after you've tried different meta parameters (optimization / architecture), the most probable place to look at is - THE DATA
as for myself - to minimize fiddling with meta parameters, i keep my optimizer automated - Adam is by opt-of-choice.
there are some rules of thumb regarding application vs architecture... but its really best to crunch those on your own.
to the point:
in my experience, after you've debugged the net (the easy debugging), and still don't converge or get to an undesired local minima, the usual suspect is the data.
weather you have contradictory samples or just incorrect ones (outliers), a small amount can make the difference from say 0.6-acc to (after cleaning) 0.9-acc..
a smaller but golden (clean) dataset is much better than a big slightly dirty one...
with augmentation you can tweak results even further.

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources