Things to try when Neural Network not Converging

Things to try when Neural Network not Converging - machine-learning

One of the most popular questions regarding Neural Networks seem to be:
Help!! My Neural Network is not converging!!
See here, here, here, here and here.
So after eliminating any error in implementation of the network, What are the most common things one should try??
I know that the things to try would vary widely depending on network architecture.
But tweaking which parameters (learning rate, momentum, initial weights, etc) and implementing what new features (windowed momentum?) were you able to overcome some similar problems while building your own neural net?
Please give answers which are language agnostic if possible. This question is intended to give some pointers to people stuck with neural nets which are not converging..

If you are using ReLU activations, you may have a "dying ReLU" problem. In short, under certain conditions, any neuron with a ReLU activation can be subject to a (bias) adjustment that leads to it never being activated ever again. It can be fixed with a "Leaky ReLU" activation, well explained in that article.
For example, I produced a simple MLP (3-layer) network with ReLU output which failed. I provided data it could not possibly fail on, and it still failed. I turned the learning rate way down, and it failed more slowly. It always converged to predicting each class with equal probability. It was all fixed by using a Leaky ReLU instead of standard ReLU.

If we are talking about classification tasks, then you should shuffle examples before training your net. I mean, don't feed your net with thousands examples of class #1, after thousands examples of class #2, etc... If you do that, your net most probably wouldn't converge, but would tend to predict last trained class.

I had faced this problem while implementing my own back prop neural network. I tried the following:
Implemented momentum (and kept the value at 0.5)
Kept the learning rate at 0.1
Charted the error, weights, input as well as output of each and every neuron, Seeing the data as a graph is more helpful in figuring out what is going wrong
Tried out different activation function (all sigmoid). But this did not help me much.
Initialized all weights to random values between -0.5 and 0.5 (My network's output was in the range -1 and 1)
I did not try this but Gradient Checking can be helpful as well

If the problem is only convergence (not the actual "well trained network", which is way to broad problem for SO) then the only thing that can be the problem once the code is ok is the training method parameters. If one use naive backpropagation, then these parameters are learning rate and momentum. Nothing else matters, as for any initialization, and any architecture, correctly implemented neural network should converge for a good choice of these two parameters (in fact, for momentum=0 it should converge to some solution too, for a small enough learning rate).
In particular - there is a good heuristic approach called "resillient backprop" which is in fact parameterless appraoch, which should (almost) always converge (assuming correct implementation).

after you've tried different meta parameters (optimization / architecture), the most probable place to look at is - THE DATA
as for myself - to minimize fiddling with meta parameters, i keep my optimizer automated - Adam is by opt-of-choice.
there are some rules of thumb regarding application vs architecture... but its really best to crunch those on your own.
to the point:
in my experience, after you've debugged the net (the easy debugging), and still don't converge or get to an undesired local minima, the usual suspect is the data.
weather you have contradictory samples or just incorrect ones (outliers), a small amount can make the difference from say 0.6-acc to (after cleaning) 0.9-acc..
a smaller but golden (clean) dataset is much better than a big slightly dirty one...
with augmentation you can tweak results even further.

Related

Regarding the consistency of convolutional neural networks

I am currently building a 2-channel (also called double-channel) convolutional neural network in order to classify 2 binary images (containing binary objects) as 'similar' or 'different'.
The problem I am having is that it seems as though the network doesn't always converge to the same solution. For example, I can use exactly the same ordering of training pairs and all the same parameters and so forth, and when I run the network multiple times, each time produces a different solution; sometimes converging to below 2% error rates, and other times I get 50% error rates.
I have a feeling that it has something to do with the random initialization of the weights of the network, which results in different optimization paths each time the network is executed. This issue even occurs when I use SGD with momentum, so I don't really know how to 'force' the network to converge to the same solution (global optima) every time?
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?

There are several sources of randomness in training.
Initialization is one. SGD itself is of course stochastic since the content of the minibatches is often random. Sometimes, layers like dropout are inherently random too. The only way to ensure getting identical results is to fix the random seed for all of them.
Given all these sources of randomness and a model with many millions of parameters, your quote
"I don't really know how to 'force' the network to converge to the same solution (global optima) every time?"
is something pretty much something anyone should say - no one knows how to find the same solution every time, or even a local optima, let alone the global optima.
Nevertheless, ideally, it is desirable to have the network perform similarly across training attempts (with fixed hyper-parameters and dataset). Anything else is going to cause problems in reproducibility, of course.
Unfortunately, I suspect the problem is inherent to CNNs.
You may be aware of the bias-variance tradeoff. For a powerful model like a CNN, the bias is likely to be low, but the variance very high. In other words, CNNs are sensitive to data noise, initialization, and hyper-parameters. Hence, it's not so surprising that training the same model multiple times yields very different results. (I also get this phenomenon, with performances changing between training runs by as much as 30% in one project I did.) My main suggestion to reduce this is stronger regularization.
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
As I mentioned, this problem is present inherently for deep models to an extent. However, your use of binary images may also be a factor, since the space of the data itself is rather discontinuous. Perhaps consider "softening" the input (e.g. filtering the inputs) and using data augmentation. A similar approach is known to help in label smoothing, for example.

In Lay man's terms what's the difference between a LossFunction and an OptimizationAlgorithm?

I get the part that training a network is all about finding the right weights with Optimization Algorithms deciding how weights are updated until the one needed to get the right prediction is come about.
So the million dollar que$tion$ to the main one are:
(1.) If optimization algorithms updates the weights what do loss functions do to the weights of the network?
(2.) Are loss functions only specific to the output layer of a neural network? (most examples I see with the deeplearning4j framework implement it at the output layer).
P.S: I really want to understand the basic difference between this two in the simplest way possible. I am not looking for anything complex or with some mathematical explosions.

The optimization algorithm tries to find the minimum of the loss function. At which points the weights are ideal.

How to overcome overfitting in CNN - standard methods don't work

I've been recently playing around with car data set from Stanford (http://ai.stanford.edu/~jkrause/cars/car_dataset.html).
From the very beginning I had an overfitting problem so decided to:
Add regularization (L2, dropout, batch norm, ...)
Tried different architectures (VGG16, VGG19, InceptionV3, DenseNet121, ...)
Tried trasnfer learning using models trained on ImageNet
Used data augmentation
Every step moved me a little bit forward. However I finished with 50% validation accuracy (started below 20%) compared to 99% train accuracy.
Do you have an idea what more can I do to get to around 80-90% accuracy?
Hope this can help some people!:)

Things you should try include:
Early stopping, i.e. use a portion of your data to monitor validation loss and stop training if performance does not improve for some epochs.
Check whether you have unbalanced classes, use class weighting to equally represent each class in the data.
Regularization parameter tuning: different l2 coefficients, different dropout values, different regularization constraints (e.g. l1).
Other general suggestions may be to try and replicate the state of the art models on this particular dataset, see if those perform as they should.
Also make sure to have all implementation details ironed out (e.g. convolution is being performed along width and height, and not along the channels dimension - this is a classic rookie mistake when starting out with Keras, for instance).
It would also help to have some more details on the code that you are using, but for now these suggestions will do.
50% accuracy on a 200-class problem doesn't sound so bad anyway.
Cheers

For those who encounter the same problem I managed to get 66,11% accuracy by playing with drop out, learning rate and learning decay mainly.
The best results were achieved on VGG16 architecture.
The model is on https://github.com/michalgdak/car-recognition

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?

One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.

I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

Advantages of RNN over DNN in prediction

I am going to work on a problem that needs to be addressed with either RNN or Deep Neural Nets. In general, the problem is predicting financial values. So, because I am given the sequence of financial data as an input, I thought that RNN would be better. On the other hand, I think that if I can fit the data into some structure, I can train with DNN much better because the training phase is easier in DNN than RNN. For example, I could get last 1-month info and keep 30 inputs and predict 31'th day while using DNN.
I don't understand the advantage of RNN over DNN in this perspective. My first question is about the proper usage of RNN or DNN in this problem.
My second questions are somehow basic. While training RNN, isn't it possible for a network to get "confused"? I mean, consider the following input: 10101111, and our inputs are one digits 0 or 1 and we have 2-sequences (1-0,1-0,1-1,1-1) Hereafter 1, comes 0 several times. And then at the end, after 1 comes 1. While training, wouldn't this become a major problem? That is, why the system not gets confused while training this sequence?

I think your question is phrased a bit problematically. First, DNNs are a class of architectures. A Convolutional Neural Network differs greatly from a Deep Belief Network or a simple Deep MLP. There are feed forward architectures (e.g. TDNN) fit for timeseries prediction but it depends on you, whether you're more interested in research or just solving your problem.
Second, RNNs are as "deep" as it gets. Considering the most basic RNN, the Elman Network: During training with Backpropagation through time (BPTT) they are unfolded in time - backpropagating over T timesteps. Since this backpropagation is done not only vertically like in a standard DNN but also horizontally over T-1 context layers, the past activations of the hidden layers from T-1 timesteps before the present are actually considered for the activation at the current timestep. This illustration of an unfolded net might help in understanding what I just wrote (source):
This makes RNNs so powerful for timeseries prediction (and should answer both your questions). If you have more questions, read about Elman Networks. LSTMs etc. will only confuse you. Understanding Elman Networks and BPTT is the needed foundation to understand any other RNN.
And one last thing you'll need to look out for: The vanishing gradient problem. While it's tempting to say let's make T=infinity and give our RNN as much memory as possible: It doesn't work. There are many ways working around this problem, LSTMs are quite popular at the moment and there are even some proper LSTM implementations around nowadays. But it's important to know that a basic Elman Network could really struggle with T=30.

As you answered yourself - RNN are for sequences. If data has sequential nature (time series) than it is preferable to use such model over DNN and other "static" models. The main reason is that RNN can model process which is responsible for each conequence, so for example given sequences
0011100
0111000
0001110
RNN will be able to build a model, that "after seeing '1' I will see two more" and correctly build a prediction when seeing
0000001**** -> 0000001110
While in the same time, for DNN (and other non sequential models) there is no relation between these three sequences, in fact the only common thing for them is that "there is 1 on forth position, so I guess it is always like that".
Regarding the second question. Why it won't get confused? Because it models sequences, because it has memory. It makes its recisions based on everything that was observed before, and assuming that your signal has any type of regularity, there is always some vent in the past that differentiate between two possible paths of signals. Once again, such phenomena are much better addressed by RNN than non-recurrent models. See for example natural language and enormous progress given by LSTM-based models in recent years.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart