CrossPost: https://stats.stackexchange.com/questions/103960/how-sensitive-are-neural-networks
I am aware of pruning, and am not sure if it removes the actual neuron or makes its weight zero, but I am asking this question as if a pruning process were not being used.
On variously sized feedforward neural networks on large datasets with lots of noise:
Is it possible one (or some trivial amount) extra OR missing hidden neurons OR hidden layers make or break a network? Or will its synapse weights simply degrade to zero if it is not necessary and compensate with the other neurons if it is missing one or two?
When experimenting, should input neurons be added one at a time or in groups of X? What is X? Increments of 5?
Lastly, should each hidden layer contain the same number of neurons? This is usually what I see in example. If not, how and why would you adjust their sizes if not relying on using pure experimentation?
I would prefer to overdo it and wait longer for a convergence than if larger networks will adapt itself to the solution. I have tried numerous configurations, but it is still difficult to gauge an optimum one.
1) Yes, absolutely. For example, if you have too less neurons in your hidden layer your model will be too simple and have high bias. Similarly, if you have too many neurons your model will overfit and have high variance. Adding more hidden layers allows you to model very complex problems like object recognition but there are a lot of tricks to make adding more hidden layers work; this is known as the field of deep learning.
2) In a single layered neural network its generally a rule of thumb to start with 2 times as many neurons as the number of inputs. You can determine the increment through binary search; i.e. run through a few different architectures and see how the accuracy changes..
3) No, definitely not - each hidden layer can contain as many neurons as you want it to contain. There is no way other can experimentation to determine their sizes; all of what you mention are hyperparameters which you must tune.
Im not sure if you are looking for a simple answer, but maybe you will be interested in a new neural network regularization technique called dropout. Dropout basically randomely "removes" some of the neurons during training forcing each of the neurons to be good feature detectors. It greatly prevents overfitting and you can go ahead and set the number of neurons to be high without worrying too much. Check this paper out for more info: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf
Related
Would anyone here know if there is any kind of normalisation or scaling between layers in existing Neural Network arcitectures?
Scaling inputs is common and i am familiar with ReLU blow up. Most models i see indicate a small range of values like -2 to +2 but i don't see how this can be maintained from layer to layer. Irrespective of the activation function the second layer output is in the tens then the third layer is hundreds and final output is tens of thousands. In the worst case the layer returns NaN. A work around can be by scaling or alternating ReLU/sigmoid but I would like to know if this is this common?
Pretty much every network uses batch normalization, which is exactly that. Paper can be found here: (https://arxiv.org/abs/1502.03167). In essence it normalizes the values to be 0 mean and unit variance before being fed into the next layer. Another work is on self normalizing linear units (selu), which in some sense does this automatically without needing any kind of scaling. Paper can be found here: (https://arxiv.org/abs/1706.02515).
I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.
I have an indefinitely large training set to train a neural network.
Does it make any sense in this scenario to use regularization techniques like dropout?
Yes, it probably still does. Dropout is regularization in a sense, but much subtler than something like L1 norm. It prevents excessive co-adaptation of feature detectors as described in the original paper.
You probably don't want the network to learn to depend on just one feature or just a small combo of features, even if that is the best feature in your training set, because it may not be the case in new data. Intuitively, a network with dropout trained to recognize people in images will likely still recognize them if the face is obscured, even if there was no example image like that in the training set (because the face high level feature would have been dropped out some fraction of the time); a network trained without dropout may not (because the face feature is probably one of the best single features for detecting people). You can think of dropout as a certain degree of forced concept generalization.
Empirically, the feature detectors that are produced with dropout are much more structured (eg, for images: closer to Gabor filters, for the first few layers) when dropout is used; without dropout they are closer to random (probably because that network approximates the Gabor filter it is converging towwards using a specific linear combo of random filters, if it can rely on the elements of that combo not being dropped out then there is no gradient towards separating the filters). This is also probably a good thing since it forces features which are independent to be implemented as independent early on, which may result in lower cross-talk later on.
NEW DEVELOPMENT
I recently used OpenCV's MLP implementation to test whether it could solve the same tasks. OpenCV was able to classify the same data sets that my implementation was able to, but unable to solve the one's that mine could not. Maybe this is due to termination parameters (determining when to end training). I stopped before 100,000 iterations, and the MLP did not generalize. This time the network architecture was 400 input neurons, 10 hidden neurons, and 2 output neurons.
I have implemented the multilayer perceptron algorithm, and verified that it works with the XOR logic gate. For OCR I taught the network to correctly classify letters of "A"s and "B"s that have been drawn with a thick drawing untensil (a marker). However when I try to teach the network to classify a thin drawing untensil (a pencil) the network seems to become stuck in a valley and unable to classify the letters in a reasonable amount of time. The same goes for letters I drew with GIMP.
I know people say we have to use momentum to get out of the valley, but the sources I read were vague. I tried increasing a momentum value when the change in error was insignificant and decreasing when above, but it did not seem to help.
My network architecture is 400 input neurons (one for each pixel), 2 hidden layers with 25 neurons each, and 2 neurons in the output layer. The images are gray scale images and the inputs are -0.5 for a black pixel and 0.5 for a white pixel.
EDIT:
Currently the network is trainning until the calculated error for each trainning example falls below an accepted error constant. I have also tried stopping trainning at 10,000 epochs, but this yields bad predictions. The activation function used is the sigmoid logistic function. The error function I am using is the sum of the squared error.
I suppose I may have reached a local minimum rather than a valley, but this should not happen repeatedly.
Momentum is not always good, it can help the model to jump out of the a bad valley but may also make the model to jump out of a good valley. Especially when the previous weights update directions is not good.
There are several reasons that make your model not work well.
The parameter are not well set, it is always a non-trivial task to set the parameters of the MLP.
An easy way is to first set the learning rate, momentum weight and regularization weight to a big number, but to set the iteration (or epoch) to a very large weight. Once the model diverge, half the learning rate, momentum weight and regularization weight.
This approach can make the model to slowly converge to a local optimal, and also give the chance for it to jump out a bad valley.
Moreover, in my opinion, one output neuron is enough for two class problem. There is no need to increase the complexity of the model if it is not necessary. Similarly, if possible, use a three-layer MLP instead of a four-layer MLP.
I would like to implement a Picture Classification using Neural Network. I want to know the way to select the Features from the Picture and the number of Hidden units or Layers to go with.
For now i have an idea of changing the size of image to some 50x50 or smaller so that the number of Features are less and that all inputs have constant size.The features would be RGB value of each of the pixels.Will it be fine or there is some other better way?
Also i decided to go with 1 Hidden Layer with half the number of units as in Inputs. I can change the number to get better results. Or would i require more layers ?
There are numerous image data sets that are successfully learned by neural networks, like
MNIST (here you will find many links to papers)
NORB
and CIFAR-10/100.
Not that you need many training examples. Usually one hidden layer is sufficient. But it can be hard to determine the "right" number of neurons. Sometimes the number of hidden neurons should even be greater than the number of inputs. When you use 2 or more hidden layer you will usually need less hidden nodes and the training will be faster. But when you have to many hidden layers it can be difficult to train the weights in the first layer.
A kind of neural network that is designed especially for images are convolutional neural networks. They usually work much better than multilayer perceptrons and are much faster.
50x50 image features matrix is 2500 features with RGB values. Your neural network may memorize this but most probably will perform poorly on other images.
Therefore this type of problem is more about image-processing , feature extraction. Your features will change according to your requirements. See this similar question about image processing and neural networks
1 layer network will only be suitable for linear problems, are you sure your problem is linear? Otherwise you will need multi layer neural network