When training a stacked auto-encoder, what are the best practices to choose the number of hidden layers, and their sizes?
For example, is it best to set the hidden layer size to be one minus the input layer size, in each layer of the stack?
Also, does one need to perform hyper-parameter optimization (e.g. L2 Weight Regularization) at each layer in the stack, or do the hyper-parameters from one location in the stack tend to generalize well for others?
Related
I recently found the "global_pooling" flag in the Pooling layer in caffe, however was unable to find sth about it in the documentation here (Layer Catalogue)
nor here (Pooling doxygen doc) .
Is there an easy forward examply explanation to this in comparison to the normal Pool-Layer behaviour?
With Global pooling reduces the dimensionality from 3D to 1D. Therefore Global pooling outputs 1 response for every feature map. This can be the maximum or the average or whatever other pooling operation you use.
It is often used at the end of the backend of a convolutional neural network to get a shape that works with dense layers. Therefore no flatten has to be applied.
Convolutions can work on any image input size (which is big enough). However, if you have a fully connected layer at the end, this layer needs a fixed input size. Hence the complete network needs a fixed image input size.
However, you can remove the fully connected layer and just work with convolutional layers. You can make a convolutional layer at the end which has the same number of filters as you have classes. But you want one value for each class which indicates the probability of that class. Hence you apply a pooling filter over the complete remaining feature map. This pooling is hence "global" as it always is as big as necessary. In contrast, usual pooling layers have a fixed size (e.g. of 2x2 or 3x3).
This is a general concept. You can also find global pooling in other libraries, e.g. Lasagne. If you want a good reference in literature, I recommend reading Network In Network.
We get only one value from entire feature map when we apply GP layer, in which kernel size is the h×w of the feature map. GP layers are used to reduce the spatial dimensions of a three-dimensional feature map. However, GP layers perform a more extreme type of dimensionality reduction, where a feature map with dimensions h×w×d is reduced in size to have dimensions 1×1×d. GP layers reduce each h×w feature map to a single number by simply taking the average of all hw values.
If you are looking for information regarding flags/parameters of caffe, it is best look them up in the comments of '$CAFFE_ROOT/src/caffe/proto/caffe.proto'.
For 'global_pooling' parameter the comment says:
// If global_pooling then it will pool over the size of the bottom by doing
// kernel_h = bottom->height and kernel_w = bottom->width
For more information about caffe layers, see this help pages.
What is the general consensus on rescaling images that have different sizes? I have read that one approach is to rescale the largest size of an image to a fixed size. It's not clear to me how only rescaling one of the dimensions would lead to uniform image shapes across the dataset.
Are there other approaches, e.g. would it work to take the average size of the two dimensions and then rescale the dimensions of each image to the mean of each dimension across the dataset?
Is it important which interpolation method is used in the rescaling?
Would it make sense to simply take an nxm part of each image and cut off the rest of each image?
Is there a list of approaches people have used and how they perform in different scenarios.
Depends on the target application of the CNNs. For object detection/classification usually a sliding window approach or cropping is used. For the first option, sliding window is moved around the image and for every patch (with different overlapping criterion) a prediction is made. This predictions are then filtered with other pooling or filter strategies.
For image segmentation (aka semantic segmentation), similar approaches are used. 1) image scaling + segmenting + scaling back to its original size. 2) different image patches + segmentation of each, or 3) sliding window segmentation + maxpooling. With the option (3) each pixel has a N = HxW votes (where N is the size of the sliding window). This N predictions are then aggregated into a maxixmum-voting classifier (similar to ensemble models on Random Forest and other classifiers).
So, in short, I believe there is no short nor unique answer to this question. The decision you take will depend in the goal you try to achieve with the CNN, and of course, the quality of your approach will have an impact in the performance of the CNN. I don't know about any study of this kind though.
[This question is now also posed at Cross Validated]
The question in short
I'm studying convolutional neural networks, and I believe that these networks do not treat every input neuron (pixel/parameter) equivalently. Imagine we have a deep network (many layers) that applies convolution on some input image. The neurons in the "middle" of the image have many unique pathways to many deeper layer neurons, which means that a small variation in the middle neurons has a strong effect on the output. However, the neurons at the edge of the image have only 1 way (or, depending on the exact implementation, of the order of 1) pathways in which their information flows through the graph. It seems that these are "under-represented".
I am concerned about this, as this discrimination of edge neurons scales exponentially with the depth (number of layers) of the network. Even adding a max-pooling layer won't halt the exponential increase, only a full connection brings all neurons on equal footing. I'm not convinced that my reasoning is correct, though, so my questions are:
Am I right that this effect takes place in deep convolutional networks?
Is there any theory about this, has it ever been mentioned in literature?
Are there ways to overcome this effect?
Because I'm not sure if this gives sufficient information, I'll elaborate a bit more about the problem statement, and why I believe this is a concern.
More detailed explanation
Imagine we have a deep neural network that takes an image as input. Assume we apply a convolutional filter of 64x64 pixel over the image, where we shift the convolution window by 4 pixels each time. This means that every neuron in the input sends it's activation to 16x16 = 265 neurons in layer 2. Each of these neurons might send their activation to another 265, such that our topmost neuron is represented in 265^2 output neurons, and so on. This is, however, not true for neurons on the edges: these might be represented in only a small number of convolution windows, thus causing them to activate (of the order of) only 1 neuron in the next layer. Using tricks such as mirroring along the edges won't help this: the second-layer-neurons that will be projected to are still at the edges, which means that that the second-layer-neurons will be underrepresented (thus limiting the importance of our edge neurons as well). As can be seen, this discrepancy scales exponentially with the number of layers.
I have created an image to visualize the problem, which can be found here (I'm not allowed to include images in the post itself). This network has a convolution window of size 3. The numbers next to neurons indicate the number of pathways down to the deepest neuron. The image is reminiscent of Pascal's Triangle.
https://www.dropbox.com/s/7rbwv7z14j4h0jr/deep_conv_problem_stackxchange.png?dl=0
Why is this a problem?
This effect doesn't seem to be a problem at first sight: In principle, the weights should automatically adjust in such a way that the network does it's job. Moreover, the edges of an image are not that important anyway in image recognition. This effect might not be noticeable in everyday image recognition tests, but it still concerns me because of two reasons: 1) generalization to other applications, and 2) problems arising in the case of very deep networks.
1) There might be other applications, like speech or sound recognition, where it is not true that the middle-most neurons are the most important. Applying convolution is often done in this field, but I haven't been able to find any papers that mention the effect that I'm concerned with.
2) Very deep networks will notice an exponentially bad effect of the discrimination of boundary neurons, which means that central neurons can be overrepresented by multiple order of magnitude (imagine we have 10 layers such that the above example would give 265^10 ways the central neurons can project their information). As one increases the number of layers, one is bound to hit a limit where weights cannot feasibly compensate for this effect. Now imagine we perturb all neurons by a small amount. The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
I will quote your sentences and below I will write my answers.
Am I right that this effect takes place in deep convolution networks
I think you are wrong in general but right according to your 64 by 64 sized convolution filter example. While you are structuring your convolution layer filter sizes, they would never be bigger than what you are looking for in your images. In other words - if your images are 200by200 and you convolve for 64by64 patches, you say that these 64by64 patches will learn some parts or exactly that image patch that identifies your category. The idea in the first layer is to learn edge-like partial important images not the entire cat or car itself.
Is there any theory about this, has it ever been mentioned in literature? and Are there ways to overcome this effect?
I never saw it in any paper I have looked through so far. And I do not think that this would be an issue even for very deep networks.
There is no such effect. Suppose your first layer which learned 64by64 patches is in action. If there is a patch in the top-left-most corner that would get fired(become active) then it will show up as a 1 in the next layers topmost left corner hence the information will be propagated through the network.
(not quoted) You should not think as 'a pixel is being useful in more neurons when it gets closer to center'. Think about 64x64 filter with a stride of 4:
if the pattern that your 64x64 filter look for is in the top-most-left corner of the image then it will get propagated to the next layers top most corner, otherwise there will be nothing in the next layer.
the idea is to keep meaningful parts of the image alive while suppressing the non-meaningful, dull parts, and combining these meaningful parts in following layers. In case of learning "an uppercase letter a-A" please look at only the images in the very old paper of Fukushima 1980 (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf) figure 7 and 5. Hence there is no importance of a pixel, there is importance of image patch which is the size of your convolution layer.
The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
Suppose you are looking for a car in an image,
And suppose that in your 1st example the car is definitely in the 64by64 top-left-most part of your 200by200 image, in 2nd example the car is definitely in the 64by64 bottom-right-most part of your 200by200 image
In the second layer all your pixel values will be almost 0, for 1st image except the one in the very top-left-most corner and for 2nd image except the one in the very bottom-right-most corner.
Now, the center part of the image will mean nothing to my forward and backward propagation because the values will already be 0. But the corner values will never be discarded and will effect my learning weights.
CrossPost: https://stats.stackexchange.com/questions/103960/how-sensitive-are-neural-networks
I am aware of pruning, and am not sure if it removes the actual neuron or makes its weight zero, but I am asking this question as if a pruning process were not being used.
On variously sized feedforward neural networks on large datasets with lots of noise:
Is it possible one (or some trivial amount) extra OR missing hidden neurons OR hidden layers make or break a network? Or will its synapse weights simply degrade to zero if it is not necessary and compensate with the other neurons if it is missing one or two?
When experimenting, should input neurons be added one at a time or in groups of X? What is X? Increments of 5?
Lastly, should each hidden layer contain the same number of neurons? This is usually what I see in example. If not, how and why would you adjust their sizes if not relying on using pure experimentation?
I would prefer to overdo it and wait longer for a convergence than if larger networks will adapt itself to the solution. I have tried numerous configurations, but it is still difficult to gauge an optimum one.
1) Yes, absolutely. For example, if you have too less neurons in your hidden layer your model will be too simple and have high bias. Similarly, if you have too many neurons your model will overfit and have high variance. Adding more hidden layers allows you to model very complex problems like object recognition but there are a lot of tricks to make adding more hidden layers work; this is known as the field of deep learning.
2) In a single layered neural network its generally a rule of thumb to start with 2 times as many neurons as the number of inputs. You can determine the increment through binary search; i.e. run through a few different architectures and see how the accuracy changes..
3) No, definitely not - each hidden layer can contain as many neurons as you want it to contain. There is no way other can experimentation to determine their sizes; all of what you mention are hyperparameters which you must tune.
Im not sure if you are looking for a simple answer, but maybe you will be interested in a new neural network regularization technique called dropout. Dropout basically randomely "removes" some of the neurons during training forcing each of the neurons to be good feature detectors. It greatly prevents overfitting and you can go ahead and set the number of neurons to be high without worrying too much. Check this paper out for more info: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf
I would like to implement a Picture Classification using Neural Network. I want to know the way to select the Features from the Picture and the number of Hidden units or Layers to go with.
For now i have an idea of changing the size of image to some 50x50 or smaller so that the number of Features are less and that all inputs have constant size.The features would be RGB value of each of the pixels.Will it be fine or there is some other better way?
Also i decided to go with 1 Hidden Layer with half the number of units as in Inputs. I can change the number to get better results. Or would i require more layers ?
There are numerous image data sets that are successfully learned by neural networks, like
MNIST (here you will find many links to papers)
NORB
and CIFAR-10/100.
Not that you need many training examples. Usually one hidden layer is sufficient. But it can be hard to determine the "right" number of neurons. Sometimes the number of hidden neurons should even be greater than the number of inputs. When you use 2 or more hidden layer you will usually need less hidden nodes and the training will be faster. But when you have to many hidden layers it can be difficult to train the weights in the first layer.
A kind of neural network that is designed especially for images are convolutional neural networks. They usually work much better than multilayer perceptrons and are much faster.
50x50 image features matrix is 2500 features with RGB values. Your neural network may memorize this but most probably will perform poorly on other images.
Therefore this type of problem is more about image-processing , feature extraction. Your features will change according to your requirements. See this similar question about image processing and neural networks
1 layer network will only be suitable for linear problems, are you sure your problem is linear? Otherwise you will need multi layer neural network