Feed Forward Neural Network Back Propogation Algorithm Assumption - machine-learning

During back propagation it appears that it's assumed that any error created in a hidden layer only affects one layer higher (for example see the derivation here, specifically equation 16).
That is, when calculating dE/dy_j the derivation states it uses the chain rule, however it only differentiates over nodes with indices in I_j (i.e. only over nodes one layer higher than y_j). Why is it the case that higher layers are ignored in this calculation? We could take into account the i+1 layer as well by considering that x_{i+1} = \sum_i w_{i,i+1} f(\sum_{j} w_{j,i} y_j) (which clearly has y_j dependence).

Higher layers aren't being ignored. In equation 16, the E in the dE/dy_i is the error of the final output, so that gradient already includes effects of all subsequent layers. That's the whole point of backpropagation. You start with the error at the output and compute the gradient with respect to the previous layer. Then, you use that gradient to compute the gradient for the next previous layer, etc.
You could do what you are describing but it would make for a much more complicated formulation. A convenient/efficient aspect of the backpropagation formulation is that since you only need to use the error term for the subsequent layer, it doesn't matter whether you have a total of 3 layers or 4 or 50. You apply the same simple formula to each hidden layer, accumulating chain rule terms as you work your way backward through the network.

Related

Interlayer scaling or normalisation between hidden layers in ANNs CNNs and MLPs

Would anyone here know if there is any kind of normalisation or scaling between layers in existing Neural Network arcitectures?
Scaling inputs is common and i am familiar with ReLU blow up. Most models i see indicate a small range of values like -2 to +2 but i don't see how this can be maintained from layer to layer. Irrespective of the activation function the second layer output is in the tens then the third layer is hundreds and final output is tens of thousands. In the worst case the layer returns NaN. A work around can be by scaling or alternating ReLU/sigmoid but I would like to know if this is this common?
Pretty much every network uses batch normalization, which is exactly that. Paper can be found here: (https://arxiv.org/abs/1502.03167). In essence it normalizes the values to be 0 mean and unit variance before being fed into the next layer. Another work is on self normalizing linear units (selu), which in some sense does this automatically without needing any kind of scaling. Paper can be found here: (https://arxiv.org/abs/1706.02515).

Is location-dependent convolution filter possible in PyTorch or TensorFlow?

Let's pretend that in plus of having an image, I also have a gradient from left to right on the X axis of an image, and another gradient from top to bottom on the Y axis. Those two gradients are of the same size of the image, and could both range from -0.5 to 0.5.
Now, I'd like to make the convolution kernel (a.k.a. convolution filter, or convolution weights) depend on the (x, y) location in the gradient. So the kernel is a function of the gradient as if the kernel was the output of a nested mini-neural net. This would make the weights of the filter to be different in every position, but slightly similar to their neighbors. How do I do that within PyTorch or TensorFlow?
Sure, I could compute a Toeplitz matrix (a.k.a. diagonal-constant matrix) by myself, but the matrix multiplication would take O(n^3) operations if pretending x==y==n, whereas convolutions can be implemented in O(n^2) normally. Or I could maybe iterate on every element myself and do the multiplications in an unvectorized fashion.
Any better ideas? I'd like to see creativity here, thinking about how could this be implemented neatly. I believe coding that would be an interesting way to build a network layer capable of doing things similar to a simplified version of a Spatial Transformer Networks, but which's spatial transformation would be independent of the image.
Here is a solution I thought for a simplified version of this problem where a linear combination of weights would be used rather than truly using a nested mini neural network:
It may be possible to do 4 different convolutions passes so as to have 4 feature maps, then to multiply those 4 maps with the gradients (2 vertical and 2 horizontal gradients), and add them together so that only 1 map remains. However, that would be a linear combination of the different maps which is simpler than truly using a nested neural network which would alter the kernel in the first place.
Thinking more about it, here is a solution to an equivalent question. The thing with this solution is that it flips the problem around by placing the "mini neural net" after rather than before, and in a quite different way. So it solves the problem, but offer a much different optimization space and convergence behavior which is less natural for me to think about than how I formulated the problem.
In a sense, a solution to the problem could be very similar to simply concatenating the two gradients to 1 regular feature map (from a regular convolution) such as having a depth of d_2 = d_1 + 2 after the concatenation), and then performing more convolutions on top of this. I won't prove why this is a valid solution to an equivalent problem, but I thought through this and it seems provable.
The optimization space (for the weights) would be here very different and I think it wouldn't converge with the same behavior. I'd like to know what you people think about this solution in terms of optimization convergence.
The reason why convolutions are more efficient than fully connected layers is because they are translation invariant. If you wish to have convolutions which are dependent on location you would need to add two extra parameters to the convolution i.e. having N+2 input channels where x, y coord are the values of the two additonal channels (as in e.g. CoordConv).
As for alternative solutions, is the gradient meaningful? If not, and it is uniform across all images, it might be better to just manually remove it in the pre-processing stage (similar to orientation correction, cropping etc). If it is not (e.g. differences in lighting, shadows) then including other layers under the assumption they would learn the invariance of different lightings is a common hands-off approach.

connecting hidden convolution layers

I have studied ordinary fully connected ANNs, I am starting to study convnets. I am struggling to understand how hidden layers connect. I do understand how the input matrix forward feeds a smaller field of values to the feature maps in the first hidden layer, by moving the local receptive field along one each time and forward feeding through the same/shared weights (for each feature map), so there are only one group of weights per feature map that are of the same structure as the local receptive field. Please correct me if I am wrong. Then, the feature maps use pooling to simplify the maps. The next part is when I get confused, here is a link to a 3d CNN visualisation to help explain my confusion
http://scs.ryerson.ca/~aharley/vis/conv/
Draw a digit between 0-9 into the top left pad and you'll see how it works. Its really cool. So, on the layer after the first pooling layer (the 4th row up containing 16 filters) if yoau hover your mouse over the filters you can see how the weights connect to the previous pooling layer. Try different filters on this row and what I do not understand is the rule that connects the second convolution layer to the previous pool layer. E.g on the filters to the very left, they are fully connected to the pooling layer. But on the ones nearer to the right, they only connect to about 3 of the previous pooled layers. Looks random.
I hope my explanation makes sense. I am essentially confused about what the pattern is that connects hidden pooled layers to the following hidden convolution layer. Even if my example is a bit odd, I would still appreciate some sort of explanation or link to a good explanation.
Thanks a lot.
Welcome to the magic of self-trained CNNs. It's confusing because the network makes up these rules as it trains. This is an image-processing example; most of these happen to train in a fashion that loosely parallels the learning in a simplified model of the visual cortex in vertebrates.
In general, the first layer's kernels "learn" to "recognize" very simple features of the input: lines and edges in various orientations. The next layer combines those for more complex features, perhaps a left-facing half-circle, or a particular angle orientation. The deeper you go in the model, the more complex the "decisions" get, and the kernels get more complex, and/or less recognizable.
The difference in connectivity from left to right may be an intentional sorting by the developer, or mere circumstance in the model. Some features need to "consult" only a handful of the previous layer's kernels; others need a committee of the whole. Note how simple features connect to relatively few kernels, while the final decision has each of the ten categories checking in with a large proportion of the "pixel"-level units in the last FC layer.
You might look around for some kernel visualizations for larger CNN implementations, such as those in the ILSVRC: GoogleNet, ResNet, VGG, etc. Those have some striking kernels through the layers, including fuzzy matches to a wheel & fender, the front part of a standing mammal, various types of faces, etc.
Does that help at all?
All of this is the result of organic growth over the training period.

What does global pooling do?

I recently found the "global_pooling" flag in the Pooling layer in caffe, however was unable to find sth about it in the documentation here (Layer Catalogue)
nor here (Pooling doxygen doc) .
Is there an easy forward examply explanation to this in comparison to the normal Pool-Layer behaviour?
With Global pooling reduces the dimensionality from 3D to 1D. Therefore Global pooling outputs 1 response for every feature map. This can be the maximum or the average or whatever other pooling operation you use.
It is often used at the end of the backend of a convolutional neural network to get a shape that works with dense layers. Therefore no flatten has to be applied.
Convolutions can work on any image input size (which is big enough). However, if you have a fully connected layer at the end, this layer needs a fixed input size. Hence the complete network needs a fixed image input size.
However, you can remove the fully connected layer and just work with convolutional layers. You can make a convolutional layer at the end which has the same number of filters as you have classes. But you want one value for each class which indicates the probability of that class. Hence you apply a pooling filter over the complete remaining feature map. This pooling is hence "global" as it always is as big as necessary. In contrast, usual pooling layers have a fixed size (e.g. of 2x2 or 3x3).
This is a general concept. You can also find global pooling in other libraries, e.g. Lasagne. If you want a good reference in literature, I recommend reading Network In Network.
We get only one value from entire feature map when we apply GP layer, in which kernel size is the h×w of the feature map. GP layers are used to reduce the spatial dimensions of a three-dimensional feature map. However, GP layers perform a more extreme type of dimensionality reduction, where a feature map with dimensions h×w×d is reduced in size to have dimensions 1×1×d. GP layers reduce each h×w feature map to a single number by simply taking the average of all hw values.
If you are looking for information regarding flags/parameters of caffe, it is best look them up in the comments of '$CAFFE_ROOT/src/caffe/proto/caffe.proto'.
For 'global_pooling' parameter the comment says:
// If global_pooling then it will pool over the size of the bottom by doing
// kernel_h = bottom->height and kernel_w = bottom->width
For more information about caffe layers, see this help pages.

Convolutional neural networks: Aren't the central neurons over-represented in the output?

[This question is now also posed at Cross Validated]
The question in short
I'm studying convolutional neural networks, and I believe that these networks do not treat every input neuron (pixel/parameter) equivalently. Imagine we have a deep network (many layers) that applies convolution on some input image. The neurons in the "middle" of the image have many unique pathways to many deeper layer neurons, which means that a small variation in the middle neurons has a strong effect on the output. However, the neurons at the edge of the image have only 1 way (or, depending on the exact implementation, of the order of 1) pathways in which their information flows through the graph. It seems that these are "under-represented".
I am concerned about this, as this discrimination of edge neurons scales exponentially with the depth (number of layers) of the network. Even adding a max-pooling layer won't halt the exponential increase, only a full connection brings all neurons on equal footing. I'm not convinced that my reasoning is correct, though, so my questions are:
Am I right that this effect takes place in deep convolutional networks?
Is there any theory about this, has it ever been mentioned in literature?
Are there ways to overcome this effect?
Because I'm not sure if this gives sufficient information, I'll elaborate a bit more about the problem statement, and why I believe this is a concern.
More detailed explanation
Imagine we have a deep neural network that takes an image as input. Assume we apply a convolutional filter of 64x64 pixel over the image, where we shift the convolution window by 4 pixels each time. This means that every neuron in the input sends it's activation to 16x16 = 265 neurons in layer 2. Each of these neurons might send their activation to another 265, such that our topmost neuron is represented in 265^2 output neurons, and so on. This is, however, not true for neurons on the edges: these might be represented in only a small number of convolution windows, thus causing them to activate (of the order of) only 1 neuron in the next layer. Using tricks such as mirroring along the edges won't help this: the second-layer-neurons that will be projected to are still at the edges, which means that that the second-layer-neurons will be underrepresented (thus limiting the importance of our edge neurons as well). As can be seen, this discrepancy scales exponentially with the number of layers.
I have created an image to visualize the problem, which can be found here (I'm not allowed to include images in the post itself). This network has a convolution window of size 3. The numbers next to neurons indicate the number of pathways down to the deepest neuron. The image is reminiscent of Pascal's Triangle.
https://www.dropbox.com/s/7rbwv7z14j4h0jr/deep_conv_problem_stackxchange.png?dl=0
Why is this a problem?
This effect doesn't seem to be a problem at first sight: In principle, the weights should automatically adjust in such a way that the network does it's job. Moreover, the edges of an image are not that important anyway in image recognition. This effect might not be noticeable in everyday image recognition tests, but it still concerns me because of two reasons: 1) generalization to other applications, and 2) problems arising in the case of very deep networks.
1) There might be other applications, like speech or sound recognition, where it is not true that the middle-most neurons are the most important. Applying convolution is often done in this field, but I haven't been able to find any papers that mention the effect that I'm concerned with.
2) Very deep networks will notice an exponentially bad effect of the discrimination of boundary neurons, which means that central neurons can be overrepresented by multiple order of magnitude (imagine we have 10 layers such that the above example would give 265^10 ways the central neurons can project their information). As one increases the number of layers, one is bound to hit a limit where weights cannot feasibly compensate for this effect. Now imagine we perturb all neurons by a small amount. The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
I will quote your sentences and below I will write my answers.
Am I right that this effect takes place in deep convolution networks
I think you are wrong in general but right according to your 64 by 64 sized convolution filter example. While you are structuring your convolution layer filter sizes, they would never be bigger than what you are looking for in your images. In other words - if your images are 200by200 and you convolve for 64by64 patches, you say that these 64by64 patches will learn some parts or exactly that image patch that identifies your category. The idea in the first layer is to learn edge-like partial important images not the entire cat or car itself.
Is there any theory about this, has it ever been mentioned in literature? and Are there ways to overcome this effect?
I never saw it in any paper I have looked through so far. And I do not think that this would be an issue even for very deep networks.
There is no such effect. Suppose your first layer which learned 64by64 patches is in action. If there is a patch in the top-left-most corner that would get fired(become active) then it will show up as a 1 in the next layers topmost left corner hence the information will be propagated through the network.
(not quoted) You should not think as 'a pixel is being useful in more neurons when it gets closer to center'. Think about 64x64 filter with a stride of 4:
if the pattern that your 64x64 filter look for is in the top-most-left corner of the image then it will get propagated to the next layers top most corner, otherwise there will be nothing in the next layer.
the idea is to keep meaningful parts of the image alive while suppressing the non-meaningful, dull parts, and combining these meaningful parts in following layers. In case of learning "an uppercase letter a-A" please look at only the images in the very old paper of Fukushima 1980 (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf) figure 7 and 5. Hence there is no importance of a pixel, there is importance of image patch which is the size of your convolution layer.
The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
Suppose you are looking for a car in an image,
And suppose that in your 1st example the car is definitely in the 64by64 top-left-most part of your 200by200 image, in 2nd example the car is definitely in the 64by64 bottom-right-most part of your 200by200 image
In the second layer all your pixel values will be almost 0, for 1st image except the one in the very top-left-most corner and for 2nd image except the one in the very bottom-right-most corner.
Now, the center part of the image will mean nothing to my forward and backward propagation because the values will already be 0. But the corner values will never be discarded and will effect my learning weights.

Resources