I have studied ordinary fully connected ANNs, I am starting to study convnets. I am struggling to understand how hidden layers connect. I do understand how the input matrix forward feeds a smaller field of values to the feature maps in the first hidden layer, by moving the local receptive field along one each time and forward feeding through the same/shared weights (for each feature map), so there are only one group of weights per feature map that are of the same structure as the local receptive field. Please correct me if I am wrong. Then, the feature maps use pooling to simplify the maps. The next part is when I get confused, here is a link to a 3d CNN visualisation to help explain my confusion
http://scs.ryerson.ca/~aharley/vis/conv/
Draw a digit between 0-9 into the top left pad and you'll see how it works. Its really cool. So, on the layer after the first pooling layer (the 4th row up containing 16 filters) if yoau hover your mouse over the filters you can see how the weights connect to the previous pooling layer. Try different filters on this row and what I do not understand is the rule that connects the second convolution layer to the previous pool layer. E.g on the filters to the very left, they are fully connected to the pooling layer. But on the ones nearer to the right, they only connect to about 3 of the previous pooled layers. Looks random.
I hope my explanation makes sense. I am essentially confused about what the pattern is that connects hidden pooled layers to the following hidden convolution layer. Even if my example is a bit odd, I would still appreciate some sort of explanation or link to a good explanation.
Thanks a lot.
Welcome to the magic of self-trained CNNs. It's confusing because the network makes up these rules as it trains. This is an image-processing example; most of these happen to train in a fashion that loosely parallels the learning in a simplified model of the visual cortex in vertebrates.
In general, the first layer's kernels "learn" to "recognize" very simple features of the input: lines and edges in various orientations. The next layer combines those for more complex features, perhaps a left-facing half-circle, or a particular angle orientation. The deeper you go in the model, the more complex the "decisions" get, and the kernels get more complex, and/or less recognizable.
The difference in connectivity from left to right may be an intentional sorting by the developer, or mere circumstance in the model. Some features need to "consult" only a handful of the previous layer's kernels; others need a committee of the whole. Note how simple features connect to relatively few kernels, while the final decision has each of the ten categories checking in with a large proportion of the "pixel"-level units in the last FC layer.
You might look around for some kernel visualizations for larger CNN implementations, such as those in the ILSVRC: GoogleNet, ResNet, VGG, etc. Those have some striking kernels through the layers, including fuzzy matches to a wheel & fender, the front part of a standing mammal, various types of faces, etc.
Does that help at all?
All of this is the result of organic growth over the training period.
Related
Would anyone here know if there is any kind of normalisation or scaling between layers in existing Neural Network arcitectures?
Scaling inputs is common and i am familiar with ReLU blow up. Most models i see indicate a small range of values like -2 to +2 but i don't see how this can be maintained from layer to layer. Irrespective of the activation function the second layer output is in the tens then the third layer is hundreds and final output is tens of thousands. In the worst case the layer returns NaN. A work around can be by scaling or alternating ReLU/sigmoid but I would like to know if this is this common?
Pretty much every network uses batch normalization, which is exactly that. Paper can be found here: (https://arxiv.org/abs/1502.03167). In essence it normalizes the values to be 0 mean and unit variance before being fed into the next layer. Another work is on self normalizing linear units (selu), which in some sense does this automatically without needing any kind of scaling. Paper can be found here: (https://arxiv.org/abs/1706.02515).
Im learning about Convolutional Neural Networks and right now i'm confused about how to implement it.
I know about regular neural networks and concepts like Gradient Descent and Back Propagation, And i can understand how CNN's how works intuitively.
My question is about Back Propagation in CNN's. How it happens? The last fully connected layers is the regular Neural Networks and there is no problem about that. But how i can update filters in convolution layers? How I can Back Propagate error from fully connected layers to these filters? My problem is updating Filters!
Filters are only simple matrixes? Or they have structures like regular NN's and connections between layers simulates that capability? I read about Sparse Connectivity and Shared Weights but I cant relate them to CNN's. Im really confused about implementing CNN's and i cant find any tutorials that talks about these concept. I can't read Papers because I'm new to these things and my Math is not good.
i dont want to use TensorFlow or tools like this, Im learning the main concept and using pure Python.
First off, I can recommend this introduction to CNNs. Maybe you can grasp the idea of it better with this.
To answer some of your questions in short:
Let's say you want to use a CNN for image classification. The picture consists of NxM pixels and has 3 channels (RBG). To apply a convolutional layer on it, you use a filter. Filters are matrices of (usually, but not necessarily) quadratic shape (e. g. PxP) and a number of channels that equals the number of channels of the representation it is applied on. Therefore, the first Conv layer filter has also 3 channels. Channels are the number of layers of the filter, so to speak.
When applying a filter to a picture, you do something called discrete convolution. You take your filter (which is usually smaller than your image) and slide it over the picture step by step, and calculate the convolution. This basically is a matrix multiplication. Then you apply a activation function on it and maybe even a pooling layer. Important to note is that the filter for all performed convolutions on this layer stays the same, so you only have P*P parameters per layer. You tweak the filter in a way, so that it fits the training data as well as possible. That's why its parameters are called shared weights. When applying GD, you simply have to apply it on said filter weights.
Also, you can find a nice demo for the convolutions here.
Implementing these things are certainly possible, but for starting out you could try out tensorflow for experimenting. At least that's the way I learn new concepts :)
Last week, Geoffrey Hinton and his team published papers that introduced a completely new type of neural network based on capsules. But I still don't understand the architecture and mechanism of work. Can someone explain in simple words how it works?
One of the major advantages of Convolutional neural networks is their invariance to translation. However this invariance comes with a price and that is, it does not consider how different features are related to each other. For example, if we have a picture of a face CNN will have difficulties distinguishing relationship between mouth feature and nose features. Max pooling layers are the main reason for this effect. Because when we use max pooling layers, we lose the precise locations of the mouth and noise and we cannot say how they are related to each other.
Capsules try to keep the advantage of CNN and fix this drawback in two ways;
Invariance: quoting from this paper
When the capsule is working properly, the probability of the visual
entity being present is locally invariant – it does not change as the
entity moves over the manifold of possible appearances within the
limited domain covered by the capsule.
In other words, capsule takes into account the existence of the the specific feature that we are looking for like mouth or nose. This property makes sure that capsules are translation invariant the same that CNNs are.
Equivariance: instead of making the feature translation invariance, capsule will make it translation-equivariant or viewpoint-equivariant. In other words, as the feature moves and changes its position in the image, feature vector representation will also change in the same way which makes it equivariant. This property of capsules tries to solve the drawback of max pooling layers that I mentioned at the beginning.
Typically, the feature space of CNNs are represented as the scalars which do not represent the accurate positional features of the information. Furthermore the pooling strategy in CNNs eliminate the usage of important sclar features within a region, highlighting a given significant feature of a region. Capsule network is designed to mitigate those issues considering the vector representation of the features which could capture the positional context of the feature representation while mapping child-parent relationship effectively through dynamic routing strategy.
In simple words, typical networks use scaler weights as connections between neurons. In capsule net, weights are matrices as there are capsules with higher dimensions instead of neurons.
Last day I learned about the convolution neural network, And went through some implementations of CNN using Tensorflow, All the implementation only specify the size, number of filters and strides for the filter. But when I learned about the filter it says that filter on each layer extracts different feature like edges, corners etc.
My question is can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
Sure, this could be done. But the advantage of CNNs is that they learn the best features themselves (or at least very good ones; better ones than we can come up with in most cases).
One famous example is the ImageNet dataset:
In 2012 the first end-to-end learned CNN was used. End-to-end means that the network gets the raw data on one end as input and the optimization objective on the other end.
Before CNNs, the computer vision community used manually designed features for many years. After AlexNet in 2012, nobody did so (for "typical" computer vision - there are special applications where it is still worth a shot).
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
It is always the complete image which is convolved with a small filter. The convolution operation is local, meaning you can compute much of it in parallel as the result of the convolution in the upper left corner is not
dependent of the convolution in the lower left corner.
I think you may be confusing filters and channels. A filter is the weight window size in your convolution which can be used to produce channels from the convolution output. It is typically these channels that represent different features:
In this car identification example you can see some of the earlier channels picking up things like the hood, doors, and other borders of the car. It is hard to truly specify which features the network is extracting. If you already have knowledge of features that are important to the network you can feed them in as an additional mask layer or using some type of weighting matrix on them.
[This question is now also posed at Cross Validated]
The question in short
I'm studying convolutional neural networks, and I believe that these networks do not treat every input neuron (pixel/parameter) equivalently. Imagine we have a deep network (many layers) that applies convolution on some input image. The neurons in the "middle" of the image have many unique pathways to many deeper layer neurons, which means that a small variation in the middle neurons has a strong effect on the output. However, the neurons at the edge of the image have only 1 way (or, depending on the exact implementation, of the order of 1) pathways in which their information flows through the graph. It seems that these are "under-represented".
I am concerned about this, as this discrimination of edge neurons scales exponentially with the depth (number of layers) of the network. Even adding a max-pooling layer won't halt the exponential increase, only a full connection brings all neurons on equal footing. I'm not convinced that my reasoning is correct, though, so my questions are:
Am I right that this effect takes place in deep convolutional networks?
Is there any theory about this, has it ever been mentioned in literature?
Are there ways to overcome this effect?
Because I'm not sure if this gives sufficient information, I'll elaborate a bit more about the problem statement, and why I believe this is a concern.
More detailed explanation
Imagine we have a deep neural network that takes an image as input. Assume we apply a convolutional filter of 64x64 pixel over the image, where we shift the convolution window by 4 pixels each time. This means that every neuron in the input sends it's activation to 16x16 = 265 neurons in layer 2. Each of these neurons might send their activation to another 265, such that our topmost neuron is represented in 265^2 output neurons, and so on. This is, however, not true for neurons on the edges: these might be represented in only a small number of convolution windows, thus causing them to activate (of the order of) only 1 neuron in the next layer. Using tricks such as mirroring along the edges won't help this: the second-layer-neurons that will be projected to are still at the edges, which means that that the second-layer-neurons will be underrepresented (thus limiting the importance of our edge neurons as well). As can be seen, this discrepancy scales exponentially with the number of layers.
I have created an image to visualize the problem, which can be found here (I'm not allowed to include images in the post itself). This network has a convolution window of size 3. The numbers next to neurons indicate the number of pathways down to the deepest neuron. The image is reminiscent of Pascal's Triangle.
https://www.dropbox.com/s/7rbwv7z14j4h0jr/deep_conv_problem_stackxchange.png?dl=0
Why is this a problem?
This effect doesn't seem to be a problem at first sight: In principle, the weights should automatically adjust in such a way that the network does it's job. Moreover, the edges of an image are not that important anyway in image recognition. This effect might not be noticeable in everyday image recognition tests, but it still concerns me because of two reasons: 1) generalization to other applications, and 2) problems arising in the case of very deep networks.
1) There might be other applications, like speech or sound recognition, where it is not true that the middle-most neurons are the most important. Applying convolution is often done in this field, but I haven't been able to find any papers that mention the effect that I'm concerned with.
2) Very deep networks will notice an exponentially bad effect of the discrimination of boundary neurons, which means that central neurons can be overrepresented by multiple order of magnitude (imagine we have 10 layers such that the above example would give 265^10 ways the central neurons can project their information). As one increases the number of layers, one is bound to hit a limit where weights cannot feasibly compensate for this effect. Now imagine we perturb all neurons by a small amount. The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
I will quote your sentences and below I will write my answers.
Am I right that this effect takes place in deep convolution networks
I think you are wrong in general but right according to your 64 by 64 sized convolution filter example. While you are structuring your convolution layer filter sizes, they would never be bigger than what you are looking for in your images. In other words - if your images are 200by200 and you convolve for 64by64 patches, you say that these 64by64 patches will learn some parts or exactly that image patch that identifies your category. The idea in the first layer is to learn edge-like partial important images not the entire cat or car itself.
Is there any theory about this, has it ever been mentioned in literature? and Are there ways to overcome this effect?
I never saw it in any paper I have looked through so far. And I do not think that this would be an issue even for very deep networks.
There is no such effect. Suppose your first layer which learned 64by64 patches is in action. If there is a patch in the top-left-most corner that would get fired(become active) then it will show up as a 1 in the next layers topmost left corner hence the information will be propagated through the network.
(not quoted) You should not think as 'a pixel is being useful in more neurons when it gets closer to center'. Think about 64x64 filter with a stride of 4:
if the pattern that your 64x64 filter look for is in the top-most-left corner of the image then it will get propagated to the next layers top most corner, otherwise there will be nothing in the next layer.
the idea is to keep meaningful parts of the image alive while suppressing the non-meaningful, dull parts, and combining these meaningful parts in following layers. In case of learning "an uppercase letter a-A" please look at only the images in the very old paper of Fukushima 1980 (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf) figure 7 and 5. Hence there is no importance of a pixel, there is importance of image patch which is the size of your convolution layer.
The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
Suppose you are looking for a car in an image,
And suppose that in your 1st example the car is definitely in the 64by64 top-left-most part of your 200by200 image, in 2nd example the car is definitely in the 64by64 bottom-right-most part of your 200by200 image
In the second layer all your pixel values will be almost 0, for 1st image except the one in the very top-left-most corner and for 2nd image except the one in the very bottom-right-most corner.
Now, the center part of the image will mean nothing to my forward and backward propagation because the values will already be 0. But the corner values will never be discarded and will effect my learning weights.