How does Geoffrey Hinton's Capsule Networks work? - machine-learning

Last week, Geoffrey Hinton and his team published papers that introduced a completely new type of neural network based on capsules. But I still don't understand the architecture and mechanism of work. Can someone explain in simple words how it works?

One of the major advantages of Convolutional neural networks is their invariance to translation. However this invariance comes with a price and that is, it does not consider how different features are related to each other. For example, if we have a picture of a face CNN will have difficulties distinguishing relationship between mouth feature and nose features. Max pooling layers are the main reason for this effect. Because when we use max pooling layers, we lose the precise locations of the mouth and noise and we cannot say how they are related to each other.
Capsules try to keep the advantage of CNN and fix this drawback in two ways;
Invariance: quoting from this paper
When the capsule is working properly, the probability of the visual
entity being present is locally invariant – it does not change as the
entity moves over the manifold of possible appearances within the
limited domain covered by the capsule.
In other words, capsule takes into account the existence of the the specific feature that we are looking for like mouth or nose. This property makes sure that capsules are translation invariant the same that CNNs are.
Equivariance: instead of making the feature translation invariance, capsule will make it translation-equivariant or viewpoint-equivariant. In other words, as the feature moves and changes its position in the image, feature vector representation will also change in the same way which makes it equivariant. This property of capsules tries to solve the drawback of max pooling layers that I mentioned at the beginning.

Typically, the feature space of CNNs are represented as the scalars which do not represent the accurate positional features of the information. Furthermore the pooling strategy in CNNs eliminate the usage of important sclar features within a region, highlighting a given significant feature of a region. Capsule network is designed to mitigate those issues considering the vector representation of the features which could capture the positional context of the feature representation while mapping child-parent relationship effectively through dynamic routing strategy.

In simple words, typical networks use scaler weights as connections between neurons. In capsule net, weights are matrices as there are capsules with higher dimensions instead of neurons.

Related

connecting hidden convolution layers

I have studied ordinary fully connected ANNs, I am starting to study convnets. I am struggling to understand how hidden layers connect. I do understand how the input matrix forward feeds a smaller field of values to the feature maps in the first hidden layer, by moving the local receptive field along one each time and forward feeding through the same/shared weights (for each feature map), so there are only one group of weights per feature map that are of the same structure as the local receptive field. Please correct me if I am wrong. Then, the feature maps use pooling to simplify the maps. The next part is when I get confused, here is a link to a 3d CNN visualisation to help explain my confusion
http://scs.ryerson.ca/~aharley/vis/conv/
Draw a digit between 0-9 into the top left pad and you'll see how it works. Its really cool. So, on the layer after the first pooling layer (the 4th row up containing 16 filters) if yoau hover your mouse over the filters you can see how the weights connect to the previous pooling layer. Try different filters on this row and what I do not understand is the rule that connects the second convolution layer to the previous pool layer. E.g on the filters to the very left, they are fully connected to the pooling layer. But on the ones nearer to the right, they only connect to about 3 of the previous pooled layers. Looks random.
I hope my explanation makes sense. I am essentially confused about what the pattern is that connects hidden pooled layers to the following hidden convolution layer. Even if my example is a bit odd, I would still appreciate some sort of explanation or link to a good explanation.
Thanks a lot.
Welcome to the magic of self-trained CNNs. It's confusing because the network makes up these rules as it trains. This is an image-processing example; most of these happen to train in a fashion that loosely parallels the learning in a simplified model of the visual cortex in vertebrates.
In general, the first layer's kernels "learn" to "recognize" very simple features of the input: lines and edges in various orientations. The next layer combines those for more complex features, perhaps a left-facing half-circle, or a particular angle orientation. The deeper you go in the model, the more complex the "decisions" get, and the kernels get more complex, and/or less recognizable.
The difference in connectivity from left to right may be an intentional sorting by the developer, or mere circumstance in the model. Some features need to "consult" only a handful of the previous layer's kernels; others need a committee of the whole. Note how simple features connect to relatively few kernels, while the final decision has each of the ten categories checking in with a large proportion of the "pixel"-level units in the last FC layer.
You might look around for some kernel visualizations for larger CNN implementations, such as those in the ILSVRC: GoogleNet, ResNet, VGG, etc. Those have some striking kernels through the layers, including fuzzy matches to a wheel & fender, the front part of a standing mammal, various types of faces, etc.
Does that help at all?
All of this is the result of organic growth over the training period.

Why rotation-invariant neural networks are not used in winners of the popular competitions?

As known, modern most popular CNN (convolutional neural network): VGG/ResNet (FasterRCNN), SSD, Yolo, Yolo v2, DenseBox, DetectNet - are not rotate invariant: Are modern CNN (convolutional neural network) as DetectNet rotate invariant?
Also known, that there are several neural networks with rotate-invariance object detection:
Rotation-Invariant Neoperceptron 2006 (PDF): https://www.researchgate.net/publication/224649475_Rotation-Invariant_Neoperceptron
Learning rotation invariant convolutional filters for texture classification 2016 (PDF): https://arxiv.org/abs/1604.06720
RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection 2016 (PDF): http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Cheng_RIFD-CNN_Rotation-Invariant_and_CVPR_2016_paper.html
Encoded Invariance in Convolutional Neural Networks 2014 (PDF)
Rotation-invariant convolutional neural networks for galaxy morphology prediction (PDF): https://arxiv.org/abs/1503.07077
Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images 2016: http://ieeexplore.ieee.org/document/7560644/
We know, that in such image-detection competitions as: IMAGE-NET, MSCOCO, PASCAL VOC - used networks ensembles (simultaneously some neural networks). Or networks ensembles in single net such as ResNet (Residual Networks Behave Like Ensembles of Relatively Shallow Networks)
But are used rotation invariant network ensembles in winners like as MSRA, and if not, then why? Why in ensemble the additional rotation-invariant network does not add accuracy to detect certain objects such as aircraft objects - which images is done at a different angles of rotation?
It can be:
aircraft objects which are photographed from the ground
or ground objects which are photographed from the air
Why rotation-invariant neural networks are not used in winners of the popular object-detection competitions?
The recent progress in image recognition which was mainly made by changing the approach from a classic feature selection - shallow learning algorithm to no feature selection - deep learning algorithm wasn't only caused by mathematical properties of convolutional neural networks. Yes - of course their ability to capture the same information using smaller number of parameters was partially caused by their shift invariance property but the recent research has shown that this is not a key in understanding their success.
In my opinion the main reason behind this success was developing faster learning algorithms than more mathematically accurate ones and that's why less attention is put on developing another property invariant neural nets.
Of course - rotation invariance is not skipped at all. This is partially made by data augmentation where you put the slightly changed (e.g. rotated or rescaled) image to your dataset - with the same label. As we can read in this fantastic book these two approaches (more structure vs less structure + data augmentation) are more or less equivalent. (Chapter 5.5.3, titled: Invariances)
I'm also wondering why the community or scholar didn't put much attention on ration invariant CNN as #Alex.
One possible cause, in my opinion, is that many scenarios don't need this property, especially for those popular competitions. Like Rob mentioned, some natural pictures are already taken in a unified horizontal (or vertical) way. For example, in face detection, many works will align the picture to ensure the people are standing on the earth before feeding to any CNN models. To be honest, this is the most cheap and efficient way for this particular task.
However, there does exist some scenarios in real life, needing rotation invariant property. So I come to another guess: this problem is not difficult from those experts (or researchers)' view. At least we can use data augmentation to obtain some rotate invariant.
Lastly, thanks so much for your summarization about the papers. I added one more paper Group Equivariant Convolutional Networks_icml2016_GCNN and its implementation on github by other people.
Object detection is mostly driven by the successes of detection algorithms in world-famous object detection benchmarks like PASCAL-VOC and MS-COCO, which are object centric datasets where most objects are vertical (potted plants, humans, horses, etc.) and thus data augmentation with left-right flips is often sufficient (for all we know data augmentation with rotated images like upside-down flips could even hurt detection performance).
Every year the entire community adopts the base algorithmic structure of the winning solution and build on it (I am exaggerating a bit to prove a point but not so much).
Interestingly other less widely known topics like oriented text detections and oriented vehicle detections in aerial imagery both need rotation invariant features and rotation equivariant detection pipelines (like in both articles from Cheng you mentioned).
If you want to find literature and code in this area you need to dive in these two domains. I can already give you a few pointers like the DOTA challenge for aerial imagery or the ICDAR challenges for oriented text detections.
As #Marcin Mozejko said, CNN are by nature translation invariant and not rotation invariant. It is an open problem how to incorporate perfect rotation invariance the few articles that deal with it have yet to become standards even though some of them seem promising.
My personal favorite for detection is the modification of Faster R-CNN recently proposed by Ma.
I hope that this direction of research will be investigated more and more once people will get fed up of MS-COCO and VOC.
What you could try is take a state-of-the-art detector trained on MS-COCO like Faster R-CNN with NASNet from TF detection API and see how it performs wrt rotating the test image, in my opinion it would be far from rotation invariant.
Rotation invariance is mostly a good thing, but not always. Objects can have different interpretation based on their rotation, eg. if a rotated "1" might be difficult to distinguish from a "7".
First, let's acknowledge that introducing rotational invariance requires a static assumption about the distribution of angles. For example, another commenter on this page suggested rotating the kernel with 30-degree steps. That's equivalent to assuming that useful rotations in each layer are uniformly distributed over the rotation angles.
In contrast to that, when the network learns rotated kernels, the network picks a different distribution of angles for each layer. An interesting research question is to find what distribution of rotation angles is implied by learned kernels. In any case, why would such learning flexibility be useful?
I suspect that the assumption of a uniform distribution might not be equally useful across all layers of a network. In the first few convolutional layers (edges and other basic shapes), it's likely true that the rotation angles are uniformly distributed. However, in the deep layers, this assumption might be less valid. If cars are almost always rotated within a small range of angles, then why waste compute and space on unlikely rotations?
However, the network won't learn the right distribution of angles if the training dataset is not sufficiently representative. Note that simply rotating an image (called data augmentation) is not the same as rotating an object relative to other objects in the same image. I suppose it comes down to your expectation of the difference between the training dataset and the unobserved dataset to which the network has to generalize.
Interestingly, the human visual cortex is not fully rotation-invariant at the scale of major face features. See https://en.wikipedia.org/wiki/Thatcher_effect.

Keypoint recognition as classification?

At the end of the introduction to this instructive kaggle competition, they state that the methods used in "Viola and Jones' seminal paper works quite well". However, that paper describes a system for binary facial recognition, and the problem being addressed is the classification of keypoints, not entire images. I am having a hard time figuring out how, exactly, I would go about adjusting the Viola/Jones system for keypoint recognition.
I assume I should train a separate classifier for each keypoint, and some ideas I have are:
iterate over sub-images of a fixed size and classify each one, where an image with a keypoint as center pixel is a positive example. In this case I'm not sure what I would do with pixels close to the edge of the image.
instead of training binary classifiers, train classifiers with l*w possible classes (one for each pixel). The big problem with this is that I suspect it will be prohibitively slow, as every weak classifier suddenly has to do l*w*original operations
the third idea I have isn't totally hashed out in my mind, but since the keypoints are each parts of a greater part of a face (left, right center of an eye, for example), maybe I could try to classify sub-images as just an eye, and then use the left, right, and center pixels (centered in the y coordinate) of the best-fit subimage for each face-part
Is there any merit to these ideas, and are there methods I haven't thought of?
however, that paper describes a system for binary facial recognition
No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.
You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.
Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.
Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.
I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
http://deeplearning.net/tutorial/lenet.html#lenet
I made the following changes to a typical convolutional network:
I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score
I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.
Here's a github repo with some of the work I did:
https://github.com/cowpig/deep_keypoints
Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:
http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

Interpreting a Self Organizing Map

I have been doing reading about Self Organizing Maps, and I understand the Algorithm(I think), however something still eludes me.
How do you interpret the trained network?
How would you then actually use it for say, a classification task(once you have done the clustering with your training data)?
All of the material I seem to find(printed and digital) focuses on the training of the Algorithm. I believe I may be missing something crucial.
Regards
SOMs are mainly a dimensionality reduction algorithm, not a classification tool. They are used for the dimensionality reduction just like PCA and similar methods (as once trained, you can check which neuron is activated by your input and use this neuron's position as the value), the only actual difference is their ability to preserve a given topology of output representation.
So what is SOM actually producing is a mapping from your input space X to the reduced space Y (the most common is a 2d lattice, making Y a 2 dimensional space). To perform actual classification you should transform your data through this mapping, and run some other, classificational model (SVM, Neural Network, Decision Tree, etc.).
In other words - SOMs are used for finding other representation of the data. Representation, which is easy for further analyzis by humans (as it is mostly 2dimensional and can be plotted), and very easy for any further classification models. This is a great method of visualizing highly dimensional data, analyzing "what is going on", how are some classes grouped geometricaly, etc.. But they should not be confused with other neural models like artificial neural networks or even growing neural gas (which is a very similar concept, yet giving a direct data clustering) as they serve a different purpose.
Of course one can use SOMs directly for the classification, but this is a modification of the original idea, which requires other data representation, and in general, it does not work that well as using some other classifier on top of it.
EDIT
There are at least few ways of visualizing the trained SOM:
one can render the SOM's neurons as points in the input space, with edges connecting the topologicaly close ones (this is possible only if the input space has small number of dimensions, like 2-3)
display data classes on the SOM's topology - if your data is labeled with some numbers {1,..k}, we can bind some k colors to them, for binary case let us consider blue and red. Next, for each data point we calculate its corresponding neuron in the SOM and add this label's color to the neuron. Once all data have been processed, we plot the SOM's neurons, each with its original position in the topology, with the color being some agregate (eg. mean) of colors assigned to it. This approach, if we use some simple topology like 2d grid, gives us a nice low-dimensional representation of data. In the following image, subimages from the third one to the end are the results of such visualization, where red color means label 1("yes" answer) andbluemeans label2` ("no" answer)
onc can also visualize the inter-neuron distances by calculating how far away are each connected neurons and plotting it on the SOM's map (second subimage in the above visualization)
one can cluster the neuron's positions with some clustering algorithm (like K-means) and visualize the clusters ids as colors (first subimage)

How do I make a U-matrix?

How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.
A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.
The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.

Resources