I have developed a GCN architecture with 3 different inputs, related to 3 different blocks of layers. After that, the outputs of those layers are concatenated. And after that, the architecture has a fully connected layer. I detail the architecture involves:
Input block (there are three of these, independent from each other): Graph convolutional layer Tanh function global_max_pool layer
Concatenation, which concatenates the results of these three input blocks
Fully connected layer
How can I provide a mathematical description of this architecture (in functional style), like the description provided on page 23 of the network represented in figur2 2.1 of this text? https://uwspace.uwaterloo.ca/bitstream/handle/10012/18264/Nguyen_Duy.pdf?sequence=3&isAllowed=y
Related
I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.
I just started to learn about neural networks and so far my knowledge of machine learning is simply linear and logistic regression. from my understanding of the latter algorithms, is that given multiples of inputs the job of the learning algorithm is to come up with appropriate weights for each input so that eventually I have a polynomial that either describes the data which is the case of linear regression or separates it as in the case of logistic regression.
if I was to represent the same mechanism in neural network, according to my understanding, it would look something like this,
multiple nodes at the input layer and a single node in the output layer. where I can back propagate the error proportionally to each input. so that also eventually I arrive to a polynomial X1W1 + X2W2+....XnWn that describes the data. to me having multiple nodes per layer, aside from the input layer, seems to make the learning process parallel, so that I can arrive to the result faster. it's almost like running multiple learning algorithms each with different starting points to see which one converges faster. and as for the multiple layers I'm at a lose of what mechanism and advantage does it have on the learning outcome.
why do we have multiple layers and multiple nodes per layer in a neural network?
We need at least one hidden layer with a non-linear activation to be able to learn non-linear functions. Usually, one thinks of each layer as an abstraction level. For computer vision, the input layer contains the image and the output layer contains one node for each class. The first hidden layer detects edges, the second hidden layer might detect circles / rectangles, then there come more complex patterns.
There is a theoretical result which says that an MLP with only one hidden layer can fit every function of interest up to an arbitrary low error margin if this hidden layer has enough neurons. However, the number of parameters might be MUCH larger than if you add more layers.
Basically, by adding more hidden layers / more neurons per layer you add more parameters to the model. Hence you allow the model to fit more complex functions. However, up to my knowledge there is no quantitative understanding what adding a single further layer / node exactly makes.
It seems to me that you might want a general introduction into neural networks. I recommend chapter 4.3 and 4.4 of [Tho14a] (my bachelors thesis) as well as [LBH15].
[Tho14a]
M. Thoma, “On-line recognition of handwritten mathematical symbols,”
Karlsruhe, Germany, Nov. 2014. [Online]. Available: https://arxiv.org/abs/1511.09030
[LBH15]
Y. LeCun,
vol. 521,
Y. Bengio,
no. 7553,
and G. Hinton,
pp. 436–444,
“Deep learning,”
Nature,
May 2015. [Online]. Available:
http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html
In this case i want to make letter recognition, the letter is scanned from a paper. the result of that process i have 5 x 5 binary matrix. so, it would use 25 input node. but i don't understand how to determine total hidden layer nodes and outputs node for that cases.i want to build the architecture of multilayer perecptron for that cases. thanks for your help!
Every NN has three types of layers: input, hidden, and output.
Creating the NN architecture therefore means coming up with values for the number of layers of each type and the number of nodes in each of these layers.
The Input Layer
Simple--every NN has exactly one of them--no exceptions that I'm aware of.
With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some NN configurations add one additional node for a bias term.
The Output Layer
Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.
Is your NN going running in Machine Mode or Regression Mode (the ML convention of using a term that is also used in statistics but assigning a different meaning to it is very confusing). Machine mode: returns a class label (e.g., "Premium Account"/"Basic Account"). Regression Mode returns a value (e.g., price).
If the NN is a regressor, then the output layer has a single node.
If the NN is a classifier, then it also has a single node unless softmax is used
in which case the output layer has one node per class label in your model.
The Hidden Layers
So those few rules set the number of layers and size (neurons/layer) for both the input and output layers. That leaves the hidden layers.
How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. Of course, you don't need an NN to resolve your data either, but it will still do the job.
Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. One hidden layer is sufficient for the large majority of problems.
So what about size of the hidden layer(s)--how many neurons? There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'. Jeff Heaton, author of Introduction to Neural Networks in Java offers a few more.
In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.
Optimization of the Network Configuration
Pruning describes a set of techniques to trim network size (by nodes not layers) to improve computational performance and sometimes resolution performance. The gist of these techniques is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.
Put another way, by applying a pruning algorithm to your network during training, you can approach optimal network configuration; whether you can do that in a single "up-front" (such as a genetic-algorithm-based algorithm) I don't know, though I do know that for now, this two-step optimization is more common.
Formula
One additional rule of thumb for supervised learning networks, the upperbound on the number of hidden neurons that won't result in over-fitting is:
Others recommend setting alpha to a value between 5 and 10, but I find a value of 2 will often work without overfitting. As explained by this excellent NN Design text, you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The degrees of freedom in your data is the number samples * degrees of freedom (dimensions) in each sample or Ns∗(Ni+No) (assuming they're all independent). So alpha is a way to indicate how general you want your model to be, or how much you want to prevent overfitting.
For an automated procedure you'd start with an alpha of 2 (twice as many degrees of freedom in your training data as your model) and work your way up to 10 if the error for training data is significantly smaller than for the cross-validation data set.
References
Advameg (2016) Comp.Ai.Neural-nets FAQ, part 1 of 7: Introduction. Available at: http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016a) Available at: https://stats.stackexchange.com/a/136542
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016b) Available at: https://stats.stackexchange.com/a/1097
Legal, H.R. - and Info, C. (2016) Introduction to neural networks for java, 2nd edition. Available at: http://www.heatonresearch.com/book/programming-neural-networks-java-2.html
As I noticed, in many popular architectures of the convolutional neural networks (e.g. AlexNet), people use more than one fully connected layers with almost the same dimension to gather the responses to previously detected features in the early layers.
Why do not we use just one FC for that? Why this hierarchical arrangement of the fully connected layers is possibly more useful?
Because there are some functions, such as XOR, that can't be modeled by a single layer. In this type of architecture the convolutional layers are computing local features and the fully-connected output layer(s) are then combining these local features to derive the final outputs.. So, you can consider the fully-connected layers as a semi-independent mapping of features to outputs, and if this is a complex mapping then you may need the expressive power of multiple layers.
Actually its no longer popular/normal. 2015+ networks(such as Resnet, Inception 4) uses Global average pooling(GAP) as a last layer + softmax, which gives same performance and mach smaller model. Last 2 layers in VGG16 is about 80% of all parameters in network. But to answer you question its common to use 2 layer MLP for classification and consider the rest of network to be feature generation. 1 layer would be normal logistic regression with global minimum and simple properties, 2 layers give some usefulness to have non linearity and usage of SGD.
I've been doing a lot of reading on Conv Nets and even some playing using Julia's Mocha.jl package (which looks a lot like Caffe, but you can play with it in the Julia REPL).
In a Conv net, Convolution layers are followed by "feature map" layers. What I'm wondering is how does one determine how many feature maps a network needs to have to solve some particular problem? Is there any science to this or is it more art? I can see that if you're trying to make a classification at least that last layer should have number of feature maps == number of classes (unless you've got a fully connected MLP at the top of the network, I suppose).
In my case, I 'm not doing a classification so much as trying to come up with a value for every pixel in an image (I suppose this could be seen as a classification where the classes are from 0 to 255).
Edit: as pointed out in the comments, I'm trying to solve a regression problem where the outputs are in a range from 0 to 255 (grayscale in this case). Still, the question remains: How does one determine how many feature maps to use at any given convolution layer? Does this differ for a regression problem vs. a classification problem?
Basically, like any other hyperparameter - by evaluting results on separate development set and finding what number works best. It also worth checking publications that deal with similar problem and finding what number of feature maps they were using.
More art. The only difference between imagenet winners that use conv-nets has been changing the structure of layers and maybe some novel ways of training.
VGG is a neat example. Begins with filter sizes beginning with 2^7, then 2^8, then 2^9 followed by fully connected layers, then an output layer which will give you your classes. Your maps and layer depths can be completely unrelated to the number of output classes.
You would not want a fully connected layer at the top. That kind of defeats the purpose that convolutional nets were designed to solve (overfitting and optimizing hundreds of thousands of weights per neuron)
Training on big sets will require some heavy computational resources. If you're working with imagenet - there's a set of pre-trained models with caffe that you could build on top of http://caffe.berkeleyvision.org/model_zoo.html
I'm not sure if you can port these to mocha. There's a port to tensor flow though if you're interested in that https://github.com/ethereon/caffe-tensorflow