Usage of Dense(Fully Connected) Layer in Neural Network - machine-learning

I know that a dense layer means a classic fully connected layer, which means each input is going to each neuron for multiplication. But recently some questions raised in my mind which when searched on youtube, blogs, StackOverflow, articles nobody gave me a satisfying answer to it.
1-Why do we need fully connected(dense) layers in neural networks, its usage? Can't we use sparse layers(means some input are going to only some neurons so all neurons are not getting all input)
2-What will happen if we use sparse layers? I know computations will be less but what will be the effect on output. Will neuron be able to perform just like dense layers or not.
3-Which will be better sparse or dense layers to use in a neural network.(Pros and Cons)
4- If we can use sparse layer and it performs well then why have I not heard this term more than FCN(Fully connected layer)
Sparse layer is not the same as a drop layer in neural network. In drop layer you prune/drop some neurons but other neurons get all the output from previous layer. So not same.
Thank you in advance for help.

Using sparse layers would simply introduce more features you would have to tweak. What would your sparse layer look like, what connects to what? Using Dense Layers atleast guarantees that there might be a chance that a connection is used.
You also answer you own question, evidently sparse layers are not better, else you would have heard from them. Dropout on the other hand is useful and is used widely.

Related

How do I use a single neuron layer in the multiple layers of a multilayer network?

Is it conceptually wrong to put a layer of multiple neurons after a single layer of neurons? if yes, How do I use this single neuron layer in the multiple layers of a multilayer network?
model = Sequential()
model.add(Input(shape=(10,))
model.add(Dense(1,activation='relu'))
model.add(Dense(5,activation='relu'))
Do I have to use a special layer? How?
In my application, the single neuron layer is a sum layer, which is as follows:
class sumLayer(Layer):
def __init__(self,**kwargs):
super(DefuzzyLayer, self).__init__(**kwargs)
def call(self, x_inputs):
xc = K.sum((x_inputs), axis=-1, keepdims=False)
return tf.reshape(xc,(tf.shape(x_inputs)[0],1))
It is definitely non standard to have any point in a deep neural network, where representation is collapsed to a single number. This creates an extreme information bottleneck, and thus, while in theory every complex decision can still be encoded, it will become extremely hard for the next layer to learn to reason about this space, if they need to react in a way that is not a very simple "thresholding" of your signal on the line. So, is it conceptually "wrong"? No, it is not really affecting representational power. Is it risky? Yes, it is something that experienced practicioners would never do unless there is a well understood reason to go this way.
To gain some intuition: one of the core reasons of effectiveness of neural networks is that they operate in very highly dimensional spaces, where their simple, affine transformations (neurons) can achieve surprising degree of data separation, but more importantly - they can be trained using extremely naive optimisation methods (gradient descent). These properties will be completely gone once dimensionality is reduced. In particular with an extreme representational bottleneck you are much more likely to be affected by local minima, that the early neural network community was struggling with (an issue that somewhat "magically" went away with the scale, and dimensionality going up; there is a whole field of research, e.g. Neural Tangent Kernels that provide mathematical fundations/understanding of why this is the case in high dimension).

Why do we use fully-connected layer at the end of CNN?

I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.

Convolutional neural networks vs downsampling?

After reading up on the subject I don't fully understand: Is the 'convolution' in neural networks comparable to a simple downsampling or 'sharpening' function?
Can you break this term down into a simple, understandable image/analogy?
edit: Rephrase after 1st answer: Can pooling be understood as downsampling of weight matrices?
Convolutional neural network is a family of models which are proved empirically to work great when it comes to image recognition. From this point of view - CNN is something completely different than downsampling.
But in framework used in CNN design there is something what is comparable to a downsampling technique. To fully understand that - you have to understand how CNN usually works. It is build by a hierarchical number of layers and at every layer you have a set of a trainable kernels which output has a dimension very similiar to spatial size of your input images.
This might be a serious problem - the output from such layer might be extremely huge (~ nr_of_kernels * size_of_kernel_output) which could make your computations intractable. This is the reason why a certain techniques are used in order to decrease size of the output:
Stride, pad and kernel size manipulation: be setting these values to a certain value you could decrese the size of the output (on the other hand - you may lose some of important information).
Pooling operation: pooling is an operation in which instead of passing as an output from a layer all outputs from all kernels - you might pass only specific aggregated statistics about it. It is considered as extremely useful and is widely used in CNN design.
For a detailed description you might visit this tutorial.
Edit: Yes, pooling is a kind of downsampling 😊

Why is there only one hidden layer in a neural network?

I recently made my first neural network simulation which also uses a genetic evolution algorithm. It's simple software that just simulates simple organisms collecting food, and they evolve, as one would expect, from organisms with random and sporadic movements into organisms with controlled, food-seeking movements. Since this kind of organism is so simple, I only used a few hidden layer neurons and a few input and output neurons. I understand that more complex neural networks could be made by simply adding more neurons, but can't you add more layers? Or would this create some kind of redundancy? All of the pictures of diagrams of neural networks, such as this one http://mechanicalforex.com/wp-content/uploads/2011/06/NN.png, always have one input layer, one hidden layer, and one output layer. Couldn't a more complex neural network be made if you just added a bunch of hidden layers? Of course this would make processing the neural network harder, but would it create any sort of advantage, or would it be just the same as adding more neurons to a single layer?
You can include as many hidden layers you want, starting from zero (--that case is called perceptron).
The ability to represent unknown functions, however, does -- in principle -- not increase. Single-hidden layer neural networks already possess a universal representation property: by increasing the number of hidden neurons, they can fit (almost) arbitrary functions. You can't get more than this. And particularly not by adding more layers.
However, that doesn't mean that multi-hidden-layer ANN's can't be useful in practice. Yet, as you get another dimension in your parameter set, people usually stuck with the single-hidden-layer version.

How to determine the number of feature maps to use in a convolutional neural network layer?

I've been doing a lot of reading on Conv Nets and even some playing using Julia's Mocha.jl package (which looks a lot like Caffe, but you can play with it in the Julia REPL).
In a Conv net, Convolution layers are followed by "feature map" layers. What I'm wondering is how does one determine how many feature maps a network needs to have to solve some particular problem? Is there any science to this or is it more art? I can see that if you're trying to make a classification at least that last layer should have number of feature maps == number of classes (unless you've got a fully connected MLP at the top of the network, I suppose).
In my case, I 'm not doing a classification so much as trying to come up with a value for every pixel in an image (I suppose this could be seen as a classification where the classes are from 0 to 255).
Edit: as pointed out in the comments, I'm trying to solve a regression problem where the outputs are in a range from 0 to 255 (grayscale in this case). Still, the question remains: How does one determine how many feature maps to use at any given convolution layer? Does this differ for a regression problem vs. a classification problem?
Basically, like any other hyperparameter - by evaluting results on separate development set and finding what number works best. It also worth checking publications that deal with similar problem and finding what number of feature maps they were using.
More art. The only difference between imagenet winners that use conv-nets has been changing the structure of layers and maybe some novel ways of training.
VGG is a neat example. Begins with filter sizes beginning with 2^7, then 2^8, then 2^9 followed by fully connected layers, then an output layer which will give you your classes. Your maps and layer depths can be completely unrelated to the number of output classes.
You would not want a fully connected layer at the top. That kind of defeats the purpose that convolutional nets were designed to solve (overfitting and optimizing hundreds of thousands of weights per neuron)
Training on big sets will require some heavy computational resources. If you're working with imagenet - there's a set of pre-trained models with caffe that you could build on top of http://caffe.berkeleyvision.org/model_zoo.html
I'm not sure if you can port these to mocha. There's a port to tensor flow though if you're interested in that https://github.com/ethereon/caffe-tensorflow

Resources