Usage of Fully Connected layer - machine-learning

Regarding Fully connected layer, I read that each neuron from the previous layer is connected to the next layer. But.. How does it help?
In general, what's the goal of fully connected layer?
Thank you!

If all neurons in any layer are connected with any neurons in previous and next layer (if exists) then during learning (changing wages) neural network is able to work on any wages. If some connection doesn't exist then learning process cannot use that one. Exactly the same is when human is learning. Then new connections are build between neurons. It's better to have more connections. It's about posibility to learn, because then wages between these neurons can be modified even to zero.
But sometimes there are neural nets with not fully connected layers, but it has also important usage with different problems solving.

Related

Is this possible to train new network based on old one without data?

Is this possible to train new smaller network based on already trained network without data? i.e. new network should just try to mimic behaviour of 1st one.
If it's not possible with out data, if there any benefits of have already trained network? i.e. as I understand at least we can use it for pseudo labeling.
Update:
The most relevant paper I have found:
https://arxiv.org/pdf/1609.02943.pdf
I don't think that you can say that you are training a network if you are not using any data. But you can always try to get a smaller one, for example by pruning the large network (in the simplest case, this means removing weights that have an l2 norm that is close to zero), there is a rich literature on the subject. Also, I think you might find some works in knowledge distillation useful, e.g. Data-Free Knowledge Distillation
for Deep Neural Networks .

Partially connect convolution layers in CNN

I think convolution layers should be fully connected (see this and this). That is, each feature map should be connected to all feature maps in the previous layer. However, when I looked at this CNN visualization, the second convolution layer is not fully connected to the first. Specifically, each feature map in the second layer is connected to 3~6 (all) feature maps in the first layer, and I don't see any pattern in it. The questions are
Is it canonical/standard to fully connect convolution layers?
What's the rational for the partial connections in the visualization?
Am I missing something here?
Neural networks have the remarkable property that knowledge is not stored anywhere specifically, but in a distributed sense. If you take a working network, you can often cut out large parts and still get a network that works approximately the same.
A related effect is that the exact layout is not very critical. ReLu and Sigmoid (tanh) activation functions are mathematically very different, but both work quite well. Similarly, the exact number of nodes in a layer doesn't really matter.
Fundamentally, this relates to the fact that in training you optimize all weights to minimize your error function, or at least find a local minimum. As long as there are sufficient weights and those are sufficiently independent, you can optimize the error function.
There is another effect to take into account, though. With too many weights and not enough training data, you cannot optimize the network well. Regularization only helps so much. A key insight in CNN's is that they have less weights than a fully connected network, because nodes in a CNN are connected only to a small local neighborhood of nodes in the prior layer.
So, this particular CNN has even less connections than a CNN in which all feature maps are connected, and therefore less weights. That allows you to have more and/or bigger maps for a given amount of data. Is that the best solution? Perhaps - choosing the best layout is still a bit of a black art. But it's not a priori unreasonable.

Adjusting hyperparameters of neural network used for (offline) handwriting recognition

I'm currently using a java library to do some naive experimentation with offline handwriting recognition. I give my program an image of a pre-written English sentence and segment it into individual characters, which I then feed to a very naively constructed neural network.
I'm new to the idea of neural nets, so my question is where to start with regard to optimising this network's hyperparameters. Currently it's a simple feed forward network which I train using resilient propagation, so the only parameters I can optimise are the number of hidden layers, and the number of neurons in each hidden layer. I could of course do an exhaustive search through a large but finite number of combination, but this would be very time-consuming, and I'm sure someone out there who is more informed in this art must be able to point me in the right direction.
I found a post somewhere on here that stated a good place to start for any network in general was to use only one hidden layer with number of neurons equal to the mean number of neurons in your input and output layer, so that's what I'm doing at the moment.
I'm getting performance of about 40-60% (depending on character) accuracy with this model.

How to implement convolutional connections without tied weights?

Given two layers of a neural network that have a 2D representation, i.e. fields of activation. I'd like to connect each neuron of the lower layer to the near neurons of the upper layer, say within a certain radius. Is this possible with TensorFlow?
This is similar to a convolution, but the weight kernels should not be tied. I'm trying to avoid connecting both layers fully first and masking out most of the parameters, in order to keep the number of parameters low.
I don't see a simple way to do this with existing TensorFlow ops efficiently, but there might be some tricks with sparse things. However, ops for efficient locally connected, non-convolutional neural net layers would be very useful, so you might want to file a feature request as a GitHub issue.

How do multiple hidden layers in a neural network improve its ability to learn?

Most neural networks bring high accuracy with only one hidden layer, so what is the purpose of multiple hidden layers?
To answer you question you first need to find the reason behind why the term 'deep learning' was coined almost a decade ago. Deep learning is nothing but a neural network with several hidden layers. The term deep roughly refers to the way our brain passes the sensory inputs (specially eyes and vision cortex) through different layers of neurons to do inference. However, until about a decade ago researchers were not able to train neural networks with more than 1 or two hidden layers due to different issues arising such as vanishing, exploding gradients, getting stuck in local minima, and less effective optimization techniques (compared to what is being used nowadays) and some other issues. In 2006 and 2007 several researchers 1 and 2 showed some new techniques enabling a better training of neural networks with more hidden layers and then since then the era of deep learning has started.
In deep neural networks the goal is to mimic what the brain does (hopefully). Before describing more, I may point out that from an abstract point of view the problem in any learning algorithm is to approximate a function given some inputs X and outputs Y. This is also the case in neural network and it has been theoretically proven that a neural network with only one hidden layer using a bounded, continuous activation function as its units can approximate any function. The theorem is coined as universal approximation theorem. However, this raises the question of why current neural networks with one hidden layer cannot approximate any function with a very very high accuracy (say >99%)? This could potentially be due to many reasons:
The current learning algorithms are not as effective as they should be
For a specific problem, how one should choose the exact number of hidden units so that the desired function is learned and the underlying manifold is approximated well?
The number of training examples could be exponential in the number of hidden units. So, how many training examples one should train a model with? This could turn into a chicken-egg problem!
What is the right bounded, continuous activation function and does the universal approximation theorem is generalizable to any other activation function rather than sigmoid?
There are also other questions that need to be answered as well but I think the most important ones are the ones I mentioned.
Before one can come up with provable answers to the above questions (either theoretically or empirically), researchers started using more than one hidden layers with limited number of hidden units. Empirically this has shown a great advantage. Although adding more hidden layers increases the computational costs, but it has been empirically proven that more hidden layers learn hierarchical representations of the input data and can better generalize to unseen data as well. By looking at the pictures below you can see how a deep neural network can learn hierarchies of features and combine them successively as we go from the first hidden layer to the one in the end:
Image taken from here
As you can see, the first hidden layer (shown in the bottom) learns some edges, then combining those seemingly, useless representations turn into some parts of the objects and then combining those parts will yield things like faces, cars, elephants, chairs and ... . Note that these results were not achievable if new optimization techniques and new activation functions were not used.

Resources