Partially connect convolution layers in CNN - machine-learning

I think convolution layers should be fully connected (see this and this). That is, each feature map should be connected to all feature maps in the previous layer. However, when I looked at this CNN visualization, the second convolution layer is not fully connected to the first. Specifically, each feature map in the second layer is connected to 3~6 (all) feature maps in the first layer, and I don't see any pattern in it. The questions are
Is it canonical/standard to fully connect convolution layers?
What's the rational for the partial connections in the visualization?
Am I missing something here?

Neural networks have the remarkable property that knowledge is not stored anywhere specifically, but in a distributed sense. If you take a working network, you can often cut out large parts and still get a network that works approximately the same.
A related effect is that the exact layout is not very critical. ReLu and Sigmoid (tanh) activation functions are mathematically very different, but both work quite well. Similarly, the exact number of nodes in a layer doesn't really matter.
Fundamentally, this relates to the fact that in training you optimize all weights to minimize your error function, or at least find a local minimum. As long as there are sufficient weights and those are sufficiently independent, you can optimize the error function.
There is another effect to take into account, though. With too many weights and not enough training data, you cannot optimize the network well. Regularization only helps so much. A key insight in CNN's is that they have less weights than a fully connected network, because nodes in a CNN are connected only to a small local neighborhood of nodes in the prior layer.
So, this particular CNN has even less connections than a CNN in which all feature maps are connected, and therefore less weights. That allows you to have more and/or bigger maps for a given amount of data. Is that the best solution? Perhaps - choosing the best layout is still a bit of a black art. But it's not a priori unreasonable.

Related

Should I adapt loss weights for misclassified samples during epochs?

I am using FCN (Fully Convolutional Networks) and trying to do image segmentation. When training, there are some areas which are mislabeled, however further training doesn't help much to make them go away. I believe this is because network learns about some features which might not be completely correct ones, but because there are enough correctly classified examples, it is stuck in local minimum and can't get out.
One solution I can think of is to train for an epoch, then validate the network on training images, and then adjust weights for mismatched parts to penalize mismatch more there in next epoch.
Intuitively, this makes sense to me - but I haven't found any writing on this. Is this a known technique? If yes, how is it called? If no, what am I missing (what are the downsides)?
It highly depends on your network structure. If you are using the original FCN, due to the pooling operations, the segmentation performance on the boundary of your objects is degraded. There have been quite some variants over the original FCN for image segmentation, although they didn't go the route you're proposing.
Just name a couple of examples here. One approach is to use Conditional Random Field (CRF) on top of the FCN output to refine the segmentation. You may search for the relevant papers to get more idea on that. In some sense, it is close to your idea but the difference is that CRF is separated from the network as a post-processing approach.
Another very interesting work is U-net. It employs some idea from the residual network (RES-net), which enables high resolution features from lower levels can be integrated into high levels to achieve more accurate segmentation.
This is still a very active research area. So you may bring the next break-through with your own idea. Who knows! Have fun!
First, if I understand well you want your network to overfit your training set ? Because that's generally something you don't want to see happening, because this would mean that while training your network have found some "rules" that enables it to have great results on your training set, but it also means that it hasn't been able to generalize so when you'll give it new samples it will probably perform poorly. Moreover, you never talk about any testing set .. have you divided your dataset in training/testing set ?
Secondly, to give you something to look into, the idea of penalizing more where you don't perform well made me think of something that is called "AdaBoost" (It might be unrelated). This short video might help you understand what it is :
https://www.youtube.com/watch?v=sjtSo-YWCjc
Hope it helps

How to implement convolutional connections without tied weights?

Given two layers of a neural network that have a 2D representation, i.e. fields of activation. I'd like to connect each neuron of the lower layer to the near neurons of the upper layer, say within a certain radius. Is this possible with TensorFlow?
This is similar to a convolution, but the weight kernels should not be tied. I'm trying to avoid connecting both layers fully first and masking out most of the parameters, in order to keep the number of parameters low.
I don't see a simple way to do this with existing TensorFlow ops efficiently, but there might be some tricks with sparse things. However, ops for efficient locally connected, non-convolutional neural net layers would be very useful, so you might want to file a feature request as a GitHub issue.

How do multiple hidden layers in a neural network improve its ability to learn?

Most neural networks bring high accuracy with only one hidden layer, so what is the purpose of multiple hidden layers?
To answer you question you first need to find the reason behind why the term 'deep learning' was coined almost a decade ago. Deep learning is nothing but a neural network with several hidden layers. The term deep roughly refers to the way our brain passes the sensory inputs (specially eyes and vision cortex) through different layers of neurons to do inference. However, until about a decade ago researchers were not able to train neural networks with more than 1 or two hidden layers due to different issues arising such as vanishing, exploding gradients, getting stuck in local minima, and less effective optimization techniques (compared to what is being used nowadays) and some other issues. In 2006 and 2007 several researchers 1 and 2 showed some new techniques enabling a better training of neural networks with more hidden layers and then since then the era of deep learning has started.
In deep neural networks the goal is to mimic what the brain does (hopefully). Before describing more, I may point out that from an abstract point of view the problem in any learning algorithm is to approximate a function given some inputs X and outputs Y. This is also the case in neural network and it has been theoretically proven that a neural network with only one hidden layer using a bounded, continuous activation function as its units can approximate any function. The theorem is coined as universal approximation theorem. However, this raises the question of why current neural networks with one hidden layer cannot approximate any function with a very very high accuracy (say >99%)? This could potentially be due to many reasons:
The current learning algorithms are not as effective as they should be
For a specific problem, how one should choose the exact number of hidden units so that the desired function is learned and the underlying manifold is approximated well?
The number of training examples could be exponential in the number of hidden units. So, how many training examples one should train a model with? This could turn into a chicken-egg problem!
What is the right bounded, continuous activation function and does the universal approximation theorem is generalizable to any other activation function rather than sigmoid?
There are also other questions that need to be answered as well but I think the most important ones are the ones I mentioned.
Before one can come up with provable answers to the above questions (either theoretically or empirically), researchers started using more than one hidden layers with limited number of hidden units. Empirically this has shown a great advantage. Although adding more hidden layers increases the computational costs, but it has been empirically proven that more hidden layers learn hierarchical representations of the input data and can better generalize to unseen data as well. By looking at the pictures below you can see how a deep neural network can learn hierarchies of features and combine them successively as we go from the first hidden layer to the one in the end:
Image taken from here
As you can see, the first hidden layer (shown in the bottom) learns some edges, then combining those seemingly, useless representations turn into some parts of the objects and then combining those parts will yield things like faces, cars, elephants, chairs and ... . Note that these results were not achievable if new optimization techniques and new activation functions were not used.

How to find dynamically the depth of a network in Convolutional Neural Network

I was looking for an automatic way to decide how many layers should I apply to my network depends on data and computer configuration. I searched in web, but I could not find anything. Maybe my keywords or looking ways are wrong.
Do you have any idea?
The number of layers, or depth, of a neural network is one of its hyperparameters.
This means that it is a quantity that can not be learned from the data, but you should choose it before trying to fit your dataset. According to Bengio,
We define a hyper-
parameter for a learning algorithm A as a variable to
be set prior to the actual application of A to the data,
one that is not directly selected by the learning algo-
rithm itself.
There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the paper I linked.
Manual search. Using well-known black magic, the researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian optimization.
More specifically, adding more layers to a deep neural network is likely to improve the performance (reduce generalization error), up to a certain number when it overfits the training data.
So, in practice, you should train your ConvNet with, say, 4 layers, try adding one hidden layer and train again, until you see some overfitting. Of course, some strong regularization techniques (such as dropout) is required.

Neural Network / Machine Learning memory storage

I am currently trying to set up an Neural Network for information extraction and I am pretty fluent with the (basic) concepts of Neural Networks, except for one which seem to puzzle me. It is probably pretty obvious but I can't seem to found information about it.
Where/How do Neural Networks store their memory? ( / Machine Learning)
There is quite a bit of information available online about Neural Networks and Machine Learning but they all seem to skip over memory storage. For example after restarting the program, where does it find its memory to continue learning/predicting? Many examples online don't seem to 'retain' memory but I can't imagine this being 'safe' for real/big-scale deployment.
I have a difficult time wording my question, so please let me know if I need to elaborate a bit more.
Thanks,
EDIT: - To follow up on the answers below
Every Neural Network will have edge weights associated with them.
These edge weights are adjusted during the training session of a
Neural Network.
This is exactly where I am struggling, how do/should I vision this secondary memory?
Is this like RAM? that doesn't seem logical.. The reason I ask because I haven't encountered an example online that defines or specifies this secondary memory (for example in something more concrete such as an XML file, or maybe even a huge array).
Memory storage is implementation-specific and not part of the algorithm per se. It is probably more useful to think about what you need to store rather than how to store it.
Consider a 3-layer multi-layer perceptron (fully connected) that has 3, 8, and 5 nodes in the input, hidden, and output layers, respectively (for this discussion, we can ignore bias inputs). Then a reasonable (and efficient) way to represent the needed weights is by two matrices: a 3x8 matrix for weights between the input and hidden layers and an 8x5 matrix for the weights between the hidden and output layers.
For this example, you need to store the weights and the network shape (number of nodes per layer). There are many ways you could store this information. It could be in an XML file or a user-defined binary file. If you were using python, you could save both matrices to a binary .npy file and encode the network shape in the file name. If you implemented the algorithm, it is up to you how to store the persistent data. If, on the other hand, you are using an existing machine learning software package, it probably has its own I/O functions for storing and loading a trained network.
Every Neural Network will have edge weights associated with them. These edge weights are adjusted during the training session of a Neural Network. I suppose your doubt is about storing these edge weights. Well, these values are stored separately in a secondary memory so that they can be retained for future use in the Neural Network.
I would expect discussion of the design of the model (neural network) would be kept separate from the discussion of the implementation, where data requirements like durability are addressed.
A particular library or framework might have a specific answer about durable storage, but if you're rolling your own from scratch, then it's up to you.
For example, why not just write the trained weights and topology in a file? Something like YAML or XML could serve as a format.
Also, while we're talking about state/storage and neural networks, you might be interested in investigating associative memory.
This may be answered in two steps:
What is "memory" in a Neural Network (referred to as NN)?
As a neural network (NN) is trained, it builds a mathematical model
that tells the NN what to give as output for a particular input. Think
of what happens when you train someone to speak a new language. The
human brain creates a model of the language. Similarly, a NN creates
mathematical model of what you are trying to teach it. It represents the mapping from input to output as a series of functions. This math model
is the memory. This math model is the weights of different edges in the network. Often, a NN is trained and these weights/connections are written to the hard disk (XML, Yaml, CSV etc). Whenever a NN needs to be used, these values are read back and the network is recreated.
How can you make a network forget its memory?
Think of someone who has been taught two languages. Let us say the individual never speaks one of these languages for 15-20 years, but uses the other one every day. It is very likely that several new words will be learnt each day and many words of the less frequent language forgotten. The critical part here is that a human being is "learning" every day. In a NN, a similar phenomena can be observed by training the network using new data. If the old data were not included in the new training samples, then the underlying math model will change so much that the old training data will no longer be represented in the model. It is possible to prevent a NN from "forgetting" the old model by changing the training process. However, this has the side effect that such a NN cannot learn completely new data samples.
I would say your approach is wrong. Neural Networks are not dumps of memory as we see on the computer. There are no addresses where a particular chunk of memory resides. All the neurons together make sure that a given input leads to a particular output.
Lets compare it with your brain. When you taste sugar, your tongue's taste buds are the input nodes which read chemical signals and transmit electric signals to brain. The brain then determines the taste using the various combinations of electric signals.
There are no lookup tables. There is no primary and secondary memories, only short and long term memory.

Resources