I have an indefinitely large training set to train a neural network.
Does it make any sense in this scenario to use regularization techniques like dropout?
Yes, it probably still does. Dropout is regularization in a sense, but much subtler than something like L1 norm. It prevents excessive co-adaptation of feature detectors as described in the original paper.
You probably don't want the network to learn to depend on just one feature or just a small combo of features, even if that is the best feature in your training set, because it may not be the case in new data. Intuitively, a network with dropout trained to recognize people in images will likely still recognize them if the face is obscured, even if there was no example image like that in the training set (because the face high level feature would have been dropped out some fraction of the time); a network trained without dropout may not (because the face feature is probably one of the best single features for detecting people). You can think of dropout as a certain degree of forced concept generalization.
Empirically, the feature detectors that are produced with dropout are much more structured (eg, for images: closer to Gabor filters, for the first few layers) when dropout is used; without dropout they are closer to random (probably because that network approximates the Gabor filter it is converging towwards using a specific linear combo of random filters, if it can rely on the elements of that combo not being dropped out then there is no gradient towards separating the filters). This is also probably a good thing since it forces features which are independent to be implemented as independent early on, which may result in lower cross-talk later on.
Related
I'm working on a (high energy physics related) problem using CNNs.
For understanding the problem, let's consider these examples here.
The left-hand side is the input to the CNN, the right-hand side the desired output. So the network is supposed to cluster the input. The actual algorithm behind this clustering (i.e. how we got the desired output for training) is really complex and we want the CNN to learn this.
I've tried different CNN architectures, for example one similar to the U-net architecture (https://arxiv.org/abs/1505.04597) but also various concatenations of convolutional layers, etc.
The outputs are always really similar (for all architectures).
Here you can see some CNN predictions.
In principle the network is performing quite well, but as you can see, in most cases the CNN output consists of several filled pixels that are directly next to each other, which will never (!) happen in the true cases.
I've been using mean squared error as the loss function in all of the networks.
Do you have any suggestions how one could avoid this problem and improve the networks performance?
Or is this a general limitation to CNNs and in practice it is not possible to solve such a problem using CNNs?
Thank you very much!
My suggestion would be to split up the work. First use a U-Shaped NN to find the activations in a binary segmentation task (like in your paper) and then regress on the found activations to find their final values. In my experience this works way better than doing regression on large images, because the MSE will result in blurry outputs, as you have observed.
The CNN does not know that you wanted a sharp result. As mentioned by #Thomas, MSE tends to give you blurry result as it is the nature of that loss function. Giving a blurry result does not introduce large loss in MSE.
An easy modification would be to use L1 Loss (absolute difference instead of squared error). It has a constant gradient unlike MSE whose gradient decreases with error.
If you really wanted a sharp result, it would be easier to add a manual step -- non maximum suppression (NMS). In practice, a 3x3 box-max filter might do.
Last week, Geoffrey Hinton and his team published papers that introduced a completely new type of neural network based on capsules. But I still don't understand the architecture and mechanism of work. Can someone explain in simple words how it works?
One of the major advantages of Convolutional neural networks is their invariance to translation. However this invariance comes with a price and that is, it does not consider how different features are related to each other. For example, if we have a picture of a face CNN will have difficulties distinguishing relationship between mouth feature and nose features. Max pooling layers are the main reason for this effect. Because when we use max pooling layers, we lose the precise locations of the mouth and noise and we cannot say how they are related to each other.
Capsules try to keep the advantage of CNN and fix this drawback in two ways;
Invariance: quoting from this paper
When the capsule is working properly, the probability of the visual
entity being present is locally invariant – it does not change as the
entity moves over the manifold of possible appearances within the
limited domain covered by the capsule.
In other words, capsule takes into account the existence of the the specific feature that we are looking for like mouth or nose. This property makes sure that capsules are translation invariant the same that CNNs are.
Equivariance: instead of making the feature translation invariance, capsule will make it translation-equivariant or viewpoint-equivariant. In other words, as the feature moves and changes its position in the image, feature vector representation will also change in the same way which makes it equivariant. This property of capsules tries to solve the drawback of max pooling layers that I mentioned at the beginning.
Typically, the feature space of CNNs are represented as the scalars which do not represent the accurate positional features of the information. Furthermore the pooling strategy in CNNs eliminate the usage of important sclar features within a region, highlighting a given significant feature of a region. Capsule network is designed to mitigate those issues considering the vector representation of the features which could capture the positional context of the feature representation while mapping child-parent relationship effectively through dynamic routing strategy.
In simple words, typical networks use scaler weights as connections between neurons. In capsule net, weights are matrices as there are capsules with higher dimensions instead of neurons.
I understand this decision depends on the task, but let me explain.
I'm designing a model that predicts steering angles from a given dashboard video frame using a convolutional neural network with dense layers at end. In my final dense layer, I only have a single unit that predicts a steering angle.
My question here is, for my task would either option below show a boost in performance?
a. Get ground truth steering angles, convert to radians, and squash them using tanh so they are between -1 and 1. In the final dense layer of my network, use a tanh activation function.
b. Get ground truth steering angles. These raw angles are between -420 and 420 degrees. In the final layer, use a linear activation.
I'm trying to think about it logically, where in option A the loss will likely be much smaller since the network is dealing with much smaller numbers. This would lead to smaller changes in weights.
Let me know your thoughts!
There are two types of variables in neural networks: weights and biases (mostly, there are additional variables, e.g. the moving mean and moving variance required for batchnorm). They behave a bit differently, for instance biases are not penalized by a regularizer as a result they don't tend to get small. So an assumption that the network is dealing only with small numbers is not accurate.
Still, biases need to be learned, and as can be seen from ResNet performance, it's easier to learn smaller values. In this sense, I'd rather pick [-1, 1] target range over [-420, 420]. But tanh is probably not an optimal activation function:
With tahn (just like with sigmoid), a saturated neuron kills the gradient during backprop. Choosing tahn with no specific reason is likely to hurt your training.
Forward and backward passes with tahn need to compute exp, which is also relatively expensive.
My option would be (at least initially, until some other variant proves to work better) to squeeze the ground truth values and have no activation at all (I think that's what you mean by a linear activation): let the network learn [-1, 1] range by itself.
In general, if you have any activation functions in the hidden layers, ReLu has proven to work better than sigmoid, though other modern functions have been proposed recently, e.g. leaky ReLu, PRelu, ELU, etc. You might try any of those.
Probably lots of people already saw this article by Google research:
http://googleresearch.blogspot.ru/2015/06/inceptionism-going-deeper-into-neural.html
It describes how Google team have made neural networks to actually draw pictures, like an artificial artist :)
I wanted to do something similar just to see how it works and maybe use it in future to better understand what makes my network to fail. The question is - how to achieve it with nolearn\lasagne (or maybe pybrain - it will also work but I prefer nolearn).
To be more specific, guys from Google have trained an ANN with some architecture to classify images (for example, to classify which fish is on a photo). Fine, suppose I have an ANN constructed in nolearn with some architecture and I have trained to some degree. But... What to do next? I don't get it from their article. It doesn't seem that they just visualize the weights of some specific layers. It seems to me (maybe I am wrong) like they do one of 2 things:
1) Feed some existing image or purely a random noise to the trained network and visualize the activation of one of the neuron layers. But - looks like it is not fully true, since if they used convolution neural network the dimensionality of the layers might be lower then the dimensionality of original image
2) Or they feed random noise to the trained ANN, get its intermediate output from one of the middlelayers and feed it back into the network - to get some kind of a loop and inspect what neural networks layers think might be out there in the random noise. But again, I might be wrong due to the same dimensionality issue as in #1
So... Any thoughts on that? How we could do the similar stuff as Google did in original article using nolearn or pybrain?
From their ipython notebook on github:
Making the "dream" images is very simple. Essentially it is just a
gradient ascent process that tries to maximize the L2 norm of
activations of a particular DNN layer. Here are a few simple tricks
that we found useful for getting good images:
offset image by a random jitter
normalize the magnitude of gradient
ascent steps apply ascent across multiple scales (octaves)
It is done using a convolutional neural network, which you are correct that the dimensions of the activations will be smaller than the original image, but this isn't a problem.
You change the image with iterations of forward/backward propagation just how you would normally train a network. On the forward pass, you only need to go until you reach the particular layer you want to work with. Then on the backward pass, you are propagating back to the inputs of the network instead of the weights.
So instead of finding the gradients to the weights with respect to a loss function, you are finding gradients to inputs with respect to the l2 Normalization of a certain set of activations.
CrossPost: https://stats.stackexchange.com/questions/103960/how-sensitive-are-neural-networks
I am aware of pruning, and am not sure if it removes the actual neuron or makes its weight zero, but I am asking this question as if a pruning process were not being used.
On variously sized feedforward neural networks on large datasets with lots of noise:
Is it possible one (or some trivial amount) extra OR missing hidden neurons OR hidden layers make or break a network? Or will its synapse weights simply degrade to zero if it is not necessary and compensate with the other neurons if it is missing one or two?
When experimenting, should input neurons be added one at a time or in groups of X? What is X? Increments of 5?
Lastly, should each hidden layer contain the same number of neurons? This is usually what I see in example. If not, how and why would you adjust their sizes if not relying on using pure experimentation?
I would prefer to overdo it and wait longer for a convergence than if larger networks will adapt itself to the solution. I have tried numerous configurations, but it is still difficult to gauge an optimum one.
1) Yes, absolutely. For example, if you have too less neurons in your hidden layer your model will be too simple and have high bias. Similarly, if you have too many neurons your model will overfit and have high variance. Adding more hidden layers allows you to model very complex problems like object recognition but there are a lot of tricks to make adding more hidden layers work; this is known as the field of deep learning.
2) In a single layered neural network its generally a rule of thumb to start with 2 times as many neurons as the number of inputs. You can determine the increment through binary search; i.e. run through a few different architectures and see how the accuracy changes..
3) No, definitely not - each hidden layer can contain as many neurons as you want it to contain. There is no way other can experimentation to determine their sizes; all of what you mention are hyperparameters which you must tune.
Im not sure if you are looking for a simple answer, but maybe you will be interested in a new neural network regularization technique called dropout. Dropout basically randomely "removes" some of the neurons during training forcing each of the neurons to be good feature detectors. It greatly prevents overfitting and you can go ahead and set the number of neurons to be high without worrying too much. Check this paper out for more info: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf