I am confused between how to correctly use dropout with RNN in keras, specifically with GRU units. The keras documentation refers to this paper (https://arxiv.org/abs/1512.05287) and I understand that same dropout mask should be used for all time-steps. This is achieved by dropout argument while specifying the GRU layer itself. What I don't understand is:
Why there are several examples over the internet including keras own example (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) and "Trigger word detection" assignment in Andrew Ng's Coursera Seq. Models course, where they add a dropout layer explicitly "model.add(Dropout(0.5))" which, in my understanding, will add a different mask to every time-step.
The paper mentioned above suggests that doing this is inappropriate and we might lose the signal as well as long-term memory due to the accumulation of this dropout noise over all the time-steps.
But then, how are these models (using different dropout masks at every time-step) are able to learn and perform well.
I myself have trained a model which uses different dropout masks at every time-step, and although I haven't gotten results as I wanted, the model is able to overfit the training data. This, in my understanding, invalidates the "accumulation of noise" and "signal getting lost" over all the time-steps (I have 1000 time-step series being input to the GRU layers).
Any insights, explanations or experience with the situation will be helpful. Thanks.
UPDATE:
To make it more clear I'll mention an extract from keras documentation of Dropout Layer ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features").
So, I believe, it can be seen that when using Dropout layer explicitly and needing the same mask at every time-step (as mentioned in the paper), we need to edit this noise_shape argument which is not done in the examples I linked earlier.
As Asterisk explained in his comment, there is a fundamental difference between dropout within a recurrent unit and dropout after the unit's output. This is the architecture from the keras tutorial you linked in your question:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
You're adding a dropout layer after the LSTM finished its computation, meaning that there won't be any more recurrent passes in that unit. Imagine this dropout layer as teaching the network not to rely on the output for a specific feature of a specific time step, but to generalize over information in different features and time steps. Dropout here is no different to feed-forward architectures.
What Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost to be very helpful to understand the paper and how it relates to the keras implementation.
Related
I'm asking about the last Layer of U-Net model for Semantic Segmentation
what it should be and why?
As I've found a lot of different architectures part of them are using Sigmoid and others are using Softmax in last layer
There's a good foundational article that goes in depth about sigmoid and softmax functions. Here is their summary:
If your model’s output classes are NOT mutually exclusive and you can choose many of them at the same time, use a sigmoid function on the network’s raw outputs.
If your model’s output classes are mutually exclusive and you can only choose one, then use a softmax function on the network’s raw outputs.
The article however specifically gives examples of classification tasks. In segmentation tasks, a pixel can only be one class at a time. (For example, in segmenting items on a beach, a pixel can't be both sand AND water.) This results in the often use of softmax in segmentation models, as the classes are mutually exclusive. In other words, a multi-class classification problem.
Sigmoid deals with multi-label classification problems, allowing for a pixel to share a label (a pixel can be both sand and water, both sky and water, even sky+water+sand+sun+etc.), which doesn't make sense. The exception, however, is if there's only one class, in other words, binary classification (water vs no water). Then you may use sigmoid in segmentation.
Softmax is actually a generalization of a sigmoid function. See this question over on Cross Validated for more info, but this is extra credit.
To finish answering your question, I should briefly speak about loss functions. Depending on your loss function, you may be preferring sigmoid or softmax. (E.g. if your loss function requires logits, softmax is inadequate.)
In summary, using softmax or sigmoid in the last layer depends on the problem you're working on, along with the associated loss function and other intricacies in your pipeline/software. In practice, if you have a multi-class problem, chances are you'll be using softmax. If you have one-class/binary problem, sigmoid or softmax are possibilities.
In a dense layer, one should initialize the weights according to some rule of thumb. For example, with RELU, the weights should come from a normal distribution and should be rescaled by 2/n where n is the number of inputs to the layer (according to Andrew Ng).
Does the same hold for convolutional layers? What is the right way to initialize weights (and biases) in a convolutional layer?
A common initializer for the sigmoid-based networks is Xavier initializer (a.k.a. Glorot initializer), named after Xavier Glorot, one of the authors of "Understanding the difficulty of training deep feedforward neural networks" paper. The formula takes into account not only the number of incoming connections, but also outcoming as well. The authors prove that with this initialization, activations distribution is approximately normal, which helps gradient flow in the backward pass.
For relu-based networks, a better initializer is He initializer from "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" by Kaiming He at al., which prove the same properties for relu activation.
Dense and convolutional layer aren't that different in this case, but it's important to remember that kernel weights are shared across the input image and batch, so the number of incoming connections depends on several parameters, incluing kernel size and striding, and might not be easy to calculate by hand.
In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer().
See also this discussion on CrossValidated.
I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.
So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).
My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Thanks in advance.
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass.
If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.
Highly recommend this post on backpropagation details.
The Keras implementation of dropout references this paper.
The following excerpt is from that paper:
The idea is to use a single neural net at test time without dropout.
The weights of this network are scaled-down versions of the trained
weights. If a unit is retained with probability p during training, the
outgoing weights of that unit are multiplied by p at test time as
shown in Figure 2.
The Keras documentation mentions that dropout is only used at train time, and the following line from the Dropout implementation
x = K.in_train_phase(K.dropout(x, level=self.p), x)
seems to indicate that indeed outputs from layers are simply passed along during test time.
Further, I cannot find code which scales down the weights after training is complete as the paper suggests. My understanding is this scaling step is fundamentally necessary to make dropout work, since it is equivalent to taking the expected output of intermediate layers in an ensemble of "subnetworks." Without it, the computation can no longer be considered sampling from this ensemble of "subnetworks."
My question, then, is where is this scaling effect of dropout implemented in Keras, if at all?
Update 1: Ok, so Keras uses inverted dropout, though it is called dropout in the Keras documentation and code. The link http://cs231n.github.io/neural-networks-2/#reg doesn't seem to indicate that the two are equivalent. Nor does the answer at https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout. I can see that they do similar things, but I have yet to see anyone say they are exactly the same. I think they are not.
So a new question: Are dropout and inverted dropout equivalent? To be clear, I'm looking for mathematical justification for saying they are or aren't.
Yes. It is implemented properly. From the time when Dropout was invented - folks improved it also from the implementation point of view. Keras is using one of this techniques. It's called inverted dropout and you may read about it here.
UPDATE:
To be honest - in the strict mathematical sense this two approaches are not equivalent. In inverted case you are multiplying every hidden activation by a reciprocal of dropout parameter. But due to that derivative is linear it is equivalent to multiplying all gradient by the same factor. To overcome this difference you must set different learning weight then. From this point of view this approaches differ. But from a practical point view - this approaches are equivalent because:
If you use a method which automatically sets the learning rate (like RMSProp or Adagrad) - it will make almost no change in algorithm.
If you use a method where you set your learning rate automatically - you must take into account the stochastic nature of dropout and that due to the fact that some neurons will be turned off during training phase (what will not happen during test / evaluation phase) - you must to rescale your learning rate in order to overcome this difference. Probability theory gives us the best rescalling factor - and it is a reciprocal of dropout parameter which makes the expected value of a loss function gradient length the same in both train and test / eval phases.
Of course - both points above are about inverted dropout technique.
Excerpted from the original Dropout paper (Section 10):
In this paper, we described dropout as a method where we retain units with probability p at training time and scale down the weights by multiplying them by a factor of p at test time. Another way to achieve the same effect is to scale up the retained activations by multiplying by 1/p at training time and not modifying the weights at test time. These methods are equivalent with appropriate scaling of the learning rate and weight initializations at each layer.
Note though, that while keras's dropout layer is implemented using inverted dropout. The rate parameter the opposite of keep_rate.
keras.layers.Dropout(rate, noise_shape=None, seed=None)
Dropout consists in randomly setting a fraction rate of input units to
0 at each update during training time, which helps prevent
overfitting.
That is, rate sets the rate of dropout and not the rate to keep which you would expect with inverted dropout
Keras Dropout
Caffe tutorial states:
The net is a set of layers connected in a computation graph – a directed acyclic graph (DAG) to be exact. Caffe does all the bookkeeping for any DAG of layers to ensure correctness of the forward and backward passes.
What is the meaning by "all the bookkeeping"? I don't understand it.
How to do all the bookkeeping?
Caffe, like many other deep-learning frameworks, trains its models using stochastic gradient decent (SGD), implemented as gradient back propagation. That is, for a mini-batch of training examples, caffe feed the batch through the net ("forward pass") to compute the loss w.r.t the net's parameters. Then, it propagates the loss gradient back ("backward pass") to update all the parameters according to the estimated gradient.
By "bookkeeping" the tutorial means, you do not need to worry about estimating the gradients and updating the parameters. Once you are using existing layers (e.g., "Convolution", "ReLU", "Sigmoid" etc.) you only need to define the graph structure (the net's architecture) and supply the training data, and caffe will take care of the rest of the training process: It will forward/backward each mini batch, compute the loss, estimate the gradients and update the parameters - all for you.
Pretty awesome, don't you think? ;)