Standard in ResNets is to skip 2 linearities.
Would skipping only one work as well?
I would refer you to the original paper by Kaiming He at al.
In sections 3.1-3.2, they define "identity" shortcuts as y = F(x, W) + x, where W are the trainable parameters, for any residual mapping F to be learned. It is important that the residual mapping contains a non-linearity, otherwise the whole construction is one sophisticated linear layer. But the number of linearities is not limited.
For example, ResNeXt network creates identity shortcuts around a stack of only convolutional layers (see the figure below). So there aren't any dense layers in the residual block.
The general answer is, thus: yes, it would work. However, in a particular neural network, reducing two dense layers to one may be a bad idea, because anyway the residual block must be flexible enough to learn the residual function. So remember to validate any design you come up with.
Related
When training a neural network, if the same module is used multiple times in one iteration, does the gradient of the module need special processing during backpropagation?
for example:
One Deformable Compensation is used three times in this model, which means they share the same weights.
What will happen when I use loss.backward()?
Will loss.backward() work correctly?
The nice thing about autograd and backward passes is that the underlying framework is not "algorithmic", but rather a mathematic one: it implements the chain rule of derivatives. Therefore, there are no "algorithmic" considerations of "shared weights" or "weighting different layers", it's pure math. The backward pass provides the derivative of your loss function w.r.t the weights in a purely mathematical way.
Sharing weights can be done globally (e.g., when training Saimese networks), on a "layer level" (as in your example), but also within a layer. When you think about it Convolution layers and Reccurent layers are a fancy way of locally sharing weights.
Naturally, pytorch (as well as all other DL frameworks) can trivially handle these cases.
As long as your "deformable compensation" layer is correctly implemented -- pytorch will take care of the gradients for you, in a mathematically correct manner, thanks to the chain rule.
I have a question regarding appropriate activation functions with environments that have both positive and negative rewards.
In reinforcement learning, our output, I believe, should be the expected reward for all possible actions. Since some options have a negative reward, we would want an output range that includes negative numbers.
This would lead me to believe that the only appropriate activation functions would either be linear or tanh. However, I see any many RL papers the use of Relu.
So two questions:
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
Many RL papers indeed use Relu's for most layers, but typically not for the final output layer. You mentioned the Human Level Control through Deep Reinforcement Learning paper and the Hindsight Experience Replay paper in one of the comments, but neither of those papers describe architectures that use Relu's for the output layer.
In the Human Level Control through Deep RL paper, page 6 (after references), Section "Methods", last paragraph for the part on "Model architecture" mentions that the output layer is a fully-connected linear layer (not a Relu). So, indeed, all hidden layers can only have nonnegative activation levels (since they all use Relus), but the output layer can have negative activation levels if there are negative weights between the output layer and last hidden layer. This is indeed necessary because the outputs it should create can be interpreted as Q-values (which may be negative).
In the Hindsight Experience Replay paper, they do not use DQN (like the paper above), but DDPG. This is an "Actor-Critic" algorithm. The "critic" part of this architecture is also intended to output values which can be negative, similar to the DQN architecture, so this also cannot use a Relu for the output layer (but it can still use Relus everywhere else in the network). In Appendix A of this paper, under "Network architecture", it is also described that the actor output layer uses tanh as activation function.
To answer your specific questions:
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
Well, there are also other activations (leaky relu, sigmoid, lots of others probably). But a Relu indeed cannot result in negative outputs.
Not 100% sure, possibly. It would often be difficult though, if you have no domain knowledge about how big or small rewards (and/or returns) can possibly get. I have a feeling it would typically be easier to simply end with one fully connected linear layer.
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
No, this is only the case for the activation function of the output layer. For all other layers, it does not matter because you can have negative weights which means neurons with only positive values can still contribute with negative values to the next layer.
I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.
Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.
What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?
I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :
1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.
Actually, Softmax functions are already used deep within neural networks, in certain cases, when dealing with differentiable memory and with attention mechanisms!
Softmax layers can be used within neural networks such as in Neural Turing Machines (NTM) and an improvement of those which are Differentiable Neural Computer (DNC).
To summarize, those architectures are RNNs/LSTMs which have been modified to contain a differentiable (neural) memory matrix which is possible to write and access through time steps.
Quickly explained, the softmax function here enables a normalization of a fetch of the memory and other similar quirks for content-based addressing of the memory. About that, I really liked this article which illustrates the operations in an NTM and other recent RNN architectures with interactive figures.
Moreover, Softmax is used in attention mechanisms for, say, machine translation, such as in this paper. There, the Softmax enables a normalization of the places to where attention is distributed in order to "softly" retain the maximal place to pay attention to: that is, to also pay a little bit of attention to elsewhere in a soft manner. However, this could be considered like to be a mini-neural network that deals with attention, within the big one, as explained in the paper. Therefore, it could be debated whether or not Softmax is used only at the end of neural networks.
Hope it helps!
Edit - More recently, it's even possible to see Neural Machine Translation (NMT) models where only attention (with softmax) is used, without any RNN nor CNN: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability
distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.
Softmax function is used for the output layer only (at least in most cases) to ensure that the sum of the components of output vector is equal to 1 (for clarity see the formula of softmax cost function). This also implies what is the probability of occurrence of each component (class) of the output and hence sum of the probabilities(or output components) is equal to 1.
Softmax function is one of the most important output function used in deep learning within the neural networks (see Understanding Softmax in minute by Uniqtech). The Softmax function is apply where there are three or more classes of outcomes. The softmax formula takes the e raised to the exponent score of each value score and devide it by the sum of e raised the exponent scores values. For example, if I know the Logit scores of these four classes to be: [3.00, 2.0, 1.00, 0.10], in order to obtain the probabilities outputs, the softmax function can be apply as follows:
import numpy as np
def softmax(x):
z = np.exp(x - np.max(x))
return z / z.sum()
scores = [3.00, 2.0, 1.00, 0.10]
print(softmax(scores))
Output: probabilities (p) = 0.642 0.236 0.087 0.035
The sum of all probabilities (p) = 0.642 + 0.236 + 0.087 + 0.035 = 1.00. You can try to substitute any value you know in the above scores, and you will get a different values. The sum of all the values or probabilities will be equal to one. That’s makes sense, because the sum of all probability is equal to one, thereby turning Logit scores to probability scores, so that we can predict better. Finally, the softmax output, can help us to understand and interpret Multinomial Logit Model. If you like the thoughts, please leave your comments below.
I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.