Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I am reading through Residual learning, and I have a question.
What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?
First up, it's important to understand what x, y and F are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.
x is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F is usually a conv layer (conv+relu+batchnorm in this paper), and y combines the two together (forming the output channel). The result of F is also of rank 4, and most of dimensions will be the same as in x, except for one. That's exactly what the transformation should patch.
For example, x shape might be (64, 32, 32, 3), where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x) might be (64, 32, 32, 16): batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.
So, in order for y=F(x)+x to be a valid operation, x must be "reshaped" from (64, 32, 32, 3) to (64, 32, 32, 16).
I'd like to stress here that "reshaping" here is not what numpy.reshape does.
Instead, x[3] is padded with 13 zeros, like this:
pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x dimensions are changed.
Here's the link to the code in Tensorflow that does this.
A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x is the vector of N input features and W is an M-byN matrix, then the matrix product Wx yields M new features where each one is a linear projection of x. Each row of W is a set of weights that defines one of the M linear projections (i.e., each row of W contains the coefficients for one of the weighted sums of x).
In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios
The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.
Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).
Related
I've been reading about convolutional nets and I've programmed a few models myself. When I see visual diagrams of other models it shows each layer being smaller and deeper than the last ones. Layers have three dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
TLDR; 256x256x32 refers to the layer's output shape rather than the layer itself.
There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.
Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w) (or (h, w, c) depending on the framework) where c is the number of channels, h is the width of the input and w the width. You can see it as a c-channel hxw image.
The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size hxw with c channels for example c=1 for greyscale or c=3 for RGB...
What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c channels.
Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3)
input
filter
convolution
output
shape
(1, 5, 5)
(3, 3)
(3,3)
representation
Now keep in mind the dimension of the output will depend on the stride and padding of the convolution layer. Here the shape of the output is the same as the shape of the filter, it does not necessarily have to be! Take an input shape of (1, 5, 5), with the same convolution settings, you would end up with a shape of (4, 4) (which is different from the filter shape (3, 3).
Also, something to note is that if the input had more than one channel: shape (c, h, w), the filter would have to have the same number of channels. Each channel of the input would convolve with each channel of the filter and the results would be averaged into a single 2D feature map. So you would have an intermediate output of (c, 3, 3), which after averaging over the channels, would leave us with (1, 3, 3)=(3, 3). As a result, considering a convolution with a single filter, however many input channels there are, the output will always have a single channel.
From there what you can do is assemble multiple filters on the same layer. This means you define your layer as having k 3x3 filters. So a layer consists k filters. For the computation of the output, the idea is simple: one filter gives a (3, 3) feature map, so k filters will give k (3, 3) feature maps. These maps are then stacked into what will be the channel dimension. Ultimately, you're left with an output shape of... (k, 3, 3).
Let k_h and k_w, be the kernel height and kernel width respectively. And h', w' the height and width of one outputted feature map:
input
layer
output
shape
(c, h, w)
(k, c, k_h, k_w)
(k, h', w')
description
c-channel hxw feature map
k filters of shape (c, k_h, k_w)
k-channel h'xw' feature map
Back to your question:
Layers have 3 dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
Convolution layers have four dimensions, but one of them is imposed by your input channel count. You can choose the size of your convolution kernel, and the number of filters. This number will determine is the number of channels of the output.
256x256 seems extremely high and you most likely correspond to the output shape of the feature map. On the other hand, 32 would be the number of channels of the output, which... as I tried to explain is the number of filters in that layer. Usually speaking the dimensions represented in visual diagrams for convolution networks correspond to the intermediate output shapes, not the layer shapes.
As an example, take the VGG neural network:
Very Deep Convolutional Networks for Large-Scale Image Recognition
Input shape for VGG is (3, 224, 224), knowing that the result of the first convolution has shape (64, 224, 224) you can determine there is a total of 64 filters in that layer.
As it turns out the kernel size in VGG is 3x3. So, here is a question for you: knowing there is a single bias parameter per filter, how many total parameters are in VGG's first convolution layer?
Sorry for the short answer, but when you have a digital image, you have 2 dimensions and then you often have 3 for the colors. The convolutional filter looks into parts of the picture with lower height/width dimensions and much more depth channels (in your case 32) to get more information. This is then fed into the neural network to learn.
I created the example in PyTorch to demonstrate the output you had:
import torch
import torch.nn as nn
bs=16
x = torch.randn(bs, 3, 256, 256)
c = nn.Conv2d(3,32,kernel_size=5,stride=1,padding=2)
out = c(x)
print(out.shape, out.shape[1])
Out:
torch.Size([16, 32, 256, 256]) 32
It's a real tensor inside. It may help.
You can play with a lot of convolution parameters.
I'm trying to understand how the adversarial generative network works: I found an example in the book by François Chollet (Deep learning with Python) in which there is an example of a GAN he uses CIFAR10 dataset, using the 'frog' class which contains 32x32 RGB images.
I can't understand why:
In (1) the input is transformed into 16 × 16 128-channel (why 128-channel?) feature map
In (2) when a convolution is performed, with which filter? It is not specified
Next, run another Conv2DTranspose and then another 3 Conv2d. Why?!
At the end, I have a 32 × 32 1-channel feature map.
from keras import layers
import numpy as np
latent_dim = 32
height = 32
width = 32
channels = 3
generator_input = keras.Input(shape=(latent_dim,))
(1)
x = layers.Dense(128 * 16 * 16)(generator_input)
x = layers.LeakyReLU()(x)
x = layers.Reshape((16, 16, 128))(x)
(2)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2DTranspose(256, 4, strides=2, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(channels, 7, activation='tanh', padding='same')(x)
generator = keras.models.Model(generator_input, x)
generator.summary()
1)
It's an arbitrary choice, you could have chosen any number of channels for the Dense layer.
16x16 is picked since a stride of 2 is set to the Conv2DTranspose and you want to upsample your width and height to get an output of 32x32.
Strides are used to influence output size of convolution layers. In normal convolutions, outputs are downsampled by the same factor as strides, where in transposed convolutions they are upsampled by the same factor as strides.
For instance, you could change your first layer output to 8x8x128 and then use a stride of 4 in your Conv2DTranspose, this way you would get the same result in terms of dimensionality.
Also keep in mind that, as stated by François Chollet in his book, when using strided transposed convolutions, in order to avoid checkerboard artifacts caused by unequal coverage of the pixel space, kernel size should be divisible by its number of strides.
2) The first argument you set in Conv2D or Conv2DTranspose is the number of filters generated by a convolution layer.
As said before, the strided Conv2DTranspose is used exactly to upsample width and height by a factor equal to the number of strides.
The other 3 Conv2D are also arbitrary, you should determine them by experimentation and fine tuning your model.
for 1) i do not think there is a reason for the number of dense nodes used (128x16x16), however the 16x16 is set because you only have 1 layer to up sample 16x16 to 32x32.
for 2) the first argument 256 used to instantiate Conv2D defines the number of filters.
In regards to your last question Next, run another Conv2DTranspose and then another 3 Conv2d. Why?! I would recommend try increasing/decreasing the number of layers to get a feel on how the model behaves with those changes (performing better or not), this is part of the "hyper-parameter tuning" process when building a neural net.
Hope the above helps.
Let y = Relu(Wx) where W is a 2d matrix representing a linear transformation on x, a vector. Likewise, let m = Zy, where Z is a 2d matrix representing a linear transformation on y. How do I programmatically calculate the gradient of Loss = sum(m^2) with respect to W, where the power means take the element wise power of the resulting vector, and sum means adding all the elements together?
I can work this out slowly mathematically by taking a hypothetical, multiplying it all out, then element-by-element taking the derivative to construct the gradient, but I can't figure out an efficient approach to write a program once the neural network layer becomes >1.
Say, for just one layer (m = Zy, take gradient wrt Z) I could just say
Loss = sum(m^2)
dLoss/dZ = 2m * y
where * is the outer product of the vectors, and I guess this is kind of like normal calculus and it works. Now for 2 layers + activation (gradient wrt W), if I try to do it like "normal" calculus and apply the chain rule I get:
dLoss/dW = 2m * Z * dRelu * x
where dRelu is the derivative of Relu(Wx) except here I have no idea what * means in this case to make it work.
Is there an easy way to calculate this gradient mathematically without basically multiplying it all out and deriving each separate element in the gradient? I'm really unfamiliar with matrix calculus, so if anyone could also give some mathematical intuition, if my attempt is completely wrong, that would be appreciated.
For the sake of convenience, let's ignore the ReLU for a moment. You have an input space X (of some size [dimX]) mapped to an intermediate space Y (of some size [dimY]) mapped to an output space m (of some size [dimM]) You have, then, W: X → Y a matrix of shape [dimY, dimX] and Z: Y → m a matrix of shape [dimM, dimY]. Finally your loss is simply a function that maps your M space to a scalar value.
Let us walk the way backwards. As you correctly said, you want to compute the derivative of the loss w.r.t W and to do so you need to apply the chain rule all the way back. You then have:
dL/dW = dL/dm * dm/dY * dY/dW
dL/dm is of shape [dimm] (a scalar function with derivatives across dimm dimensions)
dm/dY is of shape [dimm, dimY] (an m-dimensional function with derivatives across dimY dimensions)
dY/dW is of shape [dimY, dimW] = [dimY, dimY, dimX] (a y-dimensional function with derivatives across [dimY, dimX] dimensions)
Edit:
To make the last bit more clear, Y consists of dimY different values, so Y can be treated as dimY constituent functions. We need to apply the gradient operator on each of those mini-functions, all with respect to the basis vectors defined by W. More concretely, if W = [[w11, w12], [w21, w22], [w31, w32]] and x = [x1, x2], then Y = [y1, y2, y3] = [w11x1 + w12x2, w21x1 + w22x2, w31x1 + w32x2]. Then W defines a 6d space (3x2) across which we need to differentiate. We have dY/dW = [dy1/dW, dy2/dW, dy3/dW], and also dy1/dW = [[dy1/dw11, dy1/dw12], [dy1/dw21, dy1/dw22], [dy1/dw31, dy1/dw32]] = [[x1,x2],[0,0],[0,0]], a 3x2 matrix. So dY/dW is a [3,3,2] tensor.
As for the multiplication part; the operation here is tensor contraction (essentially matrix multiplication in high dimension spaces). Practically, if you have a high-order tensor A[[a1, a2, a3... ], β] (i.e. a+1 dimensions, the last of which is of size β) and a tensor B[β, [b1, b2...]] (i.e. b+1 dimensions, the first of which is β), their tensor contraction is a matrix C[[a1,a2...], [b1,b2...]] (i.e. a+b dimensions, the β dimension contracted), where C is obtained by summing over element-wise across the shared dimension β (refer to https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot).
The resulting tensor contraction then is a matrix of shape [dimY, dimX] which can be used to update your W weights. The ReLU which we ignored earlier can easily be thrown in the mix, since ReLU: 1 → 1 is a scalar function applied element-wise on Y.
To summarize, your code would be:
W_gradient = 2m * np.dot(Z, x) * np.e**x/(1+np.e**x))
I just implemented several multiplier neural networks(MLP) from scratch in C++[1], and I think I know what's your pain. And believe me, you don't even need any third party matrix/tensor/automatic differentiation(AD) libraries to do the matrix multiplication or gradient calculation. There are three things you should pay attention to:
There are two kinds of multiplication in the equations: matrix multiplication, and elementwise multiplication, you'll mess up if you denoted them all as a single *.
Use concrete examples, especially concrete numbers as dimensions of your data/matrix/vector to build intuition.
The most powerful tool for programming correctly is dimension compatibility, always don't forget to check dimensions.
Suppose your want to do binary classification and the neural network is input -> h1 -> sigmoid -> h2 -> sigmoid -> loss in which input layer has 1 sample each has 2 features, h1 has 7 neurons, and h2 has 2 neurons. Then:
forward pass:
Z1(1, 7) = X(1, 2) * W1(2, 7)
A1(1, 7) = sigmoid(Z1(1, 7))
Z2(1, 2) = A1(1, 7) * W2(7, 2)
A2(1, 2) = sigmoid(Z2(1, 2))
Loss = 1/2(A2 - label)^2
backward pass:
dA2(1, 2) = dL/dA2 = A2 - label
dZ2(1, 2) = dL/dZ2 = dA2 * dsigmoid(A2_i) -- element wise
dW2(7, 2) = A1(1, 7).T * dZ2(1, 2) -- matrix multiplication
Notice the last equation, the dimension of the gradient of W2 should match W2, which is (7, 2). And the only way to get a (7, 2) matrix is to transpose input A1 and multiply A1 with dZ2, that's dimension compatibility[2].
backward pass continued:
dA1(1, 7) = dZ2(1, 2) * A1(2, 7) -- matrix multiplication
dZ1(1, 7) = dA1(1, 7) * dsigmoid(A1_i) -- element wise
dW1(2, 7) = X.T(2, 1) * dZ1(1, 7) -- matrix multiplication
[1] The code is here, you can see the hidden layer implementation, naive matrix implementation and the references listed there.
[2] I omit the matrix derivation part, it's simple actually but hard to type the equations out. I strongly suggest you read this paper, every tiny detail you should know on matrix derivation in DL is listed in this paper.
[3] One sample as input is used in the above example(as a vector), you can substitute 1 with any batch numbers(become matrix), and the equations still hold.
I'm following the TensorFlow tutorial
Initially x is defined as
x = tf.placeholder(tf.float32, shape=[None, 784])
Later on it reshapes x, I'm trying to understand why.
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
What does -1 mean in the reshaping vector and why is x being reshaped?
1) What does -1 mean in the reshaping vector
From the documentation of reshape:
If one component of shape is the special value -1, the size of that
dimension is computed so that the total size remains constant. In
particular, a shape of [-1] flattens into 1-D. At most one component
of shape can be -1.
this is a standard feature and is available in numpy as well. Basically it means - I do not have time to calculate all the dimensions, so infer the one for me. In your case because x * 28 * 28 * 1 = 784 so your -1 = 1
2) Why is x being reshaped
They are planning to use convolution for image classification. So they need to use some spatial information. Current data is 1 dimensional. So they transform it to 4 dimensions. I do not know the point of the forth dimension because in my opinion they might have used only (x, y, color). Or even (x, y). Try to modify their reshape and convolution and most probably you will get similar accuracy.
why 4 dimensions
TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.
[batch, in_height, in_width, in_channels]
The Keras tutorial gives the following code example (with comments):
# apply a convolution 1d of length 3 to a sequence with 10 timesteps,
# with 64 output filters
model = Sequential()
model.add(Convolution1D(64, 3, border_mode='same', input_shape=(10, 32)))
# now model.output_shape == (None, 10, 64)
I am confused about the output size. Shouldn't it create 10 timesteps with a depth of 64 and a width of 32 (stride defaults to 1, no padding)? So (10,32,64) instead of (None,10,64)
In k-Dimensional convolution you will have a filters which will somehow preserve a structure of first k-dimensions and will squash the information from all other dimension by convoluting them with a filter weights. So basically every filter in your network will have a dimension (3x32) and all information from the last dimension (this one with size 32) will be squashed to a one real number with the first dimension preserved. This is the reason why you have a shape like this.
You could imagine a similar situation in 2-D case when you have a colour image. Your input will have then 3-dimensional structure (picture_length, picture_width, colour). When you apply the 2-D convolution with respect to your first two dimensions - all information about colours will be squashed by your filter and will no be preserved in your output structure. The same as here.