Mistake again? Verifying ZFNet layers' input-output dimensions - machine-learning

As mentioned in one of the lecture of cs231n, there were some calculation errors in AlexNet architecture. The initial size of the image has to be 227x227 instead of 224x224 which is what is mentioned in the paper. I wanted to know is there any similar problem in the paper of ZFNet as well?
In the given figure (from ZFNet paper) the initial size of the image is again 224x224 so if we will use a 2D convolution layer with 96 filters of size (7x7) and stride (2,2) then the size of the result should be (224-7)/2 + 1 = 109.5 but if we take initial image size to be 225x225 then we will exactly get 110. Moreover, in the first layer, I feel that there is a similar problem. The size of input to max-pool layer is 110x110x96 and pooling size is (3x3) with stride 2, so the size of the output should be (110-3)/2 + 1 = 54.5 which is again not an integer. I want to know am I doing right calculations or is there any problem with the values given in the paper?

Pytorch built-in implementation says that you need to use padding:
self.conv1 = nn.Conv2d(3, 96, 7, stride=2, padding=2)
self.conv2 = nn.Conv2d(96, 256, 5, padding=2)
self.conv3 = nn.Conv2d(256, 384, 3, padding=1)
self.conv4 = nn.Conv2d(384, 384, 3, padding=1)
self.conv5 = nn.Conv2d(384, 256, 3, padding=1)

hi you are using zf net architure diagram zf net is similiar to alexnet but have a filter smaller filter size of 7*7 with stride of 2 there is no calucalation error it just round up the values.

Related

Working of embedding layer in Tensorflow

Can someone please explain me the inputs and outputs along with the working of the layer mentioned below
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
total_words = 263
max_sequence_len=11
Is 64, the number of dimensions?
And why is the output of this layer (None, 10, 64)
Shouldn't it be a 64 dimension vector for each word, i.e (None, 263, 64)
You can find all the information about the Embedding Layer of Tensorflow Here.
The first two parameters are input_dimension and output_dimension.
The input dimensions basically represents the vocabulary size of your model. You can find this out by using the word_index function of the Tokenizer() function.
The output dimensions are going to be Dimensions of the input of the next Dense Layer
The output of the Embedding layer is of the form (batch_size, input_length, output_dim). But since you specified the input_length parameter, your layers input will be of the form (batch, input_length). That's why the output is of the form (None, 10 ,64).
Hope that clears up your doubt ☺️
In the Embedding layer the first argument represents the input dimensions (which is typically of considerable dimensionality). The second argument represents the output dimensions, a.k.a the dimensionality of the reduced vector. The third argument is for the sequence length. In essence, an Embedding layer is simply learning a lookup table of shape (input dim, output dim). The weights of this layer reflect that shape. The output of the layer, however, will of course be of shape (output dim, seq length); one dimensionality-reduced embedding vector for each element in the input sequence. The shape you were expecting is actually the shape of the weights of an embedding layer.

How to decide how many convolutions e deconvolutions apply to a GAN?

I'm trying to understand how the adversarial generative network works: I found an example in the book by François Chollet (Deep learning with Python) in which there is an example of a GAN he uses CIFAR10 dataset, using the 'frog' class which contains 32x32 RGB images.
I can't understand why:
In (1) the input is transformed into 16 × 16 128-channel (why 128-channel?) feature map
In (2) when a convolution is performed, with which filter? It is not specified
Next, run another Conv2DTranspose and then another 3 Conv2d. Why?!
At the end, I have a 32 × 32 1-channel feature map.
from keras import layers
import numpy as np
latent_dim = 32
height = 32
width = 32
channels = 3
generator_input = keras.Input(shape=(latent_dim,))
(1)
x = layers.Dense(128 * 16 * 16)(generator_input)
x = layers.LeakyReLU()(x)
x = layers.Reshape((16, 16, 128))(x)
(2)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2DTranspose(256, 4, strides=2, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(channels, 7, activation='tanh', padding='same')(x)
generator = keras.models.Model(generator_input, x)
generator.summary()
1)
It's an arbitrary choice, you could have chosen any number of channels for the Dense layer.
16x16 is picked since a stride of 2 is set to the Conv2DTranspose and you want to upsample your width and height to get an output of 32x32.
Strides are used to influence output size of convolution layers. In normal convolutions, outputs are downsampled by the same factor as strides, where in transposed convolutions they are upsampled by the same factor as strides.
For instance, you could change your first layer output to 8x8x128 and then use a stride of 4 in your Conv2DTranspose, this way you would get the same result in terms of dimensionality.
Also keep in mind that, as stated by François Chollet in his book, when using strided transposed convolutions, in order to avoid checkerboard artifacts caused by unequal coverage of the pixel space, kernel size should be divisible by its number of strides.
2) The first argument you set in Conv2D or Conv2DTranspose is the number of filters generated by a convolution layer.
As said before, the strided Conv2DTranspose is used exactly to upsample width and height by a factor equal to the number of strides.
The other 3 Conv2D are also arbitrary, you should determine them by experimentation and fine tuning your model.
for 1) i do not think there is a reason for the number of dense nodes used (128x16x16), however the 16x16 is set because you only have 1 layer to up sample 16x16 to 32x32.
for 2) the first argument 256 used to instantiate Conv2D defines the number of filters.
In regards to your last question Next, run another Conv2DTranspose and then another 3 Conv2d. Why?! I would recommend try increasing/decreasing the number of layers to get a feel on how the model behaves with those changes (performing better or not), this is part of the "hyper-parameter tuning" process when building a neural net.
Hope the above helps.

What is "linear projection" in convolutional neural network [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I am reading through Residual learning, and I have a question.
What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?
First up, it's important to understand what x, y and F are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.
x is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F is usually a conv layer (conv+relu+batchnorm in this paper), and y combines the two together (forming the output channel). The result of F is also of rank 4, and most of dimensions will be the same as in x, except for one. That's exactly what the transformation should patch.
For example, x shape might be (64, 32, 32, 3), where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x) might be (64, 32, 32, 16): batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.
So, in order for y=F(x)+x to be a valid operation, x must be "reshaped" from (64, 32, 32, 3) to (64, 32, 32, 16).
I'd like to stress here that "reshaping" here is not what numpy.reshape does.
Instead, x[3] is padded with 13 zeros, like this:
pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x dimensions are changed.
Here's the link to the code in Tensorflow that does this.
A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x is the vector of N input features and W is an M-byN matrix, then the matrix product Wx yields M new features where each one is a linear projection of x. Each row of W is a set of weights that defines one of the M linear projections (i.e., each row of W contains the coefficients for one of the weighted sums of x).
In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios
The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.
Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).

Why is the x variable tensor reshaped with -1 in the MNIST tutorial for tensorflow?

I'm following the TensorFlow tutorial
Initially x is defined as
x = tf.placeholder(tf.float32, shape=[None, 784])
Later on it reshapes x, I'm trying to understand why.
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
What does -1 mean in the reshaping vector and why is x being reshaped?
1) What does -1 mean in the reshaping vector
From the documentation of reshape:
If one component of shape is the special value -1, the size of that
dimension is computed so that the total size remains constant. In
particular, a shape of [-1] flattens into 1-D. At most one component
of shape can be -1.
this is a standard feature and is available in numpy as well. Basically it means - I do not have time to calculate all the dimensions, so infer the one for me. In your case because x * 28 * 28 * 1 = 784 so your -1 = 1
2) Why is x being reshaped
They are planning to use convolution for image classification. So they need to use some spatial information. Current data is 1 dimensional. So they transform it to 4 dimensions. I do not know the point of the forth dimension because in my opinion they might have used only (x, y, color). Or even (x, y). Try to modify their reshape and convolution and most probably you will get similar accuracy.
why 4 dimensions
TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.
[batch, in_height, in_width, in_channels]

Output dimensions of convolutional layer with Keras

The Keras tutorial gives the following code example (with comments):
# apply a convolution 1d of length 3 to a sequence with 10 timesteps,
# with 64 output filters
model = Sequential()
model.add(Convolution1D(64, 3, border_mode='same', input_shape=(10, 32)))
# now model.output_shape == (None, 10, 64)
I am confused about the output size. Shouldn't it create 10 timesteps with a depth of 64 and a width of 32 (stride defaults to 1, no padding)? So (10,32,64) instead of (None,10,64)
In k-Dimensional convolution you will have a filters which will somehow preserve a structure of first k-dimensions and will squash the information from all other dimension by convoluting them with a filter weights. So basically every filter in your network will have a dimension (3x32) and all information from the last dimension (this one with size 32) will be squashed to a one real number with the first dimension preserved. This is the reason why you have a shape like this.
You could imagine a similar situation in 2-D case when you have a colour image. Your input will have then 3-dimensional structure (picture_length, picture_width, colour). When you apply the 2-D convolution with respect to your first two dimensions - all information about colours will be squashed by your filter and will no be preserved in your output structure. The same as here.

Resources