Output dimensions of convolutional layer with Keras - machine-learning

The Keras tutorial gives the following code example (with comments):
# apply a convolution 1d of length 3 to a sequence with 10 timesteps,
# with 64 output filters
model = Sequential()
model.add(Convolution1D(64, 3, border_mode='same', input_shape=(10, 32)))
# now model.output_shape == (None, 10, 64)
I am confused about the output size. Shouldn't it create 10 timesteps with a depth of 64 and a width of 32 (stride defaults to 1, no padding)? So (10,32,64) instead of (None,10,64)

In k-Dimensional convolution you will have a filters which will somehow preserve a structure of first k-dimensions and will squash the information from all other dimension by convoluting them with a filter weights. So basically every filter in your network will have a dimension (3x32) and all information from the last dimension (this one with size 32) will be squashed to a one real number with the first dimension preserved. This is the reason why you have a shape like this.
You could imagine a similar situation in 2-D case when you have a colour image. Your input will have then 3-dimensional structure (picture_length, picture_width, colour). When you apply the 2-D convolution with respect to your first two dimensions - all information about colours will be squashed by your filter and will no be preserved in your output structure. The same as here.

Related

Understanding convolutional layers shapes

I've been reading about convolutional nets and I've programmed a few models myself. When I see visual diagrams of other models it shows each layer being smaller and deeper than the last ones. Layers have three dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
TLDR; 256x256x32 refers to the layer's output shape rather than the layer itself.
There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.
Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w) (or (h, w, c) depending on the framework) where c is the number of channels, h is the width of the input and w the width. You can see it as a c-channel hxw image.
The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size hxw with c channels for example c=1 for greyscale or c=3 for RGB...
What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c channels.
Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3)
input
filter
convolution
output
shape
(1, 5, 5)
(3, 3)
(3,3)
representation
Now keep in mind the dimension of the output will depend on the stride and padding of the convolution layer. Here the shape of the output is the same as the shape of the filter, it does not necessarily have to be! Take an input shape of (1, 5, 5), with the same convolution settings, you would end up with a shape of (4, 4) (which is different from the filter shape (3, 3).
Also, something to note is that if the input had more than one channel: shape (c, h, w), the filter would have to have the same number of channels. Each channel of the input would convolve with each channel of the filter and the results would be averaged into a single 2D feature map. So you would have an intermediate output of (c, 3, 3), which after averaging over the channels, would leave us with (1, 3, 3)=(3, 3). As a result, considering a convolution with a single filter, however many input channels there are, the output will always have a single channel.
From there what you can do is assemble multiple filters on the same layer. This means you define your layer as having k 3x3 filters. So a layer consists k filters. For the computation of the output, the idea is simple: one filter gives a (3, 3) feature map, so k filters will give k (3, 3) feature maps. These maps are then stacked into what will be the channel dimension. Ultimately, you're left with an output shape of... (k, 3, 3).
Let k_h and k_w, be the kernel height and kernel width respectively. And h', w' the height and width of one outputted feature map:
input
layer
output
shape
(c, h, w)
(k, c, k_h, k_w)
(k, h', w')
description
c-channel hxw feature map
k filters of shape (c, k_h, k_w)
k-channel h'xw' feature map
Back to your question:
Layers have 3 dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
Convolution layers have four dimensions, but one of them is imposed by your input channel count. You can choose the size of your convolution kernel, and the number of filters. This number will determine is the number of channels of the output.
256x256 seems extremely high and you most likely correspond to the output shape of the feature map. On the other hand, 32 would be the number of channels of the output, which... as I tried to explain is the number of filters in that layer. Usually speaking the dimensions represented in visual diagrams for convolution networks correspond to the intermediate output shapes, not the layer shapes.
As an example, take the VGG neural network:
Very Deep Convolutional Networks for Large-Scale Image Recognition
Input shape for VGG is (3, 224, 224), knowing that the result of the first convolution has shape (64, 224, 224) you can determine there is a total of 64 filters in that layer.
As it turns out the kernel size in VGG is 3x3. So, here is a question for you: knowing there is a single bias parameter per filter, how many total parameters are in VGG's first convolution layer?
Sorry for the short answer, but when you have a digital image, you have 2 dimensions and then you often have 3 for the colors. The convolutional filter looks into parts of the picture with lower height/width dimensions and much more depth channels (in your case 32) to get more information. This is then fed into the neural network to learn.
I created the example in PyTorch to demonstrate the output you had:
import torch
import torch.nn as nn
bs=16
x = torch.randn(bs, 3, 256, 256)
c = nn.Conv2d(3,32,kernel_size=5,stride=1,padding=2)
out = c(x)
print(out.shape, out.shape[1])
Out:
torch.Size([16, 32, 256, 256]) 32
It's a real tensor inside. It may help.
You can play with a lot of convolution parameters.

Working of embedding layer in Tensorflow

Can someone please explain me the inputs and outputs along with the working of the layer mentioned below
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
total_words = 263
max_sequence_len=11
Is 64, the number of dimensions?
And why is the output of this layer (None, 10, 64)
Shouldn't it be a 64 dimension vector for each word, i.e (None, 263, 64)
You can find all the information about the Embedding Layer of Tensorflow Here.
The first two parameters are input_dimension and output_dimension.
The input dimensions basically represents the vocabulary size of your model. You can find this out by using the word_index function of the Tokenizer() function.
The output dimensions are going to be Dimensions of the input of the next Dense Layer
The output of the Embedding layer is of the form (batch_size, input_length, output_dim). But since you specified the input_length parameter, your layers input will be of the form (batch, input_length). That's why the output is of the form (None, 10 ,64).
Hope that clears up your doubt ☺️
In the Embedding layer the first argument represents the input dimensions (which is typically of considerable dimensionality). The second argument represents the output dimensions, a.k.a the dimensionality of the reduced vector. The third argument is for the sequence length. In essence, an Embedding layer is simply learning a lookup table of shape (input dim, output dim). The weights of this layer reflect that shape. The output of the layer, however, will of course be of shape (output dim, seq length); one dimensionality-reduced embedding vector for each element in the input sequence. The shape you were expecting is actually the shape of the weights of an embedding layer.

How to decide how many convolutions e deconvolutions apply to a GAN?

I'm trying to understand how the adversarial generative network works: I found an example in the book by François Chollet (Deep learning with Python) in which there is an example of a GAN he uses CIFAR10 dataset, using the 'frog' class which contains 32x32 RGB images.
I can't understand why:
In (1) the input is transformed into 16 × 16 128-channel (why 128-channel?) feature map
In (2) when a convolution is performed, with which filter? It is not specified
Next, run another Conv2DTranspose and then another 3 Conv2d. Why?!
At the end, I have a 32 × 32 1-channel feature map.
from keras import layers
import numpy as np
latent_dim = 32
height = 32
width = 32
channels = 3
generator_input = keras.Input(shape=(latent_dim,))
(1)
x = layers.Dense(128 * 16 * 16)(generator_input)
x = layers.LeakyReLU()(x)
x = layers.Reshape((16, 16, 128))(x)
(2)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2DTranspose(256, 4, strides=2, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(channels, 7, activation='tanh', padding='same')(x)
generator = keras.models.Model(generator_input, x)
generator.summary()
1)
It's an arbitrary choice, you could have chosen any number of channels for the Dense layer.
16x16 is picked since a stride of 2 is set to the Conv2DTranspose and you want to upsample your width and height to get an output of 32x32.
Strides are used to influence output size of convolution layers. In normal convolutions, outputs are downsampled by the same factor as strides, where in transposed convolutions they are upsampled by the same factor as strides.
For instance, you could change your first layer output to 8x8x128 and then use a stride of 4 in your Conv2DTranspose, this way you would get the same result in terms of dimensionality.
Also keep in mind that, as stated by François Chollet in his book, when using strided transposed convolutions, in order to avoid checkerboard artifacts caused by unequal coverage of the pixel space, kernel size should be divisible by its number of strides.
2) The first argument you set in Conv2D or Conv2DTranspose is the number of filters generated by a convolution layer.
As said before, the strided Conv2DTranspose is used exactly to upsample width and height by a factor equal to the number of strides.
The other 3 Conv2D are also arbitrary, you should determine them by experimentation and fine tuning your model.
for 1) i do not think there is a reason for the number of dense nodes used (128x16x16), however the 16x16 is set because you only have 1 layer to up sample 16x16 to 32x32.
for 2) the first argument 256 used to instantiate Conv2D defines the number of filters.
In regards to your last question Next, run another Conv2DTranspose and then another 3 Conv2d. Why?! I would recommend try increasing/decreasing the number of layers to get a feel on how the model behaves with those changes (performing better or not), this is part of the "hyper-parameter tuning" process when building a neural net.
Hope the above helps.

what's the difference between tf.nn.conv2d with strides = 2 and tf.nn.max_pool with 2x2 pooling?

As mentioned above, both
tf.nn.conv2d with strides = 2
and
tf.nn.max_pool with 2x2 pooling
can reduce the size of input to half, and I know the output may be different, but what I don't know is that affect the final training result or not, any clue about this, thanks.
In both your examples assume we have a [height, width] kernel applied with strides [2,2]. That means we apply the kernel to a 2-D window of size [height, width] on the 2-D inputs to get an output value, and then slide the window over by 2 either up or down to get the next output value.
In both cases you end up with 4x fewer outputs than inputs (2x fewer in each dimension) assuming padding='SAME'
The difference is how the output values are computed for each window:
conv2d
the output is a linear combination of the input values times a weight for each cell in the [height, width] kernel
these weights become trainable parameters in your model
max_pool
the output is just selecting the maximum input value within the [height, width] window of input values
there is no weight and no trainable parameters introduced by this operation
The results of the final training could actually be different as the convolution multiplies the tensor by a filter, which you might not want to do as it takes up extra computational time and also can overfit your model as it will have more weights.

LSTM Autoencoder for music - Keras [Sequence to sequence]

So, I'm trying to learn fixed vector representations for segments of about 200 songs (~ 3-5 minutes per song) and wanted to use an LSTM-based Sequence-to-sequence Autoencoder for it.
I'm preprocessing the audio (using librosa) as follows:
I'm first just getting a raw audio signal time series of shape around (1500000,) - (2500000,) per song.
I'm then slicing each raw time series into segments and getting a lower-level mel spectrogram matrix of shape (512, 3000) - (512, 6000) per song. Each of these (512,) vectors can be referred to as 'mini-songs' as they represent parts of the song.
I vertically stack all these mini-songs of all the songs together to create the training data (let's call this X). X turns out to be (512, 600000) in size, where the first dimension (512) is the window size and the second dimension (600000) is the total number of 'mini-songs' in the dataset.
Which is to say, there are about 600000 mini-songs in X - each column in X represents a mini-song of length (512,).
Each of these (512,) mini-song vectors should be encoded into a (50,) vector per mini-song i.e. we will have 600000 (50,) vectors at the end of the process.
In more standard terminology, I have 600000 training samples each of length 512. [Think of this as being similar to an image dataset - 600000 images, each of length 784, where the images are of resolution 32x32. Except in my case I want to treat the 512-length samples as sequences that have temporal properties.]
I read the example here and was looking to extend that for my use case. I was wondering what the timesteps and input_dim parameters to the Input layer should be set to.
I'm setting timesteps = X.shape[0] (i.e. 512 in this case) and input_dim = X.shape[1] (i.e 600000). Is this the correct way to go about it?
Edit: Added clarifications above.
Your input is actually a 1D sequence not a 2D image.
The input tensor will be (600000, 512, 1) and you need to set the input_dim to 1 and the timesteps to 512.
The shape input does not take the first dimension of the tensor (i.e. 600000 in your case).

Resources