Creating 3D Convolutional Layer - machine-learning

ValueError: One of the dimensions in the output is <= 0 due to downsampling in conv3d_15. Consider increasing the input size. Received input shape [None, 1, 1, 1, 1904211] which would produce output shape with a zero or negative value in a dimension.
Can anyone explain me what is the meaning of this error
While i am trying to build my 3d convolutional neural network. I got this erro

A 3D convolution layer expect the input to be in a shape similar to (4, 28, 28, 28, 1). For 28x28x28 volumes with a single channel. More info here - https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv3D

Related

Understanding convolutional layers shapes

I've been reading about convolutional nets and I've programmed a few models myself. When I see visual diagrams of other models it shows each layer being smaller and deeper than the last ones. Layers have three dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
TLDR; 256x256x32 refers to the layer's output shape rather than the layer itself.
There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.
Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w) (or (h, w, c) depending on the framework) where c is the number of channels, h is the width of the input and w the width. You can see it as a c-channel hxw image.
The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size hxw with c channels for example c=1 for greyscale or c=3 for RGB...
What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c channels.
Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3)
input
filter
convolution
output
shape
(1, 5, 5)
(3, 3)
(3,3)
representation
Now keep in mind the dimension of the output will depend on the stride and padding of the convolution layer. Here the shape of the output is the same as the shape of the filter, it does not necessarily have to be! Take an input shape of (1, 5, 5), with the same convolution settings, you would end up with a shape of (4, 4) (which is different from the filter shape (3, 3).
Also, something to note is that if the input had more than one channel: shape (c, h, w), the filter would have to have the same number of channels. Each channel of the input would convolve with each channel of the filter and the results would be averaged into a single 2D feature map. So you would have an intermediate output of (c, 3, 3), which after averaging over the channels, would leave us with (1, 3, 3)=(3, 3). As a result, considering a convolution with a single filter, however many input channels there are, the output will always have a single channel.
From there what you can do is assemble multiple filters on the same layer. This means you define your layer as having k 3x3 filters. So a layer consists k filters. For the computation of the output, the idea is simple: one filter gives a (3, 3) feature map, so k filters will give k (3, 3) feature maps. These maps are then stacked into what will be the channel dimension. Ultimately, you're left with an output shape of... (k, 3, 3).
Let k_h and k_w, be the kernel height and kernel width respectively. And h', w' the height and width of one outputted feature map:
input
layer
output
shape
(c, h, w)
(k, c, k_h, k_w)
(k, h', w')
description
c-channel hxw feature map
k filters of shape (c, k_h, k_w)
k-channel h'xw' feature map
Back to your question:
Layers have 3 dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
Convolution layers have four dimensions, but one of them is imposed by your input channel count. You can choose the size of your convolution kernel, and the number of filters. This number will determine is the number of channels of the output.
256x256 seems extremely high and you most likely correspond to the output shape of the feature map. On the other hand, 32 would be the number of channels of the output, which... as I tried to explain is the number of filters in that layer. Usually speaking the dimensions represented in visual diagrams for convolution networks correspond to the intermediate output shapes, not the layer shapes.
As an example, take the VGG neural network:
Very Deep Convolutional Networks for Large-Scale Image Recognition
Input shape for VGG is (3, 224, 224), knowing that the result of the first convolution has shape (64, 224, 224) you can determine there is a total of 64 filters in that layer.
As it turns out the kernel size in VGG is 3x3. So, here is a question for you: knowing there is a single bias parameter per filter, how many total parameters are in VGG's first convolution layer?
Sorry for the short answer, but when you have a digital image, you have 2 dimensions and then you often have 3 for the colors. The convolutional filter looks into parts of the picture with lower height/width dimensions and much more depth channels (in your case 32) to get more information. This is then fed into the neural network to learn.
I created the example in PyTorch to demonstrate the output you had:
import torch
import torch.nn as nn
bs=16
x = torch.randn(bs, 3, 256, 256)
c = nn.Conv2d(3,32,kernel_size=5,stride=1,padding=2)
out = c(x)
print(out.shape, out.shape[1])
Out:
torch.Size([16, 32, 256, 256]) 32
It's a real tensor inside. It may help.
You can play with a lot of convolution parameters.

Working of embedding layer in Tensorflow

Can someone please explain me the inputs and outputs along with the working of the layer mentioned below
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
total_words = 263
max_sequence_len=11
Is 64, the number of dimensions?
And why is the output of this layer (None, 10, 64)
Shouldn't it be a 64 dimension vector for each word, i.e (None, 263, 64)
You can find all the information about the Embedding Layer of Tensorflow Here.
The first two parameters are input_dimension and output_dimension.
The input dimensions basically represents the vocabulary size of your model. You can find this out by using the word_index function of the Tokenizer() function.
The output dimensions are going to be Dimensions of the input of the next Dense Layer
The output of the Embedding layer is of the form (batch_size, input_length, output_dim). But since you specified the input_length parameter, your layers input will be of the form (batch, input_length). That's why the output is of the form (None, 10 ,64).
Hope that clears up your doubt ☺️
In the Embedding layer the first argument represents the input dimensions (which is typically of considerable dimensionality). The second argument represents the output dimensions, a.k.a the dimensionality of the reduced vector. The third argument is for the sequence length. In essence, an Embedding layer is simply learning a lookup table of shape (input dim, output dim). The weights of this layer reflect that shape. The output of the layer, however, will of course be of shape (output dim, seq length); one dimensionality-reduced embedding vector for each element in the input sequence. The shape you were expecting is actually the shape of the weights of an embedding layer.

How exactly do you compute the gradients for the filters in a convolutional neural network?

I learned from several articles that to compute the gradients for the filters, you just do a convolution with the input volume as input and the error matrix as the kernel. After that, you just subtract the filter weights by the gradients(multiplied by the learning rate). I implemented this process but it's not working.
I even tried doing the backpropagation process myself with pen and paper but the gradients I calculated doesn't make the filters perform any better. So am I understanding the whole process wrong?
Edit:
I will provide an example of my understanding of the backpropagation in CNNs and the problem with it.
Consider a randomised input matrix for a convolutional layer:
1, 0, 1
0, 0, 1
1, 0, 0
And a randomised weight matrix:
1, 0
0, 1
The output would be (applied ReLU activator):
1, 1
0, 0
The target for this layer is a 2x2 matrix filled with zeros. This way, we know the weight matrix should be filled with zeros also.
Error:
-1, -1
0, 0
By applying the process as stated above, the gradients are:
-1, -1
1, 0
So the new weight matrix is:
2, 1
-1, 1
This is not getting anywhere. If I repeat the process, the filter weights just go to extremely high values. So I must have made a mistake somewhere. So what is it that I'm doing wrong?
I'll give you a full example, not going to be short but hopefully you will get it. I'm omitting both bias and activation functions for simplicity, but once you get it it's simple enough to add those too. Remember, backpropagation is essentially the SAME in CNN as in a simple MLP, but instead of having multiplications you'll have convolutions. So, here's my sample:
Input:
.7 -.3 -.7 .5
.9 -.5 -.2 .9
-.1 .8 -.3 -.5
0 .2 -.1 .6
Kernel:
.1 -.3
-.5 .7
Doing the convolution yields (Result of 1st convolutional layer, and input for the 2nd convolutional layer):
.32 .27 -.59
.99 -.52 -.55
-.45 .64 .13
L2 Kernel:
-.5 .1
.3 .9
L2 activation:
.73 .29
.37 -.63
Here you would have a flatten layer and a standard MLP or SVM to do the actual classification. During backpropagation you'll recieve a delta which for fun let's assume is the following:
-.07 .15
-.09 .02
This will always be the same size as your activation before the flatten layer. Now, to calculate the kernel's delta for the current L2, you'll convolve L1's activation with the above delta. I'm not writting this down again but the result will be:
.17 .02
-.05 .13
Updating the kernel is done as L2.Kernel -= LR * ROT180(dL2.K), meaning you first rotate the above 2x2 matrix and then update the kernel. This for our toy example turns out to be:
-.51 .11
.3 .9
Now, to calculate the delta for the first convolutional layer, recall that in MLP you had the following: current_delta * current_weight_matrix. Well in Conv layer, you pretty much have the same. You have to convolve the original Kernel (before update) of L2 layer with your delta for the current layer. But this convolution will be a full convolution. The result turns out to be:
.04 -.08 .02
.02 -.13 .14
-.03 -.08 .01
With this you'll go for the 1st convolutional layer, and will convolve the original input with this 3x3 delta:
.16 .03
-.09 .16
And update your L1 kernel the same way as above:
.08 -.29
-.5 .68
Then you can start over from feeding forward. The above calculations were rounded to 2 decimal places and a learning rate of .1 was used for calculating the new kernel values.
TLDR:
You get a delta
You calculate the next delta that will be used for the next layer as: FullConvolution(Li.Input, delta)
Calculate the kernel delta that is used to update the kernel: Convolution(Li.W, delta)
Go to next layer and repeat.

keras vgg 16 shape error

im trying to fit the data with the following shape to the pretrained keras vgg19 model.
image input shape is (32383, 96, 96, 3)
label shape is (32383, 17)
and I got this error
expected block5_pool to have 4 dimensions, but got array with shape (32383, 17)
at this line
model.fit(x = X_train, y= Y_train, validation_data=(X_valid, Y_valid),
batch_size=64,verbose=2, epochs=epochs,callbacks=callbacks,shuffle=True)
Here's how I define my model
model = VGG16(include_top=False, weights='imagenet', input_tensor=None, input_shape=(96,96,3),classes=17)
How did maxpool give me a 2d tensor but not a 4D tensor ? I'm using the original model from keras.applications.vgg16. How can I fix this error?
Your problem comes from VGG16(include_top=False,...) as this makes your solution to load only a convolutional part of VGG. This is why Keras is complaining that it got 2-dimensional output insted of 4-dimensional one (4 dimensions come from the fact that convolutional output has shape (nb_of_examples, width, height, channels)). In order to overcome this issue you need to either set include_top=True or add additional layers which will squash the convolutional part - to a 2d one (by e.g. using Flatten, GlobalMaxPooling2D, GlobalAveragePooling2D and a set of Dense layers - including a final one which should be a Dense with size of 17 and softmax activation function).

Output dimensions of convolutional layer with Keras

The Keras tutorial gives the following code example (with comments):
# apply a convolution 1d of length 3 to a sequence with 10 timesteps,
# with 64 output filters
model = Sequential()
model.add(Convolution1D(64, 3, border_mode='same', input_shape=(10, 32)))
# now model.output_shape == (None, 10, 64)
I am confused about the output size. Shouldn't it create 10 timesteps with a depth of 64 and a width of 32 (stride defaults to 1, no padding)? So (10,32,64) instead of (None,10,64)
In k-Dimensional convolution you will have a filters which will somehow preserve a structure of first k-dimensions and will squash the information from all other dimension by convoluting them with a filter weights. So basically every filter in your network will have a dimension (3x32) and all information from the last dimension (this one with size 32) will be squashed to a one real number with the first dimension preserved. This is the reason why you have a shape like this.
You could imagine a similar situation in 2-D case when you have a colour image. Your input will have then 3-dimensional structure (picture_length, picture_width, colour). When you apply the 2-D convolution with respect to your first two dimensions - all information about colours will be squashed by your filter and will no be preserved in your output structure. The same as here.

Resources