I've read another post made about converting FC layers into convolutional layers in this post:
https://stats.stackexchange.com/questions/263349/how-to-convert-fully-connected-layer-into-convolutional-layer ,
but i don't understand how you get the 4096x1x1 in the last calculations. I know that after going trough the convolution layers and the pooling that we end up with a layer of 7x7x512
I got this from this github post: https://cs231n.github.io/convolutional-networks/#convert
Conversely, any FC layer can be converted to a CONV layer. For example, an FC layer with K=4096 that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096 since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.
It's the math i don't completely understand. In the other post, the author wrote
In this example, as far as I understood, the converted CONV layer should have the shape (7,7,512), meaning (width, height, feature dimension). And we have 4096 filters. And the output of each filter's spatial size can be calculated as (7-7+0)/1 + 1 = 1. Therefore we have a 1x1x4096 vector as output.
But where do you get the other 7 from? do you calculate the convolutional layer with itself? and that's how you end up with 1x1x4096?
Consider I have a CNN that consists of Input(234×234)-Conv(7,32,1)-Pool(2,2)-Conv(7,32,1)-Pool(2,2)Conv(7,32,1)-Pool(2,2)-FC(1024)-FC(1024)-FC(1000). And zero-padding is not used.
Running this through and calculating conv layers and pooling should leave us at 24x24x32 at last pooling if i'm not wrong. Stride is 1 for the conv layers.
234x234x1 > conv7x7x32 > (234-7)/1+1 = 228
228x228x32 > pool2x2 > (228 - 2 )/2 + 1 = 114
114x114x32 > conv7x7x32 > (114 - 7 ) / 1 + 1 = 108
108x108x32 > pool2x2 > (108-2)/2 + 1 = 54
54x54x32 > conv7x7x32 > (54-7)/1 + 1 = 48
48x48x32 > pool2x2 > (48-2)/2 + 1 = 24
24x24x32
(24-24)/1 + 1 = 1 > 1024x1x1, 1024x1x1, 1000x1x1
Is this the right way to convert the FC layers into convolutional layers?
Related
I'm new to all the stuff I'm going to talking about so that the questions may be too simple.
Thanks in advance for your answers!
My questions cames from the following image:
To be more clear:
For the first Convolution, from 1 x 28 x28 to 25 x 26 x26, the input (1 layer) goes through the filter (25 layers). So, one layer was filtered 25 times ( right ? ).
But for the second Convolution, from 25 x 13 x 13 to 50 x 11 x 11, what's the operation of the filter 50 x 3 x 3 applied on the input 25 x 13 x 13? I confused about the operation. Because the output should be 1250 x 11 x 11 if each layer of the input 25 x 13 x 13 goes through the filter 50 x 3 x 3. Why is the output still 50 layers?
For the second Max Pooling, how does MaxPooling2D() deal with a layer with odd size? The remainder of (11 mod 2) is 1. In the above image, from 11 to 5, what happened on the 1?
In addition, What's the common operation for Max Pooling an odd-size input layer?
Each convolution is applied to all the channels of the input(output of the previous layer), in this case, one filter( from the 50x3x3 Conv2d) is applied to all 25(from 25x3x3 Conv2d) input then the results will be summed up to give one output feature map of 50Conv2d, this will be done for 50 times. here is a link about how filters are applied to features maps. The rule of thumb is, if the next convolution have N filters, its output should also have N features maps.
For maxPooling, the default value for padding in MaxPooling2D; which is applied in your case, is "valid|", it means that the pooling function will not include values that can not be contained in kernel size. in your example, kernel is 2, that means the 11th element was not included in the operation. Here is a good link about the padding="valid" flag, the second answer has a good visual on how some elements of the input are left out during this operation.
Additionally, it may be good idea to do strides instead of max pooling.
You can easily find folks discussing and comparing it on the net. As you're stumbling with dimensionality, using strides is less mind-boggling.
https://stats.stackexchange.com/questions/387482/pooling-vs-stride-for-downsampling
https://www.pyimagesearch.com/2018/12/31/keras-conv2d-and-convolutional-layers/
https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/
Hope it helps.
I have a SimpleRNN like:
model.add(SimpleRNN(10, input_shape=(3, 1)))
model.add(Dense(1, activation="linear"))
The model summary says:
simple_rnn_1 (SimpleRNN) (None, 10) 120
I am curious about the parameter number 120 for simple_rnn_1.
Could you someone answer my question?
When you look at the headline of the table you see the title Param:
Layer (type) Output Shape Param
===============================================
simple_rnn_1 (SimpleRNN) (None, 10) 120
This number represents the number of trainable parameters (weights and biases) in the respective layer, in this case your SimpleRNN.
Edit:
The formula for calculating the weights is as follows:
recurrent_weights + input_weights + biases
*resp: (num_features + num_units)* num_units + num_units
Explanation:
num_units = equals the number of units in the RNN
num_features = equals the number features of your input
Now you have two things happening in your RNN.
First you have the recurrent loop, where the state is fed recurrently into the model to generate the next step. Weights for the recurrent step are:
recurrent_weights = num_units*num_units
The secondly you have new input of your sequence at each step.
input_weights = num_features*num_units
(Usually both last RNN state and new input are concatenated and then multiplied with one single weight matrix, nevertheless inputs and last RNN state use different weights)
So now we have the weights, whats missing are the biases - for every unit one bias:
biases = num_units*1
So finally we have the formula:
recurrent_weights + input_weights + biases
or
num_units* num_units + num_features* num_units + biases
=
(num_features + num_units)* num_units + biases
In your cases this means the trainable parameters are:
10*10 + 1*10 + 10 = 120
I hope this is understandable, if not just tell me - so I can edit it to make it more clear.
It might be easier to understand visually with a simple network like this:
The number of weights is 16 (4 * 4) + 12 (3 * 4) = 28 and the number of biases is 4.
where 4 is the number of units and 3 is the number of input dimensions, so the formula is just like in the first answer: num_units ^ 2 + num_units * input_dim + num_units or simply num_units * (num_units + input_dim + 1), which yields 10 * (10 + 1 + 1) = 120 for the parameters given in the question.
I visualize the SimpleRNN you add, I think the figure can explain a lot.
SimpleRNN layer, I'm a newbie here, can't post images directly, so you need to click the link.
From the unrolled version of SimpleRNN layer,it can be seen as a dense layer. And the previous layer is a concatenation of input and the current layer(previous step) itself.
So the number of parameters of SimpleRNN can be computed as a dense layer:
num_para = units_pre * units + num_bias
where:
units_pre is the sum of input neurons(1 in your settings) and units(see below),
units is the number of neurons(10 in your settings) in the current layer,
num_bias is the number of bias term in the current layer, which is the same as the units.
Plugging in your settings, we achieve the num_para = (1 + 10) * 10 + 10 = 120.
I would like to understand how an RNN, specifically an LSTM is working with multiple input dimensions using Keras and Tensorflow. I mean the input shape is (batch_size, timesteps, input_dim) where input_dim > 1.
I think the below images illustrate quite well the concept of LSTM if the input_dim = 1.
Does this mean if input_dim > 1 then x is not a single value anymore but an array? But if it's like this then the weights are also become arrays, same shape as x + the context?
Keras creates a computational graph that executes the sequence in your bottom picture per feature (but for all units). That means the state value C is always a scalar, one per unit. It does not process features at once, it processes units at once, and features separately.
import keras.models as kem
import keras.layers as kel
model = kem.Sequential()
lstm = kel.LSTM(units, input_shape=(timesteps, features))
model.add(lstm)
model.summary()
free_params = (4 * features * units) + (4 * units * units) + (4 * num_units)
print('free_params ', free_params)
print('kernel_c', lstm.kernel_c.shape)
print('bias_c', lstm.bias_c .shape)
where 4 represents one for each of the f, i, c, and o internal paths in your bottom picture. The first term is the number of weights for the kernel, the second term for the recurrent kernel, and the last one for the bias, if applied. For
units = 1
timesteps = 1
features = 1
we see
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 1) 12
=================================================================
Total params: 12.0
Trainable params: 12
Non-trainable params: 0.0
_________________________________________________________________
num_params 12
kernel_c (1, 1)
bias_c (1,)
and for
units = 1
timesteps = 1
features = 2
we see
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 1) 16
=================================================================
Total params: 16.0
Trainable params: 16
Non-trainable params: 0.0
_________________________________________________________________
num_params 16
kernel_c (2, 1)
bias_c (1,)
where bias_c is a proxy for the output shape of the state C. Note that there are different implementations regarding the internal making of the unit. Details are here (http://deeplearning.net/tutorial/lstm.html) and the default implementation uses Eq.7. Hope this helps.
Let's update the above answer to TensorFlow 2.
import tensorflow as tf
model = tf.keras.Sequential([tf.keras.layers.LSTM(units, input_shape=(timesteps, features))])
model.summary()
free_params = (4 * features * units) + (4 * units * units) + (4 * num_units)
print('free_params ', free_params)
print('kernel_c', lstm.kernel_c.shape)
print('bias_c', lstm.bias_c .shape)
Using this code, you could achieve the same result in TensorFlow 2.x as well.
Is there a way to calculate the total number of parameters in a LSTM network.
I have found a example but I'm unsure of how correct this is or If I have understood it correctly.
For eg consider the following example:-
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
model = Sequential()
model.add(LSTM(256, input_dim=4096, input_length=16))
model.summary()
Output
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
lstm_1 (LSTM) (None, 256) 4457472 lstm_input_1[0][0]
====================================================================================================
Total params: 4457472
____________________________________________________________________________________________________
As per My understanding n is the input vector lenght.
And m is the number of time steps. and in this example they consider the number of hidden layers to be 1.
Hence according to the formula in the post. 4(nm+n^2) in my example m=16;n=4096;num_of_units=256
4*((4096*16)+(4096*4096))*256 = 17246978048
Why is there such a difference?
Did I misunderstand the example or was the formula wrong ?
No - the number of parameters of a LSTM layer in Keras equals to:
params = 4 * ((size_of_input + 1) * size_of_output + size_of_output^2)
Additional 1 comes from bias terms. So n is size of input (increased by the bias term) and m is size of output of a LSTM layer.
So finally :
4 * (4097 * 256 + 256^2) = 4457472
image via this post
num_params = [(num_units + input_dim + 1) * num_units] * 4
num_units + input_dim: concat [h(t-1), x(t)]
+ 1: bias
* 4: there are 4 neural network layers (yellow box) {W_forget, W_input, W_output, W_cell}
model.add(LSTM(units=256, input_dim=4096, input_length=16))
[(256 + 4096 + 1) * 256] * 4 = 4457472
PS: num_units = num_hidden_units = output_dims
I think it would be easier to understand if we start with a simple RNN.
Let's assume that we have 4 units (please ignore the ... in the network and concentrate only on visible units), and the input size (number of dimensions) is 3:
The number of weights is 28 = 16 (num_units * num_units) for the recurrent connections + 12 (input_dim * num_units) for input. The number of biases is simply num_units.
Recurrency means that each neuron output is fed back into the whole network, so if we unroll it in time sequence, it looks like two dense layers:
and that makes it clear why we have num_units * num_units weights for the recurrent part.
The number of parameters for this simple RNN is 32 = 4 * 4 + 3 * 4 + 4, which can be expressed as num_units * num_units + input_dim * num_units + num_units or num_units * (num_units + input_dim + 1)
Now, for LSTM, we must multiply the number of of these parameters by 4, as this is the number of sub-parameters inside each unit, and it was nicely illustrated in the answer by #FelixHo
Formula expanding for #JohnStrong :
4 means we have different weight and bias variables for 3 gates (read / write / froget) and - 4-th - for the cell state (within same hidden state).
(These mentioned are shared among timesteps along particular hidden state vector)
4 * lstm_hidden_state_size * (lstm_inputs_size + bias_variable + lstm_outputs_size)
as LSTM output (y) is h (hidden state) by approach, so, without an extra projection, for LSTM outputs we have :
lstm_hidden_state_size = lstm_outputs_size
let's say it's d :
d = lstm_hidden_state_size = lstm_outputs_size
Then
params = 4 * d * ((lstm_inputs_size + 1) + d) = 4 * ((lstm_inputs_size + 1) * d + d^2)
LSTM Equations (via deeplearning.ai Coursera)
It is evident from the equations that the final dimensions of all the 6 equations will be same and final dimension must necessarily be equal to the dimension of a(t).
Out of these 6 equations, only 4 equations contribute to the number of parameters and by looking at the equations, it can be deduced that all the 4 equations are symmetric. So,if we find out the number of parameters for 1 equation, we can just multiply it by 4 and tell the total number of parameters.
One important point is to note that the total number of parameters doesn't depend on the time-steps(or input_length) as same "W" and "b" is shared throughout the time-step.
Assuming, insider of LSTM cell having just one layer for a gate(as that in Keras).
Take equation 1 and lets relate. Let number of neurons in the layer be n and number of dimension of x be m (not including number of example and time-steps). Therefore, dimension of forget gate will be n too. Now,same as that in ANN, dimension of "Wf" will be n*(n+m) and dimension of "bf" will be n. Therefore, total number of parameters for one equation will be [{n*(n+m)} + n]. Therefore, total number of parameters will be 4*[{n*(n+m)} + n].Lets open the brackets and we will get -> 4*(nm + n2 + n).
So,as per your values. Feeding it into the formula gives:->(n=256,m=4096),total number of parameters is 4*((256*256) + (256*4096) + (256) ) = 4*(1114368) = 4457472.
The others have pretty much answered it. But just for further clarification, on creating an LSTM layer. The number of params is as follows:
No of params= 4*((num_features used+1)*num_units+
num_units^2)
The +1 is because of the additional bias we take.
Where the num_features is the num_features in your input shape to the LSTM:
Input_shape=(window_size,num_features)
I can't give the correct number of parameters of AlexNet or VGG Net.
For example, to calculate the number of parameters of a conv3-256 layer of VGG Net, the answer is 0.59M = (3*3)*(256*256), that is (kernel size) * (product of both number of channels in the joint layers), however in that way, I can't get the 138M parameters.
So could you please show me where is wrong with my calculation, or show me the right calculation procedure?
If you refer to VGG Net with 16-layer (table 1, column D) then 138M refers to the total number of parameters of this network, i.e including all convolutional layers, but also the fully connected ones.
Looking at the 3rd convolutional stage composed of 3 x conv3-256 layers:
the first one has N=128 input planes and F=256 output planes,
the two other ones have N=256 input planes and F=256 output planes.
The convolution kernel is 3x3 for each of these layers. In terms of parameters this gives:
128x3x3x256 (weights) + 256 (biases) = 295,168 parameters for the 1st one,
256x3x3x256 (weights) + 256 (biases) = 590,080 parameters for the two other ones.
As explained above you have to do that for all layers, but also the fully-connected ones, and sum these values to obtain the final 138M number.
-
UPDATE: the breakdown among layers give:
conv3-64 x 2 : 38,720
conv3-128 x 2 : 221,440
conv3-256 x 3 : 1,475,328
conv3-512 x 3 : 5,899,776
conv3-512 x 3 : 7,079,424
fc1 : 102,764,544
fc2 : 16,781,312
fc3 : 4,097,000
TOTAL : 138,357,544
In particular for the fully-connected layers (fc):
fc1 (x): (512x7x7)x4,096 (weights) + 4,096 (biases)
fc2 : 4,096x4,096 (weights) + 4,096 (biases)
fc3 : 4,096x1,000 (weights) + 1,000 (biases)
(x) see section 3.2 of the article: the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).
Details about fc1
As precised above the spatial resolution right before feeding the fully-connected layers is 7x7 pixels. This is because this VGG Net uses spatial padding before convolutions, as detailed within section 2.1 of the paper:
[...] the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv. layers.
With such a padding, and working with a 224x224 pixels input image, the resolution decreases as follow along the layers: 112x112, 56x56, 28x28, 14x14 and 7x7 after the last convolution/pooling stage which has 512 feature maps.
This gives a feature vector passed to fc1 with dimension: 512x7x7.
A great breakdown of the calculation for VGG-16 network is also given in CS231n lecture notes.
INPUT: [224x224x3] memory: 224*224*3=150K weights: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K weights: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K weights: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K weights: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K weights: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K weights: 0
FC: [1x1x4096] memory: 4096 weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 weights: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 weights: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
The below VGG-16 architechture is in the original paper as highlighted by #deltheil in (table 1, column D) , and I quote from there
2.1 ARCHITECTURE
During training, the input to our ConvNets is a fixed-size 224 × 224
RGB images. The only preprocessing we do is subtracting the mean RGB
value, computed on the training set, from each pixel.
The image is passed through a stack of convolutional (conv.) layers,
where we use filters with a very small receptive field: 3 × 3 (which
is the smallest size to capture the notion of left/right, up/down,
center). The convolution stride is fixed to 1 pixel; the spatial
padding of conv. layer input is such that the spatial resolution is
preserved after convolution, i.e. the padding is 1 pixel for 3 × 3
conv. layers. Spatial pooling is carried out by five max-pooling
layers, which follow some of the conv. layers (not all the conv.
layers are followed by max-pooling). Max-pooling is performed over a 2
× 2 pixel window, with stride 2.
A stack of convolutional layers (which has a different depth in
different architectures) is followed by three Fully-Connected (FC)
layers: the first two have 4096 channels each, the third performs
1000-way ILSVRC classification and thus contains 1000 channels (one
for each class).
The final layer is the soft-max layer.
Using the above, and
A formula to find activation shape of a layer!
A formula to calculate the weights corresponding to every layer:
Note:
you can simply multiply respective activation shape column to get the activation size
CONV3: means a filter of 3*3 will convolve on the input!
MAXPOOL3-2: means, 3rd pooling layer, with 2*2 filter, stride=2, padding=0(pretty standard in pooling layers)
Stage-3 : means it has multiple CONV layer stacked! with same padding=1, , stride=1, and filter 3*3
Cin : means the depth a.k.a channel coming from the input layer!
Cout: means the depth a.k.a channel outgoing (you configure it differently- to learn more complex features!),
Cin and Cout are the number of filters that you stack together to learn multiple features at different scales such as in the first layer you might want to learn vertical edges, and horizontal edges and edges at say 45degree, blah blah!, 64 possible different filters each of different kind of edges!!
n: input dimension without depth such n=224 in case of INPUT-image!
p: padding for each layer
s: stride used for each layer
f: filter size i.e 3*3 for CONV and 2*2 for MAXPOOL layers!
After MAXPOOL5-2, you simply flatten the volume and interface it with the first FC layer.!
We get the table:
Finally, if you add all the weights calculated in the last column, you end up with 138,357,544(138 million) parameters to train for VGG-15!
Here is how to compute the number of parameters in each cnn layer:
some definitions
n--width of filter
m--height of filter
k--number of input feature maps
L--number of output feature maps
Then number of paramters #= (n*m *k+1)*L in which the first contribution is from
weights and the second is from bias.
I know this is a old post nevertheless, I think the accepted answer by #deltheil contains a mistake. If not, I would be happy to be corrected. The convolution layer should not have bias.
i.e.
128x3x3x256 (weights) + 256 (biases) = 295,168
should be
128x3x3x256 (weights) = 294,9112
Thanks