Is there a way to calculate the total number of parameters in a LSTM network.
I have found a example but I'm unsure of how correct this is or If I have understood it correctly.
For eg consider the following example:-
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
model = Sequential()
model.add(LSTM(256, input_dim=4096, input_length=16))
model.summary()
Output
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
lstm_1 (LSTM) (None, 256) 4457472 lstm_input_1[0][0]
====================================================================================================
Total params: 4457472
____________________________________________________________________________________________________
As per My understanding n is the input vector lenght.
And m is the number of time steps. and in this example they consider the number of hidden layers to be 1.
Hence according to the formula in the post. 4(nm+n^2) in my example m=16;n=4096;num_of_units=256
4*((4096*16)+(4096*4096))*256 = 17246978048
Why is there such a difference?
Did I misunderstand the example or was the formula wrong ?
No - the number of parameters of a LSTM layer in Keras equals to:
params = 4 * ((size_of_input + 1) * size_of_output + size_of_output^2)
Additional 1 comes from bias terms. So n is size of input (increased by the bias term) and m is size of output of a LSTM layer.
So finally :
4 * (4097 * 256 + 256^2) = 4457472
image via this post
num_params = [(num_units + input_dim + 1) * num_units] * 4
num_units + input_dim: concat [h(t-1), x(t)]
+ 1: bias
* 4: there are 4 neural network layers (yellow box) {W_forget, W_input, W_output, W_cell}
model.add(LSTM(units=256, input_dim=4096, input_length=16))
[(256 + 4096 + 1) * 256] * 4 = 4457472
PS: num_units = num_hidden_units = output_dims
I think it would be easier to understand if we start with a simple RNN.
Let's assume that we have 4 units (please ignore the ... in the network and concentrate only on visible units), and the input size (number of dimensions) is 3:
The number of weights is 28 = 16 (num_units * num_units) for the recurrent connections + 12 (input_dim * num_units) for input. The number of biases is simply num_units.
Recurrency means that each neuron output is fed back into the whole network, so if we unroll it in time sequence, it looks like two dense layers:
and that makes it clear why we have num_units * num_units weights for the recurrent part.
The number of parameters for this simple RNN is 32 = 4 * 4 + 3 * 4 + 4, which can be expressed as num_units * num_units + input_dim * num_units + num_units or num_units * (num_units + input_dim + 1)
Now, for LSTM, we must multiply the number of of these parameters by 4, as this is the number of sub-parameters inside each unit, and it was nicely illustrated in the answer by #FelixHo
Formula expanding for #JohnStrong :
4 means we have different weight and bias variables for 3 gates (read / write / froget) and - 4-th - for the cell state (within same hidden state).
(These mentioned are shared among timesteps along particular hidden state vector)
4 * lstm_hidden_state_size * (lstm_inputs_size + bias_variable + lstm_outputs_size)
as LSTM output (y) is h (hidden state) by approach, so, without an extra projection, for LSTM outputs we have :
lstm_hidden_state_size = lstm_outputs_size
let's say it's d :
d = lstm_hidden_state_size = lstm_outputs_size
Then
params = 4 * d * ((lstm_inputs_size + 1) + d) = 4 * ((lstm_inputs_size + 1) * d + d^2)
LSTM Equations (via deeplearning.ai Coursera)
It is evident from the equations that the final dimensions of all the 6 equations will be same and final dimension must necessarily be equal to the dimension of a(t).
Out of these 6 equations, only 4 equations contribute to the number of parameters and by looking at the equations, it can be deduced that all the 4 equations are symmetric. So,if we find out the number of parameters for 1 equation, we can just multiply it by 4 and tell the total number of parameters.
One important point is to note that the total number of parameters doesn't depend on the time-steps(or input_length) as same "W" and "b" is shared throughout the time-step.
Assuming, insider of LSTM cell having just one layer for a gate(as that in Keras).
Take equation 1 and lets relate. Let number of neurons in the layer be n and number of dimension of x be m (not including number of example and time-steps). Therefore, dimension of forget gate will be n too. Now,same as that in ANN, dimension of "Wf" will be n*(n+m) and dimension of "bf" will be n. Therefore, total number of parameters for one equation will be [{n*(n+m)} + n]. Therefore, total number of parameters will be 4*[{n*(n+m)} + n].Lets open the brackets and we will get -> 4*(nm + n2 + n).
So,as per your values. Feeding it into the formula gives:->(n=256,m=4096),total number of parameters is 4*((256*256) + (256*4096) + (256) ) = 4*(1114368) = 4457472.
The others have pretty much answered it. But just for further clarification, on creating an LSTM layer. The number of params is as follows:
No of params= 4*((num_features used+1)*num_units+
num_units^2)
The +1 is because of the additional bias we take.
Where the num_features is the num_features in your input shape to the LSTM:
Input_shape=(window_size,num_features)
Related
I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
return(z)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
tf.nn.relu(tf.math.negative(value))
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU
I have a SimpleRNN like:
model.add(SimpleRNN(10, input_shape=(3, 1)))
model.add(Dense(1, activation="linear"))
The model summary says:
simple_rnn_1 (SimpleRNN) (None, 10) 120
I am curious about the parameter number 120 for simple_rnn_1.
Could you someone answer my question?
When you look at the headline of the table you see the title Param:
Layer (type) Output Shape Param
===============================================
simple_rnn_1 (SimpleRNN) (None, 10) 120
This number represents the number of trainable parameters (weights and biases) in the respective layer, in this case your SimpleRNN.
Edit:
The formula for calculating the weights is as follows:
recurrent_weights + input_weights + biases
*resp: (num_features + num_units)* num_units + num_units
Explanation:
num_units = equals the number of units in the RNN
num_features = equals the number features of your input
Now you have two things happening in your RNN.
First you have the recurrent loop, where the state is fed recurrently into the model to generate the next step. Weights for the recurrent step are:
recurrent_weights = num_units*num_units
The secondly you have new input of your sequence at each step.
input_weights = num_features*num_units
(Usually both last RNN state and new input are concatenated and then multiplied with one single weight matrix, nevertheless inputs and last RNN state use different weights)
So now we have the weights, whats missing are the biases - for every unit one bias:
biases = num_units*1
So finally we have the formula:
recurrent_weights + input_weights + biases
or
num_units* num_units + num_features* num_units + biases
=
(num_features + num_units)* num_units + biases
In your cases this means the trainable parameters are:
10*10 + 1*10 + 10 = 120
I hope this is understandable, if not just tell me - so I can edit it to make it more clear.
It might be easier to understand visually with a simple network like this:
The number of weights is 16 (4 * 4) + 12 (3 * 4) = 28 and the number of biases is 4.
where 4 is the number of units and 3 is the number of input dimensions, so the formula is just like in the first answer: num_units ^ 2 + num_units * input_dim + num_units or simply num_units * (num_units + input_dim + 1), which yields 10 * (10 + 1 + 1) = 120 for the parameters given in the question.
I visualize the SimpleRNN you add, I think the figure can explain a lot.
SimpleRNN layer, I'm a newbie here, can't post images directly, so you need to click the link.
From the unrolled version of SimpleRNN layer,it can be seen as a dense layer. And the previous layer is a concatenation of input and the current layer(previous step) itself.
So the number of parameters of SimpleRNN can be computed as a dense layer:
num_para = units_pre * units + num_bias
where:
units_pre is the sum of input neurons(1 in your settings) and units(see below),
units is the number of neurons(10 in your settings) in the current layer,
num_bias is the number of bias term in the current layer, which is the same as the units.
Plugging in your settings, we achieve the num_para = (1 + 10) * 10 + 10 = 120.
This question already has answers here:
How to calculate the number of parameters for convolutional neural network?
(3 answers)
Closed 4 years ago.
How to compute the number of weight of CNN in a greyscale image.
here is the code:
Define input image size
input_shape = (32, 32, 1)
flat_input_size = input_shape[0]*input_shape[1]*input_shape[2]
num_classes = 4
Simple deep network
dnn_model = Sequential()
dnn_model.add(Dense(input_dim=flat_input_size, units=1000))
dnn_model.add(Activation("relu"))
dnn_model.add(Dense(units=512))
dnn_model.add(Activation("relu"))
dnn_model.add(Dense(units=256))
dnn_model.add(Activation("relu"))
dnn_model.add(Dense(units=num_classes))
dnn_model.add(Activation("softmax"))
The picture below is the network plot
here is the result
count anyone help me to compute the number of params.
how to get 1025000, 512512, 131328, 1028, show some details
For a dense layer with bias (the bias is the +1) the calculation is as follows:
(input_neurons + 1) * output_neurons
In your case for the first layer this is:
(32 * 32 + 1) * 1000 = 1025000
and for the second one:
(1000 + 1) * 512 = 512512
and so on and so forth.
Edited answer to reflect additional question in comments:
For convolutional layers, as asked in the comments, you try to learn a filter kernel for each input channel for each output channel with an additional bias. Therefore the amount of parameters in there are:
kernel_width * kernel_height * input_channels * output_channels + output_channels = num_parameters
For your example, where we go from a feature map of size (None, 16, 16, 32) to (None, 14, 14, 64) with a (3, 3) kernel we get the following calculation:
3 * 3 * 32 * 64 + 64 = 18496
That is actually the important thing in CNNs, that the number of parameters is independent of the image size.
I having some trouble understanding the tensorflow BasicLSTMCell num_units input parameter.
I have seen other posts but I am not following so hoping a simple example will help.
So say we have the following LTSM RNN model below. How do I determine the number units the cells require? Is it possible to have such a structure for a LTSM RNN?
Input Vec 1st Hidden Layer 2nd Hidden Layer Output
20 x 1 20 x 1 5 x 1 3 x 1
Follow, I have given a sample code for your model by using a dynamic rnn (https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn)
N_INPUT = 20
N_TIME_STEPS = #Define here
N_HIDDEN_UNITS1 = 20
N_HIDDEN_UNITS2 = 5
N_OUTPUT =3
input = tf.placeholder(tf.float32, [None, N_TIME_STEPS, N_INPUT], name="input")
lstm_layers = [tf.contrib.rnn.BasicLSTMCell(N_HIDDEN_UNITS1, forget_bias=1.0),tf.contrib.rnn.BasicLSTMCell(N_HIDDEN_UNITS2, forget_bias=1.0),tf.contrib.rnn.BasicLSTMCell(N_OUTPUT, forget_bias=1.0)]
lstm_layers = tf.contrib.rnn.MultiRNNCell(lstm_layers)
outputs, _ = tf.nn.dynamic_rnn(lstm_layers, input, dtype=tf.float32)
The input (input in the code) for the model should be in the shape of [BATCH_SIZE, N_TIME_STEPS, N_INPUT] and the output (outputs in the code) of the RNN is in the shape of [BATCH_SIZE, N_TIME_STEPS, N_OUTPUT]
Hope this helps.
I would like to understand how an RNN, specifically an LSTM is working with multiple input dimensions using Keras and Tensorflow. I mean the input shape is (batch_size, timesteps, input_dim) where input_dim > 1.
I think the below images illustrate quite well the concept of LSTM if the input_dim = 1.
Does this mean if input_dim > 1 then x is not a single value anymore but an array? But if it's like this then the weights are also become arrays, same shape as x + the context?
Keras creates a computational graph that executes the sequence in your bottom picture per feature (but for all units). That means the state value C is always a scalar, one per unit. It does not process features at once, it processes units at once, and features separately.
import keras.models as kem
import keras.layers as kel
model = kem.Sequential()
lstm = kel.LSTM(units, input_shape=(timesteps, features))
model.add(lstm)
model.summary()
free_params = (4 * features * units) + (4 * units * units) + (4 * num_units)
print('free_params ', free_params)
print('kernel_c', lstm.kernel_c.shape)
print('bias_c', lstm.bias_c .shape)
where 4 represents one for each of the f, i, c, and o internal paths in your bottom picture. The first term is the number of weights for the kernel, the second term for the recurrent kernel, and the last one for the bias, if applied. For
units = 1
timesteps = 1
features = 1
we see
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 1) 12
=================================================================
Total params: 12.0
Trainable params: 12
Non-trainable params: 0.0
_________________________________________________________________
num_params 12
kernel_c (1, 1)
bias_c (1,)
and for
units = 1
timesteps = 1
features = 2
we see
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 1) 16
=================================================================
Total params: 16.0
Trainable params: 16
Non-trainable params: 0.0
_________________________________________________________________
num_params 16
kernel_c (2, 1)
bias_c (1,)
where bias_c is a proxy for the output shape of the state C. Note that there are different implementations regarding the internal making of the unit. Details are here (http://deeplearning.net/tutorial/lstm.html) and the default implementation uses Eq.7. Hope this helps.
Let's update the above answer to TensorFlow 2.
import tensorflow as tf
model = tf.keras.Sequential([tf.keras.layers.LSTM(units, input_shape=(timesteps, features))])
model.summary()
free_params = (4 * features * units) + (4 * units * units) + (4 * num_units)
print('free_params ', free_params)
print('kernel_c', lstm.kernel_c.shape)
print('bias_c', lstm.bias_c .shape)
Using this code, you could achieve the same result in TensorFlow 2.x as well.