I would like to understand how an RNN, specifically an LSTM is working with multiple input dimensions using Keras and Tensorflow. I mean the input shape is (batch_size, timesteps, input_dim) where input_dim > 1.
I think the below images illustrate quite well the concept of LSTM if the input_dim = 1.
Does this mean if input_dim > 1 then x is not a single value anymore but an array? But if it's like this then the weights are also become arrays, same shape as x + the context?
Keras creates a computational graph that executes the sequence in your bottom picture per feature (but for all units). That means the state value C is always a scalar, one per unit. It does not process features at once, it processes units at once, and features separately.
import keras.models as kem
import keras.layers as kel
model = kem.Sequential()
lstm = kel.LSTM(units, input_shape=(timesteps, features))
model.add(lstm)
model.summary()
free_params = (4 * features * units) + (4 * units * units) + (4 * num_units)
print('free_params ', free_params)
print('kernel_c', lstm.kernel_c.shape)
print('bias_c', lstm.bias_c .shape)
where 4 represents one for each of the f, i, c, and o internal paths in your bottom picture. The first term is the number of weights for the kernel, the second term for the recurrent kernel, and the last one for the bias, if applied. For
units = 1
timesteps = 1
features = 1
we see
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 1) 12
=================================================================
Total params: 12.0
Trainable params: 12
Non-trainable params: 0.0
_________________________________________________________________
num_params 12
kernel_c (1, 1)
bias_c (1,)
and for
units = 1
timesteps = 1
features = 2
we see
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 1) 16
=================================================================
Total params: 16.0
Trainable params: 16
Non-trainable params: 0.0
_________________________________________________________________
num_params 16
kernel_c (2, 1)
bias_c (1,)
where bias_c is a proxy for the output shape of the state C. Note that there are different implementations regarding the internal making of the unit. Details are here (http://deeplearning.net/tutorial/lstm.html) and the default implementation uses Eq.7. Hope this helps.
Let's update the above answer to TensorFlow 2.
import tensorflow as tf
model = tf.keras.Sequential([tf.keras.layers.LSTM(units, input_shape=(timesteps, features))])
model.summary()
free_params = (4 * features * units) + (4 * units * units) + (4 * num_units)
print('free_params ', free_params)
print('kernel_c', lstm.kernel_c.shape)
print('bias_c', lstm.bias_c .shape)
Using this code, you could achieve the same result in TensorFlow 2.x as well.
Related
I am constructing a tf keras model using the functional API. This model will train fine on large memory mapped arrays. However, for numerous reasons it can be advantageous to work with tensorflow Dataset objects. Therefore, I use from_tensor_slices() to convert my arrays to Dataset objects.
The problem is that the model will no longer train.
The keras docs: Model training APIs indicate that dataset objects are acceptable.
The guide I'm following on how to train is found here: Using tf.data with tf keras
Guides on how to use the keras functional API are here. However, training a functional API model with a tf Dataset object is not outlined.
A MWE is provided here:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras import layers
print('numpy version: {}'.format(np.__version__))
print('keras version: {}'.format(keras.__version__))
print('tensorflow version: {}'.format(tf.__version__))
numpy version: 1.21.4
keras version: 2.6.0
tensorflow version: 2.6.0
X = np.random.uniform(size=(1000,75))
Y = np.random.uniform(size=(1000))
data = tf.data.Dataset.from_tensor_slices((X, Y))
print(data.cardinality().numpy())
1000
data.batch(batch_size=100, drop_remainder=True)
<BatchDataset shapes: ((100, 75), (100,)), types: (tf.float64, tf.float64)>
def API_Model(input_shape, name="test_model"):
inputs = layers.Input(shape=input_shape)
x = layers.Dense(1)(inputs)
outputs = layers.Activation('relu')(x)
return keras.Model(inputs=inputs, outputs=outputs, name=name)
api_model = API_Model(input_shape=(X.shape[1],))
api_model.compile()
api_model.summary()
Model: "test_model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 75)] 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 76
_________________________________________________________________
activation_1 (Activation) (None, 1) 0
=================================================================
Total params: 76
Trainable params: 76
Non-trainable params: 0
_________________________________________________________________
api_model.fit(data, epochs=10)
Epoch 1/10
WARNING:tensorflow:Model was constructed with shape (None, 75) for input
KerasTensor(type_spec=TensorSpec(shape=(None, 75), dtype=tf.float32, name='input_2'),
name='input_2', description="created by layer 'input_2'"), but it was called on an input with
incompatible shape (75, 1).
The error I receive is: ValueError: Input 0 of layer dense_1 is incompatible with the layer: expected axis -1 of input shape to have value 75 but received input with shape (75, 1)
In addition, the error from my actual model I'm trying to train is slightly different but seems to be malfunctioning under the same principle. It is the following:
ValueError: Input 0 is incompatible with layer pfn_base: expected shape=(None, 1086, 5), found shape=(1086, 5)
What is the proper way to train a keras functional API model on a BatchDataset object?
You need to assign the batched dataset to a variable and you should also use a loss function in model.compile because the default value is None and you can't learn anything with it. Here is a working example:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras import layers
print('numpy version: {}'.format(np.__version__))
print('keras version: {}'.format(keras.__version__))
print('tensorflow version: {}'.format(tf.__version__))
X = np.random.uniform(size=(1000,75))
Y = np.random.uniform(size=(1000))
data = tf.data.Dataset.from_tensor_slices((X, Y))
print(data.cardinality().numpy())
data = data.batch(batch_size=100, drop_remainder=True)
def API_Model(input_shape, name="test_model"):
inputs = layers.Input(shape=input_shape)
x = layers.Dense(1)(inputs)
outputs = layers.Activation('relu')(x)
return keras.Model(inputs=inputs, outputs=outputs, name=name)
api_model = API_Model(input_shape=(X.shape[1],))
api_model.compile(loss='mse')
api_model.summary()
api_model.fit(data, epochs=10)
My model starts to train and while executing for sometime it gives an error :-
IndexError: index 37 is out of bounds for axis 0 with size 37
It executes properly for my model without using gridsearchCV with fixed parameters
Here is my code
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_classifier(optimizer, nb_layers,unit):
classifier = Sequential()
classifier.add(Dense(units = unit, kernel_initializer = 'uniform', activation = 'relu', input_dim = 14))
i = 1
while i <= nb_layers:
classifier.add(Dense(activation="relu", units=unit, kernel_initializer="uniform"))
i += 1
classifier.add(Dense(units = 38, kernel_initializer = 'uniform', activation = 'softmax'))
classifier.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier)
parameters = {'batch_size': [10,25],
'epochs': [100,200],
'optimizer': ['adam'],
'nb_layers': [5,6,7],
'unit':[48,57,76]
}
grid_search = GridSearchCV(estimator = classifier,
param_grid = parameters,
scoring = 'accuracy',
cv=5,n_jobs=-1)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
The error IndexError: index 37 is out of bounds for axis 0 with size 37 means that there is no element with index 37 in your object.
In python, if you have an object like array or list, which has elements indexed numerically, if it has n elements, indexes will go from 0 to n-1 (this is the general case, with the exception of reindexing in dataframes).
So, if you ahve 37 elements you can only retrieve elements from 0-36.
This is a multi-class classifier with a huge Number of Classes (38 classes). It seems like GridSearchCV isn't spliting your dataset by stratified sampling, may be because you haven't enough data and/or your dataset isn't class-balanced.
According to the documentation:
For integer/None inputs, if the estimator is a classifier and y is
either binary or multiclass, StratifiedKFold is used. In all other
cases, KFold is used.
By using categorical_crossentropy, KerasClassifier will convert targets (a class vector (integers)) to binary class matrix using keras.utils.to_categorical. Since there are 38 classes, each target will be converted to a binary vector of dimension 38 (index from 0 to 37).
I guess that in some splits, the validation set doesn't have samples from all the 38 classes, so targets are converted to vectors of dimension < 38, but since GridSearchCV is fitted with samples from all the 38 classes, it expects vectors of dimension = 38, which causes this error.
Take a look at the shape of your y_train. It need to be a some sort of one hot with shape (,37)
I have a SimpleRNN like:
model.add(SimpleRNN(10, input_shape=(3, 1)))
model.add(Dense(1, activation="linear"))
The model summary says:
simple_rnn_1 (SimpleRNN) (None, 10) 120
I am curious about the parameter number 120 for simple_rnn_1.
Could you someone answer my question?
When you look at the headline of the table you see the title Param:
Layer (type) Output Shape Param
===============================================
simple_rnn_1 (SimpleRNN) (None, 10) 120
This number represents the number of trainable parameters (weights and biases) in the respective layer, in this case your SimpleRNN.
Edit:
The formula for calculating the weights is as follows:
recurrent_weights + input_weights + biases
*resp: (num_features + num_units)* num_units + num_units
Explanation:
num_units = equals the number of units in the RNN
num_features = equals the number features of your input
Now you have two things happening in your RNN.
First you have the recurrent loop, where the state is fed recurrently into the model to generate the next step. Weights for the recurrent step are:
recurrent_weights = num_units*num_units
The secondly you have new input of your sequence at each step.
input_weights = num_features*num_units
(Usually both last RNN state and new input are concatenated and then multiplied with one single weight matrix, nevertheless inputs and last RNN state use different weights)
So now we have the weights, whats missing are the biases - for every unit one bias:
biases = num_units*1
So finally we have the formula:
recurrent_weights + input_weights + biases
or
num_units* num_units + num_features* num_units + biases
=
(num_features + num_units)* num_units + biases
In your cases this means the trainable parameters are:
10*10 + 1*10 + 10 = 120
I hope this is understandable, if not just tell me - so I can edit it to make it more clear.
It might be easier to understand visually with a simple network like this:
The number of weights is 16 (4 * 4) + 12 (3 * 4) = 28 and the number of biases is 4.
where 4 is the number of units and 3 is the number of input dimensions, so the formula is just like in the first answer: num_units ^ 2 + num_units * input_dim + num_units or simply num_units * (num_units + input_dim + 1), which yields 10 * (10 + 1 + 1) = 120 for the parameters given in the question.
I visualize the SimpleRNN you add, I think the figure can explain a lot.
SimpleRNN layer, I'm a newbie here, can't post images directly, so you need to click the link.
From the unrolled version of SimpleRNN layer,it can be seen as a dense layer. And the previous layer is a concatenation of input and the current layer(previous step) itself.
So the number of parameters of SimpleRNN can be computed as a dense layer:
num_para = units_pre * units + num_bias
where:
units_pre is the sum of input neurons(1 in your settings) and units(see below),
units is the number of neurons(10 in your settings) in the current layer,
num_bias is the number of bias term in the current layer, which is the same as the units.
Plugging in your settings, we achieve the num_para = (1 + 10) * 10 + 10 = 120.
This question already has answers here:
How to calculate the number of parameters for convolutional neural network?
(3 answers)
Closed 4 years ago.
How to compute the number of weight of CNN in a greyscale image.
here is the code:
Define input image size
input_shape = (32, 32, 1)
flat_input_size = input_shape[0]*input_shape[1]*input_shape[2]
num_classes = 4
Simple deep network
dnn_model = Sequential()
dnn_model.add(Dense(input_dim=flat_input_size, units=1000))
dnn_model.add(Activation("relu"))
dnn_model.add(Dense(units=512))
dnn_model.add(Activation("relu"))
dnn_model.add(Dense(units=256))
dnn_model.add(Activation("relu"))
dnn_model.add(Dense(units=num_classes))
dnn_model.add(Activation("softmax"))
The picture below is the network plot
here is the result
count anyone help me to compute the number of params.
how to get 1025000, 512512, 131328, 1028, show some details
For a dense layer with bias (the bias is the +1) the calculation is as follows:
(input_neurons + 1) * output_neurons
In your case for the first layer this is:
(32 * 32 + 1) * 1000 = 1025000
and for the second one:
(1000 + 1) * 512 = 512512
and so on and so forth.
Edited answer to reflect additional question in comments:
For convolutional layers, as asked in the comments, you try to learn a filter kernel for each input channel for each output channel with an additional bias. Therefore the amount of parameters in there are:
kernel_width * kernel_height * input_channels * output_channels + output_channels = num_parameters
For your example, where we go from a feature map of size (None, 16, 16, 32) to (None, 14, 14, 64) with a (3, 3) kernel we get the following calculation:
3 * 3 * 32 * 64 + 64 = 18496
That is actually the important thing in CNNs, that the number of parameters is independent of the image size.
Is there a way to calculate the total number of parameters in a LSTM network.
I have found a example but I'm unsure of how correct this is or If I have understood it correctly.
For eg consider the following example:-
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
model = Sequential()
model.add(LSTM(256, input_dim=4096, input_length=16))
model.summary()
Output
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
lstm_1 (LSTM) (None, 256) 4457472 lstm_input_1[0][0]
====================================================================================================
Total params: 4457472
____________________________________________________________________________________________________
As per My understanding n is the input vector lenght.
And m is the number of time steps. and in this example they consider the number of hidden layers to be 1.
Hence according to the formula in the post. 4(nm+n^2) in my example m=16;n=4096;num_of_units=256
4*((4096*16)+(4096*4096))*256 = 17246978048
Why is there such a difference?
Did I misunderstand the example or was the formula wrong ?
No - the number of parameters of a LSTM layer in Keras equals to:
params = 4 * ((size_of_input + 1) * size_of_output + size_of_output^2)
Additional 1 comes from bias terms. So n is size of input (increased by the bias term) and m is size of output of a LSTM layer.
So finally :
4 * (4097 * 256 + 256^2) = 4457472
image via this post
num_params = [(num_units + input_dim + 1) * num_units] * 4
num_units + input_dim: concat [h(t-1), x(t)]
+ 1: bias
* 4: there are 4 neural network layers (yellow box) {W_forget, W_input, W_output, W_cell}
model.add(LSTM(units=256, input_dim=4096, input_length=16))
[(256 + 4096 + 1) * 256] * 4 = 4457472
PS: num_units = num_hidden_units = output_dims
I think it would be easier to understand if we start with a simple RNN.
Let's assume that we have 4 units (please ignore the ... in the network and concentrate only on visible units), and the input size (number of dimensions) is 3:
The number of weights is 28 = 16 (num_units * num_units) for the recurrent connections + 12 (input_dim * num_units) for input. The number of biases is simply num_units.
Recurrency means that each neuron output is fed back into the whole network, so if we unroll it in time sequence, it looks like two dense layers:
and that makes it clear why we have num_units * num_units weights for the recurrent part.
The number of parameters for this simple RNN is 32 = 4 * 4 + 3 * 4 + 4, which can be expressed as num_units * num_units + input_dim * num_units + num_units or num_units * (num_units + input_dim + 1)
Now, for LSTM, we must multiply the number of of these parameters by 4, as this is the number of sub-parameters inside each unit, and it was nicely illustrated in the answer by #FelixHo
Formula expanding for #JohnStrong :
4 means we have different weight and bias variables for 3 gates (read / write / froget) and - 4-th - for the cell state (within same hidden state).
(These mentioned are shared among timesteps along particular hidden state vector)
4 * lstm_hidden_state_size * (lstm_inputs_size + bias_variable + lstm_outputs_size)
as LSTM output (y) is h (hidden state) by approach, so, without an extra projection, for LSTM outputs we have :
lstm_hidden_state_size = lstm_outputs_size
let's say it's d :
d = lstm_hidden_state_size = lstm_outputs_size
Then
params = 4 * d * ((lstm_inputs_size + 1) + d) = 4 * ((lstm_inputs_size + 1) * d + d^2)
LSTM Equations (via deeplearning.ai Coursera)
It is evident from the equations that the final dimensions of all the 6 equations will be same and final dimension must necessarily be equal to the dimension of a(t).
Out of these 6 equations, only 4 equations contribute to the number of parameters and by looking at the equations, it can be deduced that all the 4 equations are symmetric. So,if we find out the number of parameters for 1 equation, we can just multiply it by 4 and tell the total number of parameters.
One important point is to note that the total number of parameters doesn't depend on the time-steps(or input_length) as same "W" and "b" is shared throughout the time-step.
Assuming, insider of LSTM cell having just one layer for a gate(as that in Keras).
Take equation 1 and lets relate. Let number of neurons in the layer be n and number of dimension of x be m (not including number of example and time-steps). Therefore, dimension of forget gate will be n too. Now,same as that in ANN, dimension of "Wf" will be n*(n+m) and dimension of "bf" will be n. Therefore, total number of parameters for one equation will be [{n*(n+m)} + n]. Therefore, total number of parameters will be 4*[{n*(n+m)} + n].Lets open the brackets and we will get -> 4*(nm + n2 + n).
So,as per your values. Feeding it into the formula gives:->(n=256,m=4096),total number of parameters is 4*((256*256) + (256*4096) + (256) ) = 4*(1114368) = 4457472.
The others have pretty much answered it. But just for further clarification, on creating an LSTM layer. The number of params is as follows:
No of params= 4*((num_features used+1)*num_units+
num_units^2)
The +1 is because of the additional bias we take.
Where the num_features is the num_features in your input shape to the LSTM:
Input_shape=(window_size,num_features)