I having some trouble understanding the tensorflow BasicLSTMCell num_units input parameter.
I have seen other posts but I am not following so hoping a simple example will help.
So say we have the following LTSM RNN model below. How do I determine the number units the cells require? Is it possible to have such a structure for a LTSM RNN?
Input Vec 1st Hidden Layer 2nd Hidden Layer Output
20 x 1 20 x 1 5 x 1 3 x 1
Follow, I have given a sample code for your model by using a dynamic rnn (https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn)
N_INPUT = 20
N_TIME_STEPS = #Define here
N_HIDDEN_UNITS1 = 20
N_HIDDEN_UNITS2 = 5
N_OUTPUT =3
input = tf.placeholder(tf.float32, [None, N_TIME_STEPS, N_INPUT], name="input")
lstm_layers = [tf.contrib.rnn.BasicLSTMCell(N_HIDDEN_UNITS1, forget_bias=1.0),tf.contrib.rnn.BasicLSTMCell(N_HIDDEN_UNITS2, forget_bias=1.0),tf.contrib.rnn.BasicLSTMCell(N_OUTPUT, forget_bias=1.0)]
lstm_layers = tf.contrib.rnn.MultiRNNCell(lstm_layers)
outputs, _ = tf.nn.dynamic_rnn(lstm_layers, input, dtype=tf.float32)
The input (input in the code) for the model should be in the shape of [BATCH_SIZE, N_TIME_STEPS, N_INPUT] and the output (outputs in the code) of the RNN is in the shape of [BATCH_SIZE, N_TIME_STEPS, N_OUTPUT]
Hope this helps.
Related
hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.
In Keras, when using LSTM or GRU, if I set return_sequences=False, I will get the last output; if I set return_sequences=True, I will get the full sequence; but how to get them both at the same time?
Actually, the last timestep returned when return_sequences=True is equivalent to the output of LSTM layer when return_sequences=False:
lstm_out_rs = LSTM(..., return_sequences=True)(x)
lstm_out_rs[:,-1] # this is the last timestep of returned sequence
lstm_out = LSTM(..., return_sequences=False)(x)
lstm_out_rs[:,-1] and lstm_out are equivalent to each other. Therefore, to have them both you can use a Lambda layer:
lstm_out_rs = LSTM(..., return_sequences=True)(x)
out = Lambda(lambda t: [t, t[:,-1]])(lstm_out_rs)
# out[0] is all the outputs, out[1] is the last output
I'm very new to TensorFlow. I've been trying use TensorFlow to create a function where I give it a vector with 6 features and get back a label.
I have a training data set in the form of 6 features and 1 label. The label is in the first column:
309,3,0,2,4,0,6
309,12,0,2,4,0,6
309,0,4,17,2,0,6
318,0,660,414,58,3,12
311,0,0,414,58,0,2
298,0,53,355,5,0,2
60,16,14,381,30,4,2
312,0,8,8,13,0,3
...
I have the index for the labels which is a list of thousand and thousands of names:
309,Joe
318,Joey
311,Bruce
...
How do I create a model and train it using TensorFlow to be able to predict the label, given a vector without the first column?
--
This is what I tried:
from __future__ import print_function
import tflearn
name_count = sum(1 for line in open('../../names.csv')) # this comes out to 24260
# Load CSV file, indicate that the first column represents labels
from tflearn.data_utils import load_csv
data, labels = load_csv('../../data.csv', target_column=0,
categorical_labels=True, n_classes=name_count)
# Build neural network
net = tflearn.input_data(shape=[None, 6])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
# Predict
pred = model.predict([[218,5,124,26,0,3]]) # 326
print("Name:", pred[0][1])
It's based on https://github.com/tflearn/tflearn/blob/master/tutorials/intro/quickstart.md
I get the error:
ValueError: Cannot feed value of shape (16, 24260) for Tensor u'TargetsData/Y:0', which has shape '(?, 2)'
24260 is the number of lines in names.csv
Thank you!
net = tflearn.fully_connected(net, 2, activation='softmax')
looks to be saying you have 2 output classes, but in reality you have 24260. 16 is the size of your minibatch, so you have 16 rows of 24260 columns (one of these 24260 will be a 1, the others will be all 0s).
I'm building a neural network with the architecture:
input layer --> fully connected layer --> ReLU --> fully connected layer --> softmax
I'm using the equations outlined here DeepLearningBook to implement backprop. I think my mistake is in eq. 1. When differentiating do I consider each example independently yielding an N x C (no. of examples x no. of classes) matrix or together to yield an N x 1 matrix?
# derivative of softmax
da2 = -a2 # a2 comprises activation values of output layer
da2[np.arange(N),y] += 1
da2 *= (a2[np.arange(N),y])[:,None]
# derivative of ReLU
da1 = a1 # a1 comprises activation values of hidden layer
da1[a1>0] = 1
# eq. 1
mask = np.zeros(a2.shape)
mask = [np.arange(N),y] = 1
delta_2 = ((1/a2) * mask) * da2 / N
# delta_L = - (1 / a2[np.arange(N),y])[:,None] * da2 / N
# eq.2
delta_1 = np.dot(delta_2,W2.T) * da1
# eq. 3
grad_b1 = np.sum(delta_1,axis=0)
grad_b2 = np.sum(delta_2,axis=0)
# eq. 4
grad_w1 = np.dot(X.T,delta_1)
grad_w2 = np.dot(a1.T,delta_2)
Oddly, the commented line in eq. 1 returns the correct value for biases but I can't seem to justify using that equation since it returns an N x 1 matrix which is multiplied with the corresponding rows of da2.
Edit: I'm working on the assignment problems of the CS231n course which can be found here: CS231n
I also couldn't find any explanation about this elsewhere. So I write a post :) Please read it here.
Is there a way to calculate the total number of parameters in a LSTM network.
I have found a example but I'm unsure of how correct this is or If I have understood it correctly.
For eg consider the following example:-
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
model = Sequential()
model.add(LSTM(256, input_dim=4096, input_length=16))
model.summary()
Output
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
lstm_1 (LSTM) (None, 256) 4457472 lstm_input_1[0][0]
====================================================================================================
Total params: 4457472
____________________________________________________________________________________________________
As per My understanding n is the input vector lenght.
And m is the number of time steps. and in this example they consider the number of hidden layers to be 1.
Hence according to the formula in the post. 4(nm+n^2) in my example m=16;n=4096;num_of_units=256
4*((4096*16)+(4096*4096))*256 = 17246978048
Why is there such a difference?
Did I misunderstand the example or was the formula wrong ?
No - the number of parameters of a LSTM layer in Keras equals to:
params = 4 * ((size_of_input + 1) * size_of_output + size_of_output^2)
Additional 1 comes from bias terms. So n is size of input (increased by the bias term) and m is size of output of a LSTM layer.
So finally :
4 * (4097 * 256 + 256^2) = 4457472
image via this post
num_params = [(num_units + input_dim + 1) * num_units] * 4
num_units + input_dim: concat [h(t-1), x(t)]
+ 1: bias
* 4: there are 4 neural network layers (yellow box) {W_forget, W_input, W_output, W_cell}
model.add(LSTM(units=256, input_dim=4096, input_length=16))
[(256 + 4096 + 1) * 256] * 4 = 4457472
PS: num_units = num_hidden_units = output_dims
I think it would be easier to understand if we start with a simple RNN.
Let's assume that we have 4 units (please ignore the ... in the network and concentrate only on visible units), and the input size (number of dimensions) is 3:
The number of weights is 28 = 16 (num_units * num_units) for the recurrent connections + 12 (input_dim * num_units) for input. The number of biases is simply num_units.
Recurrency means that each neuron output is fed back into the whole network, so if we unroll it in time sequence, it looks like two dense layers:
and that makes it clear why we have num_units * num_units weights for the recurrent part.
The number of parameters for this simple RNN is 32 = 4 * 4 + 3 * 4 + 4, which can be expressed as num_units * num_units + input_dim * num_units + num_units or num_units * (num_units + input_dim + 1)
Now, for LSTM, we must multiply the number of of these parameters by 4, as this is the number of sub-parameters inside each unit, and it was nicely illustrated in the answer by #FelixHo
Formula expanding for #JohnStrong :
4 means we have different weight and bias variables for 3 gates (read / write / froget) and - 4-th - for the cell state (within same hidden state).
(These mentioned are shared among timesteps along particular hidden state vector)
4 * lstm_hidden_state_size * (lstm_inputs_size + bias_variable + lstm_outputs_size)
as LSTM output (y) is h (hidden state) by approach, so, without an extra projection, for LSTM outputs we have :
lstm_hidden_state_size = lstm_outputs_size
let's say it's d :
d = lstm_hidden_state_size = lstm_outputs_size
Then
params = 4 * d * ((lstm_inputs_size + 1) + d) = 4 * ((lstm_inputs_size + 1) * d + d^2)
LSTM Equations (via deeplearning.ai Coursera)
It is evident from the equations that the final dimensions of all the 6 equations will be same and final dimension must necessarily be equal to the dimension of a(t).
Out of these 6 equations, only 4 equations contribute to the number of parameters and by looking at the equations, it can be deduced that all the 4 equations are symmetric. So,if we find out the number of parameters for 1 equation, we can just multiply it by 4 and tell the total number of parameters.
One important point is to note that the total number of parameters doesn't depend on the time-steps(or input_length) as same "W" and "b" is shared throughout the time-step.
Assuming, insider of LSTM cell having just one layer for a gate(as that in Keras).
Take equation 1 and lets relate. Let number of neurons in the layer be n and number of dimension of x be m (not including number of example and time-steps). Therefore, dimension of forget gate will be n too. Now,same as that in ANN, dimension of "Wf" will be n*(n+m) and dimension of "bf" will be n. Therefore, total number of parameters for one equation will be [{n*(n+m)} + n]. Therefore, total number of parameters will be 4*[{n*(n+m)} + n].Lets open the brackets and we will get -> 4*(nm + n2 + n).
So,as per your values. Feeding it into the formula gives:->(n=256,m=4096),total number of parameters is 4*((256*256) + (256*4096) + (256) ) = 4*(1114368) = 4457472.
The others have pretty much answered it. But just for further clarification, on creating an LSTM layer. The number of params is as follows:
No of params= 4*((num_features used+1)*num_units+
num_units^2)
The +1 is because of the additional bias we take.
Where the num_features is the num_features in your input shape to the LSTM:
Input_shape=(window_size,num_features)