TimeDistributed(Dense) vs Dense in Keras - Same number of parameters - machine-learning

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.
My simplified model is the following:
InputSize = 15
MaxLen = 64
HiddenSize = 16
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)
The summary of the network is:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).
However, if I switch to a simple Dense layer:
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)
I still only have 255 parameters:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
dense_1 (Dense) (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).
Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:
def build(self, input_shape):
assert len(input_shape) >= 2
input_dim = input_shape[-1]
self.kernel = self.add_weight(shape=(input_dim, self.units),
It also uses keras.dot to apply the weights:
def call(self, inputs):
output = K.dot(inputs, self.kernel)
The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).
However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.
So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.
from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])
model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)
print(model1.summary())
print(model2.summary())
print(model3.summary())
In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.

Here is a piece of code that verifies TimeDistirbuted(Dense(X)) is identical to Dense(X):
import numpy as np
from keras.layers import Dense, TimeDistributed
import tensorflow as tf
X = np.array([ [[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
],
[[3, 1, 7],
[8, 2, 5],
[11, 10, 4],
[9, 6, 12]
]
]).astype(np.float32)
print(X.shape)
(2, 4, 3)
dense_weights = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.2, 0.7, 0.9, 0.1, 0.2],
[0.1, 0.8, 0.6, 0.2, 0.4]])
bias = np.array([0.1, 0.3, 0.7, 0.8, 0.4])
print(dense_weights.shape)
(3, 5)
dense = Dense(input_dim=3, units=5, weights=[dense_weights, bias])
input_tensor = tf.Variable(X, name='inputX')
output_tensor1 = dense(input_tensor)
output_tensor2 = TimeDistributed(dense)(input_tensor)
print(output_tensor1.shape)
print(output_tensor2.shape)
(2, 4, 5)
(2, ?, 5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
output1 = sess.run(output_tensor1)
output2 = sess.run(output_tensor2)
print(output1 - output2)
And the difference is:
[[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]]

Related

how to do custom keras layer matrix multiplication

Layers:
Input shape (None,75)
Hidden layer 1 - shape is (75,3)
Hidden layer 2 - shape is (3,1)
For the last layer, the output must be calculated as ( (H21*w1)*(H22*w2)*(H23*w3)), where H21,H22,H23 will be the outcome of Hidden layer 2, and w1,w2,w3 will be constant weight which are not trainable. So how to write a lambda function for the above outcome
def product(X):
return X[0]*X[1]
keras_model = Sequential()
keras_model.add(Dense(75,
input_dim=75,activation='tanh',name="layer1" ))
keras_model.add(Dense(3 ,activation='tanh',name="layer2" ))
keras_model.add(Dense(1,name="layer3"))
cross1=keras_model.add(Lambda(lambda x:product,output_shape=(1,1)))([layer2,layer3])
print(cross1)
NameError: name 'layer2' is not defined
Use the functional API model
inputs = Input((75,)) #shape (batch, 75)
output1 = Dense(75, activation='tanh',name="layer1" )(inputs) #shape (batch, 75)
output2 = Dense(3 ,activation='tanh',name="layer2" )(output1) #shape (batch, 3)
output3 = Dense(1,name="layer3")(output2) #shape (batch, 1)
cross1 = Lambda(lambda x: x[0] * x[1])([output2, output3]) #shape (batch, 3)
model = Model(inputs, cross1)
Please notice that the shapes are totally different from what you expect.
I will suggest you to do it via a customized layer instead of the Lambda layer. Why? A customized will give you more freedom to do stuffs, and it is also more transparent in terms of viewing your desired weights. More precisely, if you do it through Lambda layer, the constant weight will not be saved as a part of the model, but it will if you use a customized layer.
Here is an example
from keras import backend as K
from keras.layers import *
from keras.models import *
import numpy as np
class MyLayer(Layer) :
# see https://keras.io/layers/writing-your-own-keras-layers/
def __init__(self,
w_vec=None,
allow_training=False,
**kwargs) :
self._w_vec = w_vec
assert allow_training or (w_vec is not None), \
"ERROR: non-trainable w_vec must be initialized"
self.allow_training = allow_training
super().__init__(**kwargs)
return
def build(self, input_shape) :
batch_size, num_feats = input_shape
self.w_vec = self.add_weight(shape=(1, num_feats),
name='w_vec',
initializer='uniform', # <- use your own preferred initializer
trainable=self.allow_training,)
if self._w_vec is not None :
# predefined w_vec
assert self._w_vec.shape[1] == num_feats, \
"ERROR: initial w_vec shape mismatches the input shape"
# set it to the weight
self.set_weights([self._w_vec]) # <- set weights to the supplied one
super().build(input_shape)
return
def call(self, x) :
# Given:
# x = [H21, H22, H23]
# w_vec = [w1, w2, w3]
# Step 1: output elem_prod
# elem_prod = [H21*w1, H22*w2, H23*w3]
elem_prod = x * self.w_vec
# Step 2: output ret
# ret = (H21*w1) * (H22*w2) * (H23*w3)
ret = K.prod(elem_prod, axis=-1, keepdims=True)
return ret
def compute_output_shape(self, input_shape) :
return (input_shape[0], 1)
def make_test_cases(w_vec=None, allow_training=False):
x = Input(shape=(75,))
y = Dense(75, activation='tanh', name='fc1')(x)
y = Dense(3, activation='tanh', name='fc2')(y)
y = MyLayer(w_vec, allow_training, name='core')(y)
y = Dense(1, name='fc3')(y)
net = Model(inputs=x, outputs=y, name='{}-{}'.format( 'randomInit' if w_vec is None else 'assignInit',
'trainable' if allow_training else 'nontrainable'))
print(net.name)
print(net.layers[-2].get_weights()[0])
print(net.summary())
return net
And you may run the following test cases to see the differences (pay attention to the first and the last lines in the print out, which gives you the initial values and the number of constant parameters, respectively)
a. Constant weights, non-trainable
m1 = make_test_cases(w_vec=np.arange(3).reshape([1,3]), allow_training=False)
will give you
assignInit-nontrainable [[0. 1. 2.]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) (None, 75) 0
_________________________________________________________________
fc1 (Dense) (None, 75) 5700
_________________________________________________________________
fc2 (Dense) (None, 3) 228
_________________________________________________________________
core (MyLayer) (None, 1) 3
_________________________________________________________________
fc3 (Dense) (None, 1) 2
=================================================================
Total params: 5,933
Trainable params: 5,930
Non-trainable params: 3
_________________________________________________________________
b. Constant weights, trainable
m2 = make_test_cases(w_vec=np.arange(3).reshape([1,3]), allow_training=True)
will give you
assignInit-trainable [[0. 1. 2.]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) (None, 75) 0
_________________________________________________________________
fc1 (Dense) (None, 75) 5700
_________________________________________________________________
fc2 (Dense) (None, 3) 228
_________________________________________________________________
core (MyLayer) (None, 1) 3
_________________________________________________________________
fc3 (Dense) (None, 1) 2
=================================================================
Total params: 5,933
Trainable params: 5,933
Non-trainable params: 0
_________________________________________________________________
c. Random weights, trainable
m3 = make_test_cases(w_vec=None, allow_training=True)
will give you
randomInit-trainable [[ 0.02650297 -0.02010062 -0.03771694]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_6 (InputLayer) (None, 75) 0
_________________________________________________________________
fc1 (Dense) (None, 75) 5700
_________________________________________________________________
fc2 (Dense) (None, 3) 228
_________________________________________________________________
core (MyLayer) (None, 1) 3
_________________________________________________________________
fc3 (Dense) (None, 1) 2
=================================================================
Total params: 5,933
Trainable params: 5,933
Non-trainable params: 0
_________________________________________________________________
Final remark
I will say it is unclear which case may work better in advance for your problem, but trying all three sounds like a good plan.

Keras target dimensions mismatch

Attempting a single-label classification problem with num_classes = 73
Here's my simplified Keras model:
num_classes = 73
batch_size = 4
train_data_list = [training_file_names list here..]
validation_data_list = [ validation_file_names list here..]
training_generator = DataGenerator(train_data_list, batch_size, num_classes)
validation_generator = DataGenerator(validation_data_list, batch_size, num_classes)
model = Sequential()
model.add(Conv1D(32, 3, strides=1, input_shape=(15,120), activation="relu"))
model.add(Conv1D(16, 3, strides=1, activation="relu"))
model.add(Flatten())
model.add(Dense(n_classes, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss="categorical_crossentropy",optimizer=sgd,metrics=['accuracy'])
model.fit_generator(generator=training_generator, epochs=100,
validation_data=validation_generator)
Here's my DataGenerator's __get_item__ method:
def __get_item__(self):
X = np.zeros((self.batch_size,15,120))
y = np.zeros((self.batch_size, 1 ,self.n_classes))
for i in range(self.batch_size):
X_row = some_method_that_gives_X_of_15x20_dim()
target = some_method_that_gives_target()
one_hot = keras.utils.to_categorical(target, num_classes=self.n_classes)
X[i] = X_row
y[i] = one_hot
return X, y
Since my X values are correctly returned with dimension (batch_size, 15, 120), I am not showing it here. My issue is with the y value returned.
y returned from this generator method has a shape of (batch_size, 1, 73) as one hot encoded label for the 73 classes, which I think is the correct shape to return.
However Keras gives the following error for the last layer:
ValueError: Error when checking target: expected dense_1 to have 2
dimensions, but got array with shape (4, 1, 73)
Since the batch size is 4, I think the target batch should also be 3 dimensional (4,1,73). Why is then Keras expecting the last layer to be 2 dimensions ?
you model' s summary shows that in the output layer there should be only 2 dimensions, (None, 73)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_7 (Conv1D) (None, 13, 32) 11552
_________________________________________________________________
conv1d_8 (Conv1D) (None, 11, 16) 1552
_________________________________________________________________
flatten_5 (Flatten) (None, 176) 0
_________________________________________________________________
dense_4 (Dense) (None, 73) 12921
=================================================================
Total params: 26,025
Trainable params: 26,025
Non-trainable params: 0
_________________________________________________________________
Since dimension of your target is (batch_size, 1, 73), you can just change to (batch_size, 73) in order for your model to run

Why is my CNN overfitting and how can I fix it?

I am finetuning a 3D-CNN called C3D which was originally trained to classify sports from video clips.
I am freezing the convolution (feature extraction) layers and training the fully connected layers using gifs from GIPHY to classify the gifs for sentiment analysis (positive or negative).
Weights are pre loaded for all layers except the final fully connected layer.
I am using 5000 images (2500 positive, 2500 negative) for training with a 70/30 training/testing split using Keras. I am using the Adam optimizer with a learning rate of 0.0001.
The training accuracy increases and the training loss decreases during training but very early on the validation accuracy and loss does not improve as the model starts to overfit.
I believe I have enough training data and am using a dropout of 0.5 on both of the fully connected layers so how can I combat this overfitting?
The model architechture, training code and visualisations of training performance from Keras can be found below.
train_c3d.py
from training.c3d_model import create_c3d_sentiment_model
from ImageSentiment import load_gif_data
import numpy as np
import pathlib
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
def image_generator(files, batch_size):
"""
Generate batches of images for training instead of loading all images into memory
:param files:
:param batch_size:
:return:
"""
while True:
# Select files (paths/indices) for the batch
batch_paths = np.random.choice(a=files,
size=batch_size)
batch_input = []
batch_output = []
# Read in each input, perform preprocessing and get labels
for input_path in batch_paths:
input = load_gif_data(input_path)
if "pos" in input_path: # if file name contains pos
output = np.array([1, 0]) # label
elif "neg" in input_path: # if file name contains neg
output = np.array([0, 1]) # label
batch_input += [input]
batch_output += [output]
# Return a tuple of (input,output) to feed the network
batch_x = np.array(batch_input)
batch_y = np.array(batch_output)
yield (batch_x, batch_y)
model = create_c3d_sentiment_model()
print(model.summary())
model.load_weights('models/C3D_Sport1M_weights_keras_2.2.4.h5', by_name=True)
for layer in model.layers[:14]: # freeze top layers as feature extractor
layer.trainable = False
for layer in model.layers[14:]: # fine tune final layers
layer.trainable = True
train_files = [str(filepath.absolute()) for filepath in pathlib.Path('data/sample_train').glob('**/*')]
val_files = [str(filepath.absolute()) for filepath in pathlib.Path('data/sample_validation').glob('**/*')]
batch_size = 8
train_generator = image_generator(train_files, batch_size)
validation_generator = image_generator(val_files, batch_size)
model.compile(optimizer=Adam(lr=0.0001),
loss='binary_crossentropy',
metrics=['accuracy'])
mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', verbose=1)
history = model.fit_generator(train_generator, validation_data=validation_generator,
steps_per_epoch=int(np.ceil(len(train_files) / batch_size)),
validation_steps=int(np.ceil(len(val_files) / batch_size)), epochs=5, shuffle=True,
callbacks=[mc])
load_gif_data()
def load_gif_data(file_path):
"""
Load and process gif for input into Keras model
:param file_path:
:return: Mean normalised image in BGR format as numpy array
for more info see -> http://cs231n.github.io/neural-networks-2/
"""
im = Img(fp=file_path)
try:
im.load(limit=16, # Keras image model only requires 16 frames
first=True)
except:
print("Error loading image: " + file_path)
return
im.resize(size=(112, 112))
im.convert('RGB')
im.close()
np_frames = []
frame_index = 0
for i in range(16): # if image is less than 16 frames, repeat the frames until there are 16
frame = im.frames[frame_index]
rgb = np.array(frame)
bgr = rgb[..., ::-1]
mean = np.mean(bgr, axis=0)
np_frames.append(bgr - mean) # C3D model was originally trained on BGR, mean normalised images
# it is important that unseen images are in the same format
if frame_index == (len(im.frames) - 1):
frame_index = 0
else:
frame_index = frame_index + 1
return np.array(np_frames)
model architecture
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1 (Conv3D) (None, 16, 112, 112, 64) 5248
_________________________________________________________________
pool1 (MaxPooling3D) (None, 16, 56, 56, 64) 0
_________________________________________________________________
conv2 (Conv3D) (None, 16, 56, 56, 128) 221312
_________________________________________________________________
pool2 (MaxPooling3D) (None, 8, 28, 28, 128) 0
_________________________________________________________________
conv3a (Conv3D) (None, 8, 28, 28, 256) 884992
_________________________________________________________________
conv3b (Conv3D) (None, 8, 28, 28, 256) 1769728
_________________________________________________________________
pool3 (MaxPooling3D) (None, 4, 14, 14, 256) 0
_________________________________________________________________
conv4a (Conv3D) (None, 4, 14, 14, 512) 3539456
_________________________________________________________________
conv4b (Conv3D) (None, 4, 14, 14, 512) 7078400
_________________________________________________________________
pool4 (MaxPooling3D) (None, 2, 7, 7, 512) 0
_________________________________________________________________
conv5a (Conv3D) (None, 2, 7, 7, 512) 7078400
_________________________________________________________________
conv5b (Conv3D) (None, 2, 7, 7, 512) 7078400
_________________________________________________________________
zeropad5 (ZeroPadding3D) (None, 2, 8, 8, 512) 0
_________________________________________________________________
pool5 (MaxPooling3D) (None, 1, 4, 4, 512) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 8192) 0
_________________________________________________________________
fc6 (Dense) (None, 4096) 33558528
_________________________________________________________________
dropout_1 (Dropout) (None, 4096) 0
_________________________________________________________________
fc7 (Dense) (None, 4096) 16781312
_________________________________________________________________
dropout_2 (Dropout) (None, 4096) 0
_________________________________________________________________
nfc8 (Dense) (None, 2) 8194
=================================================================
Total params: 78,003,970
Trainable params: 78,003,970
Non-trainable params: 0
_________________________________________________________________
None
training visualisations
I think that the error is in the loss function and in the last Dense layer. As provided in the model summary, the last Dense layer is,
nfc8 (Dense) (None, 2)
The output shape is ( None , 2 ) meaning that the layer has 2 units. As you said earlier, you need to classify GIFs as positive or negative.
Classifying GIFs could be a binary classification problem or a multiclass classification problem ( with two classes ).
Binary classification has only 1 unit in the last Dense layer with a sigmoid activation function. But, here the model has 2 units in the last Dense layer.
Hence, the model is a multiclass classifier, but you have given a loss function of binary_crossentropy which is meant for binary classifiers ( with a single unit in the last layer ).
So, replacing the loss with categorical_crossentropy should work. Or edit the last Dense layer and change the number of units and activation function.
Hope this helps.

ValueError: Error when checking target: expected dense_8 to have 4 dimensions, but got array with shape (37800, 10, 10)

I'm a beginner at machine learning. I am working on mnist dataset which I downloaded from kaggle. I am making this very first project by the help of a tutorial. But I'm facing this issue which I am unable to resolve. Please help. Here's the below.
import keras
import keras.preprocessing
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
import pandas as pd
from keras.layers import Dense
from keras.optimizers import SGD
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score, confusion_matrix
X = pd.read_csv(r'C:\Users\faizan\Desktop\ML\Kaggle\MNIST\train.csv')
Y = pd.read_csv(r'C:\Users\faizan\Desktop\ML\Kaggle\MNIST\test.csv')
y = X["label"]
X = X.drop(["label"], 1)
#x = Y.drop(['label'], 1)
print(y.shape)
print(X.shape)
print(Y.shape)
y = keras.utils.to_categorical(y, num_classes = 10)
X = X / 255.0
X = X.values.reshape(-1,28,28,1)
# Shuffle Split Train and Test from original dataset
seed=2
train_index, valid_index = ShuffleSplit(n_splits=1,
train_size=0.9,
test_size=None,
random_state=seed).split(X).__next__()
x_train = X[train_index]
Y_train = y[train_index]
x_test = X[valid_index]
Y_test = y[valid_index]
model = Sequential()
model.add(Dense(units=128,activation="relu", input_shape=(28, 28, 1)))
model.add(Dense(units=128,activation="relu"))
model.add(Dense(units=128,activation="relu"))
model.add(Dense(units=10,activation="softmax"))
## Compiling Model
model.compile(optimizer=SGD(0.001),loss="categorical_crossentropy",metrics=["accuracy"])
## Training
model.fit(x_train,Y_train,batch_size=32, epochs=10,verbose=1)
accuracy = model.evaluate(x=x_test, y=Y_test, batch_size=32)
## Checking Accuracy
print("Accuracy: ", accuracy[1])
You are making some mistake that cause your network to fail.
First i will assume you are working with NMIST data set and that you are trying to classify each image to a class. Your Network is the following:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 28, 28, 128) 256
_________________________________________________________________
dense_2 (Dense) (None, 28, 28, 128) 16512
_________________________________________________________________
dense_3 (Dense) (None, 28, 28, 128) 16512
_________________________________________________________________
dense_4 (Dense) (None, 28, 28, 10) 1290
=================================================================
Total params: 34,570
Trainable params: 34,570
Non-trainable params: 0
_________________________________________________________________
So: You have four dimensions at the output of the network. And that is not right for a classification task. If you add a Flatten Layer just before the last layer:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 28, 28, 128) 256
_________________________________________________________________
dense_6 (Dense) (None, 28, 28, 128) 16512
_________________________________________________________________
dense_7 (Dense) (None, 28, 28, 128) 16512
_________________________________________________________________
flatten_1 (Flatten) (None, 100352) 0
_________________________________________________________________
dense_8 (Dense) (None, 10) 1003530
=================================================================
Total params: 1,036,810
Trainable params: 1,036,810
Non-trainable params: 0
_________________________________________________________________
and here you see that we have the ten classes you wanted. And that you only have two dimensions: one for the batch size (None) and the other for the classes (10). For one sample it will be a vector of probabilities for every class summing to one due to the softmax activation (mutually exclusives classes)
Could you please try to run with the flatten to see if this was your issue.
Then I strongly advice you to look into dealing with images in Keras because the use of Dense layers here (adn only Dense) is not optimal (for example you can see this Kaggle tuto)

How to decrease a 3D matrix to a 2D matrix using Keras?

I have built a Keras ConvLSTM neural network, and I want to predict one frame ahead based on a sequence of 10-time steps:
from keras.models import Sequential
from keras.layers.convolutional import Conv3D
from keras.layers.convolutional_recurrent import ConvLSTM2D
from keras.layers.normalization import BatchNormalization
import numpy as np
import pylab as plt
from keras import layers
# We create a layer which take as input movies of shape
# (n_frames, width, height, channels) and returns a movie
# of identical shape.
model = Sequential()
model.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
input_shape=(None, 64, 64, 1),
padding='same', return_sequences=True))
model.add(BatchNormalization())
model.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
model.add(BatchNormalization())
model.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
model.add(BatchNormalization())
model.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
model.add(BatchNormalization())
model.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
activation='sigmoid',
padding='same', data_format='channels_last'))
model.compile(loss='binary_crossentropy', optimizer='adadelta')
training:
data_train_x = data_4[0:20, 0:10, :, :, :]
data_train_y = data_4[0:20, 10:11, :, :, :]
model.fit(data_train_x, data_train_y, batch_size=10, epochs=1,
validation_split=0.05)
and I test the model:
test_x = np.reshape(data_test_x[2,:,:,:,:], [1,10,64,64,1])
next_frame = model.predict(test_x,batch_size=1, verbose=1, steps=None)
but the problem is that 'next_frame' shape is: (1, 10, 64, 64, 1) but I wanted it to be of shape (1, 1, 64, 64, 1)
And this is the results of 'model.summary()':
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv_lst_m2d_1 (ConvLSTM2D) (None, None, 64, 64, 40) 59200
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 64, 64, 40) 160
_________________________________________________________________
conv_lst_m2d_2 (ConvLSTM2D) (None, None, 64, 64, 40) 115360
_________________________________________________________________
batch_normalization_2 (Batch (None, None, 64, 64, 40) 160
_________________________________________________________________
conv_lst_m2d_3 (ConvLSTM2D) (None, None, 64, 64, 40) 115360
_________________________________________________________________
batch_normalization_3 (Batch (None, None, 64, 64, 40) 160
_________________________________________________________________
conv_lst_m2d_4 (ConvLSTM2D) (None, None, 64, 64, 40) 115360
_________________________________________________________________
batch_normalization_4 (Batch (None, None, 64, 64, 40) 160
_________________________________________________________________
conv3d_1 (Conv3D) (None, None, 64, 64, 1) 1081
=================================================================
Total params: 407,001
Trainable params: 406,681
Non-trainable params: 320
So I don't know what layer to add so I decrease the output to 1 frame instead of 10 frames?
This is expected based on the 3D convolution in the final layer. For example, if you have 1 filter in a Conv2D across a 3-dimensional tensor, with padding = 'same', this means it will produce a 2D output of the same height and width (e.g. the filter implicitly also captures along the depth axis).
The same is true for 3D across a 4-dimensional tensor, where it implicitly captures along the channel dimension depth axis, resulting in a 3-D tensor of the same (sequence index, height, width) as the input.
It sounds like what you want to do is add a pooling step of some kind after your Conv3D layer, such that it flattens across the sequence dimension, such as with AveragePooling3D with a pooling tuple of (10, 1, 1) to average across the first non-batch dimension (or modified according to your specific network needs).
Alternatively, suppose you want to specifically "pool" along the sequence dimension by taking only the final sequence element (e.g. instead of averaging or max-pooling across the sequence). You could then make the final ConvLSTM2D layer to have return_sequences=False, followed by a 2D convolution in the final step, but this means your final convolution won't benefit from aggregating across a sequence of predicted frames. Probably application-specific whether this is a good idea or not.
Just to confirm the first approach, I added:
model.add(layers.AveragePooling3D(pool_size=(10, 1, 1), padding='same'))
just after the Conv3D layer, and then made toy data:
x = np.random.rand(1, 10, 64, 64, 1)
and then:
In [22]: z = model.predict(x)
In [23]: z.shape
Out[23]: (1, 1, 64, 64, 1)
You would need to ensure the pooling size in the first non-batch dimension is set to the maximum possible sequence length to ensure you always get (1, 1, ...) in the final output shape.
As an alternative to ely's Conv2D and AveragePooling3D solutions, you can set the last ConvLSTM2D layer's return_sequence parameter as True but change the padding of the Conv3D layer to valid then set its kernel_size parameter as (n_observations - k_steps_to_predict + 1 , 1 , 1). With this, you are able to alter the time_dimension(#frames) of the output. You can apply this for any direct k-step ahead prediction assuming that the number of observations are fixed.

Resources