Keras CNN - loss continuously decreases but accuracy converges quickly - machine-learning

No matter what optimizer, accuracy or loss metrics I use, my accuracy converges quickly (within 10-20 epochs) while my loss continues to decrease (>100 epochs). I've tried every optimizer available in Keras and the same trend occurs (although some converge less quickly and with slightly higher accuracy than others, with nAdam, Adadelta and Adamax performing the best).
My input is a 64x1 data vector and my output is a 3x1 vector representing 3D coordinates in real space. I have about 2000 training samples and 500 test samples. I've normalized both the input and outputs using MinMaxScaler from the scikit learn preprocessing toolbox, and I also shuffle my data using the scikit learn shuffle function. I use test_train_split to shuffle my data (with a specified random state). Here's my CNN:
def cnn(pretrained_weights = None,input_size = (64,1)):
inputs = keras.engine.input_layer.Input(input_size)
conv1 = Conv1D(64,2,strides=1,activation='relu')(inputs)
conv2 = Conv1D(64,2,strides=1,activation='relu')(conv1)
pool1 = MaxPooling1D(pool_size=2)(conv2)
#pool1 = Dropout(0.25)(pool1)
conv3 = Conv1D(128,2,strides=1,activation='relu')(pool1)
conv4 = Conv1D(128,2,strides=1,activation='relu')(conv3)
pool2 = MaxPooling1D(pool_size=2)(conv4)
#pool2 = Dropout(0.25)(pool2)
conv5 = Conv1D(256,2,strides=1,activation='relu')(pool2)
conv6 = Conv1D(256,2,strides=1,activation='relu')(conv5)
pool3 = MaxPooling1D(pool_size=2)(conv6)
#pool3 = Dropout(0.25)(pool3)
pool4 = MaxPooling1D(pool_size=2)(pool3)
dense1 = Dense(256,activation='relu')(pool4)
#drop1 = Dropout(0.5)(dense1)
drop1 = dense1
dense2 = Dense(64,activation='relu')(drop1)
#drop2 = Dropout(0.5)(dense2)
drop2 = dense2
dense3 = Dense(32,activation='relu')(drop2)
dense4 = Dense(1,activation='sigmoid')(dense3)
model = Model(inputs = inputs, outputs = dense4)
#opt = Adam(lr=1e-6,clipvalue=0.01)
model.compile(optimizer = Nadam(lr=1e-4), loss = 'mse', metrics = ['accuracy','mse','mae'])
I tried additional pooling (as can be seen in my code) to regularize my data and reduce overfitting (in case that's the problem) but to no avail. Here's a training example using the parameters above:
model = cnn()
model.fit(x=x_train, y=y_train, batch_size=7, epochs=10, verbose=1, validation_split=0.2, shuffle=True)
Train on 1946 samples, validate on 487 samples
Epoch 1/10
1946/1946 [==============================] - 5s 3ms/step - loss: 0.0932 - acc: 0.0766 - mean_squared_error: 0.0932 - mean_absolute_error: 0.2616 - val_loss: 0.0930 - val_acc: 0.0815 - val_mean_squared_error: 0.0930 - val_mean_absolute_error: 0.2605
Epoch 2/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0903 - acc: 0.0783 - mean_squared_error: 0.0903 - mean_absolute_error: 0.2553 - val_loss: 0.0899 - val_acc: 0.0842 - val_mean_squared_error: 0.0899 - val_mean_absolute_error: 0.2544
Epoch 3/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0886 - acc: 0.0807 - mean_squared_error: 0.0886 - mean_absolute_error: 0.2524 - val_loss: 0.0880 - val_acc: 0.0862 - val_mean_squared_error: 0.0880 - val_mean_absolute_error: 0.2529
Epoch 4/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0865 - acc: 0.0886 - mean_squared_error: 0.0865 - mean_absolute_error: 0.2488 - val_loss: 0.0875 - val_acc: 0.1081 - val_mean_squared_error: 0.0875 - val_mean_absolute_error: 0.2534
Epoch 5/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0849 - acc: 0.0925 - mean_squared_error: 0.0849 - mean_absolute_error: 0.2461 - val_loss: 0.0851 - val_acc: 0.0972 - val_mean_squared_error: 0.0851 - val_mean_absolute_error: 0.2427
Epoch 6/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0832 - acc: 0.1002 - mean_squared_error: 0.0832 - mean_absolute_error: 0.2435 - val_loss: 0.0817 - val_acc: 0.1075 - val_mean_squared_error: 0.0817 - val_mean_absolute_error: 0.2400
Epoch 7/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0819 - acc: 0.1041 - mean_squared_error: 0.0819 - mean_absolute_error: 0.2408 - val_loss: 0.0796 - val_acc: 0.1129 - val_mean_squared_error: 0.0796 - val_mean_absolute_error: 0.2374
Epoch 8/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0810 - acc: 0.1060 - mean_squared_error: 0.0810 - mean_absolute_error: 0.2391 - val_loss: 0.0787 - val_acc: 0.1129 - val_mean_squared_error: 0.0787 - val_mean_absolute_error: 0.2348
Epoch 9/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0794 - acc: 0.1089 - mean_squared_error: 0.0794 - mean_absolute_error: 0.2358 - val_loss: 0.0789 - val_acc: 0.1102 - val_mean_squared_error: 0.0789 - val_mean_absolute_error: 0.2337
Epoch 10/10
1946/1946 [==============================] - 2s 1ms/step - loss: 0.0785 - acc: 0.1086 - mean_squared_error: 0.0785 - mean_absolute_error: 0.2343 - val_loss: 0.0767 - val_acc: 0.1143 - val_mean_squared_error: 0.0767 - val_mean_absolute_error: 0.2328
I'm having a hard time diagnosing what the problem is. Do I need to additional regularization? Here's an example of an input vector and corresponding ground truth:
input = array([[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[5.05487319e-04],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[2.11865474e-03],
[6.57073860e-04],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[8.02714614e-04],
[1.09597877e-03],
[5.37978732e-03],
[9.74035809e-03],
[0.00000000e+00],
[0.00000000e+00],
[2.04473307e-03],
[5.60562907e-04],
[1.76158615e-03],
[3.48869003e-03],
[6.45111735e-02],
[7.75741303e-01],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[1.33064182e-02],
[5.04751340e-02],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[5.90069050e-04],
[3.27240480e-03],
[1.92582590e-03],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[0.00000000e+00],
[4.50609885e-04],
[1.12957157e-03],
[1.24890352e-03]])
output = array([[0. ],
[0.41666667],
[0.58823529]])
Could it have to do with how the data is normalized or the nature of my data? Do I just not have enough data? Any insight is appreciated, I've tried advice from many other posts but nothing has worked yet. Thanks!

There are several issues with your question...
To start with, both your training & validation accuracies certainly do not "converge quickly", as you claim (both go from 0.07 to ~ 0.1); but even if this was the case, I fail to see how this would be a problem (usually people complain for the oppposite, i.e. accuracy not converging, or not converging quickly enough).
But all this discussion is irrelevant, simply because you are in a regression setting, where accuracy is meaningless; truth is, in such a case, Keras will not "protect" you with a warning or something. You may find the discussion in What function defines accuracy in Keras when the loss is mean squared error (MSE)? useful (disclaimer: the answer is mine).
So, you should change the model.compile statement as follows:
model.compile(optimizer = Nadam(lr=1e-4), loss = 'mse')
i.e. there is no need for metrics here (measuring both mse and mae sounds like an overkill - I suggest to use only one of them).
Is the "mode" I'm in (in this case, regression) only dictated by the type of activation I use in the output layer?
No. The "mode" (regression or classification) is determined by your loss function: losses like mse and mae imply regression settings.
Which brings us to the last issue: unless you know that your outputs take values only in [0, 1], you should not use sigmoid as the activation function of your last layer; a linear activation is normally used for regression settings, i.e.:
dense4 = Dense(1,activation='linear')(dense3)
which, as the linear activation is the default one in Keras (docs), is not even needed explicitly, i.e.:
dense4 = Dense(1)(dense3)
will do the job as well.

Related

Using the same unseen data for validation and testing has large performance metric difference

Overview:
During transfer model (ResNet) training, when setting the test data as the validation data during training, the VALIDATION performance metrics for the last epoch (15) are: binary accuracy of 0.85, f1 of 0.84, precision of 0.84, and recall of 0.85. However, after training and fine-tuning, the model prediction on the test set yields a poor confusion matrix:
[[223, 277]
[233, 267]]
Training Data:
Perfectly balanced with 2000 positive and 2000 negative samples of retinal fundus images.
Validation/Testing Data:
Perfectly balanced with 500 positive and 500 negative samples of retinal fundus images.
Data Splitting:
train: 2 column dataframe with 4000 retinal fundus images paths and 4000 label types
test: 2 column dataframe with 1000 retinal fundus images paths and 1000 label types
Generator Code:
from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras.applications.resnet_v2 import preprocess_input
from keras.preprocessing.image import ImageDataGenerator
target = 512
trainDataGen = ImageDataGenerator(preprocessing_function=preprocess_input, rotation_range=30, horizontal_flip=True, vertical_flip=False,shear_range = 0.2,zoom_range = 0.2,brightness_range=(0.8, 1.2))
trainGen = trainDataGen.flow_from_dataframe(dataframe=train, batch_size = 16, shuffle=True, x_col="fundus", y_col="types", class_mode="binary", validate_filenames='True', target_size=(target, target), directory=None, color_mode='rgb')
testDataGen = ImageDataGenerator(preprocessing_function=preprocess_input)
testGen = testDataGen.flow_from_dataframe(dataframe=test, x_col="fundus", y_col="types", class_mode="binary", validate_filenames='True', target_size=(target, target), directory=None, color_mode='rgb')
Example Model:
from keras.layers import Dropout, BatchNormalization, GlobalAveragePooling2D, Input, Flatten, Dropout, Dense
from keras.models import Model
base_model = ResNet50V2(weights='imagenet', include_top=False, input_shape=(target,target,3))
for layer in base_model.layers[:-4]:
layer.trainable = False
for layer in base_model.layers[-4:]:
layer.trainable = True
flatten = Flatten() (base_model.output)
flatten = Dropout(0.75) (flatten)
flatten = Dense(512, activation='relu') (flatten)
predictions = Dense(1, activation='sigmoid') (flatten)
model = Model(inputs=base_model.input, outputs=predictions)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy',f1_m,precision_m, recall_m])
history = model.fit_generator(trainGen, class_weight = {0:1 , 1:1}, epochs=15, validation_freq=1, validation_data=testGen)
Training Log Sample (prior to fine tuning):
Epoch 1/5
250/250 [==============================] - 643s 3s/step - loss: 1.6509 - binary_accuracy: 0.7312 - f1_m: 0.7095 - precision_m: 0.7429 - recall_m: 0.7319 - val_loss: 0.3843 - val_binary_accuracy: 0.8030 - val_f1_m: 0.7933 - val_precision_m: 0.8162 - val_recall_m: 0.7771
Epoch 2/5
250/250 [==============================] - 607s 2s/step - loss: 0.4188 - binary_accuracy: 0.8090 - f1_m: 0.7974 - precision_m: 0.8190 - recall_m: 0.8027 - val_loss: 0.3680 - val_binary_accuracy: 0.8160 - val_f1_m: 0.8118 - val_precision_m: 0.7705 - val_recall_m: 0.8718
Epoch 3/5
250/250 [==============================] - 606s 2s/step - loss: 0.3929 - binary_accuracy: 0.8278 - f1_m: 0.8113 - precision_m: 0.8542 - recall_m: 0.7999 - val_loss: 0.5049 - val_binary_accuracy: 0.7720 - val_f1_m: 0.8026 - val_precision_m: 0.7053 - val_recall_m: 0.9474
Epoch 4/5
250/250 [==============================] - 602s 2s/step - loss: 0.3491 - binary_accuracy: 0.8465 - f1_m: 0.8342 - precision_m: 0.8836 - recall_m: 0.8129 - val_loss: 0.3410 - val_binary_accuracy: 0.8350 - val_f1_m: 0.8425 - val_precision_m: 0.8038 - val_recall_m: 0.8948
Epoch 5/5
250/250 [==============================] - 617s 2s/step - loss: 0.3321 - binary_accuracy: 0.8480 - f1_m: 0.8335 - precision_m: 0.8705 - recall_m: 0.8187 - val_loss: 0.3538 - val_binary_accuracy: 0.8530 - val_f1_m: 0.8440 - val_precision_m: 0.9173 - val_recall_m: 0.7881
Model Evaluation:
from sklearn.metrics import confusion_matrix
import numpy as np
y_true = np.asarray(testGen.classes)
prediction = model.predict(testGen, verbose=1)
confusion = confusion_matrix(y_true, np.rint(prediction))
Summary:
Since the validation and test data were the same, I expected similar results. However, the large performance difference and poor confusion matrix are confusing :). Assuming the code is error-free, should this be expected when using the same data for validation and testing (despite both being unseen)?
The default behavior of flow_from_dataframe is shuffle=True, which should not be used for validation or testing data generators. When specifying shuffle=False for the trainGen variable, the confusion matrix will accurately show the performance results.
Please see related thread: Should I shuffle the test image dataset in ImageDataGenerator? I have different results with False and True

training loss and validation loss both become 0.00000e+00 always

can someone help me to resolve this why i'm getting my loss 0.0000e+00.
I've looked around that few people had the same problem but I'm not be able to fix it following same advices.
Rows are shuffled and label is already transformaned into float32. These are suggestions I've found on similar questions. Can you tell me what i'm wrong?
this problem is a classification of images having classes more than 1.
this is how i create my model
def createmodel():
pretrained = InceptionV3(input_shape=(150,150,3),
include_top=False,
weights='imagenet')
for layer in pretrained.layers:
layer.trainable = False
x = layers.Flatten()(pretrained.output)
x = layers.Dense(1024,activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(1,activation="softmax")(x)
model = Model(pretrained.input,x)
model.compile(optimizer = Adam(0.001),
loss = 'categorical_crossentropy',
)
return model
Epoch 1/2
10/10 [==============================] - 3s 322ms/step - loss: 0.0000e+00 - val_loss: 0.0000e+00
Epoch 2/2
10/10 [==============================] - 5s 464ms/step - loss: 0.0000e+00 - val_loss: 0.0000e+00
There is an issue with the final layer. The size should be equal to the number of classes as opposed to 1, i.e.:
x = layers.Dense(num_classes, activation="softmax")(x)
assuming num_classes is the number of the distinct classes in your data.

Is it normal for my test loss to reach millions

I'm training a model over several iterations (training, saving, and training again) on the second iteration my val_loss reached millions for some reason. is something wrong with how i'm importing the model?
This is how i saved my initial model after my first run
model.save('/content/drive/My Drive/Colab Notebooks/path/to/save/locaiton',save_format='tf')
and this is how i'm importing and overwriting it
def retrainmodel(model_path,tr_path,v_path):
image_size = 224
BATCH_SIZE_TRAINING = 10
BATCH_SIZE_VALIDATION = 10
BATCH_SIZE_TESTING = 1
EARLY_STOP_PATIENCE = 6
STEPS_PER_EPOCH_TRAINING = 10
STEPS_PER_EPOCH_VALIDATION = 10
NUM_EPOCHS = 20
model = tf.keras.models.load_model(model_path)
data_generator = ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = data_generator.flow_from_directory(tr_path,
target_size=(image_size, image_size),
batch_size=BATCH_SIZE_TRAINING,
class_mode='categorical')
validation_generator = data_generator.flow_from_directory(v_path,
target_size=(image_size, image_size),
batch_size=BATCH_SIZE_VALIDATION,
class_mode='categorical')
cb_early_stopper = EarlyStopping(monitor = 'val_loss', patience = EARLY_STOP_PATIENCE)
cb_checkpointer = ModelCheckpoint(filepath = 'path/to/checkpoint/folder', monitor = 'val_loss', save_best_only = True, mode = 'auto')
fit_history = model.fit(
train_generator,
steps_per_epoch=STEPS_PER_EPOCH_TRAINING,
epochs = NUM_EPOCHS,
validation_data=validation_generator,
validation_steps=STEPS_PER_EPOCH_VALIDATION,
callbacks=[cb_checkpointer, cb_early_stopper]
)
model.save('/content/drive/My Drive/Colab Notebooks/path/to/save/locaiton',save_format='tf')
this is my output after passing my directories onto this function
Found 1421 images belonging to 5 classes.
Found 305 images belonging to 5 classes.
Epoch 1/20
10/10 [==============================] - 233s 23s/step - loss: 2.3330 - acc: 0.7200 - val_loss: 4.6237 - val_acc: 0.4400
Epoch 2/20
10/10 [==============================] - 171s 17s/step - loss: 2.7988 - acc: 0.5900 - val_loss: 56996.6289 - val_acc: 0.6800
Epoch 3/20
10/10 [==============================] - 159s 16s/step - loss: 1.2776 - acc: 0.6800 - val_loss: 8396707.0000 - val_acc: 0.6500
Epoch 4/20
10/10 [==============================] - 144s 14s/step - loss: 1.4562 - acc: 0.6600 - val_loss: 2099639.7500 - val_acc: 0.7200
Epoch 5/20
10/10 [==============================] - 126s 13s/step - loss: 1.0970 - acc: 0.7033 - val_loss: 50811.5781 - val_acc: 0.7300
Epoch 6/20
10/10 [==============================] - 127s 13s/step - loss: 0.7326 - acc: 0.8000 - val_loss: 84781.5703 - val_acc: 0.7000
Epoch 7/20
10/10 [==============================] - 110s 11s/step - loss: 1.2356 - acc: 0.7100 - val_loss: 1000.2982 - val_acc: 0.7300
here is my optimizer:
sgd = optimizers.SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True)
model.compile(optimizer = sgd, loss = 'categorical_crossentropy', metrics = 'acc')
where do you think i'm wrong?
I am training my model in batches because i'm working on google colab with 22K images in total, so these results are after feeding the network 2800 training images. do you think it'll sort itself out if i feed it more images, or is something seriously wrong?
I think it's not good to have this loss. It's logical to have a higher loss during the initial few epochs as we load a model and retrain it. However, this loss value should not shoot to the stars, as in your case. If at the time of saving, the loss value was roughly 0.5, then when you load the same model for retraining, it should not be higher than 10x previous value, so, a value of 5 +- 1 would be expected. [NOTE: this is purely based on experience. There is no general method to know the loss beforehand.]
If your loss is too high, the following would be reasonable:
Varying dataset - Changing the dynamics of the training dataset could force the model for this behavior.
Model save might have altered weights
Solutions Suggested:
Try using save_weights instead of save method on model
model.save_weights('path/to/filename.h5')
also, use load_weights instead of load_model
model = call_cnn_function_to_build_model()
model.compile(... your args ...)
model = model.load_weights('path/to/filename.h5')
Since you have checkpoints, try using checkpoint saved models. So rather than the final model, try to load a model from checkpoint which is near to your last epochs.
PS: Corrections gratefully accepted.

Keras Embedding Layer: keep zero-padded values as zeros

I've been thinking about 0-padding of word sequence and how that 0-padding is then converted to the Embedding layer. At first glance, one would think that you want to keep the embeddings = 0.0 as well. However, Embedding layer in keras generates random values for any input token, and there is no way to force it to generate 0.0's. Note, mask_zero does something different, I've already checked.
One might ask, why worry about this, the code seems to be working even when the embeddings are not 0.0's, as long as they are the same. So I came up with an example, albeit somewhat contrived, where setting the embeddings to 0.0's for the 0 padded token makes a difference.
I used the 20 News Groups data set from sklearn.datasets import fetch_20newsgroups. I do some minimal preprocessing: removal of punctuation, stopwords and numbers. I use from keras.preprocessing.sequence import pad_sequences for 0-padding. I split the ~18K posts into the training and validation set with the proportion of training/validation = 4/1.
I create a simple 1 dense hidden layer network with the input being the flattened sequence of embeddings:
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 1100
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
x = Flatten(name='flatten_dnn')(embedded_sequences)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
The model has about 14M trainable parameters (this example is a bit contrived, as I've already mentioned).
When I train it
earlystop = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=30, batch_size=BATCH_SIZE, callbacks=[earlystop])
it looks like for 4 epochs the algorithm is struggling to find its way out of the 'randomness':
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 58s 4ms/step - loss: 3.1118 - acc: 0.0519 - val_loss: 2.9894 - val_acc: 0.0534
Epoch 2/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.9820 - acc: 0.0556 - val_loss: 2.9827 - val_acc: 0.0527
Epoch 3/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9712 - acc: 0.0626 - val_loss: 2.9718 - val_acc: 0.0579
Epoch 4/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9259 - acc: 0.0756 - val_loss: 2.8363 - val_acc: 0.0874
Epoch 5/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.7092 - acc: 0.1390 - val_loss: 2.3251 - val_acc: 0.2796
...
Epoch 13/30
15048/15048 [==============================] - 56s 4ms/step - loss: 0.0698 - acc: 0.9807 - val_loss: 0.5010 - val_acc: 0.8736
It ends up with the accuracy of ~0.87
print ('Best validation accuracy is ', max(history.history['val_acc']))
Best validation accuracy is 0.874934175379845
However, when I explicitly set the embeddings for the padded 0's to 0.0
def myMask(x):
mask= K.greater(x,0) #will return boolean values
mask= K.cast(mask, dtype=K.floatx())
return mask
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
the model with the same number of parameters immediately finds its way out of the 'randomness':
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 64s 4ms/step - loss: 2.4356 - acc: 0.3060 - val_loss: 1.2424 - val_acc: 0.7754
Epoch 2/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.6973 - acc: 0.8267 - val_loss: 0.5240 - val_acc: 0.8797
...
Epoch 10/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.0496 - acc: 0.9881 - val_loss: 0.4176 - val_acc: 0.8944
and ends up with a better accuracy of ~0.9.
Again, this is a somewhat contrived example, but still it shows that keeping those 'padded' embeddings at 0.0 can be beneficial.
Am I missing something here? And if I'm not missing anything, then, what is the reason Keras doesn't provide this functionality out-of-the-box?
UPDATE
#DanielMöller I tried your suggestion:
layer_size = 25
dropout = 0.3
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
Unfortunately, the network was stuck in the 'randomness':
Train on 15197 samples, validate on 3649 samples
Epoch 1/30
15197/15197 [==============================] - 60s 4ms/step - loss: 3.1354 - acc: 0.0505 - val_loss: 2.9943 - val_acc: 0.0496
....
Epoch 24/30
15197/15197 [==============================] - 60s 4ms/step - loss: 2.9905 - acc: 0.0538 - val_loss: 2.9907 - val_acc: 0.0496
I also tried without the NonNeg() constraint, the same result.
Well, you're eliminating the computation of the gradients of the weights related to the padded steps.
If you have too many padded steps, then the embedding weights regarding the padding value will participate in a lot of calculations and will significantly compete with the other weights. But training these weights is a waste of computation and will certainly interfere in other words.
Consider also that, for instance, some of the weights for padding might have values between the values for meaningful words. So, increasing the weight might make it similar to another word when it's not. And decreasing too....
These extra calculations, extra contributions to loss and gradient calculations, etc. will create more computational need and more obstacles. It's like having a lot of garbage in the middle of the data.
Notice also that these zeros are going directly to the dense layer, which will also eliminate the gradients for a lot of the dense weights. This might overfit longer sequences though if they are few compared to shorter sequences.
Out of curiosity, what will happen if you do this?
from keras.initializers import RandomUniform
from keras.constraints import NonNeg
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
......
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
..........

Bad performance While training Lstm for Text Classification on Amazon Fine Food Reviews Dataset?

I am trying to train a Lstm model for Text classification on Amazon Fine food review problem , I am using same dataset as provided by kaggle , I am using tokenizer to convert text data into tokens , but while I am training I am getting same accuracy for all the epochs .Like this
Epoch 1/5
55440/55440 [==============================] - 161s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Epoch 2/5
55440/55440 [==============================] - 159s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Epoch 3/5
55440/55440 [==============================] - 160s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Epoch 4/5
55440/55440 [==============================] - 160s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Moreover when I am plotting my confusion matrix None of the Negative classes are predicted only positive classes are predicted.
I think I am mostly doing wrong when converting Labels i.e. 'Positive' and 'Negative' to some numerical representation for classification.
Please see my code for more details.
I have tried increasing number of Lstm units and epochs also tried increasing length of sequences but none worked , please note that I have done all pre-processing on reviews as required.
# train and test split x is dataframe consisting of amazon fine food review
#dataset
x = df
x1 = pd.DataFrame(x)
y = df['Score']
x1.head()
import math
train_pct_index = int(0.70 * len(df)) #train data size = 70%
X_train, X_test = x1[:train_pct_index], x1[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
#y_test.value_counts()
x1_df = pd.DataFrame(X_train)
x2_df = pd.DataFrame(X_test)
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
y_train=encoder.fit_transform(y_train)
y_test=encoder.fit_transform(y_test)
# tokenizing reviews
tokenizer = Tokenizer(num_words = 5000 )
tokenizer.fit_on_texts(x1_df['CleanedText'])
sequences = tokenizer.texts_to_sequences(x1_df['CleanedText'])
test_sequences = tokenizer.texts_to_sequences(x2_df['CleanedText'])
train_data = pad_sequences(sequences, maxlen=500)
test_data = pad_sequences(test_sequences, maxlen=500)
nb_words = (np.max(train_data) + 1)
# building lstm model and compiling it
from keras.layers.recurrent import LSTM, GRU
model = Sequential()
model.add(Embedding(nb_words,50,input_length=500))
model.add(LSTM(20))
model.add(Dropout(0.5))
model.add(Dense(1, activation='softmax'))
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
I would like my lstm model to generalise well and predict negative reviews which are minority class in this case as well.

Resources