What's wrong with my Backpropagation algorithm implementation? - machine-learning

I tried to implement a no-hidden-layer neural network to hack with MNIST dataset.
I used sigmoid as activate function and cross-entropy as loss function.
For simplicity I my network has no hidden layer, just input and output.
X = trainImage
label = trainLabel
w1 = np.random.normal(np.zeros([28 * 28, 10]))
b1 = np.random.normal(np.zeros([10]))
def sigm(x):
return 1 / (1 + np.exp(-x))
def acc():
y = sigm(np.matmul(X, w1))
return sum(np.argmax(label, 1) == np.argmax(y, 1)) / y.shape[0]
def loss():
y = sigm(np.matmul(X, w1))
return sum((-label * np.log(y)).flatten())
a = np.matmul(X[0:1], w1)
y = sigm(a)
dy = - label[0:1] / y
ds = dy * y * (1 - y)
dw = np.matmul(X[0:1].transpose(), ds)
db = ds
def bp(lr, i):
global w1, b1, a, y, dy, ds, dw, db
a = np.matmul(X[i:i+1], w1)
y = sigm(a)
dy = - label[0:1] / y
ds = dy * y * (1 - y)
dw = np.matmul(X[i:i+1].transpose(), ds)
db = ds
w1 = w1 - lr * dw
b1 = b1 - lr * db
for i in range(100 * 60000):
bp(1, i % 60000)
if i % 60000 == 0:
print("#", int(i / 60000), "loss:", loss(), "acc:", acc())
This is the part of my implementation of backpropagation algorithm, but it doesn't work as expected. The descent of loss function is extremely slow (I tried with learning rate varing from 0.001 to 1), and the accuracy never grows over than 0.1.
The output is like this:
# 0 loss: 279788.368245 acc: 0.0903333333333
# 1 loss: 279788.355211 acc: 0.09035
# 2 loss: 279788.350629 acc: 0.09035
# 3 loss: 279788.348228 acc: 0.09035
# 4 loss: 279788.346736 acc: 0.09035
# 5 loss: 279788.345715 acc: 0.09035
# 6 loss: 279788.344969 acc: 0.09035
# 7 loss: 279788.3444 acc: 0.09035
# 8 loss: 279788.343951 acc: 0.09035
# 9 loss: 279788.343587 acc: 0.09035
# 10 loss: 279788.343286 acc: 0.09035
# 11 loss: 279788.343033 acc: 0.09035

From what I see here, there are a few possible factors that are preventing this from working.
Firstly, you have to randomly initialize your weights and biases. From what I am seeing, you are trying to change a value that doesn't yet have a tangible value(ex: w1 = 0).
Secondly, your optimizer is not suited for the MNIST dataset. The optimizer is what changes the values based on the backpropagation, so it is very important that you choose the right optimizer. Gradient Descent is better suited for this dataset, and from what I can see, you are not using Gradient Descent. If you are attempting to make a Gradient Descent optimizer with your code, you are most likely doing it wrong. Gradient Descent involves partial derivatives of the SSE (Sum of Squared Error), which I do not see in this code.
If you want to go with the Gradient Descent optimizer, you will have to make a few changes to your code besides implementing the mathematics behind Gradient Descent. You will have to use ReLU activation function(which I suggest you do anyway) instead of the sigmoid function. This will make sure the Vanishing Gradient problem doesn't occur. Also, your should make your loss function the reduced mean of the cross entropy. The optimizer will be much more effective this way.
Hope this helps.

Related

training loss and validation loss both become 0.00000e+00 always

can someone help me to resolve this why i'm getting my loss 0.0000e+00.
I've looked around that few people had the same problem but I'm not be able to fix it following same advices.
Rows are shuffled and label is already transformaned into float32. These are suggestions I've found on similar questions. Can you tell me what i'm wrong?
this problem is a classification of images having classes more than 1.
this is how i create my model
def createmodel():
pretrained = InceptionV3(input_shape=(150,150,3),
include_top=False,
weights='imagenet')
for layer in pretrained.layers:
layer.trainable = False
x = layers.Flatten()(pretrained.output)
x = layers.Dense(1024,activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(1,activation="softmax")(x)
model = Model(pretrained.input,x)
model.compile(optimizer = Adam(0.001),
loss = 'categorical_crossentropy',
)
return model
Epoch 1/2
10/10 [==============================] - 3s 322ms/step - loss: 0.0000e+00 - val_loss: 0.0000e+00
Epoch 2/2
10/10 [==============================] - 5s 464ms/step - loss: 0.0000e+00 - val_loss: 0.0000e+00
There is an issue with the final layer. The size should be equal to the number of classes as opposed to 1, i.e.:
x = layers.Dense(num_classes, activation="softmax")(x)
assuming num_classes is the number of the distinct classes in your data.

BERT HuggingFace gives NaN Loss

I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.
First I define a BERT-tokenizer and then tokenize my text:
from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
max_length=128, pad_to_max_length=True)
def tokenize(sentences, tokenizer):
input_ids, input_masks, input_segments = [],[],[]
for sentence in tqdm(sentences):
inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=True)
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')
train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)
Next, I load my integer labels which are: 0, 1, 2, 3.
d_y = train['label']
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 3
9 1
Name: label, dtype: int64
Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel
distil_bert = 'distilbert-base-uncased'
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)
config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)
input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32')
embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(4, activation='softmax')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)
for layer in model.layers[:3]:
layer.trainable = False
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:
model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)
Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>
EDIT: The model computes losses on the first epoch but it starts returning NaNs
at the second epoch. What could be causing that problem???
Does anyone has any ideas about what I am doing wrong?
All suggestions are welcomed!
The problem is here:
X = tf.keras.layers.Dense(1, activation='softmax')(X)
At the end of the network, you only have a single neuron, corresponding to a single class. The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs at the end.
The problem would have occurred because of not specifying the num_labels
At the final output layer, by default K = 1 (number of labels), and as mentioned
\sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}
so while fine tuning we need to provide num_labels when going for multi class classification.
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)
I'd also suggest removing NA values from the pandas data frame before using the dataset for training and evaluation.
train = pd.read_csv('train_dataset.csv')
d = train['text']
d = d.dropna()
I had a similar problem where my model produced NaN losses only during the last batch of an epoch. All the other batches resulted in typical loss values. In my case, the problem was that the size of the batches was not always equal. Thus, the model produced NaN losses. After I made all batches equally sized, the NaN's were gone. It might be also worth investigating if this is also true in your case.

Constant Training Loss and Validation Loss

I am running a RNN model with Pytorch library to do sentiment analysis on movie review, but somehow the training loss and validation loss remained constant throughout the training. I have looked up different online sources but still stuck.
Can someone please help and take a look at my code?
Some parameters are specified by the assignment:
embedding_dim = 64
n_layers = 1
n_hidden = 128
dropout = 0.5
batch_size = 32
My main code
txt_field = data.Field(tokenize=word_tokenize, lower=True, include_lengths=True, batch_first=True)
label_field = data.Field(sequential=False, use_vocab=False, batch_first=True)
train = data.TabularDataset(path=part2_filepath+"train_Copy.csv", format='csv',
fields=[('label', label_field), ('text', txt_field)], skip_header=True)
validation = data.TabularDataset(path=part2_filepath+"validation_Copy.csv", format='csv',
fields=[('label', label_field), ('text', txt_field)], skip_header=True)
txt_field.build_vocab(train, min_freq=5)
label_field.build_vocab(train, min_freq=2)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
(train, validation, test),
batch_size=32,
sort_key=lambda x: len(x.text),
sort_within_batch=True,
device=device)
n_vocab = len(txt_field.vocab)
embedding_dim = 64
n_hidden = 128
n_layers = 1
dropout = 0.5
model = Text_RNN(n_vocab, embedding_dim, n_hidden, n_layers, dropout)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = torch.nn.BCELoss().to(device)
N_EPOCHS = 15
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
train_loss, train_acc = RNN_train(model, train_iter, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iter, criterion)
My Model
class Text_RNN(nn.Module):
def __init__(self, n_vocab, embedding_dim, n_hidden, n_layers, dropout):
super(Text_RNN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.emb = nn.Embedding(n_vocab, embedding_dim)
self.rnn = nn.RNN(
input_size=embedding_dim,
hidden_size=n_hidden,
num_layers=n_layers,
dropout=dropout,
batch_first=True
)
self.sigmoid = nn.Sigmoid()
self.linear = nn.Linear(n_hidden, 2)
def forward(self, sent, sent_len):
sent_emb = self.emb(sent)
outputs, hidden = self.rnn(sent_emb)
prob = self.sigmoid(self.linear(hidden.squeeze(0)))
return prob
The training function
def RNN_train(model, iterator, optimizer, criterion):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in iterator:
text, text_lengths = batch.text
predictions = model(text, text_lengths)
batch.label = batch.label.type(torch.FloatTensor).squeeze()
predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
loss = criterion(predictions, batch.label)
loss.requires_grad = True
acc = binary_accuracy(predictions, batch.label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
The output I run on 10 testing reviews + 5 validation reviews
Epoch [1/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [2/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [3/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [4/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
...
Appreciate if someone can point me to the right direction, I believe is something with the training code, since for most parts I follow this article:
https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
In your training loop you are using the indices from the max operation, which is not differentiable, so you cannot track gradients through it. Because it is not differentiable, everything afterwards does not track the gradients either. Calling
loss.backward() would fail.
# The indices of the max operation are not differentiable
predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
loss = criterion(predictions, batch.label)
# Setting requires_grad to True to make .backward() work, although incorrectly.
loss.requires_grad = True
Presumably you wanted to fix that by setting requires_grad, but that does not do what you expect, because no gradients are propagated to your model, since the only thing in your computational graph would be the loss itself, and there is nowhere to go from there.
You used the indices to get either 0 or 1, since the output of your model is essentially two classes, and you wanted the one with the higher probability. For the Binary Cross Entropy loss, you only need one class that has a value between 0 and 1 (continuous), which you get by applying the sigmoid function.
So you need change the output channels of the final linear layer to 1:
self.linear = nn.Linear(n_hidden, 1)
and in your training loop you can remove the torch.max call and also the requires_grad.
# Squeeze the model's output to get rid of the single class dimension
predictions = model(text, text_lengths).squeeze()
batch.label = batch.label.type(torch.FloatTensor).squeeze()
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
optimizer.zero_grad()
loss.backward()
Since you have only 1 class at the end, an actual prediction would be either 0 or 1 (nothing in between), to achieve that you can simply use 0.5 as the threshold, so everything below is considered a 0 and everything above is considered a 1. If you are using the binary_accuracy function of the article you were following, that is done automatically for you. They do that by rounding it with torch.round.

Keras Embedding Layer: keep zero-padded values as zeros

I've been thinking about 0-padding of word sequence and how that 0-padding is then converted to the Embedding layer. At first glance, one would think that you want to keep the embeddings = 0.0 as well. However, Embedding layer in keras generates random values for any input token, and there is no way to force it to generate 0.0's. Note, mask_zero does something different, I've already checked.
One might ask, why worry about this, the code seems to be working even when the embeddings are not 0.0's, as long as they are the same. So I came up with an example, albeit somewhat contrived, where setting the embeddings to 0.0's for the 0 padded token makes a difference.
I used the 20 News Groups data set from sklearn.datasets import fetch_20newsgroups. I do some minimal preprocessing: removal of punctuation, stopwords and numbers. I use from keras.preprocessing.sequence import pad_sequences for 0-padding. I split the ~18K posts into the training and validation set with the proportion of training/validation = 4/1.
I create a simple 1 dense hidden layer network with the input being the flattened sequence of embeddings:
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 1100
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
x = Flatten(name='flatten_dnn')(embedded_sequences)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
The model has about 14M trainable parameters (this example is a bit contrived, as I've already mentioned).
When I train it
earlystop = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=30, batch_size=BATCH_SIZE, callbacks=[earlystop])
it looks like for 4 epochs the algorithm is struggling to find its way out of the 'randomness':
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 58s 4ms/step - loss: 3.1118 - acc: 0.0519 - val_loss: 2.9894 - val_acc: 0.0534
Epoch 2/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.9820 - acc: 0.0556 - val_loss: 2.9827 - val_acc: 0.0527
Epoch 3/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9712 - acc: 0.0626 - val_loss: 2.9718 - val_acc: 0.0579
Epoch 4/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9259 - acc: 0.0756 - val_loss: 2.8363 - val_acc: 0.0874
Epoch 5/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.7092 - acc: 0.1390 - val_loss: 2.3251 - val_acc: 0.2796
...
Epoch 13/30
15048/15048 [==============================] - 56s 4ms/step - loss: 0.0698 - acc: 0.9807 - val_loss: 0.5010 - val_acc: 0.8736
It ends up with the accuracy of ~0.87
print ('Best validation accuracy is ', max(history.history['val_acc']))
Best validation accuracy is 0.874934175379845
However, when I explicitly set the embeddings for the padded 0's to 0.0
def myMask(x):
mask= K.greater(x,0) #will return boolean values
mask= K.cast(mask, dtype=K.floatx())
return mask
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
the model with the same number of parameters immediately finds its way out of the 'randomness':
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 64s 4ms/step - loss: 2.4356 - acc: 0.3060 - val_loss: 1.2424 - val_acc: 0.7754
Epoch 2/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.6973 - acc: 0.8267 - val_loss: 0.5240 - val_acc: 0.8797
...
Epoch 10/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.0496 - acc: 0.9881 - val_loss: 0.4176 - val_acc: 0.8944
and ends up with a better accuracy of ~0.9.
Again, this is a somewhat contrived example, but still it shows that keeping those 'padded' embeddings at 0.0 can be beneficial.
Am I missing something here? And if I'm not missing anything, then, what is the reason Keras doesn't provide this functionality out-of-the-box?
UPDATE
#DanielMöller I tried your suggestion:
layer_size = 25
dropout = 0.3
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
Unfortunately, the network was stuck in the 'randomness':
Train on 15197 samples, validate on 3649 samples
Epoch 1/30
15197/15197 [==============================] - 60s 4ms/step - loss: 3.1354 - acc: 0.0505 - val_loss: 2.9943 - val_acc: 0.0496
....
Epoch 24/30
15197/15197 [==============================] - 60s 4ms/step - loss: 2.9905 - acc: 0.0538 - val_loss: 2.9907 - val_acc: 0.0496
I also tried without the NonNeg() constraint, the same result.
Well, you're eliminating the computation of the gradients of the weights related to the padded steps.
If you have too many padded steps, then the embedding weights regarding the padding value will participate in a lot of calculations and will significantly compete with the other weights. But training these weights is a waste of computation and will certainly interfere in other words.
Consider also that, for instance, some of the weights for padding might have values between the values for meaningful words. So, increasing the weight might make it similar to another word when it's not. And decreasing too....
These extra calculations, extra contributions to loss and gradient calculations, etc. will create more computational need and more obstacles. It's like having a lot of garbage in the middle of the data.
Notice also that these zeros are going directly to the dense layer, which will also eliminate the gradients for a lot of the dense weights. This might overfit longer sequences though if they are few compared to shorter sequences.
Out of curiosity, what will happen if you do this?
from keras.initializers import RandomUniform
from keras.constraints import NonNeg
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
......
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
..........

Bad performance While training Lstm for Text Classification on Amazon Fine Food Reviews Dataset?

I am trying to train a Lstm model for Text classification on Amazon Fine food review problem , I am using same dataset as provided by kaggle , I am using tokenizer to convert text data into tokens , but while I am training I am getting same accuracy for all the epochs .Like this
Epoch 1/5
55440/55440 [==============================] - 161s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Epoch 2/5
55440/55440 [==============================] - 159s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Epoch 3/5
55440/55440 [==============================] - 160s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Epoch 4/5
55440/55440 [==============================] - 160s 3ms/step - loss: 2.3666 - acc: 0.8516 - val_loss: 2.3741 - val_acc: 0.8511
Moreover when I am plotting my confusion matrix None of the Negative classes are predicted only positive classes are predicted.
I think I am mostly doing wrong when converting Labels i.e. 'Positive' and 'Negative' to some numerical representation for classification.
Please see my code for more details.
I have tried increasing number of Lstm units and epochs also tried increasing length of sequences but none worked , please note that I have done all pre-processing on reviews as required.
# train and test split x is dataframe consisting of amazon fine food review
#dataset
x = df
x1 = pd.DataFrame(x)
y = df['Score']
x1.head()
import math
train_pct_index = int(0.70 * len(df)) #train data size = 70%
X_train, X_test = x1[:train_pct_index], x1[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
#y_test.value_counts()
x1_df = pd.DataFrame(X_train)
x2_df = pd.DataFrame(X_test)
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
y_train=encoder.fit_transform(y_train)
y_test=encoder.fit_transform(y_test)
# tokenizing reviews
tokenizer = Tokenizer(num_words = 5000 )
tokenizer.fit_on_texts(x1_df['CleanedText'])
sequences = tokenizer.texts_to_sequences(x1_df['CleanedText'])
test_sequences = tokenizer.texts_to_sequences(x2_df['CleanedText'])
train_data = pad_sequences(sequences, maxlen=500)
test_data = pad_sequences(test_sequences, maxlen=500)
nb_words = (np.max(train_data) + 1)
# building lstm model and compiling it
from keras.layers.recurrent import LSTM, GRU
model = Sequential()
model.add(Embedding(nb_words,50,input_length=500))
model.add(LSTM(20))
model.add(Dropout(0.5))
model.add(Dense(1, activation='softmax'))
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
I would like my lstm model to generalise well and predict negative reviews which are minority class in this case as well.

Resources