RNN classification constantly output the same value - machine-learning

I was using rnn to do some classification job, and success on one task. But when I use the similar model on another task, what strange happened.
These are some information. The value above is prediction, another is target.
Step 147, learning rate is 0.050000000000000, cost is 0.333333
[[ 1.00000000e+00 1.94520349e-16 5.00660735e-10 8.93992450e-11
6.57709234e-11 2.75211902e-11]]
[[ 0. 0. 0. 0. 0. 1.]]
Step 148, learning rate is 0.050000000000000, cost is 0.333333
[[ 1.00000000e+00 2.51522596e-16 6.98772706e-10 1.32924283e-10
2.06628145e-10 1.63214553e-10]]
[[ 0. 0. 0. 1. 0. 0.]]
Step 149, learning rate is 0.050000000000000, cost is 1.07511e-18
[[ 1.00000000e+00 6.98618693e-16 2.44663956e-09 2.75078210e-10
4.09978718e-10 4.69938033e-10]]
[[ 1. 0. 0. 0. 0. 0.]]
It seems all the output converge to the same value. In other words, with every input, the model output the same prediction regardless the cost.
To provide more information, this is my model structure:
class SequenceClassification:
def __init__(self, data, target, dropout, learning_rate,num_hidden=2500, num_layers=2):
self.data = data
self.target = target
self.dropout = dropout
self.learning_rate = learning_rate
self._num_hidden = num_hidden
self._num_layers = num_layers
self.prediction
self.precision
self.optimize
#lazy_property
def prediction(self):
# Recurrent network.
network = tf.nn.rnn_cell.BasicLSTMCell(self._num_hidden)
network = tf.nn.rnn_cell.DropoutWrapper(network, output_keep_prob = self.dropout)
network = tf.nn.rnn_cell.MultiRNNCell([network]*self._num_layers)
output, _ = tf.nn.dynamic_rnn(network, data, dtype=tf.float32)
# Select last output.
output = tf.transpose(output, [1, 0, 2])
print(output.get_shape())
last = tf.gather(output, int(output.get_shape()[0]) - 1)
# Softmax layer.
weight, bias = self._weight_and_bias(
self._num_hidden, int(self.target.get_shape()[1]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
return prediction
#lazy_property
def cost(self):
#cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction+1e-10))
#loss =cross_entropy
loss = tf.reduce_mean(tf.square(self.target - self.prediction))
return loss
#lazy_property
def optimize(self):
optimizer = tf.train.RMSPropOptimizer(self.learning_rate)
return optimizer.minimize(self.cost), self.cost, self.prediction
#lazy_property
def precision(self):
correct = tf.equal(
tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
return tf.reduce_mean(tf.cast(correct, tf.float32))
#staticmethod
def _weight_and_bias(in_size, out_size):
weight = tf.get_variable("W", shape=[in_size, out_size],
initializer=tf.contrib.layers.xavier_initializer())
bias = tf.get_variable("B", shape=[out_size],
initializer=tf.contrib.layers.xavier_initializer())
return weight, bias
And the input is in the shape[datanum, maxstep, vectorsize], I use zeros to pad them into the same size.
I can't understand what happen because it works well on the former task.
Additionally, this classification task works well when I was using DL4J:
This is the model:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).iterations(1)
.updater(Updater.RMSPROP)
.regularization(true).l2(1e-5)
.weightInit(WeightInit.XAVIER)
.gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue).gradientNormalizationThreshold(1.0)
.learningRate(0.08)
.dropOut(0.5)
.list(2)
.layer(0, new GravesBidirectionalLSTM.Builder().nIn(vectorSize).nOut(1800)
.activation("tanh").build())
.layer(1, new RnnOutputLayer.Builder().activation("softmax")
.lossFunction(LossFunctions.LossFunction.MCXENT).nIn(1800).nOut(6).build())
.pretrain(false).backprop(true).build();
MultiLayerNetwork net = new MultiLayerNetwork(conf);
net.init();
Any advice is appreciated.

This problem may be due to the "batch normalization". When you are evaluating your model, you should disable batch normalization.

Related

Why is a simple embedding+linear layer outperforming this LSTM classifier?

I'm running into a roadblock in my learning about NLP. I'm working on a beginner's Kaggle competition classifying tweets as "disaster" or "not disaster". I started out by repurposing a simple network from a PyTorch tutorial comprised of nn.EmbeddingBag and nn.Linear layers and saw decent results during both training and inference:
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
The loss function is BCEWithLogits, by the way.
I decided to up my game and throw an LSTM into the mix. I took a deep dive into padded/packed sequences and think I understand them pretty well. After perusing around and thinking about it, I came to the conclusion that I should be grabbing the final non-padded hidden state of each sequence's output from the LSTM. That's what I tried below:
My attempt at upping my game:
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
padded_output, lengths = pad_packed_sequence(output, batch_first=True)
# get hidden state of final non-padded sequence element:
h_n = []
for seq, length in zip(padded_output, lengths):
h_n.append(seq[length - 1, :])
lstm_out = torch.stack(h_n)
# linear layers
out = self.fc1(lstm_out)
return out
This morning, I ported my notebook over to an IDE and ran the debugger and confirmed that h_n is indeed the final hidden state of each sequence, not including padding.
So everything runs/trains without error but my loss never decreases when I use batch size > 1.
With batch_size = 8:
With batch_size = 1:
My Question
I would have expected this LSTM setup to perform much better on this simple task. So I'm wondering "Where have I gone wrong?"
Additional Information: Training Code
def train_one_epoch(model, opt, criterion, lr, trainloader):
model.to(device)
model.train()
running_tl = 0
for (label, data, lengths) in trainloader:
opt.zero_grad()
label = label.reshape(label.size()[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_tl += loss.item()
loss.backward()
opt.step()
return running_tl
def validate_one_epoch(model, opt, criterion, lr, validloader):
running_vl = 0
model.eval()
with torch.no_grad():
for (label, data, lengths) in validloader:
label = label.reshape(label.shape[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_vl += loss.item()
return running_vl
def train_model(model, opt, criterion, epochs, trainload, testload=None, lr=1e-3):
avg_tl_per_epoch = []
avg_vl_per_epoch = []
for e in trange(epochs):
running_tl = train_one_epoch(model, opt, criterion, lr, trainload)
avg_tl_per_epoch.append(running_tl / len(trainload))
if testload:
running_vl = validate_one_epoch(model, opt, criterion, lr, validloader)
avg_vl_per_epoch.append(running_vl / len(testload))
return avg_tl_per_epoch, avg_vl_per_epoch
I think your model should look like that :
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
out = self.fc1(output)
return out
As, by default, the LSTM will just output the last hidden state as an output when provided with a sequence.
Also depending on the number of examples, the simple embedding + linear model might work better as it needs fewer data to converge. Your data being tweets (very short text) the sequential aspect of the text might not be so important.
You have not provided the code for preprocessing your data. With text a good preprocessing is crucial and I recommend you to take a look to the pytorch tutorial called NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION.

PyTorch: CrossEntropyLoss, changing class weight does not change the computed loss

According to Doc for cross entropy loss, the weighted loss is calculated by multiplying the weight for each class and the original loss.
However, in the pytorch implementation, the class weight seems to have no effect on the final loss value unless it is set to zero. Following is the code:
from torch import nn
import torch
logits = torch.FloatTensor([
[0.1, 0.9],
])
label = torch.LongTensor([0])
criterion = nn.CrossEntropyLoss(weight=torch.FloatTensor([1, 1]))
loss = criterion(logits, label)
print(loss.item()) # result: 1.1711
# Change class weight for the first class to 0.1
criterion = nn.CrossEntropyLoss(weight=torch.FloatTensor([0.1, 1]))
loss = criterion(logits, label)
print(loss.item()) # result: 1.1711, should be 0.11711
# Change weight for first class to 0
criterion = nn.CrossEntropyLoss(weight=torch.FloatTensor([0, 1]))
loss = criterion(logits, label)
print(loss.item()) # result: 0
As illustrated in the code, the class weight seems to have no effect unless it is set to 0, this behavior contradicts to the documentation.
Updates
I implemented a version of weighted cross entropy which is in my eyes the "correct" way to do it.
import torch
from torch import nn
def weighted_cross_entropy(logits, label, weight=None):
assert len(logits.size()) == 2
batch_size, label_num = logits.size()
assert (batch_size == label.size(0))
if weight is None:
weight = torch.ones(label_num).float()
assert (label_num == weight.size(0))
x_terms = -torch.gather(logits, 1, label.unsqueeze(1)).squeeze()
log_terms = torch.log(torch.sum(torch.exp(logits), dim=1))
weights = torch.gather(weight, 0, label).float()
return torch.mean((x_terms+log_terms)*weights)
logits = torch.FloatTensor([
[0.1, 0.9],
[0.0, 0.1],
])
label = torch.LongTensor([0, 1])
neg_weight = 0.1
weight = torch.FloatTensor([neg_weight, 1])
criterion = nn.CrossEntropyLoss(weight=weight)
loss = criterion(logits, label)
print(loss.item()) # results: 0.69227
print(weighted_cross_entropy(logits, label, weight).item()) # results: 0.38075
What I did is to multiply each instance in the batch with its associated class weight. The result is still different from the original pytorch implementation, which makes me wonder how pytorch actually implement this.

TensorFlow:Evaluate test set multiple times but get different accuracy

I have trained the model of MNIST using CNN, but when I check the model's accuracy with test data after the training, I find that my accuracy will improve. Here is the code.
BATCH_SIZE = 50
LR = 0.001 # learning rate
mnist = input_data.read_data_sets('./mnist', one_hot=True) # they has been normalized to range (0,1)
test_x = mnist.test.images[:2000]
test_y = mnist.test.labels[:2000]
def new_cnn(imageinput, inputshape):
weights = tf.Variable(tf.truncated_normal(inputshape, stddev = 0.1),name = 'weights')
biases = tf.Variable(tf.constant(0.05, shape = [inputshape[3]]),name = 'biases')
layer = tf.nn.conv2d(imageinput, weights, strides = [1, 1, 1, 1], padding = 'SAME')
layer = tf.nn.relu(layer)
return weights, layer
tf_x = tf.placeholder(tf.float32, [None, 28 * 28])
image = tf.reshape(tf_x, [-1, 28, 28, 1]) # (batch, height, width, channel)
tf_y = tf.placeholder(tf.int32, [None, 10]) # input y
# CNN
weights1, layer1 = new_cnn(image, [5, 5, 1, 32])
pool1 = tf.layers.max_pooling2d(
layer1,
pool_size=2,
strides=2,
) # -> (14, 14, 32)
weight2, layer2 = new_cnn(pool1, [5, 5, 32, 64]) # -> (14, 14, 64)
pool2 = tf.layers.max_pooling2d(layer2, 2, 2) # -> (7, 7, 64)
flat = tf.reshape(pool2, [-1, 7 * 7 * 64]) # -> (7*7*64, )
hide = tf.layers.dense(flat, 1024, name = 'hide') # hidden layer
output = tf.layers.dense(hide, 10, name = 'output')
loss = tf.losses.softmax_cross_entropy(onehot_labels=tf_y, logits=output) # compute cost
accuracy = tf.metrics.accuracy( labels=tf.argmax(tf_y, axis=1), predictions=tf.argmax(output, axis=1),)[1]
train_op = tf.train.AdamOptimizer(LR).minimize(loss)
sess = tf.Session()
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()) # the local var is for accuracy
sess.run(init_op) # initialize var in graph
saver = tf.train.Saver()
for step in range(101):
b_x, b_y = mnist.train.next_batch(BATCH_SIZE)
_, loss_ = sess.run([train_op, loss], {tf_x: b_x, tf_y: b_y})
if step % 50 == 0:
print(loss_)
accuracy_, loss2 = sess.run([accuracy, loss], {tf_x: test_x, tf_y: test_y })
print('Step:', step, '| test accuracy: %f' % accuracy_)
To simplify the problem, I only use the 100 training iterations. And the final accuracy of test set is approximately 0.655000.
But when I run the following code:
for i in range(5):
accuracy2 = sess.run(accuracy, {tf_x: test_x, tf_y: test_y })
print(sess.run(weight2[1,:,0,0])) # To show that the model parameters won't update
print(accuracy2)
The output is
[-0.06928255 -0.13498515 0.01266837 0.05656774 0.09438231]
0.725875
[-0.06928255 -0.13498515 0.01266837 0.05656774 0.09438231]
0.7684
[-0.06928255 -0.13498515 0.01266837 0.05656774 0.09438231]
0.79675
[-0.06928255 -0.13498515 0.01266837 0.05656774 0.09438231]
0.817
[-0.06928255 -0.13498515 0.01266837 0.05656774 0.09438231]
0.832187
This makes me confused, can somebody tell me what's wrong?
Thanks for your patience!
tf.metrics.accuracy is not as trivial as you might think. Take a look at its documentation:
The accuracy function creates two local variables, total and
count that are used to compute the frequency with which
predictions matches labels. This frequency is ultimately
returned as accuracy: an idempotent operation that simply divides
total by count.
Internally, an is_correct operation computes a Tensor with
elements 1.0 where the corresponding elements of predictions and
labels match and 0.0 otherwise. Then update_op increments
total with the reduced sum of the product of weights and
is_correct, and it increments count with the reduced sum of
weights.
For estimation of the metric over a stream of data, the function
creates an update_op operation that updates these variables and
returns the accuracy.
...
Returns:
accuracy: A Tensor representing the accuracy, the value of total divided
by count.
update_op: An operation that increments the total and count variables
appropriately and whose value matches accuracy.
Note that it returns a tuple and you take the second item, namely update_op. Consecutive invocation of update_op is treated as streaming of data, which is not what you intend to do (because each evaluation during training will affect future evaluations). In fact, this running metric is pretty counter-intuitive.
The solution for you is to use plain simple accuracy calculation. Change this line to:
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(tf_y, axis=1), tf.argmax(output, axis=1)), tf.float32))
and you'll have a stable accuracy calculation.

Tensorflow cross-entropy NaN, and changing learning rate doesn't seem to have an impact

TL;DR
Trying to build a bidirectional RNN for sequence tagging using tensorflow.
The goal is to take inputs "I like New York" and produce outputs "O O LOC_START LOC"
The graph compiles and runs, but the loss becomes NaN after 1 or 2 batches. I understand this could be a problem with the learning rate, but changing the learning rate seems to have no impact. Using AdamOptimizer at the moment.
Any help would be appreciated.
Here is my code:
Code:
# The input and output: a sequence of words, embedded, and a sequence of word classifications, one-hot
self.input_x = tf.placeholder(tf.float32, [None, n_sequence_length, n_embedding_dim], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, n_sequence_length, n_output_classes], name="input_y")
# New shape: [sequence_length, batch_size (None), embedding_dim]
inputs = tf.transpose(self.input_x, [1, 0, 2])
# New shape: [sequence_length * batch_size (None), embedding_dim]
inputs = tf.reshape(inputs, [-1, n_embedding_dim])
# Define weights
w_hidden = tf.Variable(tf.random_normal([n_embedding_dim, 2 * n_hidden_states]))
b_hidden = tf.Variable(tf.random_normal([2 * n_hidden_states]))
w_out = tf.Variable(tf.random_normal([2 * n_hidden_states, n_output_classes]))
b_out = tf.Variable(tf.random_normal([n_output_classes]))
# Linear activation for the input; this will make it fit to the hidden size
inputs = tf.nn.xw_plus_b(inputs, w_hidden, b_hidden)
# Split up the batches into a Python list
inputs = tf.split(0, n_sequence_length, inputs)
# Now we define our cell. It takes one word as input, a vector of embedding_size length
cell_forward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
cell_backward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
# And we add a Dropout Wrapper as appropriate
if is_training and prob_keep < 1:
cell_forward = rnn_cell.DropoutWrapper(cell_forward, output_keep_prob=prob_keep)
cell_backward = rnn_cell.DropoutWrapper(cell_backward, output_keep_prob=prob_keep)
# And we make it a few layers deep
cell_forward_multi = rnn_cell.MultiRNNCell([cell_forward] * n_layers)
cell_backward_multi = rnn_cell.MultiRNNCell([cell_backward] * n_layers)
# returns outputs = a list T of tensors [batch, 2*hidden]
outputs = rnn.bidirectional_rnn(cell_forward_multi, cell_backward_multi, inputs, dtype=dtypes.float32)
# [sequence, batch, 2*hidden]
outputs = tf.pack(outputs)
# [batch, sequence, 2*hidden]
outputs = tf.transpose(outputs, [1, 0, 2])
# [batch * sequence, 2 * hidden]
outputs = tf.reshape(outputs, [-1, 2 * n_hidden_states])
# [batch * sequence, output_classes]
self.scores = tf.nn.xw_plus_b(outputs, w_out, b_out)
# [batch * sequence, output_classes]
inputs_y = tf.reshape(self.input_y, [-1, n_output_classes])
# [batch * sequence]
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Now calculate the cross-entropy
losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, inputs_y)
self.loss = tf.reduce_mean(losses, name="loss")
if not is_training:
return
# Training
self.train_op = tf.train.AdamOptimizer(1e-4).minimize(self.loss)
# Evaluate model
correct_pred = tf.equal(self.predictions, tf.argmax(inputs_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name="accuracy")
Could there be an example in the training data where something is wrong with the labels? Then when it hits that example the cost become NaN. I'm suggesting this because it seems like it still happens when the learning rate is zero and after just a few batches.
Here is how I would debug:
Set the batch size to 1
set the learning rate to 0.0
when you run a batch have tensorflow output the intermediate values not just the cost
run until you get a NaN and then check to see what the input was and by examining the intermediate outputs determine at which point there is a NaN

Stacked RNN model setup in TensorFlow

I'm kind of lost in building up a stacked LSTM model for text classification in TensorFlow.
My input data was something like:
x_train = [[1.,1.,1.],[2.,2.,2.],[3.,3.,3.],...,[0.,0.,0.],[0.,0.,0.],
...... #I trained the network in batch with batch size set to 32.
]
y_train = [[1.,0.],[1.,0.],[0.,1.],...,[1.,0.],[0.,1.]]
# binary classification
The skeleton of my code looks like:
self._input = tf.placeholder(tf.float32, [self.batch_size, self.max_seq_length, self.vocab_dim], name='input')
self._target = tf.placeholder(tf.float32, [self.batch_size, 2], name='target')
lstm_cell = rnn_cell.BasicLSTMCell(self.vocab_dim, forget_bias=1.)
lstm_cell = rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=self.dropout_ratio)
self.cells = rnn_cell.MultiRNNCell([lstm_cell] * self.num_layers)
self._initial_state = self.cells.zero_state(self.batch_size, tf.float32)
inputs = tf.nn.dropout(self._input, self.dropout_ratio)
inputs = [tf.reshape(input_, (self.batch_size, self.vocab_dim)) for input_ in
tf.split(1, self.max_seq_length, inputs)]
outputs, states = rnn.rnn(self.cells, inputs, initial_state=self._initial_state)
# We only care about the output of the last RNN cell...
y_pred = tf.nn.xw_plus_b(outputs[-1], tf.get_variable("softmax_w", [self.vocab_dim, 2]), tf.get_variable("softmax_b", [2]))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_pred, self._target))
correct_pred = tf.equal(tf.argmax(y_pred, 1), tf.argmax(self._target, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
initializer = tf.random_uniform_initializer(-0.04, 0.04)
with tf.variable_scope("model", reuse=True, initializer=initializer):
sess.run(init)
# generate batches here (omitted for clarity)
print sess.run([train_op, loss, accuracy], feed_dict={self._input: batch_x, self._target: batch_y})
The problem is that no matter how large the dataset is, the loss and accuracy has no sign of improvement (looks completely stochastic). Am I doing anything wrong?
Update:
# First, load Word2Vec model in Gensim.
model = Doc2Vec.load(word2vec_path)
# Second, build the dictionary.
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.vocab.keys(), allow_update=True)
w2indx = {v: k + 1 for k, v in gensim_dict.items()}
w2vec = {word: model[word] for word in w2indx.keys()}
# Third, read data from a text file.
for fname in fnames:
i = 0
with codecs.open(fname, 'r', encoding='utf8') as fr:
for line in fr:
tmp = []
for t in line.split():
tmp.append(t)
X_train.append(tmp)
i += 1
if i is samples_count:
break
# Fourth, convert words into vectors, and pad each sentence with ZERO arrays to a fixed length.
result = np.zeros((len(data), self.max_seq_length, self.vocab_dim), dtype=np.float32)
for rowNo in xrange(len(data)):
rowLen = len(data[rowNo])
for colNo in xrange(rowLen):
word = data[rowNo][colNo]
if word in w2vec:
result[rowNo][colNo] = w2vec[word]
else:
result[rowNo][colNo] = [0] * self.vocab_dim
for colPadding in xrange(rowLen, self.max_seq_length):
result[rowNo][colPadding] = [0] * self.vocab_dim
return result
# Fifth, generate batches and feed them to the model.
... Trivias ...
Here are few reasons it may not be training and suggestions to try:
You are not allowing to update word vectors, space of pre-learned vectors may be not working properly.
RNNs really need gradient clipping when trained. You can try adding something like this.
Unit scale initialization seems to work better, as it accounts for the size of the layer and allows gradient to be scaled properly as it goes deeper.
You should try removing dropout and second layer - just to check if your data passing is correct and your loss is going down at all.
I also can recommend trying this example with your data: https://github.com/tensorflow/skflow/blob/master/examples/text_classification.py
It trains word vectors from scratch, already has gradient clipping and uses GRUCells which usually are easier to train. You can also see nice visualizations for loss and other things by running tensorboard logdir=/tmp/tf_examples/word_rnn.

Resources