Stacked RNN model setup in TensorFlow - machine-learning

I'm kind of lost in building up a stacked LSTM model for text classification in TensorFlow.
My input data was something like:
x_train = [[1.,1.,1.],[2.,2.,2.],[3.,3.,3.],...,[0.,0.,0.],[0.,0.,0.],
...... #I trained the network in batch with batch size set to 32.
]
y_train = [[1.,0.],[1.,0.],[0.,1.],...,[1.,0.],[0.,1.]]
# binary classification
The skeleton of my code looks like:
self._input = tf.placeholder(tf.float32, [self.batch_size, self.max_seq_length, self.vocab_dim], name='input')
self._target = tf.placeholder(tf.float32, [self.batch_size, 2], name='target')
lstm_cell = rnn_cell.BasicLSTMCell(self.vocab_dim, forget_bias=1.)
lstm_cell = rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=self.dropout_ratio)
self.cells = rnn_cell.MultiRNNCell([lstm_cell] * self.num_layers)
self._initial_state = self.cells.zero_state(self.batch_size, tf.float32)
inputs = tf.nn.dropout(self._input, self.dropout_ratio)
inputs = [tf.reshape(input_, (self.batch_size, self.vocab_dim)) for input_ in
tf.split(1, self.max_seq_length, inputs)]
outputs, states = rnn.rnn(self.cells, inputs, initial_state=self._initial_state)
# We only care about the output of the last RNN cell...
y_pred = tf.nn.xw_plus_b(outputs[-1], tf.get_variable("softmax_w", [self.vocab_dim, 2]), tf.get_variable("softmax_b", [2]))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_pred, self._target))
correct_pred = tf.equal(tf.argmax(y_pred, 1), tf.argmax(self._target, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
initializer = tf.random_uniform_initializer(-0.04, 0.04)
with tf.variable_scope("model", reuse=True, initializer=initializer):
sess.run(init)
# generate batches here (omitted for clarity)
print sess.run([train_op, loss, accuracy], feed_dict={self._input: batch_x, self._target: batch_y})
The problem is that no matter how large the dataset is, the loss and accuracy has no sign of improvement (looks completely stochastic). Am I doing anything wrong?
Update:
# First, load Word2Vec model in Gensim.
model = Doc2Vec.load(word2vec_path)
# Second, build the dictionary.
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.vocab.keys(), allow_update=True)
w2indx = {v: k + 1 for k, v in gensim_dict.items()}
w2vec = {word: model[word] for word in w2indx.keys()}
# Third, read data from a text file.
for fname in fnames:
i = 0
with codecs.open(fname, 'r', encoding='utf8') as fr:
for line in fr:
tmp = []
for t in line.split():
tmp.append(t)
X_train.append(tmp)
i += 1
if i is samples_count:
break
# Fourth, convert words into vectors, and pad each sentence with ZERO arrays to a fixed length.
result = np.zeros((len(data), self.max_seq_length, self.vocab_dim), dtype=np.float32)
for rowNo in xrange(len(data)):
rowLen = len(data[rowNo])
for colNo in xrange(rowLen):
word = data[rowNo][colNo]
if word in w2vec:
result[rowNo][colNo] = w2vec[word]
else:
result[rowNo][colNo] = [0] * self.vocab_dim
for colPadding in xrange(rowLen, self.max_seq_length):
result[rowNo][colPadding] = [0] * self.vocab_dim
return result
# Fifth, generate batches and feed them to the model.
... Trivias ...

Here are few reasons it may not be training and suggestions to try:
You are not allowing to update word vectors, space of pre-learned vectors may be not working properly.
RNNs really need gradient clipping when trained. You can try adding something like this.
Unit scale initialization seems to work better, as it accounts for the size of the layer and allows gradient to be scaled properly as it goes deeper.
You should try removing dropout and second layer - just to check if your data passing is correct and your loss is going down at all.
I also can recommend trying this example with your data: https://github.com/tensorflow/skflow/blob/master/examples/text_classification.py
It trains word vectors from scratch, already has gradient clipping and uses GRUCells which usually are easier to train. You can also see nice visualizations for loss and other things by running tensorboard logdir=/tmp/tf_examples/word_rnn.

Related

How to feed previous time-stamp prediction as additional input to the next time-stamp?

This question might have been asked, but I got confused.
I am trying to apply one of RNN types, e.g. LSTM for time-series forecasting. I have inputs, y (stock returns). For each timestamp, I'd like to get the predictions. Q1 - Am I correct choosing seq2seq approach?
I also want to use predictions from previous timestamp (initializing initial values with some constant) as additional (still using my existing inputs) input in the form of squared residuals, i.e. using
eps_{t-1} = (y_{t-1} - y^_{t-1})^2 as additional input at t (as well as previous inputs).
So, how can I do this in tensorflow or in pytorch?
I tried to depict what I want on the attached graph. The graph
p.s. Sorry, it the question is poorly formulated
Let say your input if of dimension (32,10,1) with batch_size 32, time steps of length 10 and dimension of 1. Same for your target (stock return). This code make use of the tf.scan function, which is usefull when implementing custom recurrent networks (it will iterate over the timesteps). It remains to use the residual of t-1 in t somewhere, as you would like to.
ps: it is a very basic implementation of lstm from scratch, without any bias or output activation.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
BS = 32
TS = 10
inputs_dim = 1
target_dim = 1
inputs = tf.placeholder(shape=[BS, TS, inputs_dim], dtype=tf.float32)
stock_returns = tf.placeholder(shape=[BS, TS, target_dim], dtype=tf.float32)
state_size = 16
# initial hidden state
init_state = tf.placeholder(shape=[2, BS, state_size],
dtype=tf.float32, name='initial_state')
# initializer
xav_init = tf.contrib.layers.xavier_initializer
# params
W = tf.get_variable('W', shape=[4, state_size, state_size],
initializer=xav_init())
U = tf.get_variable('U', shape=[4, inputs_dim, state_size],
initializer=xav_init())
W_out = tf.get_variable('W_out', shape=[state_size, target_dim],
initializer=xav_init())
#the function to feed tf.scan with
def step(prev, inputs_):
#unpack all inputs and previous outputs
st_1, ct_1 = prev[0][0], prev[0][1]
x = inputs_[0]
target = inputs_[1]
#get previous squared residual
eps = prev[1]
"""
here do whatever you want with eps_t-1
like x += eps if x if of the same dimension
or include it somewhere in your graph
"""
# lstm gates (add bias if needed)
#
# input gate
i = tf.sigmoid(tf.matmul(x,U[0]) + tf.matmul(st_1,W[0]))
# forget gate
f = tf.sigmoid(tf.matmul(x,U[1]) + tf.matmul(st_1,W[1]))
# output gate
o = tf.sigmoid(tf.matmul(x,U[2]) + tf.matmul(st_1,W[2]))
# gate weights
g = tf.tanh(tf.matmul(x,U[3]) + tf.matmul(st_1,W[3]))
ct = ct_1*f + g*i
st = tf.tanh(ct)*o
"""
make prediction, compute residual in t
and pass it to t+1
Normaly, we would compute prediction outside the scan function,
but as we need it here, we could just keep it and return it back
as an output of the scan function
"""
prediction_t = tf.matmul(st, W_out) # + bias
eps = (target - prediction_t)**2
return [tf.stack((st, ct), axis=0), eps, prediction_t]
states, eps, preds = tf.scan(step, [tf.transpose(inputs, [1,0,2]),
tf.transpose(stock_returns, [1,0,2])], initializer=[init_state,
tf.zeros((32,1), dtype=tf.float32),
tf.zeros((32,1),dtype=tf.float32)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
out = sess.run(preds, feed_dict=
{inputs:np.random.rand(BS,TS,inputs_dim),
stock_returns:np.random.rand(BS,TS,target_dim),
init_state:np.zeros((2,BS,state_size))})
out = tf.transpose(out,[1,0,2])
print(out)
And the output :
Tensor("transpose_2:0", shape=(32, 10, 1), dtype=float32)
Base code from here

How to store a dictionary and map words to ints when using Tensorflow Serving?

I have trained an LSTM RNN classification model on Tensorflow. I was saving and restoring checkpoints to retrain and use the model for testing. Now I want to use Tensorflow serving so that I can use the model in production.
Initially, I would parse through a corpus to create my dictionary which is then used to map words in a string to integers. I would then store this dictionary in a pickle file which could be reloaded when restoring a checkpoint and retraining on a data set or just for using the model so that the mapping is consistent. How do I store this dictionary when saving the model using SavedModelBuilder?
My code for the neural network is as follows. The code for saving the model is towards the end (I am including an overview of the whole structure for context):
...
# Read files and store them in variables
with open('./someReview.txt', 'r') as f:
reviews = f.read()
with open('./someLabels.txt', 'r') as f:
labels = f.read()
...
#Pre-processing functions
#Parse through dataset and create a vocabulary
vocab_to_int, reviews = RnnPreprocessing.map_vocab_to_int(reviews)
with open(pickle_path, 'wb') as handle:
pickle.dump(vocab_to_int, handle, protocol=pickle.HIGHEST_PROTOCOL)
#More preprocessing functions
...
# Building the graph
lstm_size = 256
lstm_layers = 2
batch_size = 1000
learning_rate = 0.01
n_words = len(vocab_to_int) + 1
# Create the graph object
tf.reset_default_graph()
with tf.name_scope('inputs'):
inputs_ = tf.placeholder(tf.int32, [None, None], name="inputs")
labels_ = tf.placeholder(tf.int32, [None, None], name="labels")
keep_prob = tf.placeholder(tf.float32, name="keep_prob")
#Create embedding layer LSTM cell, LSTM Layers
...
# Forward pass
with tf.name_scope("RNN_forward"):
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# Output. We are only interested in the latest output of the lstm cell
with tf.name_scope('predictions'):
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
tf.summary.histogram('predictions', predictions)
#More functions for cost, accuracy, optimizer initialization
...
# Training
epochs = 1
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
iteration = 1
for e in range(epochs):
state = sess.run(initial_state)
for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
feed = {inputs_: x,
labels_: y[:, None],
keep_prob: 0.5,
initial_state: state}
summary, loss, state, _ = sess.run([merged, cost, final_state, optimizer], feed_dict=feed)
train_writer.add_summary(summary, iteration)
if iteration%1==0:
print("Epoch: {}/{}".format(e, epochs),
"Iteration: {}".format(iteration),
"Train loss: {:.3f}".format(loss))
if iteration%2==0:
val_acc = []
val_state = sess.run(cell.zero_state(batch_size, tf.float32))
for x, y in get_batches(val_x, val_y, batch_size):
feed = {inputs_: x,
labels_: y[:, None],
keep_prob: 1,
initial_state: val_state}
summary, batch_acc, val_state = sess.run([merged, accuracy, final_state], feed_dict=feed)
val_acc.append(batch_acc)
print("Val acc: {:.3f}".format(np.mean(val_acc)))
iteration +=1
test_writer.add_summary(summary, iteration)
#Saving the model
export_path = './SavedModel'
print ('Exporting trained model to %s'%(export_path))
builder = saved_model_builder.SavedModelBuilder(export_path)
# Build the signature_def_map.
classification_inputs = utils.build_tensor_info(inputs_)
classification_outputs_classes = utils.build_tensor_info(labels_)
classification_signature = signature_def_utils.build_signature_def(
inputs={signature_constants.CLASSIFY_INPUTS: classification_inputs},
outputs={
signature_constants.CLASSIFY_OUTPUT_CLASSES:
classification_outputs_classes,
},
method_name=signature_constants.CLASSIFY_METHOD_NAME)
legacy_init_op = tf.group(
tf.tables_initializer(), name='legacy_init_op')
#add the sigs to the servable
builder.add_meta_graph_and_variables(
sess, [tag_constants.SERVING],
signature_def_map={
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
classification_signature
},
legacy_init_op=legacy_init_op)
print ("added meta graph and variables")
#save it!
builder.save()
print("model saved")
I am not entirely sure if this is the correct way to save a model such as this but this is the only implementation I have found in the documentation and online tutorials.
I haven't found any example or any explicit guide to saving the dictionary or how to use it when restoring a savedModel in the documentation.
When using checkpoints, I would just load the pickle file before running the session. How do I restore this savedModel so that I can use the same word to int mapping using the dictionary? Is there any specific way I should be saving the model or loading it?
I have also added inputs_ as the input for the input signature. This is a sequence of integeres 'after' the words have been mapped. I can't specify a string as input because I get an AttributeError: 'str' object has no attribute 'dtype' . In such cases, how exactly are words mapped to integers in models that are in production?
Implement your preprocessing using the utilities in tf.feature_column and it'll be straightforward to use the same mapping to integers in serving.
One approach to this is storing the vocabulary in the model's graph. This will then be shipped with the model.
...
vocab_table = lookup.index_table_from_file(vocabulary_file='data/vocab.csv', num_oov_buckets=1, default_value=-1)
text = features[commons.FEATURE_COL]
words = tf.string_split(text)
dense_words = tf.sparse_tensor_to_dense(words, default_value=commons.PAD_WORD)
word_ids = vocab_table.lookup(dense_words)
padding = tf.constant([[0, 0], [0, commons.MAX_DOCUMENT_LENGTH]])
# Pad all the word_ids entries to the maximum document length
word_ids_padded = tf.pad(word_ids, padding)
word_id_vector = tf.slice(word_ids_padded, [0, 0], [-1, commons.MAX_DOCUMENT_LENGTH])
Source: https://github.com/KishoreKarunakaran/CloudML-Serving/blob/master/text/imdb_cnn/model/cnn_model.py#L83

TensorFlow average gradients over several batches

This is a possible duplicate of Tensorflow: How to get gradients per instance in a batch?. I ask it anyway, because there has not been a satisfying answer and the goal here is a bit different.
I have a very big network that I can fit on my GPU but the max batch size I can feed is 32. Anything bigger than that causes the GPU to run out of memory. I want to use a bigger batch in order to get a more accurate approximation of the gradient.
For concreteness, let's say I want to compute the gradient on a big batch of size 96 by feeding 3 batches of 32 in turn. The best way that I know of is to use Optimizer.compute_gradients() and Optimizer.apply_gradients(). Here is a small example how it can work
import tensorflow as tf
import numpy as np
learn_rate = 0.1
W_init = np.array([ [1,2,3], [4,5,6], [7,8,9] ], dtype=np.float32)
x_init = np.array([ [11,12,13], [14,15,16], [17,18,19] ], dtype=np.float32)
X = tf.placeholder(dtype=np.float32, name="x")
W = tf.Variable(W_init, dtype=np.float32, name="w")
y = tf.matmul(X, W, name="y")
loss = tf.reduce_mean(y, name="loss")
opt = tf.train.GradientDescentOptimizer(learn_rate)
grad_vars_op = opt.compute_gradients(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Compute the gradients for each batch
grads_vars1 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,0]})
grads_vars2 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,1]})
grads_vars3 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,2]})
# Separate the gradients from the variables
grads1 = [ grad for grad, var in grads_vars1 ]
grads2 = [ grad for grad, var in grads_vars2 ]
grads3 = [ grad for grad, var in grads_vars3 ]
varl = [ var for grad, var in grads_vars1 ]
# Average the gradients
grads = [ (g1 + g2 + g3)/3 for g1, g2, g3 in zip(grads1, grads2, grads3)]
sess.run(opt.apply_gradients(zip(grads,varl)))
print("Weights after 1 gradient")
print(sess.run(W))
Now this is all very ugly and inefficient since the forward pass is being run on the GPU while averaging the gradients happens on the CPU and then applying them happens on the GPU again.
Moreover, this code throws an exception because grads is a list of np.arrays and to make it work, one would have to create a tf.placeholder for every gradient.
I am sure there should be a better and more efficient way to do this? Any suggestions?
You can create copy of trainable_variables and accumulate batch gradients. Here's few simple steps to follow
...
opt = tf.train.GradientDescentOptimizer(learn_rate)
# constant to scale sum of gradient
const = tf.constant(1/n_batches)
# get all trainable variables
t_vars = tf.trainable_variables()
# create a copy of all trainable variables with `0` as initial values
accum_tvars = [tf.Variable(tf.zeros_like(tv.initialized_value()),trainable=False) for t_var in t_vars]
# create a op to initialize all accums vars
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_tvars]
# compute gradients for a batch
batch_grads_vars = opt.compute_gradients(loss, t_vars)
# collect the (scaled by const) batch gradient into accumulated vars
accum_ops = [accum_tvars[i].assign_add(tf.scalar_mul(const, batch_grad_var[0]) for i, batch_grad_var in enumerate(batch_grads_vars)]
# apply accums gradients
train_step = opt.apply_gradients([(accum_tvars[i], batch_grad_var[1]) for i, batch_grad_var in enumerate(batch_grads_vars)])
# train_step = opt.apply_gradients(zip(accum_tvars, zip(*batch_grads_vars)[1])
while True:
# initialize the accumulated gards
sess.run(zero_ops)
# number of batches for gradient accumulation
n_batches = 3
for i in xrange(n_batches):
sess.run(accum_ops, feed_dict={X: x_init[:, i]})
sess.run(train_step)

Tensorflow - Weight Noise Regularization

I'm trying to implement weight noise regularization like Alex Graves made in his PhD Thesis, but I have several issues in how I should implement that. The algorithm should look like
while stopping criteria not met do
Randomize training set order
for each example in the training set do
Add zero mean Gaussian Noise to weights
Run forward and backward pass to calculate the gradient
Restore original weights
Update weights with gradient descent algorithm
Could anyone shed some light?
Edit 09/16/16
Here is my code:
# e.g: log filter bank or MFCC features
# Has size [batch_size, max_stepsize, num_features], but the
# batch_size and max_stepsize can vary along each step
inputs = tf.placeholder(tf.float32, [None, None, num_features])
# Here we use sparse_placeholder that will generate a
# SparseTensor required by ctc_loss op.
targets = tf.sparse_placeholder(tf.int32)
# 1d array of size [batch_size]
seq_len = tf.placeholder(tf.int32, [None])
# Defining the cell
# Can be:
# tf.nn.rnn_cell.RNNCell
# tf.nn.rnn_cell.GRUCell
cell = tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True)
# Stacking rnn cells
stack = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers,
state_is_tuple=True)
# The second output is the last state and we will no use that
outputs, _ = tf.nn.dynamic_rnn(cell, inputs, seq_len, dtype=tf.float32)
shape = tf.shape(inputs)
batch_s, max_timesteps = shape[0], shape[1]
# Reshaping to apply the same weights over the timesteps
outputs = tf.reshape(outputs, [-1, num_hidden])
# Truncated normal with mean 0 and stdev=0.1
# Tip: Try another initialization
# see https://www.tensorflow.org/versions/r0.9/api_docs/python/contrib.layers.html#initializers
W = tf.Variable(tf.truncated_normal([num_hidden,
num_classes],
stddev=0.1))
# Zero initialization
# Tip: Is tf.zeros_initializer the same?
b = tf.Variable(tf.constant(0., shape=[num_classes]))
# Doing the affine projection
logits = tf.matmul(outputs, W) + b
# Reshaping back to the original shape
logits = tf.reshape(logits, [batch_s, -1, num_classes])
# Time major
logits = tf.transpose(logits, (1, 0, 2))
loss = tf.contrib.ctc.ctc_loss(logits, targets, seq_len)
cost = tf.reduce_mean(loss)
optimizer = tf.train.MomentumOptimizer(initial_learning_rate,
0.9).minimize(cost)
# Option 2: tf.contrib.ctc.ctc_beam_search_decoder
# (it's slower but you'll get better results)
decoded, log_prob = tf.contrib.ctc.ctc_greedy_decoder(logits, seq_len)
# Inaccuracy: label error rate
ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
targets))
Edit 09/27/16
I realized that I must change my optimizer in order to add noise weight regularizer. But, I have no idea how to insert this on my code.
variables = tf.trainable_variables()
with tf.variable_scope(self.name or "OptimizeLoss", [loss, global_step]):
update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS))
# Make sure update ops are ran before computing loss.
if update_ops:
loss = control_flow_ops.with_dependencies(list(update_ops), loss)
add_noise_ops = [tf.no_op()]
if self.weights_noise_scale is not None:
add_noise_ops, remove_noise_ops = self._noise_ops(variables, self.weights_noise_scale)
# Make sure add noise to weights before computing loss.
loss = control_flow_ops.with_dependencies(add_noise_ops, loss)
# Compute gradients.
gradients = self._opt.compute_gradients(loss, variables, colocate_gradients_with_ops=self.colocate_gradients_with_ops)
# Optionally add gradient noise.
if self.gradient_noise_scale is not None:
gradients = self._add_scaled_noise_to_gradients(gradients, self.gradient_noise_scale)
# Optionally clip gradients by global norm.
if self.clip_gradients_by_global_norm is not None:
gradients = self._clip_gradients_by_global_norm(gradients, self.clip_gradients_by_global_norm)
# Optionally clip gradients by value.
if self.clip_gradients_by_value is not None:
gradients = self._clip_gradients_by_value(gradients, self.clip_gradients_by_value)
# Optionally clip gradients by norm.
if self.clip_gradients_by_norm is not None:
gradients = self._clip_gradients_by_norm(gradients, self.clip_gradients_by_norm)
self._grads = [g[0] for g in gradients]
self._vars = [g[1] for g in gradients]
# Create gradient updates.
# Make sure that the noise of weights will be removed before the gradient update rule
grad_updates = self._opt.apply_gradients(gradients,
global_step=global_step,
name="train")
# Ensure the train_tensor computes grad_updates.
train_tensor = control_flow_ops.with_dependencies([grad_updates], loss)
Could anyone shed some light on me?
Thanks :)
To solve that I would build 2 graphs: one for training and another for evaluation. The latter won't sum the noise to the weights. To sum a random noise to the weights, you can just do:
W = tf.Variable(tf.truncated_normal([num_hidden,
num_classes],
stddev=0.1))
noise = tf.truncated_normal([num_hidden, num_classes],
stddev=0.001))
W = W + noise
The tensor tf.truncated_normal will add a small amount of random noise to your weights.

Tensorflow cross-entropy NaN, and changing learning rate doesn't seem to have an impact

TL;DR
Trying to build a bidirectional RNN for sequence tagging using tensorflow.
The goal is to take inputs "I like New York" and produce outputs "O O LOC_START LOC"
The graph compiles and runs, but the loss becomes NaN after 1 or 2 batches. I understand this could be a problem with the learning rate, but changing the learning rate seems to have no impact. Using AdamOptimizer at the moment.
Any help would be appreciated.
Here is my code:
Code:
# The input and output: a sequence of words, embedded, and a sequence of word classifications, one-hot
self.input_x = tf.placeholder(tf.float32, [None, n_sequence_length, n_embedding_dim], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, n_sequence_length, n_output_classes], name="input_y")
# New shape: [sequence_length, batch_size (None), embedding_dim]
inputs = tf.transpose(self.input_x, [1, 0, 2])
# New shape: [sequence_length * batch_size (None), embedding_dim]
inputs = tf.reshape(inputs, [-1, n_embedding_dim])
# Define weights
w_hidden = tf.Variable(tf.random_normal([n_embedding_dim, 2 * n_hidden_states]))
b_hidden = tf.Variable(tf.random_normal([2 * n_hidden_states]))
w_out = tf.Variable(tf.random_normal([2 * n_hidden_states, n_output_classes]))
b_out = tf.Variable(tf.random_normal([n_output_classes]))
# Linear activation for the input; this will make it fit to the hidden size
inputs = tf.nn.xw_plus_b(inputs, w_hidden, b_hidden)
# Split up the batches into a Python list
inputs = tf.split(0, n_sequence_length, inputs)
# Now we define our cell. It takes one word as input, a vector of embedding_size length
cell_forward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
cell_backward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
# And we add a Dropout Wrapper as appropriate
if is_training and prob_keep < 1:
cell_forward = rnn_cell.DropoutWrapper(cell_forward, output_keep_prob=prob_keep)
cell_backward = rnn_cell.DropoutWrapper(cell_backward, output_keep_prob=prob_keep)
# And we make it a few layers deep
cell_forward_multi = rnn_cell.MultiRNNCell([cell_forward] * n_layers)
cell_backward_multi = rnn_cell.MultiRNNCell([cell_backward] * n_layers)
# returns outputs = a list T of tensors [batch, 2*hidden]
outputs = rnn.bidirectional_rnn(cell_forward_multi, cell_backward_multi, inputs, dtype=dtypes.float32)
# [sequence, batch, 2*hidden]
outputs = tf.pack(outputs)
# [batch, sequence, 2*hidden]
outputs = tf.transpose(outputs, [1, 0, 2])
# [batch * sequence, 2 * hidden]
outputs = tf.reshape(outputs, [-1, 2 * n_hidden_states])
# [batch * sequence, output_classes]
self.scores = tf.nn.xw_plus_b(outputs, w_out, b_out)
# [batch * sequence, output_classes]
inputs_y = tf.reshape(self.input_y, [-1, n_output_classes])
# [batch * sequence]
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Now calculate the cross-entropy
losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, inputs_y)
self.loss = tf.reduce_mean(losses, name="loss")
if not is_training:
return
# Training
self.train_op = tf.train.AdamOptimizer(1e-4).minimize(self.loss)
# Evaluate model
correct_pred = tf.equal(self.predictions, tf.argmax(inputs_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name="accuracy")
Could there be an example in the training data where something is wrong with the labels? Then when it hits that example the cost become NaN. I'm suggesting this because it seems like it still happens when the learning rate is zero and after just a few batches.
Here is how I would debug:
Set the batch size to 1
set the learning rate to 0.0
when you run a batch have tensorflow output the intermediate values not just the cost
run until you get a NaN and then check to see what the input was and by examining the intermediate outputs determine at which point there is a NaN

Resources