I am trying to model a translation between two numerical (floating point) datasets and thought of using sequence to sequence learning with teaching enforcement. I am able to run the training model with a decently low mse but when it comes to the inference model, my outputs are really off from the target data or maybe I am inferencing incorrectly. My question is how can we inference floating type values? On the internet, I can find several tutorials which one-hot encode integer type data and draw inference in form of an one-hot encoded vector and then decode it to the predicted integer. But, how can I carry out the same with my data?
My both the datasets are numeric with floating points.
encoder input data =
array([[0. ],
[0.00075804],
[0.00024911],
...,
[0. ],
[0. ],
[0. ]])
I am using a masking layer with 0 as a the start/stop character because my encoder dataset consists of 4096 time steps per sample.
My decoder output data =
array([[0.04930792],
[0.0509621 ],
[0.05045872],
...,
[0.02535375],
[0.02148524],
[0.02867743]], dtype=float32)
Decoder data consists of 8192 time steps per sample.
My decoder input data =
array([[0. ],
[0.04930792],
[0.0509621 ],
...,
[0.01980789],
[0.02535375],
[0.02148524]], dtype=float32)
Decoder also consists of 8192 time steps per sample.
My train model architecture:
encoder_inputs= Input(shape=(max_input_sequence, input_dimension), name='encoder_inputs')
masking = Masking(mask_value= 0)
encoder_inputs_masked = masking(encoder_inputs)
encoder_lstm=LSTM(LSTMoutputDimension,activation='elu', return_state=True, name='encoder_lstm')
LSTM_outputs, state_h, state_c = encoder_lstm(encoder_inputs_masked)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None, input_dimension), name='decoder_inputs')
decoder_lstm = LSTM(LSTMoutputDimension, activation='elu', return_sequences=True, return_state=True, name='decoder_lstm')
# Set up the decoder, using `context vector` as initial state.
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(input_dimension ,name='decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)
# put together
model_encoder_training = Model([encoder_inputs, decoder_inputs], decoder_outputs, name='model_encoder_training')
opt = Adam(lr=0.007, clipnorm=1)
model_encoder_training.compile(optimizer=opt, loss='mean_squared_error', metrics=['mse'])
My inference model architecture:
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(LSTMoutputDimension,))
decoder_state_input_c = Input(shape=(LSTMoutputDimension,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, input_dimension))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, 0] = 0
# target_seq = 0
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_seq = list()
while not stop_condition:
# in a loop
# decode the input to a token/output prediction + required states for context vector
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)
# convert the token/output prediction to a token/output
# sampled_token_index = np.argmax(output_tokens[0, -1, :])
# sampled_digit = sampled_token_index
# add the predicted token/output to output sequence
decoded_seq.append(output_tokens)
# Exit condition: either hit max length
# or find stop character.
if (
len(decoded_seq) == max_input_sequence):
stop_condition = True
# Update the input target sequence (of length 1)
# with the predicted token/output
# target_seq = np.zeros((1, 1, input_dimension))
# target_seq[0, 0, sampled_token_index] = 1.
target_seq = output_tokens
# Update input states (context vector)
# with the ouputed states
states_value = [h, c]
# loop back.....
# when loop exists return the output sequence
return decoded_seq
sampleNo = 1
# for sample in range(0,sampleNo):
for sample in range(0,sampleNo):
predicted= decode_sequence(encoder_input_data[sample].reshape(1,max_input_sequence,input_dimension))
# store.append(predicted)
So far, I have tried playing out with different activation functions for the Dense layer output but to no luck, nothing seems to be working the way I expect it to. Any sort of suggestions or help will be greatly appreciated!
Related
What I am trying to do
I want to create a model that smoothes my predictions. My predictions have a shape [num samples, 4, 7], where 4 is the sequence length and 7 is the number of classes. The class values sum to 100.
However, my predictions often fluctuate, predicting for example a value of 50 for class 5 at time step 1, and 89 for time step 2. In reality, a class rarely makes such extreme fluctuations. So, I want to smooth my predictions,
I have training data that has a similar shape [num samples, 4, 7]. I want to create a model that learns the behavior of classes using this data, and then applies that on my predictions, hopefully smoothing my results.
I understand that I can just average out results and smooth like that, but I am curious if I can use a deep learning model that understand underlying probabilities and indirectly more corrects as well as smoothes the predictions.
What I have tried
However, I am struggling to understand how one creates such an architecture. I have tried working with matrices as well as with LSTM:
class SmoothModel(nn.Module):
def __init__(self, input_size, output_size):
super(SmoothModel, self).__init__()
self.input_size = input_size
self.output_size = output_size
# Initialize the cooccurrence matrix as a learnable parameter
self.cooccurrence = nn.Parameter(torch.randn(input_size, output_size))
# Initialize the transition matrix as a learnable parameter
self.transition = nn.Parameter(torch.randn(input_size, output_size))
# Softmax layer
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
sequences_updated = []
# Update sequence based on transition and cooccurence matrix
for i in range(x.shape[0]):
# Cooccurence multiplication
seq_list = []
for j in range(x.shape[1]):
predicted_cooc = x[i, j, :].unsqueeze(0) # shape [1, 7]
updated_cooc = torch.matmul(predicted_cooc, self.cooccurrence)
seq_list.append(updated_cooc)
# Create to a sequence of 4 again, where the cooccurence is updated
seq = torch.cat(seq_list, dim=0) # create shape [4, 7]
# Transition multiplication
updated_seq = torch.matmul(seq, self.transition) # shape [4, 7]
# Append the updated sequence
sequences_updated.append(updated_seq.unsqueeze(0)) # append shape [1, 4, 7]
# Create tensor with all updated sequences
updated_tensor = torch.cat(sequences_updated, dim=0) # dim = 0 is the number of samples
# Output should sum to 100
updated_tensor = self.softmax(updated_tensor) * 100
return updated_tensor
My idea behind this model was that it would update my predictions based on learned cooccurrence and transition probabilities.
Another model I tried, but with LSTM:
class SmoothModel(nn.Module):
def __init__(self, input_size, output_size, hidden_size = 64):
super(SmoothModel, self).__init__()
self.input_size = input_size
self.output_size = output_size
# Initialize the cooccurrence as a learnable parameter
self.cooccurrence = nn.Linear(input_size, output_size)
# Initialize the transition probability as a learnable parameter
self.transition = nn.LSTM(input_size, hidden_size)
self.transition_probability = nn.Linear(hidden_size, output_size)
# Softmax layer
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
sequences_updated = []
# Update sequence based on transition and cooccurence matrix
for i in range(x.shape[0]):
# Cooccurence multiplication
seq_list = []
for j in range(x.shape[1]):
predicted_cooc = x[i, j, :].unsqueeze(0) # shape [1, 7]
updated_cooc = self.cooccurrence(predicted_cooc)
seq_list.append(updated_cooc)
# Create to a sequence of 4 again, where the cooccurence is updated
seq = torch.cat(seq_list, dim=0) # create shape [4, 7]
# Transition probability
_, (hidden, _) = self.transition(seq.unsqueeze(0)) # shape [1, 4, hidden size]
updated_seq = self.transition_probability(hidden[-1, :, :].unsqueeze(0)) # shape [1, 4, 7]
# Append the updated sequence
sequences_updated.append(updated_seq) # append
# Create tensor with all updated sequences
updated_tensor = torch.cat(sequences_updated, dim=0) # dim = 0 is the number of samples
# Output should sum to 100
updated_tensor = self.softmax(updated_tensor) * 100
return updated_tensor
I furthermore tried some variants on this, for example only updating time step per timestep and sort of Markov Chain theory. But current models don't improve results.
Question
Does anyone have experience regarding this / know what theory/architecture I could be using? Or should I look at it a total different way?
I am happy to provide further (data) information if necessary!
This is using PyTorch
I have been trying to implement UNet model on my images, however, my model accuracy is always exact 0.5. Loss does decrease.
I have also checked for class imbalance. I have also tried playing with learning rate. Learning rate affects loss but not the accuracy.
My architecture below ( from here )
""" `UNet` class is based on https://arxiv.org/abs/1505.04597
The U-Net is a convolutional encoder-decoder neural network.
Contextual spatial information (from the decoding,
expansive pathway) about an input tensor is merged with
information representing the localization of details
(from the encoding, compressive pathway).
Modifications to the original paper:
(1) padding is used in 3x3 convolutions to prevent loss
of border pixels
(2) merging outputs does not require cropping due to (1)
(3) residual connections can be used by specifying
UNet(merge_mode='add')
(4) if non-parametric upsampling is used in the decoder
pathway (specified by upmode='upsample'), then an
additional 1x1 2d convolution occurs after upsampling
to reduce channel dimensionality by a factor of 2.
This channel halving happens with the convolution in
the tranpose convolution (specified by upmode='transpose')
Arguments:
in_channels: int, number of channels in the input tensor.
Default is 3 for RGB images. Our SPARCS dataset is 13 channel.
depth: int, number of MaxPools in the U-Net. During training, input size needs to be
(depth-1) times divisible by 2
start_filts: int, number of convolutional filters for the first conv.
up_mode: string, type of upconvolution. Choices: 'transpose' for transpose convolution
"""
class UNet(nn.Module):
def __init__(self, num_classes, depth, in_channels, start_filts=16, up_mode='transpose', merge_mode='concat'):
super(UNet, self).__init__()
if up_mode in ('transpose', 'upsample'):
self.up_mode = up_mode
else:
raise ValueError("\"{}\" is not a valid mode for upsampling. Only \"transpose\" and \"upsample\" are allowed.".format(up_mode))
if merge_mode in ('concat', 'add'):
self.merge_mode = merge_mode
else:
raise ValueError("\"{}\" is not a valid mode for merging up and down paths.Only \"concat\" and \"add\" are allowed.".format(up_mode))
# NOTE: up_mode 'upsample' is incompatible with merge_mode 'add'
if self.up_mode == 'upsample' and self.merge_mode == 'add':
raise ValueError("up_mode \"upsample\" is incompatible with merge_mode \"add\" at the moment "
"because it doesn't make sense to use nearest neighbour to reduce depth channels (by half).")
self.num_classes = num_classes
self.in_channels = in_channels
self.start_filts = start_filts
self.depth = depth
self.down_convs = []
self.up_convs = []
# create the encoder pathway and add to a list
for i in range(depth):
ins = self.in_channels if i == 0 else outs
outs = self.start_filts*(2**i)
pooling = True if i < depth-1 else False
down_conv = DownConv(ins, outs, pooling=pooling)
self.down_convs.append(down_conv)
# create the decoder pathway and add to a list
# - careful! decoding only requires depth-1 blocks
for i in range(depth-1):
ins = outs
outs = ins // 2
up_conv = UpConv(ins, outs, up_mode=up_mode, merge_mode=merge_mode)
self.up_convs.append(up_conv)
self.conv_final = conv1x1(outs, self.num_classes)
# add the list of modules to current module
self.down_convs = nn.ModuleList(self.down_convs)
self.up_convs = nn.ModuleList(self.up_convs)
self.reset_params()
#staticmethod
def weight_init(m):
if isinstance(m, nn.Conv2d):
#https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
##Doc: https://pytorch.org/docs/stable/nn.init.html?highlight=xavier#torch.nn.init.xavier_normal_
init.xavier_normal_(m.weight)
init.constant_(m.bias, 0)
def reset_params(self):
for i, m in enumerate(self.modules()):
self.weight_init(m)
def forward(self, x):
encoder_outs = []
# encoder pathway, save outputs for merging
for i, module in enumerate(self.down_convs):
x, before_pool = module(x)
encoder_outs.append(before_pool)
for i, module in enumerate(self.up_convs):
before_pool = encoder_outs[-(i+2)]
x = module(before_pool, x)
# No softmax is used. This means we need to use
# nn.CrossEntropyLoss is your training script,
# as this module includes a softmax already.
x = self.conv_final(x)
return x
Parameters are :
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x,y = train_sequence[0] ; batch_size = x.shape[0]
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
optim = torch.optim.Adam(model.parameters(),lr=0.01, weight_decay=1e-3)
criterion = nn.BCEWithLogitsLoss() #has sigmoid internally
epochs = 1000
The function for training is :
import torch.nn.functional as f
def train_model(epoch,train_sequence):
"""Train the model and report validation error with training error
Args:
model: the model to be trained
criterion: loss function
data_train (DataLoader): training dataset
"""
model.train()
for idx in range(len(train_sequence)):
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
print(masks.shape, outputs.shape)
loss = criterion(outputs, masks)
optim.zero_grad()
loss.backward()
# Update weights
optim.step()
# total_loss = get_loss_train(model, data_train, criterion)
My function for calculating loss and accuracy is below:
def get_loss_train(model, train_sequence):
"""
Calculate loss over train set
"""
model.eval()
total_acc = 0
total_loss = 0
for idx in range(len(train_sequence)):
with torch.no_grad():
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
loss = criterion(outputs, masks)
preds = torch.argmax(outputs, dim=1).float()
acc = accuracy_check_for_batch(masks.cpu(), preds.cpu(), images.size()[0])
total_acc = total_acc + acc
total_loss = total_loss + loss.cpu().item()
return total_acc/(len(train_sequence)), total_loss/(len(train_sequence))
Edit : Code which runs (calls) the functions:
for epoch in range(epochs):
train_model(epoch, train_sequence)
train_acc, train_loss = get_loss_train(model,train_sequence)
print("Train Acc:", train_acc)
print("Train loss:", train_loss)
Can someone help me identify as why is accuracy always exact 0.5?
Edit-2:
As asked accuracy_check_for_batch function is here:
def accuracy_check_for_batch(masks, predictions, batch_size):
total_acc = 0
for index in range(batch_size):
total_acc += accuracy_check(masks[index], predictions[index])
return total_acc/batch_size
and
def accuracy_check(mask, prediction):
ims = [mask, prediction]
np_ims = []
for item in ims:
if 'str' in str(type(item)):
item = np.array(Image.open(item))
elif 'PIL' in str(type(item)):
item = np.array(item)
elif 'torch' in str(type(item)):
item = item.numpy()
np_ims.append(item)
compare = np.equal(np_ims[0], np_ims[1])
accuracy = np.sum(compare)
return accuracy/len(np_ims[0].flatten())
I found the mistake.
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
should be
model = UNet(num_classes = 1, depth=5, in_channels=5, merge_mode='concat').to(device)
because I am using BCELosswithLogits.
This question might have been asked, but I got confused.
I am trying to apply one of RNN types, e.g. LSTM for time-series forecasting. I have inputs, y (stock returns). For each timestamp, I'd like to get the predictions. Q1 - Am I correct choosing seq2seq approach?
I also want to use predictions from previous timestamp (initializing initial values with some constant) as additional (still using my existing inputs) input in the form of squared residuals, i.e. using
eps_{t-1} = (y_{t-1} - y^_{t-1})^2 as additional input at t (as well as previous inputs).
So, how can I do this in tensorflow or in pytorch?
I tried to depict what I want on the attached graph. The graph
p.s. Sorry, it the question is poorly formulated
Let say your input if of dimension (32,10,1) with batch_size 32, time steps of length 10 and dimension of 1. Same for your target (stock return). This code make use of the tf.scan function, which is usefull when implementing custom recurrent networks (it will iterate over the timesteps). It remains to use the residual of t-1 in t somewhere, as you would like to.
ps: it is a very basic implementation of lstm from scratch, without any bias or output activation.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
BS = 32
TS = 10
inputs_dim = 1
target_dim = 1
inputs = tf.placeholder(shape=[BS, TS, inputs_dim], dtype=tf.float32)
stock_returns = tf.placeholder(shape=[BS, TS, target_dim], dtype=tf.float32)
state_size = 16
# initial hidden state
init_state = tf.placeholder(shape=[2, BS, state_size],
dtype=tf.float32, name='initial_state')
# initializer
xav_init = tf.contrib.layers.xavier_initializer
# params
W = tf.get_variable('W', shape=[4, state_size, state_size],
initializer=xav_init())
U = tf.get_variable('U', shape=[4, inputs_dim, state_size],
initializer=xav_init())
W_out = tf.get_variable('W_out', shape=[state_size, target_dim],
initializer=xav_init())
#the function to feed tf.scan with
def step(prev, inputs_):
#unpack all inputs and previous outputs
st_1, ct_1 = prev[0][0], prev[0][1]
x = inputs_[0]
target = inputs_[1]
#get previous squared residual
eps = prev[1]
"""
here do whatever you want with eps_t-1
like x += eps if x if of the same dimension
or include it somewhere in your graph
"""
# lstm gates (add bias if needed)
#
# input gate
i = tf.sigmoid(tf.matmul(x,U[0]) + tf.matmul(st_1,W[0]))
# forget gate
f = tf.sigmoid(tf.matmul(x,U[1]) + tf.matmul(st_1,W[1]))
# output gate
o = tf.sigmoid(tf.matmul(x,U[2]) + tf.matmul(st_1,W[2]))
# gate weights
g = tf.tanh(tf.matmul(x,U[3]) + tf.matmul(st_1,W[3]))
ct = ct_1*f + g*i
st = tf.tanh(ct)*o
"""
make prediction, compute residual in t
and pass it to t+1
Normaly, we would compute prediction outside the scan function,
but as we need it here, we could just keep it and return it back
as an output of the scan function
"""
prediction_t = tf.matmul(st, W_out) # + bias
eps = (target - prediction_t)**2
return [tf.stack((st, ct), axis=0), eps, prediction_t]
states, eps, preds = tf.scan(step, [tf.transpose(inputs, [1,0,2]),
tf.transpose(stock_returns, [1,0,2])], initializer=[init_state,
tf.zeros((32,1), dtype=tf.float32),
tf.zeros((32,1),dtype=tf.float32)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
out = sess.run(preds, feed_dict=
{inputs:np.random.rand(BS,TS,inputs_dim),
stock_returns:np.random.rand(BS,TS,target_dim),
init_state:np.zeros((2,BS,state_size))})
out = tf.transpose(out,[1,0,2])
print(out)
And the output :
Tensor("transpose_2:0", shape=(32, 10, 1), dtype=float32)
Base code from here
I trained a model with the purpose of generating sentences as follow:
I feed as training example 2 sequences: x which is a sequence of characters and y which is the same shift by one. The model is based on LSTM and is created with tensorflow.
My question is: since the model take in input sequences of a certain size (50 in my case), how can I make prediction giving him only a single character as seed ? I've seen it in some examples that after training they generate sentences by simply feeding a single characters.
Here is my code:
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [batch_size, truncated_backprop], name='x')
y = tf.placeholder(tf.int32, [batch_size, truncated_backprop], name='y')
with tf.name_scope('weights'):
W = tf.Variable(np.random.rand(n_hidden, num_classes), dtype=tf.float32)
b = tf.Variable(np.random.rand(1, num_classes), dtype=tf.float32)
inputs_series = tf.split(x, truncated_backprop, 1)
labels_series = tf.unstack(y, axis=1)
with tf.name_scope('LSTM'):
cell = tf.contrib.rnn.BasicLSTMCell(n_hidden, state_is_tuple=True)
cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)
cell = tf.contrib.rnn.MultiRNNCell([cell] * n_layers)
states_series, current_state = tf.contrib.rnn.static_rnn(cell, inputs_series, \
dtype=tf.float32)
logits_series = [tf.matmul(state, W) + b for state in states_series]
prediction_series = [tf.nn.softmax(logits) for logits in logits_series]
losses = [tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels) \
for logits, labels, in zip(logits_series, labels_series)]
total_loss = tf.reduce_mean(losses)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)
I suggest you use dynamic_rnn instead of static_rnn, which creates the graph during execution time and allows you to have inputs of any length. Your input placeholder would be
x = tf.placeholder(tf.float32, [batch_size, None, features], name='x')
Next, you'll need a way to input your own initial state into the network. You can do that by passing the initial_state parameter to dynamic_rnn, like:
initialstate = cell.zero_state(batch_sie, tf.float32)
outputs, current_state = tf.nn.dynamic_rnn(cell,
inputs,
initial_state=initialstate)
With that, in order to generate text from a single character you can feed the graph 1 character at a time, passing in the previous character and state each time, like:
prompt = 's' # beginning character, whatever
inp = one_hot(prompt) # preprocessing, as you probably want to feed one-hot vectors
state = None
while True:
if state is None:
feed = {x: [[inp]]}
else:
feed = {x: [[inp]], initialstate: state}
out, state = sess.run([outputs, current_state], feed_dict=feed)
inp = process(out) # extract the predicted character from out and one-hot it
TL;DR
Trying to build a bidirectional RNN for sequence tagging using tensorflow.
The goal is to take inputs "I like New York" and produce outputs "O O LOC_START LOC"
The graph compiles and runs, but the loss becomes NaN after 1 or 2 batches. I understand this could be a problem with the learning rate, but changing the learning rate seems to have no impact. Using AdamOptimizer at the moment.
Any help would be appreciated.
Here is my code:
Code:
# The input and output: a sequence of words, embedded, and a sequence of word classifications, one-hot
self.input_x = tf.placeholder(tf.float32, [None, n_sequence_length, n_embedding_dim], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, n_sequence_length, n_output_classes], name="input_y")
# New shape: [sequence_length, batch_size (None), embedding_dim]
inputs = tf.transpose(self.input_x, [1, 0, 2])
# New shape: [sequence_length * batch_size (None), embedding_dim]
inputs = tf.reshape(inputs, [-1, n_embedding_dim])
# Define weights
w_hidden = tf.Variable(tf.random_normal([n_embedding_dim, 2 * n_hidden_states]))
b_hidden = tf.Variable(tf.random_normal([2 * n_hidden_states]))
w_out = tf.Variable(tf.random_normal([2 * n_hidden_states, n_output_classes]))
b_out = tf.Variable(tf.random_normal([n_output_classes]))
# Linear activation for the input; this will make it fit to the hidden size
inputs = tf.nn.xw_plus_b(inputs, w_hidden, b_hidden)
# Split up the batches into a Python list
inputs = tf.split(0, n_sequence_length, inputs)
# Now we define our cell. It takes one word as input, a vector of embedding_size length
cell_forward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
cell_backward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
# And we add a Dropout Wrapper as appropriate
if is_training and prob_keep < 1:
cell_forward = rnn_cell.DropoutWrapper(cell_forward, output_keep_prob=prob_keep)
cell_backward = rnn_cell.DropoutWrapper(cell_backward, output_keep_prob=prob_keep)
# And we make it a few layers deep
cell_forward_multi = rnn_cell.MultiRNNCell([cell_forward] * n_layers)
cell_backward_multi = rnn_cell.MultiRNNCell([cell_backward] * n_layers)
# returns outputs = a list T of tensors [batch, 2*hidden]
outputs = rnn.bidirectional_rnn(cell_forward_multi, cell_backward_multi, inputs, dtype=dtypes.float32)
# [sequence, batch, 2*hidden]
outputs = tf.pack(outputs)
# [batch, sequence, 2*hidden]
outputs = tf.transpose(outputs, [1, 0, 2])
# [batch * sequence, 2 * hidden]
outputs = tf.reshape(outputs, [-1, 2 * n_hidden_states])
# [batch * sequence, output_classes]
self.scores = tf.nn.xw_plus_b(outputs, w_out, b_out)
# [batch * sequence, output_classes]
inputs_y = tf.reshape(self.input_y, [-1, n_output_classes])
# [batch * sequence]
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Now calculate the cross-entropy
losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, inputs_y)
self.loss = tf.reduce_mean(losses, name="loss")
if not is_training:
return
# Training
self.train_op = tf.train.AdamOptimizer(1e-4).minimize(self.loss)
# Evaluate model
correct_pred = tf.equal(self.predictions, tf.argmax(inputs_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name="accuracy")
Could there be an example in the training data where something is wrong with the labels? Then when it hits that example the cost become NaN. I'm suggesting this because it seems like it still happens when the learning rate is zero and after just a few batches.
Here is how I would debug:
Set the batch size to 1
set the learning rate to 0.0
when you run a batch have tensorflow output the intermediate values not just the cost
run until you get a NaN and then check to see what the input was and by examining the intermediate outputs determine at which point there is a NaN