Generate text with a trained character level LSTM model - machine-learning

I trained a model with the purpose of generating sentences as follow:
I feed as training example 2 sequences: x which is a sequence of characters and y which is the same shift by one. The model is based on LSTM and is created with tensorflow.
My question is: since the model take in input sequences of a certain size (50 in my case), how can I make prediction giving him only a single character as seed ? I've seen it in some examples that after training they generate sentences by simply feeding a single characters.
Here is my code:
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [batch_size, truncated_backprop], name='x')
y = tf.placeholder(tf.int32, [batch_size, truncated_backprop], name='y')
with tf.name_scope('weights'):
W = tf.Variable(np.random.rand(n_hidden, num_classes), dtype=tf.float32)
b = tf.Variable(np.random.rand(1, num_classes), dtype=tf.float32)
inputs_series = tf.split(x, truncated_backprop, 1)
labels_series = tf.unstack(y, axis=1)
with tf.name_scope('LSTM'):
cell = tf.contrib.rnn.BasicLSTMCell(n_hidden, state_is_tuple=True)
cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)
cell = tf.contrib.rnn.MultiRNNCell([cell] * n_layers)
states_series, current_state = tf.contrib.rnn.static_rnn(cell, inputs_series, \
dtype=tf.float32)
logits_series = [tf.matmul(state, W) + b for state in states_series]
prediction_series = [tf.nn.softmax(logits) for logits in logits_series]
losses = [tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels) \
for logits, labels, in zip(logits_series, labels_series)]
total_loss = tf.reduce_mean(losses)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)

I suggest you use dynamic_rnn instead of static_rnn, which creates the graph during execution time and allows you to have inputs of any length. Your input placeholder would be
x = tf.placeholder(tf.float32, [batch_size, None, features], name='x')
Next, you'll need a way to input your own initial state into the network. You can do that by passing the initial_state parameter to dynamic_rnn, like:
initialstate = cell.zero_state(batch_sie, tf.float32)
outputs, current_state = tf.nn.dynamic_rnn(cell,
inputs,
initial_state=initialstate)
With that, in order to generate text from a single character you can feed the graph 1 character at a time, passing in the previous character and state each time, like:
prompt = 's' # beginning character, whatever
inp = one_hot(prompt) # preprocessing, as you probably want to feed one-hot vectors
state = None
while True:
if state is None:
feed = {x: [[inp]]}
else:
feed = {x: [[inp]], initialstate: state}
out, state = sess.run([outputs, current_state], feed_dict=feed)
inp = process(out) # extract the predicted character from out and one-hot it

Related

Large, exploding loss in Pytorch transformer model

I am trying to solve a sequence to sequence problem with a transformer model. The data is derived from a set of crossword puzzles.
The positional encoding and transformer classes are as follows:
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 3000):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe = torch.zeros(1, max_len, d_model)
pe[0, :, 0::2] = torch.sin(position * div_term)
pe[0, :, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe)
def debug(self, x):
return x.shape, x.size()
def forward(self, x: Tensor) -> Tensor:
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
class Transformer(nn.Module):
def __init__(
self,
num_tokens,
dim_model,
num_heads,
num_encoder_layers,
num_decoder_layers,
batch_first,
dropout_p,
):
super().__init__()
self.model_type = "Transformer"
self.dim_model = dim_model
self.positional_encoder = PositionalEncoding(
d_model=dim_model, dropout=dropout_p, max_len=3000
)
self.embedding = nn.Embedding.from_pretrained(vec_weights, freeze=False)#nn.Embedding(num_tokens, dim_model)
self.transformer = nn.Transformer(
d_model=dim_model,
nhead=num_heads,
num_encoder_layers=num_encoder_layers,
num_decoder_layers=num_decoder_layers,
dropout=dropout_p,
batch_first = batch_first
)
self.out = nn.Linear(dim_model, num_tokens)
def forward(self, src, tgt, tgt_mask=None, src_pad_mask=None, tgt_pad_mask=None):
src = self.embedding(src)*math.sqrt(self.dim_model)
tgt = self.embedding(tgt)*math.sqrt(self.dim_model)
src = self.positional_encoder(src)
tgt = self.positional_encoder(tgt)
transformer_out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask=src_pad_mask, tgt_key_padding_mask=tgt_pad_mask)
out = self.out(transformer_out)
return out
def get_tgt_mask(self, size) -> torch.tensor:
mask = torch.tril(torch.ones(size, size) == 1)
mask = mask.float()
mask = mask.masked_fill(mask == 0, float('-inf'))
mask = mask.masked_fill(mask == 1, float(0.0))
return mask
def create_pad_mask(self, matrix: torch.tensor, pad_token: int) -> torch.tensor:
return (matrix == pad_token)
The input tensors are a source tensor of size N by S, where N is the batch size and S is the source sequence length, and a target tensor of size N by T, where T is the target sequence length. S is about 10 and T is about 5, while the total number of items is about 160,000-200,000, divided into batch sizes of 512. They are torch.IntTensors, with elements in the range from 0 to V, where V is the vocabulary length.
The first layer is an embedding layer that takes the input from N by S to N by S by E, where E is the embedding dimension (300), or to N by T by E in the case of the target. The second layer adds position encoding without changing the shape. Then both tensors are passed through the transformer layer, which outputs an N by T by E tensor. Finally, we pass this output through a linear layer, which produces an N by T by V output, where V is the size of the vocabulary used in the problem. Here V is about 56,697. The most frequent tokens (words) appear about 50-60 times in the target tensor.
The transformer class also contains the functions for implementing the masking matrices.
Then we create the model and run it (this process is wrapped in a function).
device = "cuda"
src_train, src_test = torch.utils.data.random_split(src_t, [int(0.9*len(src_t)), len(src_t)-int(0.9*len(src_t))])
src_train, src_test = src_train[:512], src_test[:512]
tgt_train, tgt_test = torch.utils.data.random_split(tgt_t, [int(0.9*len(tgt_t)), len(tgt_t)-int(0.9*len(tgt_t))])
tgt_train, tgt_test = tgt_train[:512], tgt_test[:512]
train_data, test_data = list(zip(src_train, tgt_train)), list(zip(src_test, tgt_test))
train, test = torch.utils.data.DataLoader(dataset=train_data), torch.utils.data.DataLoader(dataset=test_data)
model = Transformer(num_tokens=ntokens, dim_model=300, num_heads=2, num_encoder_layers=3, num_decoder_layers=3, batch_first = True, dropout_p=0.1).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0000001)
n_epochs = 50
def train_model(model, optimizer, loss_function, n_epochs):
loss_value=0
for epoch in range(n_epochs):
print(f"Starting epoch {epoch}")
for batch, data in enumerate(train):
x, y = data
if batch%100 == 0:
print(f"Batch is {batch}")
batch += 1
optimizer.zero_grad()
x, y = torch.tensor(x).to(device), torch.tensor(y).to(device)
y_input, y_base = y[:, :-1], y[:, 1:]
y_input, y_base = y_input.to(device), y_base.to(device)
tgt_mask = model.get_tgt_mask(y_input.shape[1]).to(device)
pad_token = vocabulary_table[embeddings.key_to_index["/"]]
src_pad_mask = model.create_pad_mask(x, pad_token).to(device)
tgt_pad_mask = model.create_pad_mask(y_input, pad_token).to(device)
z = model(x, y_input, tgt_mask, src_pad_mask, tgt_pad_mask)
z = z.permute(0, 2, 1).to(device)
y_base = y_base.long().to(device)
loss = loss_function(z, y_base).to(device)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2)
optimizer.step()
loss_value += float(loss)
if batch%100 == 0:
print(f"For epoch {epoch}, batch {batch} the cross-entropy loss is {loss_value}")
#Free GPU memory.
del z
del x
del y
del y_input
del y_base
del loss
torch.cuda.empty_cache()
return model.parameters(), loss_value
Basically, we split the data into test and training sets and use an SGD optimizer and cross-entropy loss. We create a masking matrix for the padding for both the target and source tensors, and a masking matrix for unseen elements for the target tensor. We then do the usual gradient update steps. Right now, there is no validation loop, because I cannot even get the training loss to decrease.
The loss is very high, reaching more than 1000 after 100 batches. More concerningly, the loss also increases rapidly during training, rather than decreasing. In the code that I included, I tried clipping the gradients, lowering the learning rate, and using a much smaller sample to debug the code.
What could be causing this behavior?
You are only adding things to your loss, so naturally it can only increase.
loss_value += float(loss)
You're supposed to set it to zero after every epoch. Now you set it to zero once, in the beginning of the training process. There is a training loop template here if you're interested (https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html). This explains the increasing loss. To further troubleshoot (if needed) I'd throw in a validation loop.

Pytorch model stuck at 0.5 though loss decreases consistently

This is using PyTorch
I have been trying to implement UNet model on my images, however, my model accuracy is always exact 0.5. Loss does decrease.
I have also checked for class imbalance. I have also tried playing with learning rate. Learning rate affects loss but not the accuracy.
My architecture below ( from here )
""" `UNet` class is based on https://arxiv.org/abs/1505.04597
The U-Net is a convolutional encoder-decoder neural network.
Contextual spatial information (from the decoding,
expansive pathway) about an input tensor is merged with
information representing the localization of details
(from the encoding, compressive pathway).
Modifications to the original paper:
(1) padding is used in 3x3 convolutions to prevent loss
of border pixels
(2) merging outputs does not require cropping due to (1)
(3) residual connections can be used by specifying
UNet(merge_mode='add')
(4) if non-parametric upsampling is used in the decoder
pathway (specified by upmode='upsample'), then an
additional 1x1 2d convolution occurs after upsampling
to reduce channel dimensionality by a factor of 2.
This channel halving happens with the convolution in
the tranpose convolution (specified by upmode='transpose')
Arguments:
in_channels: int, number of channels in the input tensor.
Default is 3 for RGB images. Our SPARCS dataset is 13 channel.
depth: int, number of MaxPools in the U-Net. During training, input size needs to be
(depth-1) times divisible by 2
start_filts: int, number of convolutional filters for the first conv.
up_mode: string, type of upconvolution. Choices: 'transpose' for transpose convolution
"""
class UNet(nn.Module):
def __init__(self, num_classes, depth, in_channels, start_filts=16, up_mode='transpose', merge_mode='concat'):
super(UNet, self).__init__()
if up_mode in ('transpose', 'upsample'):
self.up_mode = up_mode
else:
raise ValueError("\"{}\" is not a valid mode for upsampling. Only \"transpose\" and \"upsample\" are allowed.".format(up_mode))
if merge_mode in ('concat', 'add'):
self.merge_mode = merge_mode
else:
raise ValueError("\"{}\" is not a valid mode for merging up and down paths.Only \"concat\" and \"add\" are allowed.".format(up_mode))
# NOTE: up_mode 'upsample' is incompatible with merge_mode 'add'
if self.up_mode == 'upsample' and self.merge_mode == 'add':
raise ValueError("up_mode \"upsample\" is incompatible with merge_mode \"add\" at the moment "
"because it doesn't make sense to use nearest neighbour to reduce depth channels (by half).")
self.num_classes = num_classes
self.in_channels = in_channels
self.start_filts = start_filts
self.depth = depth
self.down_convs = []
self.up_convs = []
# create the encoder pathway and add to a list
for i in range(depth):
ins = self.in_channels if i == 0 else outs
outs = self.start_filts*(2**i)
pooling = True if i < depth-1 else False
down_conv = DownConv(ins, outs, pooling=pooling)
self.down_convs.append(down_conv)
# create the decoder pathway and add to a list
# - careful! decoding only requires depth-1 blocks
for i in range(depth-1):
ins = outs
outs = ins // 2
up_conv = UpConv(ins, outs, up_mode=up_mode, merge_mode=merge_mode)
self.up_convs.append(up_conv)
self.conv_final = conv1x1(outs, self.num_classes)
# add the list of modules to current module
self.down_convs = nn.ModuleList(self.down_convs)
self.up_convs = nn.ModuleList(self.up_convs)
self.reset_params()
#staticmethod
def weight_init(m):
if isinstance(m, nn.Conv2d):
#https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
##Doc: https://pytorch.org/docs/stable/nn.init.html?highlight=xavier#torch.nn.init.xavier_normal_
init.xavier_normal_(m.weight)
init.constant_(m.bias, 0)
def reset_params(self):
for i, m in enumerate(self.modules()):
self.weight_init(m)
def forward(self, x):
encoder_outs = []
# encoder pathway, save outputs for merging
for i, module in enumerate(self.down_convs):
x, before_pool = module(x)
encoder_outs.append(before_pool)
for i, module in enumerate(self.up_convs):
before_pool = encoder_outs[-(i+2)]
x = module(before_pool, x)
# No softmax is used. This means we need to use
# nn.CrossEntropyLoss is your training script,
# as this module includes a softmax already.
x = self.conv_final(x)
return x
Parameters are :
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x,y = train_sequence[0] ; batch_size = x.shape[0]
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
optim = torch.optim.Adam(model.parameters(),lr=0.01, weight_decay=1e-3)
criterion = nn.BCEWithLogitsLoss() #has sigmoid internally
epochs = 1000
The function for training is :
import torch.nn.functional as f
def train_model(epoch,train_sequence):
"""Train the model and report validation error with training error
Args:
model: the model to be trained
criterion: loss function
data_train (DataLoader): training dataset
"""
model.train()
for idx in range(len(train_sequence)):
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
print(masks.shape, outputs.shape)
loss = criterion(outputs, masks)
optim.zero_grad()
loss.backward()
# Update weights
optim.step()
# total_loss = get_loss_train(model, data_train, criterion)
My function for calculating loss and accuracy is below:
def get_loss_train(model, train_sequence):
"""
Calculate loss over train set
"""
model.eval()
total_acc = 0
total_loss = 0
for idx in range(len(train_sequence)):
with torch.no_grad():
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
loss = criterion(outputs, masks)
preds = torch.argmax(outputs, dim=1).float()
acc = accuracy_check_for_batch(masks.cpu(), preds.cpu(), images.size()[0])
total_acc = total_acc + acc
total_loss = total_loss + loss.cpu().item()
return total_acc/(len(train_sequence)), total_loss/(len(train_sequence))
Edit : Code which runs (calls) the functions:
for epoch in range(epochs):
train_model(epoch, train_sequence)
train_acc, train_loss = get_loss_train(model,train_sequence)
print("Train Acc:", train_acc)
print("Train loss:", train_loss)
Can someone help me identify as why is accuracy always exact 0.5?
Edit-2:
As asked accuracy_check_for_batch function is here:
def accuracy_check_for_batch(masks, predictions, batch_size):
total_acc = 0
for index in range(batch_size):
total_acc += accuracy_check(masks[index], predictions[index])
return total_acc/batch_size
and
def accuracy_check(mask, prediction):
ims = [mask, prediction]
np_ims = []
for item in ims:
if 'str' in str(type(item)):
item = np.array(Image.open(item))
elif 'PIL' in str(type(item)):
item = np.array(item)
elif 'torch' in str(type(item)):
item = item.numpy()
np_ims.append(item)
compare = np.equal(np_ims[0], np_ims[1])
accuracy = np.sum(compare)
return accuracy/len(np_ims[0].flatten())
I found the mistake.
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
should be
model = UNet(num_classes = 1, depth=5, in_channels=5, merge_mode='concat').to(device)
because I am using BCELosswithLogits.

How to feed previous time-stamp prediction as additional input to the next time-stamp?

This question might have been asked, but I got confused.
I am trying to apply one of RNN types, e.g. LSTM for time-series forecasting. I have inputs, y (stock returns). For each timestamp, I'd like to get the predictions. Q1 - Am I correct choosing seq2seq approach?
I also want to use predictions from previous timestamp (initializing initial values with some constant) as additional (still using my existing inputs) input in the form of squared residuals, i.e. using
eps_{t-1} = (y_{t-1} - y^_{t-1})^2 as additional input at t (as well as previous inputs).
So, how can I do this in tensorflow or in pytorch?
I tried to depict what I want on the attached graph. The graph
p.s. Sorry, it the question is poorly formulated
Let say your input if of dimension (32,10,1) with batch_size 32, time steps of length 10 and dimension of 1. Same for your target (stock return). This code make use of the tf.scan function, which is usefull when implementing custom recurrent networks (it will iterate over the timesteps). It remains to use the residual of t-1 in t somewhere, as you would like to.
ps: it is a very basic implementation of lstm from scratch, without any bias or output activation.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
BS = 32
TS = 10
inputs_dim = 1
target_dim = 1
inputs = tf.placeholder(shape=[BS, TS, inputs_dim], dtype=tf.float32)
stock_returns = tf.placeholder(shape=[BS, TS, target_dim], dtype=tf.float32)
state_size = 16
# initial hidden state
init_state = tf.placeholder(shape=[2, BS, state_size],
dtype=tf.float32, name='initial_state')
# initializer
xav_init = tf.contrib.layers.xavier_initializer
# params
W = tf.get_variable('W', shape=[4, state_size, state_size],
initializer=xav_init())
U = tf.get_variable('U', shape=[4, inputs_dim, state_size],
initializer=xav_init())
W_out = tf.get_variable('W_out', shape=[state_size, target_dim],
initializer=xav_init())
#the function to feed tf.scan with
def step(prev, inputs_):
#unpack all inputs and previous outputs
st_1, ct_1 = prev[0][0], prev[0][1]
x = inputs_[0]
target = inputs_[1]
#get previous squared residual
eps = prev[1]
"""
here do whatever you want with eps_t-1
like x += eps if x if of the same dimension
or include it somewhere in your graph
"""
# lstm gates (add bias if needed)
#
# input gate
i = tf.sigmoid(tf.matmul(x,U[0]) + tf.matmul(st_1,W[0]))
# forget gate
f = tf.sigmoid(tf.matmul(x,U[1]) + tf.matmul(st_1,W[1]))
# output gate
o = tf.sigmoid(tf.matmul(x,U[2]) + tf.matmul(st_1,W[2]))
# gate weights
g = tf.tanh(tf.matmul(x,U[3]) + tf.matmul(st_1,W[3]))
ct = ct_1*f + g*i
st = tf.tanh(ct)*o
"""
make prediction, compute residual in t
and pass it to t+1
Normaly, we would compute prediction outside the scan function,
but as we need it here, we could just keep it and return it back
as an output of the scan function
"""
prediction_t = tf.matmul(st, W_out) # + bias
eps = (target - prediction_t)**2
return [tf.stack((st, ct), axis=0), eps, prediction_t]
states, eps, preds = tf.scan(step, [tf.transpose(inputs, [1,0,2]),
tf.transpose(stock_returns, [1,0,2])], initializer=[init_state,
tf.zeros((32,1), dtype=tf.float32),
tf.zeros((32,1),dtype=tf.float32)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
out = sess.run(preds, feed_dict=
{inputs:np.random.rand(BS,TS,inputs_dim),
stock_returns:np.random.rand(BS,TS,target_dim),
init_state:np.zeros((2,BS,state_size))})
out = tf.transpose(out,[1,0,2])
print(out)
And the output :
Tensor("transpose_2:0", shape=(32, 10, 1), dtype=float32)
Base code from here

Tensorflow dynamic_rnn TypeError: 'Tensor' object is not iterable

I'm trying to get a basic LSTM working in TensorFlow. I'm receiving the following error:
TypeError: 'Tensor' object is not iterable.
The offending line is:
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, x, sequence_length=seqlen,
initial_state=init_state,)`
I'm using version 1.0.1 on windows 7. My inputs and label have the following shapes
x_shape = (50, 40, 18), y_shape = (50, 40)
Where:
batch size = 50
sequence length = 40
input vector length at each step = 18
I'm building my graph as follows
def build_graph(learn_rate, seq_len, state_size=32, batch_size=5):
# use a fixed sequence length
seqlen = tf.constant(seq_len, shape=[batch_size],dtype=tf.int32)
# Placeholders
x = tf.placeholder(tf.float32, [batch_size, None, 18])
y = tf.placeholder(tf.float32, [batch_size, None])
keep_prob = tf.constant(1.0)
# RNN
cell = tf.contrib.rnn.LSTMCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size],
initializer=tf.constant_initializer(0.0))
init_state = tf.tile(init_state, [batch_size, 1])
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, x, sequence_length=seqlen,
initial_state=init_state,)
# Add dropout, as the model otherwise quickly overfits
rnn_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
# Prediction layer
with tf.variable_scope('prediction'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
preds = tf.tanh(tf.matmul(rnn_outputs, W) + b)
# MSE
loss = tf.square(tf.subtract(y, preds))
# loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y))
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss)
Can anyone tell me what I am missing?
Sequence length should be iterable e.g. a list or tensor, not a scalar. In your case specifically, you need to replace sequence length = 40 with a list of the lengths of each input. For instance, if your first sequence has 10 steps, the second 13 and the third 18, you would pass in [10, 13, 18]. This lets TensorFlow's dynamic RNN know how many steps to unroll for (I believe it uses a while loop internally).

Stacked RNN model setup in TensorFlow

I'm kind of lost in building up a stacked LSTM model for text classification in TensorFlow.
My input data was something like:
x_train = [[1.,1.,1.],[2.,2.,2.],[3.,3.,3.],...,[0.,0.,0.],[0.,0.,0.],
...... #I trained the network in batch with batch size set to 32.
]
y_train = [[1.,0.],[1.,0.],[0.,1.],...,[1.,0.],[0.,1.]]
# binary classification
The skeleton of my code looks like:
self._input = tf.placeholder(tf.float32, [self.batch_size, self.max_seq_length, self.vocab_dim], name='input')
self._target = tf.placeholder(tf.float32, [self.batch_size, 2], name='target')
lstm_cell = rnn_cell.BasicLSTMCell(self.vocab_dim, forget_bias=1.)
lstm_cell = rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=self.dropout_ratio)
self.cells = rnn_cell.MultiRNNCell([lstm_cell] * self.num_layers)
self._initial_state = self.cells.zero_state(self.batch_size, tf.float32)
inputs = tf.nn.dropout(self._input, self.dropout_ratio)
inputs = [tf.reshape(input_, (self.batch_size, self.vocab_dim)) for input_ in
tf.split(1, self.max_seq_length, inputs)]
outputs, states = rnn.rnn(self.cells, inputs, initial_state=self._initial_state)
# We only care about the output of the last RNN cell...
y_pred = tf.nn.xw_plus_b(outputs[-1], tf.get_variable("softmax_w", [self.vocab_dim, 2]), tf.get_variable("softmax_b", [2]))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_pred, self._target))
correct_pred = tf.equal(tf.argmax(y_pred, 1), tf.argmax(self._target, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
initializer = tf.random_uniform_initializer(-0.04, 0.04)
with tf.variable_scope("model", reuse=True, initializer=initializer):
sess.run(init)
# generate batches here (omitted for clarity)
print sess.run([train_op, loss, accuracy], feed_dict={self._input: batch_x, self._target: batch_y})
The problem is that no matter how large the dataset is, the loss and accuracy has no sign of improvement (looks completely stochastic). Am I doing anything wrong?
Update:
# First, load Word2Vec model in Gensim.
model = Doc2Vec.load(word2vec_path)
# Second, build the dictionary.
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.vocab.keys(), allow_update=True)
w2indx = {v: k + 1 for k, v in gensim_dict.items()}
w2vec = {word: model[word] for word in w2indx.keys()}
# Third, read data from a text file.
for fname in fnames:
i = 0
with codecs.open(fname, 'r', encoding='utf8') as fr:
for line in fr:
tmp = []
for t in line.split():
tmp.append(t)
X_train.append(tmp)
i += 1
if i is samples_count:
break
# Fourth, convert words into vectors, and pad each sentence with ZERO arrays to a fixed length.
result = np.zeros((len(data), self.max_seq_length, self.vocab_dim), dtype=np.float32)
for rowNo in xrange(len(data)):
rowLen = len(data[rowNo])
for colNo in xrange(rowLen):
word = data[rowNo][colNo]
if word in w2vec:
result[rowNo][colNo] = w2vec[word]
else:
result[rowNo][colNo] = [0] * self.vocab_dim
for colPadding in xrange(rowLen, self.max_seq_length):
result[rowNo][colPadding] = [0] * self.vocab_dim
return result
# Fifth, generate batches and feed them to the model.
... Trivias ...
Here are few reasons it may not be training and suggestions to try:
You are not allowing to update word vectors, space of pre-learned vectors may be not working properly.
RNNs really need gradient clipping when trained. You can try adding something like this.
Unit scale initialization seems to work better, as it accounts for the size of the layer and allows gradient to be scaled properly as it goes deeper.
You should try removing dropout and second layer - just to check if your data passing is correct and your loss is going down at all.
I also can recommend trying this example with your data: https://github.com/tensorflow/skflow/blob/master/examples/text_classification.py
It trains word vectors from scratch, already has gradient clipping and uses GRUCells which usually are easier to train. You can also see nice visualizations for loss and other things by running tensorboard logdir=/tmp/tf_examples/word_rnn.

Resources