How to feed previous time-stamp prediction as additional input to the next time-stamp? - time-series

This question might have been asked, but I got confused.
I am trying to apply one of RNN types, e.g. LSTM for time-series forecasting. I have inputs, y (stock returns). For each timestamp, I'd like to get the predictions. Q1 - Am I correct choosing seq2seq approach?
I also want to use predictions from previous timestamp (initializing initial values with some constant) as additional (still using my existing inputs) input in the form of squared residuals, i.e. using
eps_{t-1} = (y_{t-1} - y^_{t-1})^2 as additional input at t (as well as previous inputs).
So, how can I do this in tensorflow or in pytorch?
I tried to depict what I want on the attached graph. The graph
p.s. Sorry, it the question is poorly formulated

Let say your input if of dimension (32,10,1) with batch_size 32, time steps of length 10 and dimension of 1. Same for your target (stock return). This code make use of the tf.scan function, which is usefull when implementing custom recurrent networks (it will iterate over the timesteps). It remains to use the residual of t-1 in t somewhere, as you would like to.
ps: it is a very basic implementation of lstm from scratch, without any bias or output activation.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
BS = 32
TS = 10
inputs_dim = 1
target_dim = 1
inputs = tf.placeholder(shape=[BS, TS, inputs_dim], dtype=tf.float32)
stock_returns = tf.placeholder(shape=[BS, TS, target_dim], dtype=tf.float32)
state_size = 16
# initial hidden state
init_state = tf.placeholder(shape=[2, BS, state_size],
dtype=tf.float32, name='initial_state')
# initializer
xav_init = tf.contrib.layers.xavier_initializer
# params
W = tf.get_variable('W', shape=[4, state_size, state_size],
initializer=xav_init())
U = tf.get_variable('U', shape=[4, inputs_dim, state_size],
initializer=xav_init())
W_out = tf.get_variable('W_out', shape=[state_size, target_dim],
initializer=xav_init())
#the function to feed tf.scan with
def step(prev, inputs_):
#unpack all inputs and previous outputs
st_1, ct_1 = prev[0][0], prev[0][1]
x = inputs_[0]
target = inputs_[1]
#get previous squared residual
eps = prev[1]
"""
here do whatever you want with eps_t-1
like x += eps if x if of the same dimension
or include it somewhere in your graph
"""
# lstm gates (add bias if needed)
#
# input gate
i = tf.sigmoid(tf.matmul(x,U[0]) + tf.matmul(st_1,W[0]))
# forget gate
f = tf.sigmoid(tf.matmul(x,U[1]) + tf.matmul(st_1,W[1]))
# output gate
o = tf.sigmoid(tf.matmul(x,U[2]) + tf.matmul(st_1,W[2]))
# gate weights
g = tf.tanh(tf.matmul(x,U[3]) + tf.matmul(st_1,W[3]))
ct = ct_1*f + g*i
st = tf.tanh(ct)*o
"""
make prediction, compute residual in t
and pass it to t+1
Normaly, we would compute prediction outside the scan function,
but as we need it here, we could just keep it and return it back
as an output of the scan function
"""
prediction_t = tf.matmul(st, W_out) # + bias
eps = (target - prediction_t)**2
return [tf.stack((st, ct), axis=0), eps, prediction_t]
states, eps, preds = tf.scan(step, [tf.transpose(inputs, [1,0,2]),
tf.transpose(stock_returns, [1,0,2])], initializer=[init_state,
tf.zeros((32,1), dtype=tf.float32),
tf.zeros((32,1),dtype=tf.float32)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
out = sess.run(preds, feed_dict=
{inputs:np.random.rand(BS,TS,inputs_dim),
stock_returns:np.random.rand(BS,TS,target_dim),
init_state:np.zeros((2,BS,state_size))})
out = tf.transpose(out,[1,0,2])
print(out)
And the output :
Tensor("transpose_2:0", shape=(32, 10, 1), dtype=float32)
Base code from here

Related

MXNet - application of GANs to MNIST

So this question is about GANs.
I am trying to do a trivial example for my own proof of concept; namely, generate images of hand written digits (MNIST). While most will approach this via deep convolutional gans (dgGANs), I am just trying to achieve this via the 1D array (i.e. instead of 28x28 gray-scale pixel values, a 28*28 1d array).
This git repo features a "vanilla" gans which treats the MNIST dataset as a 1d array of 784 values. Their output values look pretty acceptable so I wanted to do something similar.
Import statements
from __future__ import print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
import mxnet as mx
from mxnet import nd, gluon, autograd
from mxnet.gluon import nn, utils
import numpy as np
import os
from math import floor
from random import random
import time
from datetime import datetime
import logging
ctx = mx.gpu()
np.random.seed(3)
Hyper parameters
batch_size = 100
epochs = 100
generator_learning_rate = 0.001
discriminator_learning_rate = 0.001
beta1 = 0.5
latent_z_size = 100
Load data
mnist = mx.test_utils.get_mnist()
# convert imgs to arrays
flattened_training_data = mnist["test_data"].reshape(10000, 28*28)
define models
G = nn.Sequential()
with G.name_scope():
G.add(nn.Dense(300, activation="relu"))
G.add(nn.Dense(28 * 28, activation="tanh"))
D = nn.Sequential()
with D.name_scope():
D.add(nn.Dense(128, activation="relu"))
D.add(nn.Dense(64, activation="relu"))
D.add(nn.Dense(32, activation="relu"))
D.add(nn.Dense(2, activation="tanh"))
loss = gluon.loss.SoftmaxCrossEntropyLoss()
init stuff
G.initialize(mx.init.Normal(0.02), ctx=ctx)
D.initialize(mx.init.Normal(0.02), ctx=ctx)
trainer_G = gluon.Trainer(G.collect_params(), 'adam', {"learning_rate": generator_learning_rate, "beta1": beta1})
trainer_D = gluon.Trainer(D.collect_params(), 'adam', {"learning_rate": discriminator_learning_rate, "beta1": beta1})
metric = mx.metric.Accuracy()
dynamic plot (for juptyer notebook)
import matplotlib.pyplot as plt
import time
def dynamic_line_plt(ax, y_data, colors=['r', 'b', 'g'], labels=['Line1', 'Line2', 'Line3']):
x_data = []
y_max = 0
y_min = 0
x_min = 0
x_max = 0
for y in y_data:
x_data.append(list(range(len(y))))
if max(y) > y_max:
y_max = max(y)
if min(y) < y_min:
y_min = min(y)
if len(y) > x_max:
x_max = len(y)
ax.set_ylim(y_min, y_max)
ax.set_xlim(x_min, x_max)
if ax.lines:
for i, line in enumerate(ax.lines):
line.set_xdata(x_data[i])
line.set_ydata(y_data[i])
else:
for i in range(len(y_data)):
l = ax.plot(x_data[i], y_data[i], colors[i], label=labels[i])
ax.legend()
fig.canvas.draw()
train
stamp = datetime.now().strftime('%Y_%m_%d-%H_%M')
logging.basicConfig(level=logging.DEBUG)
# arrays to store data for plotting
loss_D = nd.array([0], ctx=ctx)
loss_G = nd.array([0], ctx=ctx)
acc_d = nd.array([0], ctx=ctx)
labels = ['Discriminator Loss', 'Generator Loss', 'Discriminator Acc.']
%matplotlib notebook
fig, ax = plt.subplots(1, 1)
ax.set_xlabel('Time')
ax.set_ylabel('Loss')
dynamic_line_plt(ax, [loss_D.asnumpy(), loss_G.asnumpy(), acc_d.asnumpy()], labels=labels)
for epoch in range(epochs):
tic = time.time()
data_iter.reset()
for i, batch in enumerate(data_iter):
####################################
# Update Disriminator: maximize log(D(x)) + log(1-D(G(z)))
####################################
# extract batch of real data
data = batch.data[0].as_in_context(ctx)
# add noise
# Produce our noisey input to the generator
latent_z = mx.nd.random_normal(0,1,shape=(batch_size, latent_z_size), ctx=ctx)
# soft and noisy labels
# real_label = mx.nd.ones((batch_size, ), ctx=ctx) * nd.random_uniform(.7, 1.2, shape=(1)).asscalar()
# fake_label = mx.nd.ones((batch_size, ), ctx=ctx) * nd.random_uniform(0, .3, shape=(1)).asscalar()
# real_label = nd.random_uniform(.7, 1.2, shape=(batch_size), ctx=ctx)
# fake_label = nd.random_uniform(0, .3, shape=(batch_size), ctx=ctx)
real_label = mx.nd.ones((batch_size, ), ctx=ctx)
fake_label = mx.nd.zeros((batch_size, ), ctx=ctx)
with autograd.record():
# train with real data
real_output = D(data)
errD_real = loss(real_output, real_label)
# train with fake data
fake = G(latent_z)
fake_output = D(fake.detach())
errD_fake = loss(fake_output, fake_label)
errD = errD_real + errD_fake
errD.backward()
trainer_D.step(batch_size)
metric.update([real_label, ], [real_output,])
metric.update([fake_label, ], [fake_output,])
####################################
# Update Generator: maximize log(D(G(z)))
####################################
with autograd.record():
output = D(fake)
errG = loss(output, real_label)
errG.backward()
trainer_G.step(batch_size)
####
# Plot Loss
####
# append new data to arrays
loss_D = nd.concat(loss_D, nd.mean(errD), dim=0)
loss_G = nd.concat(loss_G, nd.mean(errG), dim=0)
name, acc = metric.get()
acc_d = nd.concat(acc_d, nd.array([acc], ctx=ctx), dim=0)
# plot array
dynamic_line_plt(ax, [loss_D.asnumpy(), loss_G.asnumpy(), acc_d.asnumpy()], labels=labels)
name, acc = metric.get()
metric.reset()
logging.info('Binary training acc at epoch %d: %s=%f' % (epoch, name, acc))
logging.info('time: %f' % (time.time() - tic))
output
img = G(mx.nd.random_normal(0,1,shape=(100, latent_z_size), ctx=ctx))[0].reshape((28, 28))
plt.imshow(img.asnumpy(),cmap='gray')
plt.show()
Now this doesn't get nearly as good as the repo's example from above. Although fairly similar.
Thus I was wondering if you could take a look and figure out why:
the colors are inverted
why the results are sub par
I have been fiddling around with this trying a lot of various things to improve the results (I will list this in a second), but for the MNIST dataset this really shouldn't be needed.
Things I have tried (and I have also tried a host of combinations):
increasing the generator network
increasing the discriminator network
using soft labeling
using noisy labeling
batch norm after every layer in the generator
batch norm of the data
normalizing all values between -1 and 1
leaky relus in the generator
drop out layers in the generator
increased learning rate of discriminator compared to generator
decreased learning rate of i compared to generator
Please let me know if you have any ideas.
1) If you look into original dataset:
training_set = mnist["train_data"].reshape(60000, 28, 28)
plt.imshow(training_set[10,:,:], cmap='gray')
you will notice that the digits are white on a black background. So, technically speaking, your results are not inversed - they match the pattern of original images you used as a real data.
If you want to invert colors for visualization purposes, you can easily do that by changing the pallete to reversed one by adding '_r' (it works for all color palletes):
plt.imshow(img.asnumpy(), cmap='gray_r')
You also can play with ranges of colors by changing vmin and vmax parameters. They control how big the difference between colors should be. By default it is calculated automatically based on provided set.
2) "Why the results are sub par" - I think this is exactly the reason why the community started to use dcGANs. To me the results in the git repo you provided are quite noisy. Surely, they are different from what you receive, and you can achieve the same quality just by changing your activation functions from tanh to sigmoid as in the example on github:
G = nn.Sequential()
with G.name_scope():
G.add(nn.Dense(300, activation="relu"))
G.add(nn.Dense(28 * 28, activation="sigmoid"))
D = nn.Sequential()
with D.name_scope():
D.add(nn.Dense(128, activation="relu"))
D.add(nn.Dense(64, activation="relu"))
D.add(nn.Dense(32, activation="relu"))
D.add(nn.Dense(2, activation="sigmoid"))
Sigmoid never goes below zero and it works better in this scenario. Here is a sample picture I get if I train updated model for 30 epochs (the rest of the hyperparameters are same).
If you decide to explore dcGAN to get even better results, take a look here - https://mxnet.incubator.apache.org/tutorials/unsupervised_learning/gan.html It is a well explained tutorial on how to build dcGAN with Mxnet and Gluon. By using dcGAN you will get way better results than that.

Tensorflow KNN : How can we assign the K parameter for defining number of neighbors in KNN?

I have started working on a machine learning project using K-Nearest-Neighbors method on python tensorflow library. I have no experience working with tensorflow tools, so I found some code in github and modified it for my data.
My dataset is like this:
2,2,2,2,0,0,3
2,2,2,2,0,1,0
2,2,2,4,2,2,1
...
2,2,2,4,2,0,0
And this is the code which actually works fine:
import tensorflow as tf
import numpy as np
# Whole dataset => 1428 samples
dataset = 'car-eval-data-1.csv'
# samples for train, remaining for test
samples = 1300
reader = np.loadtxt(open(dataset, "rb"), delimiter=",", skiprows=1, dtype=np.int32)
train_x, train_y = reader[:samples,:5], reader[:samples,6]
test_x, test_y = reader[samples:, :5], reader[samples:, 6]
# Placeholder you can assign values in future. its kind of a variable
# v = ("variable type",[None,4]) -- you can have multidimensional values here
training_values = tf.placeholder("float",[None,len(train_x[0])])
test_values = tf.placeholder("float",[len(train_x[0])])
# MANHATTAN distance
distance = tf.abs(tf.reduce_sum(tf.square(tf.subtract(training_values,test_values)),reduction_indices=1))
prediction = tf.arg_min(distance, 0)
init = tf.global_variables_initializer()
accuracy = 0.0
with tf.Session() as sess:
sess.run(init)
# Looping through the test set to compare against the training set
for i in range (len(test_x)):
# Tensor flow method to get the prediction near to the test parameters in the training set.
index_in_trainingset = sess.run(prediction, feed_dict={training_values:train_x,test_values:test_x[i]})
print("Test %d, and the prediction is %s, the real value is %s"%(i,train_y[index_in_trainingset],test_y[i]))
if train_y[index_in_trainingset] == test_y[i]:
# if prediction is right so accuracy increases.
accuracy += 1. / len(test_x)
print('Accuracy -> ', accuracy * 100, ' %')
The only thing I do not understand is that if it's the KNN method so there has to be some K parameter which defines the number of neighbors for predicting the label for each test sample.
How can we assign the K parameter to tune the number of nearest neighbors for the code?
Is there any way to modify this code to make use of K parameter?
You're right that the example above does not have the provision to select K-Nearest neighbours. In the code below, I have added the ability to add such a parameter(knn_size) along with other corrections
import tensorflow as tf
import numpy as np
# Whole dataset => 1428 samples
dataset = 'PATH_TO_DATASET_CSV'
knn_size = 1
# samples for train, remaining for test
samples = 1300
reader = np.loadtxt(open(dataset, "rb"), delimiter=",", skiprows=1, dtype=np.int32)
train_x, train_y = reader[:samples,:6], reader[:samples,6]
test_x, test_y = reader[samples:, :6], reader[samples:, 6]
# Placeholder you can assign values in future. its kind of a variable
# v = ("variable type",[None,4]) -- you can have multidimensional values here
training_values = tf.placeholder("float",[None, len(train_x[0])])
test_values = tf.placeholder("float",[len(train_x[0])])
# MANHATTAN distance
distance = tf.abs(tf.reduce_sum(tf.square(tf.subtract(training_values,test_values)),reduction_indices=1))
# Here, we multiply the distance by -1 to reverse the magnitude of distances, i.e. the largest distance becomes the smallest distance
# tf.nn.top_k returns the top k values and their indices, here k is controlled by the parameter knn_size
k_nearest_neighbour_values, k_nearest_neighbour_indices = tf.nn.top_k(tf.scalar_mul(-1,distance),k=knn_size)
#Based on the indices we obtain from the previous step, we locate the exact class label set of the k closest matches in the training data
best_training_labels = tf.gather(train_y,k_nearest_neighbour_indices)
if knn_size==1:
prediction = tf.squeeze(best_training_labels)
else:
# Now we make our prediction based on the class label that appears most frequently
# tf.unique_with_counts() gives us all unique values that appear in a 1-D tensor along with their indices and counts
values, indices, counts = tf.unique_with_counts(best_training_labels)
# This gives us the index of the class label that has repeated the most
max_count_index = tf.argmax(counts,0)
#Retrieve the required class label
prediction = tf.gather(values,max_count_index)
init = tf.global_variables_initializer()
accuracy = 0.0
with tf.Session() as sess:
sess.run(init)
# Looping through the test set to compare against the training set
for i in range (len(test_x)):
# Tensor flow method to get the prediction near to the test parameters in the training set.
prediction_value = sess.run([prediction], feed_dict={training_values:train_x,test_values:test_x[i]})
print("Test %d, and the prediction is %s, the real value is %s"%(i,prediction_value[0],test_y[i]))
if prediction_value[0] == test_y[i]:
# if prediction is right so accuracy increases.
accuracy += 1. / len(test_x)
print('Accuracy -> ', accuracy * 100, ' %')

How can I change the max sequence length in a Tensorflow RNN Model?

I am currently trying to adapt my tensorflow classifier, which is able to tag a sequence of words to be positive or negative, to handle much longer sequences, without retraining. My model is a RNN, with a max sequence lenght of 210. One input is one word(300 dim), I vectorised the words with Googles word2vec, so I am able to feed a sequence with max 210 words. Now my question is, how can I change the max sequence length to for example 3000, for classifying movie reviews.
My working model with fixed max sequence length of 210(tf_version: 1.1.0):
n_chunks = 210
chunk_size = 300
x = tf.placeholder("float",[None,n_chunks,chunk_size])
y = tf.placeholder("float",None)
seq_length = tf.placeholder("int64",None)
with tf.variable_scope("rnn1"):
lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size,
state_is_tuple=True)
lstm_cell = tf.contrib.rnn.DropoutWrapper (lstm_cell,
input_keep_prob=0.8)
outputs, _ = tf.nn.dynamic_rnn(lstm_cell,x,dtype=tf.float32,
sequence_length = self.seq_length)
fc = tf.contrib.layers.fully_connected(outputs, 1000,
activation_fn=tf.nn.relu)
output = tf.contrib.layers.flatten(fc)
#*1
logits = tf.contrib.layers.fully_connected(output, self.n_classes,
activation_fn=None)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits
(logits=logits, labels=y) )
optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)
...
#train
#train_x padded to fit(batch_size*n_chunks*chunk_size)
sess.run([optimizer, cost], feed_dict={x:train_x, y:train_y,
seq_length:seq_length})
#predict:
...
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
What modifications I already tried:
1 Replacing n_chunks with None and simply feed data in
x = tf.placeholder(tf.float32, [None,None,300])
#model fails to build
#ValueError: The last dimension of the inputs to `Dense` should be defined.
#Found `None`.
# at *1
...
#all entrys in word_vecs still have got the same length for example
#3000(batch_size*3000(!= n_chunks)*300)
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
2 Changing x and then restore the old model:
x = tf.placeholder(tf.float32, [None,n_chunks*10,chunk_size]
...
saver = tf.train.Saver(tf.all_variables(), reshape=True)
saver.restore(sess,"...")
#fails as well:
#InvalidArgumentError (see above for traceback): Input to reshape is a
#tensor with 420000 values, but the requested shape has 840000
#[[Node: save/Reshape_5 = Reshape[T=DT_FLOAT, Tshape=DT_INT32,
#_device="/job:localhost/replica:0/task:0/cpu:0"](save/RestoreV2_5,
#save/Reshape_5/shape)]]
# run prediction
If it is possible could you please provide me with any working example or explain me why it isnt?
I am just wondering why not you just assign the n_chunk a value of 3000?
In your first attempt, you cannot use two None, since tf cannot how many dimensions to put for each one. The first dimension is set as None because it is contingent upon the batch size. In your second attempt, you just change one place and the other places where n_chunks is used may conflict with the x placeholder.

Generate text with a trained character level LSTM model

I trained a model with the purpose of generating sentences as follow:
I feed as training example 2 sequences: x which is a sequence of characters and y which is the same shift by one. The model is based on LSTM and is created with tensorflow.
My question is: since the model take in input sequences of a certain size (50 in my case), how can I make prediction giving him only a single character as seed ? I've seen it in some examples that after training they generate sentences by simply feeding a single characters.
Here is my code:
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [batch_size, truncated_backprop], name='x')
y = tf.placeholder(tf.int32, [batch_size, truncated_backprop], name='y')
with tf.name_scope('weights'):
W = tf.Variable(np.random.rand(n_hidden, num_classes), dtype=tf.float32)
b = tf.Variable(np.random.rand(1, num_classes), dtype=tf.float32)
inputs_series = tf.split(x, truncated_backprop, 1)
labels_series = tf.unstack(y, axis=1)
with tf.name_scope('LSTM'):
cell = tf.contrib.rnn.BasicLSTMCell(n_hidden, state_is_tuple=True)
cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)
cell = tf.contrib.rnn.MultiRNNCell([cell] * n_layers)
states_series, current_state = tf.contrib.rnn.static_rnn(cell, inputs_series, \
dtype=tf.float32)
logits_series = [tf.matmul(state, W) + b for state in states_series]
prediction_series = [tf.nn.softmax(logits) for logits in logits_series]
losses = [tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels) \
for logits, labels, in zip(logits_series, labels_series)]
total_loss = tf.reduce_mean(losses)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)
I suggest you use dynamic_rnn instead of static_rnn, which creates the graph during execution time and allows you to have inputs of any length. Your input placeholder would be
x = tf.placeholder(tf.float32, [batch_size, None, features], name='x')
Next, you'll need a way to input your own initial state into the network. You can do that by passing the initial_state parameter to dynamic_rnn, like:
initialstate = cell.zero_state(batch_sie, tf.float32)
outputs, current_state = tf.nn.dynamic_rnn(cell,
inputs,
initial_state=initialstate)
With that, in order to generate text from a single character you can feed the graph 1 character at a time, passing in the previous character and state each time, like:
prompt = 's' # beginning character, whatever
inp = one_hot(prompt) # preprocessing, as you probably want to feed one-hot vectors
state = None
while True:
if state is None:
feed = {x: [[inp]]}
else:
feed = {x: [[inp]], initialstate: state}
out, state = sess.run([outputs, current_state], feed_dict=feed)
inp = process(out) # extract the predicted character from out and one-hot it

Stacked RNN model setup in TensorFlow

I'm kind of lost in building up a stacked LSTM model for text classification in TensorFlow.
My input data was something like:
x_train = [[1.,1.,1.],[2.,2.,2.],[3.,3.,3.],...,[0.,0.,0.],[0.,0.,0.],
...... #I trained the network in batch with batch size set to 32.
]
y_train = [[1.,0.],[1.,0.],[0.,1.],...,[1.,0.],[0.,1.]]
# binary classification
The skeleton of my code looks like:
self._input = tf.placeholder(tf.float32, [self.batch_size, self.max_seq_length, self.vocab_dim], name='input')
self._target = tf.placeholder(tf.float32, [self.batch_size, 2], name='target')
lstm_cell = rnn_cell.BasicLSTMCell(self.vocab_dim, forget_bias=1.)
lstm_cell = rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=self.dropout_ratio)
self.cells = rnn_cell.MultiRNNCell([lstm_cell] * self.num_layers)
self._initial_state = self.cells.zero_state(self.batch_size, tf.float32)
inputs = tf.nn.dropout(self._input, self.dropout_ratio)
inputs = [tf.reshape(input_, (self.batch_size, self.vocab_dim)) for input_ in
tf.split(1, self.max_seq_length, inputs)]
outputs, states = rnn.rnn(self.cells, inputs, initial_state=self._initial_state)
# We only care about the output of the last RNN cell...
y_pred = tf.nn.xw_plus_b(outputs[-1], tf.get_variable("softmax_w", [self.vocab_dim, 2]), tf.get_variable("softmax_b", [2]))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_pred, self._target))
correct_pred = tf.equal(tf.argmax(y_pred, 1), tf.argmax(self._target, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
initializer = tf.random_uniform_initializer(-0.04, 0.04)
with tf.variable_scope("model", reuse=True, initializer=initializer):
sess.run(init)
# generate batches here (omitted for clarity)
print sess.run([train_op, loss, accuracy], feed_dict={self._input: batch_x, self._target: batch_y})
The problem is that no matter how large the dataset is, the loss and accuracy has no sign of improvement (looks completely stochastic). Am I doing anything wrong?
Update:
# First, load Word2Vec model in Gensim.
model = Doc2Vec.load(word2vec_path)
# Second, build the dictionary.
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.vocab.keys(), allow_update=True)
w2indx = {v: k + 1 for k, v in gensim_dict.items()}
w2vec = {word: model[word] for word in w2indx.keys()}
# Third, read data from a text file.
for fname in fnames:
i = 0
with codecs.open(fname, 'r', encoding='utf8') as fr:
for line in fr:
tmp = []
for t in line.split():
tmp.append(t)
X_train.append(tmp)
i += 1
if i is samples_count:
break
# Fourth, convert words into vectors, and pad each sentence with ZERO arrays to a fixed length.
result = np.zeros((len(data), self.max_seq_length, self.vocab_dim), dtype=np.float32)
for rowNo in xrange(len(data)):
rowLen = len(data[rowNo])
for colNo in xrange(rowLen):
word = data[rowNo][colNo]
if word in w2vec:
result[rowNo][colNo] = w2vec[word]
else:
result[rowNo][colNo] = [0] * self.vocab_dim
for colPadding in xrange(rowLen, self.max_seq_length):
result[rowNo][colPadding] = [0] * self.vocab_dim
return result
# Fifth, generate batches and feed them to the model.
... Trivias ...
Here are few reasons it may not be training and suggestions to try:
You are not allowing to update word vectors, space of pre-learned vectors may be not working properly.
RNNs really need gradient clipping when trained. You can try adding something like this.
Unit scale initialization seems to work better, as it accounts for the size of the layer and allows gradient to be scaled properly as it goes deeper.
You should try removing dropout and second layer - just to check if your data passing is correct and your loss is going down at all.
I also can recommend trying this example with your data: https://github.com/tensorflow/skflow/blob/master/examples/text_classification.py
It trains word vectors from scratch, already has gradient clipping and uses GRUCells which usually are easier to train. You can also see nice visualizations for loss and other things by running tensorboard logdir=/tmp/tf_examples/word_rnn.

Resources