Neural Conditional Random Field - Forward Algorithm Recurrence - machine-learning

I'm currently working on implementing a neural CRF as a project for school and am looking around for repos to reference.
I encountered this one the other day and have been completely stumped by the implementation of the forward algorithm.
T = feats.shape[1]
batch_size = feats.shape[0]
# alpha_recursion,forward, alpha(zt)=p(zt,bar_x_1:t)
log_alpha = torch.Tensor(batch_size, 1, self.num_labels).fill_(-10000.).to(self.device)
# normal_alpha_0 : alpha[0]=Ot[0]*self.PIs
# self.start_label has all of the score. it is log,0 is p=1
log_alpha[:, 0, self.start_label_id] = 0
# feats: sentances -> word embedding -> lstm -> MLP -> feats
# feats is the probability of emission, feat.shape=(1,tag_size)
for t in range(1, T):
log_alpha = (log_sum_exp_batch(self.transitions + log_alpha, axis=-1) + feats[:, t]).unsqueeze(1)
# log_prob of all barX
log_prob_all_barX = log_sum_exp_batch(log_alpha)
return log_prob_all_barX
From my understanding, the forward algorithm and its log alpha should be the log of the following where s_m is the scoring function - transition score from previous to current tag + emission score of neural hidden state/feature:
it seems to me that the code should be something more akin to log_alpha = log_sum_exp(transition_score + feat_score + log_alpha) if the log is applied.

Related

How to feed previous time-stamp prediction as additional input to the next time-stamp?

This question might have been asked, but I got confused.
I am trying to apply one of RNN types, e.g. LSTM for time-series forecasting. I have inputs, y (stock returns). For each timestamp, I'd like to get the predictions. Q1 - Am I correct choosing seq2seq approach?
I also want to use predictions from previous timestamp (initializing initial values with some constant) as additional (still using my existing inputs) input in the form of squared residuals, i.e. using
eps_{t-1} = (y_{t-1} - y^_{t-1})^2 as additional input at t (as well as previous inputs).
So, how can I do this in tensorflow or in pytorch?
I tried to depict what I want on the attached graph. The graph
p.s. Sorry, it the question is poorly formulated
Let say your input if of dimension (32,10,1) with batch_size 32, time steps of length 10 and dimension of 1. Same for your target (stock return). This code make use of the tf.scan function, which is usefull when implementing custom recurrent networks (it will iterate over the timesteps). It remains to use the residual of t-1 in t somewhere, as you would like to.
ps: it is a very basic implementation of lstm from scratch, without any bias or output activation.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
BS = 32
TS = 10
inputs_dim = 1
target_dim = 1
inputs = tf.placeholder(shape=[BS, TS, inputs_dim], dtype=tf.float32)
stock_returns = tf.placeholder(shape=[BS, TS, target_dim], dtype=tf.float32)
state_size = 16
# initial hidden state
init_state = tf.placeholder(shape=[2, BS, state_size],
dtype=tf.float32, name='initial_state')
# initializer
xav_init = tf.contrib.layers.xavier_initializer
# params
W = tf.get_variable('W', shape=[4, state_size, state_size],
initializer=xav_init())
U = tf.get_variable('U', shape=[4, inputs_dim, state_size],
initializer=xav_init())
W_out = tf.get_variable('W_out', shape=[state_size, target_dim],
initializer=xav_init())
#the function to feed tf.scan with
def step(prev, inputs_):
#unpack all inputs and previous outputs
st_1, ct_1 = prev[0][0], prev[0][1]
x = inputs_[0]
target = inputs_[1]
#get previous squared residual
eps = prev[1]
"""
here do whatever you want with eps_t-1
like x += eps if x if of the same dimension
or include it somewhere in your graph
"""
# lstm gates (add bias if needed)
#
# input gate
i = tf.sigmoid(tf.matmul(x,U[0]) + tf.matmul(st_1,W[0]))
# forget gate
f = tf.sigmoid(tf.matmul(x,U[1]) + tf.matmul(st_1,W[1]))
# output gate
o = tf.sigmoid(tf.matmul(x,U[2]) + tf.matmul(st_1,W[2]))
# gate weights
g = tf.tanh(tf.matmul(x,U[3]) + tf.matmul(st_1,W[3]))
ct = ct_1*f + g*i
st = tf.tanh(ct)*o
"""
make prediction, compute residual in t
and pass it to t+1
Normaly, we would compute prediction outside the scan function,
but as we need it here, we could just keep it and return it back
as an output of the scan function
"""
prediction_t = tf.matmul(st, W_out) # + bias
eps = (target - prediction_t)**2
return [tf.stack((st, ct), axis=0), eps, prediction_t]
states, eps, preds = tf.scan(step, [tf.transpose(inputs, [1,0,2]),
tf.transpose(stock_returns, [1,0,2])], initializer=[init_state,
tf.zeros((32,1), dtype=tf.float32),
tf.zeros((32,1),dtype=tf.float32)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
out = sess.run(preds, feed_dict=
{inputs:np.random.rand(BS,TS,inputs_dim),
stock_returns:np.random.rand(BS,TS,target_dim),
init_state:np.zeros((2,BS,state_size))})
out = tf.transpose(out,[1,0,2])
print(out)
And the output :
Tensor("transpose_2:0", shape=(32, 10, 1), dtype=float32)
Base code from here

Neural Network MNIST: Backpropagation is correct, but training/test accuracy very low

I am building a neural network to learn to recognize handwritten digits from MNIST. I have confirmed that backpropagation calculates the gradients perfectly (gradient checking gives error < 10 ^ -10).
It appears that no matter how I train the weights, the cost function always tends towards around 3.24-3.25 (never below that, just approaching from above) and the training/test set accuracy is very low (around 11% for the test set). It appears that the h values in the end are all very close to 0.1 and to each other.
I cannot find why my program cannot produce better results. I was wondering if anyone could maybe take a look at my code and please tell me any reasons for this occurring. Thank you so much for all your help, I really appreciate it!
Here is my Python code:
import numpy as np
import math
from tensorflow.examples.tutorials.mnist import input_data
# Neural network has four layers
# The input layer has 784 nodes
# The two hidden layers each have 5 nodes
# The output layer has 10 nodes
num_layer = 4
num_node = [784,5,5,10]
num_output_node = 10
# 30000 training sets are used
# 10000 test sets are used
# Can be adjusted
Ntrain = 30000
Ntest = 10000
# Sigmoid Function
def g(X):
return 1/(1 + np.exp(-X))
# Forwardpropagation
def h(W,X):
a = X
for l in range(num_layer - 1):
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
return a
# Cost Function
def J(y, W, X, Lambda):
cost = 0
for i in range(Ntrain):
H = h(W,X[i])
for k in range(num_output_node):
cost = cost + y[i][k] * math.log(H[k]) + (1-y[i][k]) * math.log(1-H[k])
regularization = 0
for l in range(num_layer - 1):
for i in range(num_node[l]):
for j in range(num_node[l+1]):
regularization = regularization + W[l][i+1][j] ** 2
return (-1/Ntrain * cost + Lambda / (2*Ntrain) * regularization)
# Backpropagation - confirmed to be correct
# Algorithm based on https://www.coursera.org/learn/machine-learning/lecture/1z9WW/backpropagation-algorithm
# Returns D, the value of the gradient
def BackPropagation(y, W, X, Lambda):
delta = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
delta[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for i in range(Ntrain):
A = np.empty(num_layer-1, dtype = object)
a = X[i]
for l in range(num_layer - 1):
A[l] = a
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
diff = a - y[i]
delta[num_layer-2] = delta[num_layer-2] + np.outer(np.insert(A[num_layer-2],0,1),diff)
for l in range(num_layer-2):
index = num_layer-2-l
diff = np.multiply(np.dot(np.array([W[index][k+1] for k in range(num_node[index])]), diff), np.multiply(A[index], 1-A[index]))
delta[index-1] = delta[index-1] + np.outer(np.insert(A[index-1],0,1),diff)
D = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
D[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for l in range(num_layer-1):
for i in range(num_node[l]+1):
if i == 0:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * delta[l][i][j]
else:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * (delta[l][i][j] + Lambda * W[l][i][j])
return D
# Neural network - this is where the learning/adjusting of weights occur
# W is the weights
# learn is the learning rate
# iterations is the number of iterations we pass over the training set
# Lambda is the regularization parameter
def NeuralNetwork(y, X, learn, iterations, Lambda):
W = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
W[l] = np.random.rand(num_node[l]+1,num_node[l+1])/100
for k in range(iterations):
print(J(y, W, X, Lambda))
D = BackPropagation(y, W, X, Lambda)
for l in range(num_layer-1):
W[l] = W[l] - learn * D[l]
print(J(y, W, X, Lambda))
return W
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Training data, read from MNIST
inputpix = []
output = []
for i in range(Ntrain):
inputpix.append(2 * np.array(mnist.train.images[i]) - 1)
output.append(np.array(mnist.train.labels[i]))
np.savetxt('input.txt', inputpix, delimiter=' ')
np.savetxt('output.txt', output, delimiter=' ')
# Train the weights
finalweights = NeuralNetwork(output, inputpix, 2, 5, 1)
# Test data
inputtestpix = []
outputtest = []
for i in range(Ntest):
inputtestpix.append(2 * np.array(mnist.test.images[i]) - 1)
outputtest.append(np.array(mnist.test.labels[i]))
np.savetxt('inputtest.txt', inputtestpix, delimiter=' ')
np.savetxt('outputtest.txt', outputtest, delimiter=' ')
# Determine the accuracy of the training data
count = 0
for i in range(Ntrain):
H = h(finalweights,inputpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and output[i][j] == 1:
count = count + 1
print(count/Ntrain)
# Determine the accuracy of the test data
count = 0
for i in range(Ntest):
H = h(finalweights,inputtestpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and outputtest[i][j] == 1:
count = count + 1
print(count/Ntest)
Your network is tiny, 5 neurons make it basically a linear model. Increase it to 256 per layer.
Notice, that trivial linear model has 768 * 10 + 10 (biases) parameters, adding up to 7690 floats. Your neural network on the other hand has 768 * 5 + 5 + 5 * 5 + 5 + 5 * 10 + 10 = 3845 + 30 + 60 = 3935. In other words despite being nonlinear neural network, it is actualy a simpler model than a trivial logistic regression applied to this problem. And logistic regression obtains around 11% error on its own, thus you cannot really expect to beat it. Of course this is not a strict argument, but should give you some intuition for why it should not work.
Second issue is related to other hyperparameters, you seem to be using:
huge learning rate (is it 2?) it should be more of order 0.0001
very little training iterations (are you just executing 5 epochs?)
your regularization parameter is huge (it is set to 1), so your network is heavily penalised for learning anything, again - change it to something order of magnitude smaller
The NN architecture is most likely under-fitting. Maybe, the learning rate is high/low. Or there are most issues with the regularization parameter.

tensorflow resize nearest neighbor approach don't optmize weights

I'm beginner in tensorflow and i'm working on a Model which Colorize Greyscale images and in the last part of the model the paper say :
Once the features are fused, they are processed by a set of
convolutions and upsampling layers, the latter which consist of simply
upsampling the input by using the nearest neighbour technique so that
the output is twice as wide and twice as tall.
when i tried to implement it in tensorflow i used tf.image.resize_nearest_neighbor for upsampling but when i used it i found the cost didn't change in all the epochs except of the 2nd epoch, and without it the cost is optmized and changed
This part of code
def Model(Input_images):
#some code till the following last part
Color_weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,256,128])),'W_conv2':tf.Variable(tf.random_normal([3,3,128,64])),
'W_conv3':tf.Variable(tf.random_normal([3,3,64,64])),
'W_conv4':tf.Variable(tf.random_normal([3,3,64,32])),'W_conv5':tf.Variable(tf.random_normal([3,3,32,2]))}
Color_biases = {'b_conv1':tf.Variable(tf.random_normal([128])),'b_conv2':tf.Variable(tf.random_normal([64])),'b_conv3':tf.Variable(tf.random_normal([64])),
'b_conv4':tf.Variable(tf.random_normal([32])),'b_conv5':tf.Variable(tf.random_normal([2]))}
Color_layer1 = tf.nn.relu(Conv2d(Fuse, Color_weights['W_conv1'], 1) + Color_biases['b_conv1'])
Color_layer1_up = tf.image.resize_nearest_neighbor(Color_layer1,[56,56])
Color_layer2 = tf.nn.relu(Conv2d(Color_layer1_up, Color_weights['W_conv2'], 1) + Color_biases['b_conv2'])
Color_layer3 = tf.nn.relu(Conv2d(Color_layer2, Color_weights['W_conv3'], 1) + Color_biases['b_conv3'])
Color_layer3_up = tf.image.resize_nearest_neighbor(Color_layer3,[112,112])
Color_layer4 = tf.nn.relu(Conv2d(Color_layer3, Color_weights['W_conv4'], 1) + Color_biases['b_conv4'])
return Color_layer4
The Training Code
Prediction = Model(Input_images)
Colorization_MSE = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(Prediction,tf.Variable(tf.random_normal([2,112,112,32]))))
Optmizer = tf.train.AdadeltaOptimizer(learning_rate= 0.05).minimize(Colorization_MSE)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for epoch in range(EpochsNum):
epoch_loss = 0
Batch_indx = 1
for i in range(int(ExamplesNum / Batch_size)):#Over batches
print("Batch Num ",i + 1)
ReadNextBatch()
a, c = sess.run([Optmizer,Colorization_MSE],feed_dict={Input_images:Batch_GreyImages})
epoch_loss += c
print("epoch: ",epoch + 1, ",Los: ",epoch_loss)
So what is wrong with my logic or if the problem is in
tf.image.resize_nearest_neighbor what should i do or what is it's replacement ?
Ok, i solved it, i noticed that tf.random normal was the problem and when i replaced it with tf.truncated normal it is works well

Making training example of multilayer perceptron

I'm trying to make several training examples to get a set of weights and bias for the particular network which correctly implements a hard threshold activation function.
Four inputs x_1, ... x_4 , where x_i is Real number, and the network must output(y) 1 if x_1 < x_2 < x_3 < x_4 (sorted order), and 0
otherwise.
A hard threshold activation function ;
f(z) = 1 (if z>= 0) or 0 (if z <0)
h1 = x1w11 + x2w12 + x3w13 + x4w14 + b11
h2 = x1w21 + x2w22 + x3w23 + x4w24 + b21
h3 = x1w31 + x2w32 + x3w33 + x4w34 + b31
y = w1h1 + h2w2 + h3w3 + b (*Actually h1, h2, h3 are f(h1),f(h2),f(h3) because of activation function)
And, f(y).
I guess training example should be
(-2,-1,0,1) -> output 1, (0,0,0,0) -> output 0, (0,0,0,1) -> output 0,
(1,2,3,4) -> output 1.
.. and so on.
But the domain of input is too broad to build specific examples to use multilayer perception algorithm.
Can you help me to get proper example for applying the algorithm?
No, its not to broad, you can just concentrate in the [0, 1] range for each x_i, as in any case you need normalized data to train a neural network.
So basically you can just generate uniformly distributed random numbers in the [0, 1] range, check if they are sorted or not, and produce the label accordingly. Then you repeat say for 10K or 100K and then you have a dataset to train an MLP. You could also discretize the [0, 1] range with a chosen step to generate the numbers.

Implementing a custom objective function in Keras

I am trying to implement a custom Keras objective function:
in 'Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression', Narihira et al.
This is the sum of equations (4) and (6) from the previous picture. Y* is the ground truth, Y a prediction map and y = Y* - Y.
This is my code:
def custom_objective(y_true, y_pred):
#Eq. (4) Scale invariant L2 loss
y = y_true - y_pred
h = 0.5 # lambda
term1 = K.mean(K.sum(K.square(y)))
term2 = K.square(K.mean(K.sum(y)))
sca = term1-h*term2
#Eq. (6) Gradient L2 loss
gra = K.mean(K.sum((K.square(K.gradients(K.sum(y[:,1]), y)) + K.square(K.gradients(K.sum(y[1,:]), y)))))
return (sca + gra)
However, I suspect that the equation (6) is not correctly implemented because the results are not good. Am I computing this right?
Thank you!
Edit:
I am trying to approximate (6) convolving with Prewitt filters. It works when my input is a chunk of images i.e. y[batch_size, channels, row, cols], but not with y_true and y_pred (which are of type TensorType(float32, 4D)).
My code:
def cconv(image, g_kernel, batch_size):
g_kernel = theano.shared(g_kernel)
M = T.dtensor3()
conv = theano.function(
inputs=[M],
outputs=conv2d(M, g_kernel, border_mode='full'),
)
accum = 0
for curr_batch in range (batch_size):
accum = accum + conv(image[curr_batch])
return accum/batch_size
def gradient_loss(y_true, y_pred):
y = y_true - y_pred
batch_size = 40
# Direction i
pw_x = np.array([[-1,0,1],[-1,0,1],[-1,0,1]]).astype(np.float64)
g_x = cconv(y, pw_x, batch_size)
# Direction j
pw_y = np.array([[-1,-1,-1],[0,0,0],[1,1,1]]).astype(np.float64)
g_y = cconv(y, pw_y, batch_size)
gra_l2_loss = K.mean(K.square(g_x) + K.square(g_y))
return (gra_l2_loss)
The crash is produced in:
accum = accum + conv(image[curr_batch])
...and error description is the following one:
*** TypeError: ('Bad input argument to theano function with name "custom_models.py:836" at index 0 (0-based)', 'Expected an array-like
object, but found a Variable: maybe you are trying to call a function
on a (possibly shared) variable instead of a numeric array?')
How can I use y (y_true - y_pred) as a numpy array, or how can I solve this issue?
SIL2
term1 = K.mean(K.square(y))
term2 = K.square(K.mean(y))
[...]
One mistake spread across the code was that when you see (1/n * sum()) in the equations, it is a mean. Not the mean of a sum.
Gradient
After reading your comment and giving it more thought, I think there is a confusion about the gradient. At least I got confused.
There are two ways of interpreting the gradient symbol:
The gradient of a vector where y should be differentiated with respect to the parameters of your model (usually the weights of the neural net). In previous edits I started to write in this direction because that's the sort of approach used to trained the model (eg. gradient descent). But I think I was wrong.
The pixel intensity gradient in a picture, as you mentioned in your comment. The diff of each pixel with its neighbor in each direction. In which case I guess you have to translate the example you gave into Keras.
To sum up, K.gradients() and numpy.gradient() are not used in the same way. Because numpy implicitly considers (i, j) (the row and column indices) as the two input variables, while when you feed a 2D image to a neural net, every single pixel is an input variable. Hope I'm clear.

Resources