Are multiple layers in LSTM gates usefull? - machine-learning

I'm doing an LSTM cell implementation from scratch and I was thinking of implementing the computation of the gates with multiple neural layers instead of just using the single layer version : sigmoid(dot(W,concat(a_prev,xt)) + b). I can't seem to find any literature on it. Does it work ? Can it converge ?
This the standard LSTM cell forward propagation code that I learned on Andrew Ng's Deep Learning course :
concat = np.zeros((n_a + n_x, m))
concat[: n_a, :] = a_prev
concat[n_a :, :] = xt
ft = sigmoid(np.dot(Wf, concat) + bf)
it = sigmoid(np.dot(Wi, concat) + bi)
cct = np.tanh(np.dot(Wc, concat) + bc)
c_next = ft * c_prev + it * cct
ot = sigmoid(np.dot(Wo, concat) + bo)
a_next = ot * np.tanh(c_next)
# Compute prediction of the LSTM cell
yt_pred = softmax(np.dot(Wy, a_next) + by)
This the LSTM cell I want to use :
concat = np.zeros((n_a + n_x, m))
concat[: n_a, :] = a_prev
concat[n_a:, :] = xt
ft1 = sigmoid(np.dot(Wf1, concat) + bf1)
ft2 = sigmoid(np.dot(Wf2, ft1) + bf2)
it1 = sigmoid(np.dot(Wi11, concat) + bi1)
it2 = sigmoid(np.dot(Wi12, it1) + bi2)
cct1 = np.tanh(np.dot(Wc1, concat) + bc1)
cct2 = np.tanh(np.dot(Wc2, cct1) + bc2)
c_next = ft2 * c_prev + it2 * cct2
ot1 = sigmoid(np.dot(Wo1, concat) + bo1)
ot2 = sigmoid(np.dot(Wo2, ot1) + bo2)
a_next = ot2 * np.tanh(c_next)
# Compute prediction of the LSTM cell
yt_pred1 = softmax(np.dot(Wy1, a_next) + by1)

Related

How to calculate F1 Score for Multi-label Classification

I am trying to calculate F1 score (and accuracy) for my multi-label classification problem. Could you please provide feedback on my method, if I'm calculating it correctly. Note that I'm calculating IOU (intersection over union) when model predicts an object as 1, and mark it as TP only if IOU is greater than or equal to 0.5.
GT labels: 14 x 10 x 128
Output: 14 x 10 x 128
where 14 is the batch_size, 10 is the sequence_length, and 128 is the object vector (i.e., 1 if the object at an index belongs to the sequence and 0 otherwise).
def calculate_performance_metrics(total_padded_elements, gt_labels, predicted_labels):
# check if TP pred objects overlap with TP gt objects
TP_INDICES = (torch.logical_and(predicted_labels == 1, gt_labels == 1)).nonzero() # we only want the batch and object indices, i.e. the 0 and 2 indices
TP = calculate_tp_with_iou() # details of this don't matter for now
FP = torch.sum(torch.logical_and(predicted_labels, 1 - gt_labels)).item()
TN = torch.sum(torch.logical_and(1 - predicted_labels, 1 - gt_labels)).item()
FN = torch.sum(torch.logical_and(1 - predicted_labels, gt_labels)).item()
return float(TP), float(FP), float(TN - total_padded_elements), float(FN)
for epoch in range(10):
TP = FP = TN = FN = EPOCH_PRECISION = EPOCH_RECALL = EPOCH_F1 = 0.
for inputs, gt_labels, masks in tr_dl:
outputs = model(inputs) # out shape: (14, 10, 128)
# mask shape: (14, 10). So need to expand it to the shape of output
masks = masks[:, :, None].expand_as(outputs)
pred_labels = (torch.sigmoid(outputs) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
pred_labels = pred_labels * masks
gt_labels = (gt_labels * masks).type(torch.int64)
total_padded_elements = masks.numel() - masks.sum() # need this to get accurate true negatives
batch_tp, batch_fp, batch_tn, batch_fn = calculate_performance_metrics(gt_labels, pred_labels, total_padded_elements)
EPOCH_TP += batch_tp
EPOCH_FP += batch_fp
EPOCH_TN += batch_tn
EPOCH_FN += batch_fn
EPOCH_ACCURACY = (EPOCH_TP + EPOCH_TN) / (EPOCH_TP + EPOCH_TN + EPOCH_FP + EPOCH_FN)
if EPOCH_TP + EPOCH_FP > 0:
EPOCH_PRECISION = EPOCH_TP / (EPOCH_TP + EPOCH_FP)
if EPOCH_TP + EPOCH_FN > 0:
EPOCH_RECALL = EPOCH_TP / (EPOCH_TP + EPOCH_FN)
EPOCH_F1 = (2 * EPOCH_PRECISION * EPOCH_RECALL) / (EPOCH_PRECISION + EPOCH_RECALL)

How to debug if weight keep increasing. Pytorch program

I m having some doubt when practicing Pytorch program.
I have function like y = m1x1 + m2x2 + c (just 2 weights to learn here). The expected values of weight should be 16,-14 and bias should be 36. But in every epoch the learned wight goes very big. Can any one help me to debug and understand this 20 lines of code, what going wrong here.
import torch
x = torch.randint(size = (1,2), high = 10)
w = torch.Tensor([16,-14])
b = 36
#Compute Ground Truth
y = w * x + b
#Find weights by program
epoch = 20
learning_rate = 30
#initialize random
w1 = torch.rand(size= (1,2), requires_grad= True)
b1 = torch.ones(size = [1], requires_grad= True)
for i in range(epoch):
y1 = w1 * x + b1
#loss function RMSQ
loss = torch.sum((y1-y)**2)
#Find gradient
loss.backward()
with torch.no_grad():
#update parameters
w1 -= (learning_rate * w1.grad)
b1 -= (learning_rate * b1.grad)
w1.grad.zero_()
b1.grad.zero_()
print("B ", b1)
print("W ", w1)
Thanks,
Ganesh
You have a very large learning rate.
This is an illustration from Jeremy Jordan's blog that explains exactly what is going on in your case.

Calculating Gradient Update

Lets say I want to manually calculate the gradient update with respect to the Kullback-Liebler divergence loss, say on a VAE (see an actual example from pytorch sample documentation here):
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
where the logvar is (for simplicitys sake, ignoring activation functions and multiple layers etc.) basically a single layer transformation from a 400 dim feature vector into a 20 dim one:
self.fc21 = nn.Linear(400, 20)
logvar = fc21(x)
I'm just not mathematically understanding how you take the gradient of this, with respect to the weight vector for fc21. Mathematically I thought this would look like:
KL = -.5sum(1 + Wx + b - m^2 - e^{Wx + b})
dKL/dW = -.5 (x - e^{Wx + b}x)
where W is the weight matrix of the fc21 layer. But here this result isn't in the same shape as W (20x400). Like, x is just a 400 feature vector. So how would I perform SGD on this? Does x just broadcast to the second term, and if so why? I feel like I'm just missing some mathematical understanding here...
Let's simplify the example a bit and assume a fully connected layer of input shape 3 and output shape 2, then:
W = [[w1, w2, w3], [w4, w5, w6]]
x = [x1, x2, x3]
y = [w1*x1 + w2*x2 + w3*x3, w4*x1 + w5*x2 + w6*x3]
D_KL = -0.5 * [ 1 + w1*x1 + w2*x2 + w3*x3 + w4*x1 + w5*x2 + w6*x3 + b - m^2 + e^(..)]
grad(D_KL, w1) = -0.5 * [x1 + x1* e^(..)]
grad(D_KL, w2) = -0.5 * [x2 + x2* e^(..)]
...
grad(D_KL, W) = [[grad(D_KL, w1), grad(D_KL, w2), grad(D_KL,w3)],
[grad(D_KL, w4), grad(D_KL, w5), grad(D_KL,w6)]
]
This generalizes for higher order tensors of any dimensionality. Your differentiation is wrong in treating x and W as scalars rather than taking element-wise partial derivatives.

Cost value doesn't converge

I'm trying code a logistic regression but I'm in trouble getting a convergent COST, can anyone help me? Below are my codes. Thank you!
#input:
m = 3, n = 4
# we have 3 training examples and each of them has 4 features (Sorry, I know it looks weired here). Y is a label matrix.
X = np.array([[1,2,1],[1,1,0],[1,2,1],[1,0,2]])
Y = np.array([[0,1,0]])
h = 100000 #iterations
alpha = 0.05 #learning rate
b = 0 #scalar bias
W = np.zeros(n).reshape(1,n) #weights
J = np.zeros(h).reshape(1,h) #a vector for holing cost value
Yhat = np.zeros(m).reshape(1,m) #predicted value
def activation(yhat):
return 1/(1+np.exp(-yhat))
W=W.T
for g in range(h):
m = X.T.shape[0]
Y_hat = activation(X.dot(W)+b)
cost = -1/m * np.sum(Y*np.log(Y_hat)+(1-Y)*np.log(1-Y_hat))
current_error = Y.T - Y_hat
dW = 1/m * np.dot(X.T, current_error)
db = 1/m * np.sum(current_error)
W = W + alpha * dW
b = b + alpha * db
J[0][g] = cost

Gradient in continuous regression using a neural network

I'm trying to implement a regression NN that has 3 layers (1 input, 1 hidden and 1 output layer with a continuous result). As a basis I took a classification NN from coursera.org class, but changed the cost function and gradient calculation so as to fit a regression problem (and not a classification one):
My nnCostFunction now is:
function [J grad] = nnCostFunctionLinear(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
J = 1/(2*m)*sum(sum((a3 - Y).^2))
th1 = Theta1;
th1(:,1) = 0; %set bias = 0 in reg. formula
th2 = Theta2;
th2(:,1) = 0;
t1 = th1.^2;
t2 = th2.^2;
th = sum(sum(t1)) + sum(sum(t2));
th = lambda * th / (2*m);
J = J + th; %regularization
del_3 = a3 - Y;
t1 = del_3'*a2;
Theta2_grad = 2*(t1)/m + lambda*th2/m;
t1 = del_3 * Theta2;
del_2 = t1 .* a2;
del_2 = del_2(:,2:end);
t1 = del_2'*a1;
Theta1_grad = 2*(t1)/m + lambda*th1/m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
Then I use this func in fmincg algorithm, but in firsts iterations fmincg end it's work. I think my gradient is wrong, but I can't find the error.
Can anybody help?
If I understand correctly, your first block of code (shown below) -
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
is to get the output a(3) at the output layer.
Ng's slides about NN has the below configuration to calculate a(3). It's different from what your code presents.
in the middle/output layer, you are not doing the activation function g, e.g., a sigmoid function.
In terms of the cost function J without regularization terms, Ng's slides has the below formula:
I don't understand why you can compute it using:
J = 1/(2*m)*sum(sum((a3 - Y).^2))
because you are not including the log function at all.
Mikhaill, I´ve been playing with a NN for continuous regression as well, and had a similar issues at some point. The best thing to do here would be to test gradient computation against a numerical calculation before running the model. If that´s not correct, fmincg won´t be able to train the model. (Btw, I discourage you of using numerical gradient as the time involved is much bigger).
Taking into account that you took this idea from Ng´s Coursera class, I´ll implement a possible solution for you to try using the same notation for Octave.
% Cost function without regularization.
J = 1/2/m^2*sum((a3-Y).^2);
% In case it´s needed, regularization term is added (i.e. for Training).
if (reg==true);
J=J+lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
endif;
% Derivatives are computed for layer 2 and 3.
d3=(a3.-Y);
d2=d3*Theta2(:,2:end);
% Theta grad is computed without regularization.
Theta1_grad=(d2'*a1)./m;
Theta2_grad=(d3'*a2)./m;
% Regularization is added to grad computation.
Theta1_grad(:,2:end)=Theta1_grad(:,2:end)+(lambda/m).*Theta1(:,2:end);
Theta2_grad(:,2:end)=Theta2_grad(:,2:end)+(lambda/m).*Theta2(:,2:end);
% Unroll gradients.
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Note that, since you have taken out all the sigmoid activation, the derivative calculation is quite simple and results in a simplification of the original code.
Next steps:
1. Check this code to understand if it makes sense to your problem.
2. Use gradient checking to test gradient calculation.
3. Finally, use fmincg and check you get different results.
Try to include sigmoid function to compute second layer (hidden layer) values and avoid sigmoid in calculating the target (output) value.
function [J grad] = nnCostFunction1(nnParams, ...
inputLayerSize, ...
hiddenLayerSize, ...
numLabels, ...
X, y, lambda)
Theta1 = reshape(nnParams(1:hiddenLayerSize * (inputLayerSize + 1)), ...
hiddenLayerSize, (inputLayerSize + 1));
Theta2 = reshape(nnParams((1 + (hiddenLayerSize * (inputLayerSize + 1))):end), ...
numLabels, (hiddenLayerSize + 1));
Theta1Grad = zeros(size(Theta1));
Theta2Grad = zeros(size(Theta2));
m = size(X,1);
a1 = [ones(m, 1) X]';
z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [ones(1, m); a2];
z3 = Theta2 * a2;
a3 = z3;
Y = y';
r1 = lambda / (2 * m) * sum(sum(Theta1(:, 2:end) .* Theta1(:, 2:end)));
r2 = lambda / (2 * m) * sum(sum(Theta2(:, 2:end) .* Theta2(:, 2:end)));
J = 1 / ( 2 * m ) * (a3 - Y) * (a3 - Y)' + r1 + r2;
delta3 = a3 - Y;
delta2 = (Theta2' * delta3) .* sigmoidGradient([ones(1, m); z2]);
delta2 = delta2(2:end, :);
Theta2Grad = 1 / m * (delta3 * a2');
Theta2Grad(:, 2:end) = Theta2Grad(:, 2:end) + lambda / m * Theta2(:, 2:end);
Theta1Grad = 1 / m * (delta2 * a1');
Theta1Grad(:, 2:end) = Theta1Grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
grad = [Theta1Grad(:) ; Theta2Grad(:)];
end
Normalize the inputs before passing it in nnCostFunction.
In accordance with Week 5 Lecture Notes guideline for a Linear System NN you should make following changes in the initial code:
Remove num_lables or make it 1 (in reshape() as well)
No need to convert y into a logical matrix
For a2 - replace sigmoid() function to tanh()
In d2 calculation - replace sigmoidGradient(z2) with (1-tanh(z2).^2)
Remove sigmoid from output layer (a3 = z3)
Replace cost function in the unregularized portion to linear one: J = (1/(2*m))*sum((a3-y).^2)
Create predictLinear(): use predict() function as a basis, replace sigmoid with tanh() for the first layer hypothesis, remove second sigmoid for the second layer hypothesis, remove the line with max() function, use output of the hidden layer hypothesis as a prediction result
Verify your nnCostFunctionLinear() on the test case from the lecture note

Resources