Back-propagation algorithm converging too quickly to poor results - machine-learning

I'm trying to implement the back propagation algorithm for a multi layer feedforward neural network, but I'm having issues getting it to converge to good results. The reason being, the gradient descent gets stuck on a plate of the root mean squared error.
As you can see in the graph, there is very little change in the rms value for the first 70 epochs or so. Therefore the gradient descent things its found a minimum and stops. To fix this I set a requirement that the rms error must be below 0.3 in addition to the rate of change being below a given value. However, I don't think this is good as I believe there is something wrong with my implementation.
Below is the ruby code:
def train eta, criteria
rms = 1
old_rms = 0
rms_window = Array.new 20, 0
new_avg = 10
old_avg = 0
diff = 100
epoch = 0
#data[:training].shuffle!
while (diff > criteria || rms > 0.3) do
#while (diff > criteria) do
rms = 0
old_avg = new_avg
new_avg = 0
classification_error = 0
sample_num = 0
#data[:training].each_with_index do |s, s_i|
# Forward Propagation
inputs = [1, s[1], s[2]]
#hidden_layers.each_with_index do |hl, hl_i|
outputs = Array.new
# Bias Term
outputs << 1
# Compute the output for each neuron
hl.each do |p|
outputs << p.compute_output(inputs)
end
inputs = outputs
end
# Compute System Outputs
outputs = Array.new
#outputs.each do |p|
outputs << p.compute_output(inputs)
end
# Comput Errors
errors = Array.new
desired = #desired_values[s[0]-1]
#outputs.length.times do |x|
errors[x] = desired[x] - outputs[x]
rms += errors[x]**2
end
decision = outputs.each_with_index.max[1]
if decision+1 != s[0]
classification_error += 1
end
# Back Propagation
gradients = Array.new
local_gradient = Array.new
next_layer = Array.new
#outputs.each_with_index do |o, i|
local_gradient << errors[i] * o.activation_prime(o.output)
o.weights.length.times do |x|
o.weights[x] += eta * local_gradient[i] * o.inputs[x]
end
end
gradients << local_gradient
next_layer = #outputs
#hidden_layers.reverse_each do |hl|
local_gradient = Array.new
hl.each do |p|
gradient = 0
gradients.last.each_with_index do |g, i|
gradient += g * next_layer[i].weights[p.index+1]
end
gradient *= p.activation_prime(p.output)
local_gradient << gradient
p.weights.each_index do |x|
p.weights[x] += eta * gradient * p.inputs[x]
end
end
gradients << local_gradient
next_layer = hl
end
if s_i == 0
#puts "Epoch: #{epoch}\nOutputs: #{outputs}\nGradients:\n#{gradients[0]}\n#{gradients[1]}\n#{gradients[2]}\n\n"
#puts "Epoch #{epoch}\nError: #{errors}\nSE: #{rms}"
end
end
rms = Math::sqrt(rms / (#data[:training].length * 4))
rms_window[0] = rms
rms_window.rotate!
rms_window.each do |x|
new_avg += x
end
new_avg /= 20
diff = (new_avg - old_avg).abs
#rms << rms
epoch += 1
if classification_error == 0
break
end
#puts "RMS: #{rms}\tDiff: \t#{diff}\tClassification: #{classification_error}\n\n"
end
self.rms_plot "Plot"
self.grid_eval "Test", 250
end
The graph shown is for a 2-hidden layer network with 5 neurons in each hidden layer. There are 2 inputs and 4 outputs. Perhaps this is normal behavior, but something just seems off to me. Any help would be greatly appreciated.

There are many parameters that need to be tuned to get a multi-layer neural net to work. Based on my experiment, my first suggestions are:
1- give it a small set of synthesized data and run a baby project to see if the framework works.
2- Use a more convex cost function. There is no function that guarantees convexity, but there are many functions that are more convex that RMS.
3- Try scaling your input data in (-1,1) and output data in (0,1).
4- Try different values for learning rate.

In addition to whats already been said:
vary the range of the initial weights a little more (0 - 1 for example)
make sure your input data are properly normalised - i fell this can't be said often enough
vary the learning rate, start with sth like 0.05 and keep increasing/decreasing in small steps (if you find that changing your learning rate has a too extreme effect on the network's performance, then you may haven't normalised your input data appropriately)
shuffle the input data before every epoch
try using momentum (this essentially means, increase the learning rate while the gradient is steep, decrease if it becomes flatter), this often helps to jump over local optima
try using regularisation
experiment with the structure (add another hidden layer, increase the number of units in the hidden layer)

Related

Training seq2seq LM over multiple iterations in PyTorch, seems like lack of connection between encoder and decoder

My seq2seq model seems to only learn to produce sequences of popular words like:
"i don't . i don't . i don't . i don't . i don't"
I think that might be due to a lack of actual data flow between encoder and decoder.
That happens whether I use encoder.init_hidden() or encoder_hidden.detach().
If I use neither, I get an error:
"RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward."
If I try to use retain_graph=True, I get another error:
"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 768]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)."
This seems to be a very common use case, but from all similar questions and all the documentation and experiments, I cannot solve it.
Am I missing something obvious?
encoder = Encoder(embedding_dim, hidden_size, max_seq_len, num_layers, vocab.len(), word_embeddings).to(device)
decoder = Decoder(embedding_dim, hidden_size, num_layers, vocab.len()).to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.SGD(params=encoder.parameters() + decoder.parameters(), lr=learn_rate)
encoder.train()
decoder.train()
encoder_hidden = encoder.init_hidden()
for epoch in range(num_epochs):
epoch_loss = 0
num_samples = 0
j = 0
for prompts, responses in train_data_loader:
#encoder_hidden = encoder.init_hidden() # new tensor of zeroes
encoder_hidden = encoder_hidden.detach()
optimizer.zero_grad()
encoder_output, encoder_hidden = encoder(prompts, encoder_hidden)
decoder_hidden = encoder.transform_hidden(encoder_hidden)
batch_size = responses.size(0)
decoder_input = torch.tensor([[SOS_TOKEN]] * batch_size, device=device)
decoder_outputs = []
sequence_length = responses.shape[1]
for i in range(sequence_length):
word_index = responses[:, i:i+1]
decoder_output, _ = decoder(decoder_input, decoder_hidden)
decoder_outputs.append(decoder_output)
decoder_input = word_index
decoder_outputs_t = torch.cat(decoder_outputs, dim=1)
decoder_outputs_t = decoder_outputs_t.permute(0, 2, 1)
loss = loss_function(decoder_outputs_t, responses)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
num_samples += 1
j += 1
mean_loss = epoch_loss / num_samples

Perceptron : Weight for each sample data , or one common weight

With Perception learning, I am really confused on initializing and updating weight. If I have a sample data that contains 2 inputs x0 and x1 and I have 80 rows of these 2 inputs, hence 80x2 matrix.
Do I need to initialize weight as a matrix of 80x2 or just 2 values w0 and w1 ? Is final goal of perceptron learning is to find 2 weights w0 and w1 which should fit for all 80 input sample rows ?
I have following code and my errors never get to 0, despite going up to 10,000 iterations.
x=input matrix of 80x2
y=output matrix of 80x1
n = number of iterations
w=[0.1,0.1]
learningRate = 0.1
for i in range(n):
expectedT = y.transpose();
xT = x.transpose()
prediction = np.dot (w,xT)
for i in range (len(x)):
if prediction[i] >= 0:
ypred[i] = 1
else:
ypred[i] = 0
error = expectedT - ypred
# updating the weights
w = np.add(w,learningRate*(np.dot(error,x)))
globalError = globalError + np.square(error)
For each feature you will have one weight. Thus you have two features and two weights. It also helps to introduce a bias which adds another weight. For more information about bias check this Role of Bias in Neural Networks. The weights indeed should learn how to fit the sample data best. Depending on the data this can mean that you will never reach error of 0. For example a single layer perceptron can not learn an XOR gate when using a monotonic activation function. (solving XOR with single layer perceptron).
For your example I would recommend two things. Introducing a bias and stopping the training when the error is below a certain threshold or if error is 0 for example.
I completed your example to learn a logical AND gate:
# AND input and output
x = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,1])
n = 1000
w=[0.1,0.1,0.1]
learningRate = 0.01
globalError = 0
def predict(X):
prediction = np.dot(w[0:2],X) + w[2]
ypred = np.zeros(len(y))
for i in range (len(y)):
if prediction[i] >= 0:
ypred[i] = 1
else:
ypred[i] = 0
return ypred
for i in range(n):
expectedT = y.transpose();
xT = x.transpose()
ypred = predict(xT)
error = expectedT - ypred
if sum(error) == 0:
break
# updating the weights
w[0:2] = np.add(w[0:2],learningRate*(np.dot(error,x)))
w[2] += learningRate*sum(error)
globalError = globalError + np.square(error)
After the training the error is 0
print(error)
# [0. 0. 0. 0.]
And the weights are as follows
print(w)
#[0.1, 0.1, -0.00999999999999999]
The perceptron can be used now as AND gate:
predict(x.transpose())
#array([0., 1., 1., 1.])
Hope that helps

Neural Network MNIST: Backpropagation is correct, but training/test accuracy very low

I am building a neural network to learn to recognize handwritten digits from MNIST. I have confirmed that backpropagation calculates the gradients perfectly (gradient checking gives error < 10 ^ -10).
It appears that no matter how I train the weights, the cost function always tends towards around 3.24-3.25 (never below that, just approaching from above) and the training/test set accuracy is very low (around 11% for the test set). It appears that the h values in the end are all very close to 0.1 and to each other.
I cannot find why my program cannot produce better results. I was wondering if anyone could maybe take a look at my code and please tell me any reasons for this occurring. Thank you so much for all your help, I really appreciate it!
Here is my Python code:
import numpy as np
import math
from tensorflow.examples.tutorials.mnist import input_data
# Neural network has four layers
# The input layer has 784 nodes
# The two hidden layers each have 5 nodes
# The output layer has 10 nodes
num_layer = 4
num_node = [784,5,5,10]
num_output_node = 10
# 30000 training sets are used
# 10000 test sets are used
# Can be adjusted
Ntrain = 30000
Ntest = 10000
# Sigmoid Function
def g(X):
return 1/(1 + np.exp(-X))
# Forwardpropagation
def h(W,X):
a = X
for l in range(num_layer - 1):
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
return a
# Cost Function
def J(y, W, X, Lambda):
cost = 0
for i in range(Ntrain):
H = h(W,X[i])
for k in range(num_output_node):
cost = cost + y[i][k] * math.log(H[k]) + (1-y[i][k]) * math.log(1-H[k])
regularization = 0
for l in range(num_layer - 1):
for i in range(num_node[l]):
for j in range(num_node[l+1]):
regularization = regularization + W[l][i+1][j] ** 2
return (-1/Ntrain * cost + Lambda / (2*Ntrain) * regularization)
# Backpropagation - confirmed to be correct
# Algorithm based on https://www.coursera.org/learn/machine-learning/lecture/1z9WW/backpropagation-algorithm
# Returns D, the value of the gradient
def BackPropagation(y, W, X, Lambda):
delta = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
delta[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for i in range(Ntrain):
A = np.empty(num_layer-1, dtype = object)
a = X[i]
for l in range(num_layer - 1):
A[l] = a
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
diff = a - y[i]
delta[num_layer-2] = delta[num_layer-2] + np.outer(np.insert(A[num_layer-2],0,1),diff)
for l in range(num_layer-2):
index = num_layer-2-l
diff = np.multiply(np.dot(np.array([W[index][k+1] for k in range(num_node[index])]), diff), np.multiply(A[index], 1-A[index]))
delta[index-1] = delta[index-1] + np.outer(np.insert(A[index-1],0,1),diff)
D = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
D[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for l in range(num_layer-1):
for i in range(num_node[l]+1):
if i == 0:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * delta[l][i][j]
else:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * (delta[l][i][j] + Lambda * W[l][i][j])
return D
# Neural network - this is where the learning/adjusting of weights occur
# W is the weights
# learn is the learning rate
# iterations is the number of iterations we pass over the training set
# Lambda is the regularization parameter
def NeuralNetwork(y, X, learn, iterations, Lambda):
W = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
W[l] = np.random.rand(num_node[l]+1,num_node[l+1])/100
for k in range(iterations):
print(J(y, W, X, Lambda))
D = BackPropagation(y, W, X, Lambda)
for l in range(num_layer-1):
W[l] = W[l] - learn * D[l]
print(J(y, W, X, Lambda))
return W
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Training data, read from MNIST
inputpix = []
output = []
for i in range(Ntrain):
inputpix.append(2 * np.array(mnist.train.images[i]) - 1)
output.append(np.array(mnist.train.labels[i]))
np.savetxt('input.txt', inputpix, delimiter=' ')
np.savetxt('output.txt', output, delimiter=' ')
# Train the weights
finalweights = NeuralNetwork(output, inputpix, 2, 5, 1)
# Test data
inputtestpix = []
outputtest = []
for i in range(Ntest):
inputtestpix.append(2 * np.array(mnist.test.images[i]) - 1)
outputtest.append(np.array(mnist.test.labels[i]))
np.savetxt('inputtest.txt', inputtestpix, delimiter=' ')
np.savetxt('outputtest.txt', outputtest, delimiter=' ')
# Determine the accuracy of the training data
count = 0
for i in range(Ntrain):
H = h(finalweights,inputpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and output[i][j] == 1:
count = count + 1
print(count/Ntrain)
# Determine the accuracy of the test data
count = 0
for i in range(Ntest):
H = h(finalweights,inputtestpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and outputtest[i][j] == 1:
count = count + 1
print(count/Ntest)
Your network is tiny, 5 neurons make it basically a linear model. Increase it to 256 per layer.
Notice, that trivial linear model has 768 * 10 + 10 (biases) parameters, adding up to 7690 floats. Your neural network on the other hand has 768 * 5 + 5 + 5 * 5 + 5 + 5 * 10 + 10 = 3845 + 30 + 60 = 3935. In other words despite being nonlinear neural network, it is actualy a simpler model than a trivial logistic regression applied to this problem. And logistic regression obtains around 11% error on its own, thus you cannot really expect to beat it. Of course this is not a strict argument, but should give you some intuition for why it should not work.
Second issue is related to other hyperparameters, you seem to be using:
huge learning rate (is it 2?) it should be more of order 0.0001
very little training iterations (are you just executing 5 epochs?)
your regularization parameter is huge (it is set to 1), so your network is heavily penalised for learning anything, again - change it to something order of magnitude smaller
The NN architecture is most likely under-fitting. Maybe, the learning rate is high/low. Or there are most issues with the regularization parameter.

tensorflow resize nearest neighbor approach don't optmize weights

I'm beginner in tensorflow and i'm working on a Model which Colorize Greyscale images and in the last part of the model the paper say :
Once the features are fused, they are processed by a set of
convolutions and upsampling layers, the latter which consist of simply
upsampling the input by using the nearest neighbour technique so that
the output is twice as wide and twice as tall.
when i tried to implement it in tensorflow i used tf.image.resize_nearest_neighbor for upsampling but when i used it i found the cost didn't change in all the epochs except of the 2nd epoch, and without it the cost is optmized and changed
This part of code
def Model(Input_images):
#some code till the following last part
Color_weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,256,128])),'W_conv2':tf.Variable(tf.random_normal([3,3,128,64])),
'W_conv3':tf.Variable(tf.random_normal([3,3,64,64])),
'W_conv4':tf.Variable(tf.random_normal([3,3,64,32])),'W_conv5':tf.Variable(tf.random_normal([3,3,32,2]))}
Color_biases = {'b_conv1':tf.Variable(tf.random_normal([128])),'b_conv2':tf.Variable(tf.random_normal([64])),'b_conv3':tf.Variable(tf.random_normal([64])),
'b_conv4':tf.Variable(tf.random_normal([32])),'b_conv5':tf.Variable(tf.random_normal([2]))}
Color_layer1 = tf.nn.relu(Conv2d(Fuse, Color_weights['W_conv1'], 1) + Color_biases['b_conv1'])
Color_layer1_up = tf.image.resize_nearest_neighbor(Color_layer1,[56,56])
Color_layer2 = tf.nn.relu(Conv2d(Color_layer1_up, Color_weights['W_conv2'], 1) + Color_biases['b_conv2'])
Color_layer3 = tf.nn.relu(Conv2d(Color_layer2, Color_weights['W_conv3'], 1) + Color_biases['b_conv3'])
Color_layer3_up = tf.image.resize_nearest_neighbor(Color_layer3,[112,112])
Color_layer4 = tf.nn.relu(Conv2d(Color_layer3, Color_weights['W_conv4'], 1) + Color_biases['b_conv4'])
return Color_layer4
The Training Code
Prediction = Model(Input_images)
Colorization_MSE = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(Prediction,tf.Variable(tf.random_normal([2,112,112,32]))))
Optmizer = tf.train.AdadeltaOptimizer(learning_rate= 0.05).minimize(Colorization_MSE)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for epoch in range(EpochsNum):
epoch_loss = 0
Batch_indx = 1
for i in range(int(ExamplesNum / Batch_size)):#Over batches
print("Batch Num ",i + 1)
ReadNextBatch()
a, c = sess.run([Optmizer,Colorization_MSE],feed_dict={Input_images:Batch_GreyImages})
epoch_loss += c
print("epoch: ",epoch + 1, ",Los: ",epoch_loss)
So what is wrong with my logic or if the problem is in
tf.image.resize_nearest_neighbor what should i do or what is it's replacement ?
Ok, i solved it, i noticed that tf.random normal was the problem and when i replaced it with tf.truncated normal it is works well

Backpropagation, all outputs tend to 1

I have this Backpropagation implementation in MATLAB, and have an issue with training it. Early on in the training phase, all of the outputs go to 1. I have normalized the input data(except the desired class, which is used to generate a binary target vector) to the interval [0, 1]. I have been referring to the implementation in Artificial Intelligence: A Modern Approach, Norvig et al.
Having checked the pseudocode against my code(and studying the algorithm for some time), I cannot spot the error. I have not been using MATLAB for that long, so have been trying to use the documentation where needed.
I have also tried different amounts of nodes in the hidden layer and different learning rates (ALPHA).
The target data encodings are as follows: when the target is to classify as, say 2, the target vector would be [0,1,0], say it were 1, [1, 0, 0] so on and so forth. I have also tried using different values for the target, such as (for class 1 for example) [0.5, 0, 0].
I noticed that some of my weights go above 1, resulting in large net values.
%Topological constants
NUM_HIDDEN = 8+1;%written as n+1 so is clear bias is used
NUM_OUT = 3;
%Training constants
ALPHA = 0.01;
TARG_ERR = 0.01;
MAX_EPOCH = 50000;
%Read and normalize data file.
X = normdata(dlmread('iris.data'));
X = shuffle(X);
%X_test = normdata(dlmread('iris2.data'));
%epocherrors = fopen('epocherrors.txt', 'w');
%Weight matrices.
%Features constitute size(X, 2)-1, however size is (X, 2) to allow for
%appending bias.
w_IH = rand(size(X, 2), NUM_HIDDEN)-(0.5*rand(size(X, 2), NUM_HIDDEN));
w_HO = rand(NUM_HIDDEN+1, NUM_OUT)-(0.5*rand(NUM_HIDDEN+1, NUM_OUT));%+1 for bias
%Layer nets
net_H = zeros(NUM_HIDDEN, 1);
net_O = zeros(NUM_OUT, 1);
%Layer outputs
out_H = zeros(NUM_HIDDEN, 1);
out_O = zeros(NUM_OUT, 1);
%Layer deltas
d_H = zeros(NUM_HIDDEN, 1);
d_O = zeros(NUM_OUT, 1);
%Control variables
error = inf;
epoch = 0;
%Run the algorithm.
while error > TARG_ERR && epoch < MAX_EPOCH
for n=1:size(X, 1)
x = [X(n, 1:size(X, 2)-1) 1]';%Add bias for hiddens & transpose to column vector.
o = X(n, size(X, 2));
%Forward propagate.
net_H = w_IH'*x;%Transposed w.
out_H = [sigmoid(net_H); 1]; %Append 1 for bias to outputs
net_O = w_HO'*out_H;
out_O = sigmoid(net_O); %Again, transposed w.
%Calculate output deltas.
d_O = ((targetVec(o, NUM_OUT)-out_O) .* (out_O .* (1-out_O)));
%Calculate hidden deltas.
for i=1:size(w_HO, 1);
delta_weight = 0;
for j=1:size(w_HO, 2)
delta_weight = delta_weight + d_O(j)*w_HO(i, j);
end
d_H(i) = (out_H(i)*(1-out_H(i)))*delta_weight;
end
%Update hidden-output weights
for i=1:size(w_HO, 1)
for j=1:size(w_HO, 2)
w_HO(i, j) = w_HO(i, j) + (ALPHA*out_H(i)*d_O(j));
end
end
%Update input-hidden weights.
for i=1:size(w_IH, 1)
for j=1:size(w_IH, 2)
w_IH(i, j) = w_IH(i, j) + (ALPHA*x(i)*d_H(j));
end
end
out_O
o
%out_H
%w_IH
%w_HO
%d_O
%d_H
end
end
function outs = sigmoid(nets)
outs = zeros(size(nets, 1), 1);
for i=1:size(nets, 1)
if nets(i) < -45
outs(i) = 0;
elseif nets(i) > 45
outs(i) = 1;
else
outs(i) = 1/1+exp(-nets(i));
end
end
end
From what we've established in the comments the only thing that comes in my mind are all recipes written down together in this great NN archive:
ftp://ftp.sas.com/pub/neural/FAQ2.html#questions
First things you could try are:
1) How to avoid overflow in the logistic function? Probably that's the problem - many times I've implemented NNs the problem was with such an overflow.
2) How should categories be encoded?
And more general:
3) How does ill-conditioning affect NN training?
4) Help! My NN won't learn! What should I do?
After the discussion it turns out the problem lies within the sigmoid function:
function outs = sigmoid(nets)
%...
outs(i) = 1/1+exp(-nets(i)); % parenthesis missing!!!!!!
%...
end
It should be:
function outs = sigmoid(nets)
%...
outs(i) = 1/(1+exp(-nets(i)));
%...
end
The lack of parenthesis caused that the sigmoid output was larger than 1 sometimes. That made the gradient calculation incorrect (because it wasn't a gradient of this function). This caused the gradient to be negative. And this caused that the delta for the output layer was most of the time in the wrong direction. After the fix (the after correctly maintaining the error variable - this seems to be missing in your code) all seems to work fine.
Beside that, there are two other main problems with this code:
1) No bias. Without the bias each neuron can only represent a line which crosses the origin. If data is normalized (i.e. values are between 0 and 1), some configurations are inseparable.
2) Lack of guarding against high gradient values (point 1 in my previous answer).

Resources