Backpropagation, all outputs tend to 1 - machine-learning

I have this Backpropagation implementation in MATLAB, and have an issue with training it. Early on in the training phase, all of the outputs go to 1. I have normalized the input data(except the desired class, which is used to generate a binary target vector) to the interval [0, 1]. I have been referring to the implementation in Artificial Intelligence: A Modern Approach, Norvig et al.
Having checked the pseudocode against my code(and studying the algorithm for some time), I cannot spot the error. I have not been using MATLAB for that long, so have been trying to use the documentation where needed.
I have also tried different amounts of nodes in the hidden layer and different learning rates (ALPHA).
The target data encodings are as follows: when the target is to classify as, say 2, the target vector would be [0,1,0], say it were 1, [1, 0, 0] so on and so forth. I have also tried using different values for the target, such as (for class 1 for example) [0.5, 0, 0].
I noticed that some of my weights go above 1, resulting in large net values.
%Topological constants
NUM_HIDDEN = 8+1;%written as n+1 so is clear bias is used
NUM_OUT = 3;
%Training constants
ALPHA = 0.01;
TARG_ERR = 0.01;
MAX_EPOCH = 50000;
%Read and normalize data file.
X = normdata(dlmread('iris.data'));
X = shuffle(X);
%X_test = normdata(dlmread('iris2.data'));
%epocherrors = fopen('epocherrors.txt', 'w');
%Weight matrices.
%Features constitute size(X, 2)-1, however size is (X, 2) to allow for
%appending bias.
w_IH = rand(size(X, 2), NUM_HIDDEN)-(0.5*rand(size(X, 2), NUM_HIDDEN));
w_HO = rand(NUM_HIDDEN+1, NUM_OUT)-(0.5*rand(NUM_HIDDEN+1, NUM_OUT));%+1 for bias
%Layer nets
net_H = zeros(NUM_HIDDEN, 1);
net_O = zeros(NUM_OUT, 1);
%Layer outputs
out_H = zeros(NUM_HIDDEN, 1);
out_O = zeros(NUM_OUT, 1);
%Layer deltas
d_H = zeros(NUM_HIDDEN, 1);
d_O = zeros(NUM_OUT, 1);
%Control variables
error = inf;
epoch = 0;
%Run the algorithm.
while error > TARG_ERR && epoch < MAX_EPOCH
for n=1:size(X, 1)
x = [X(n, 1:size(X, 2)-1) 1]';%Add bias for hiddens & transpose to column vector.
o = X(n, size(X, 2));
%Forward propagate.
net_H = w_IH'*x;%Transposed w.
out_H = [sigmoid(net_H); 1]; %Append 1 for bias to outputs
net_O = w_HO'*out_H;
out_O = sigmoid(net_O); %Again, transposed w.
%Calculate output deltas.
d_O = ((targetVec(o, NUM_OUT)-out_O) .* (out_O .* (1-out_O)));
%Calculate hidden deltas.
for i=1:size(w_HO, 1);
delta_weight = 0;
for j=1:size(w_HO, 2)
delta_weight = delta_weight + d_O(j)*w_HO(i, j);
end
d_H(i) = (out_H(i)*(1-out_H(i)))*delta_weight;
end
%Update hidden-output weights
for i=1:size(w_HO, 1)
for j=1:size(w_HO, 2)
w_HO(i, j) = w_HO(i, j) + (ALPHA*out_H(i)*d_O(j));
end
end
%Update input-hidden weights.
for i=1:size(w_IH, 1)
for j=1:size(w_IH, 2)
w_IH(i, j) = w_IH(i, j) + (ALPHA*x(i)*d_H(j));
end
end
out_O
o
%out_H
%w_IH
%w_HO
%d_O
%d_H
end
end
function outs = sigmoid(nets)
outs = zeros(size(nets, 1), 1);
for i=1:size(nets, 1)
if nets(i) < -45
outs(i) = 0;
elseif nets(i) > 45
outs(i) = 1;
else
outs(i) = 1/1+exp(-nets(i));
end
end
end

From what we've established in the comments the only thing that comes in my mind are all recipes written down together in this great NN archive:
ftp://ftp.sas.com/pub/neural/FAQ2.html#questions
First things you could try are:
1) How to avoid overflow in the logistic function? Probably that's the problem - many times I've implemented NNs the problem was with such an overflow.
2) How should categories be encoded?
And more general:
3) How does ill-conditioning affect NN training?
4) Help! My NN won't learn! What should I do?

After the discussion it turns out the problem lies within the sigmoid function:
function outs = sigmoid(nets)
%...
outs(i) = 1/1+exp(-nets(i)); % parenthesis missing!!!!!!
%...
end
It should be:
function outs = sigmoid(nets)
%...
outs(i) = 1/(1+exp(-nets(i)));
%...
end
The lack of parenthesis caused that the sigmoid output was larger than 1 sometimes. That made the gradient calculation incorrect (because it wasn't a gradient of this function). This caused the gradient to be negative. And this caused that the delta for the output layer was most of the time in the wrong direction. After the fix (the after correctly maintaining the error variable - this seems to be missing in your code) all seems to work fine.
Beside that, there are two other main problems with this code:
1) No bias. Without the bias each neuron can only represent a line which crosses the origin. If data is normalized (i.e. values are between 0 and 1), some configurations are inseparable.
2) Lack of guarding against high gradient values (point 1 in my previous answer).

Related

Modifying the loss in ppo in stable-baselines3

I'm trying to implement an addition to the loss function of the ppo algorithm in stable-baselines3. For this I collected additional observations for the states s(t-10) and s(t+1) which I can access in the train-function of the PPO class in ppo.py as part of the rollout_buffer.
I'm using a 3-layer-mlp as my network architecture and need the outputs of the second layer for the triplet (s(t-α), s(t), s(t+1)) to use them to calculate L = max(d(s(t+1) , s(t)) − d(s(t+1) , s(t−α)) + γ, 0), where d is the L2-distance.
Finally I want to add this term to the old loss, so loss = loss + 0.3 * L
This is my implementation starting with the original loss in line 242:
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
###############################
net1 = nn.Sequential(*list(self.policy.mlp_extractor.policy_net.children())[:-1])
L_losses = []
a = 0
obs = rollout_data.observations
obs_alpha = rollout_data.observations_alpha
obs_plusone = rollout_data.observations_plusone
inds = rollout_data.inds
for i in inds:
if i > alpha: # only use observations for which L can be calculated
fs_t = net1(obs[a])
fs_talpha = net1(obs_alpha[a])
fs_tone = net1(obs_plusone[a])
L = max(
th.norm(th.subtract(fs_tone, fs_t)) - th.norm(th.subtract(fs_tone, fs_talpha)) + 1.0, 0.0)
L_losses.append(L)
else:
L_losses.append(0)
a += 1
L_loss = th.mean(th.FloatTensor(L_losses))
loss += 0.3 * L_loss
So with net1 I tried to get a clone of the original network with the outputs from the second layer. I am unsure if this is the right way to do this.
I do have some questions about my approach as the resulting performance is slightly worse compared to without the added term although it should be slightly better:
Is my way of getting the outputs of the second layer of the mlp network working?
When loss.backward() is called can the gradient be calculated correctly (with the new term included)?

How to do 2-layer nested FOR loop in PYTORCH?

I am learning to implement the Factorization Machine in Pytorch.
And there should be some feature crossing operations.
For example, I've got three features [A,B,C], after embedding, they are [vA,vB,vC], so the feature crossing is "[vA·vB], [vA·vC], [vB·vc]".
I know this operation can be simplified by the following:
It can be implemented by MATRIX OPERATIONS.
But this only gives a final result, say, a single value.
The question is, how to get all cross_vec in the following without doing FOR loop:
note: size of "feature_emb" is [batch_size x feature_len x embedding_size]
g_feature = 0
for i in range(self.featurn_len):
for j in range(self.featurn_len):
if j <= i: continue
cross_vec = feature_emb[:,i,:] * feature_emb[:,j,:]
g_feature += torch.sum(cross_vec, dim=1)
You can
cross_vec = (feature_emb[:, None, ...] * feature_emb[..., None, :]).sum(dim=-1)
This should give you corss_vec of shape (batch_size, feature_len, feature_len).
Alternatively, you can use torch.bmm
cross_vec = torch.bmm(feature_emb, feature_emb.transpose(1, 2))

Understanding code wrt Logistic Regression using gradient descent

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

tensorflow resize nearest neighbor approach don't optmize weights

I'm beginner in tensorflow and i'm working on a Model which Colorize Greyscale images and in the last part of the model the paper say :
Once the features are fused, they are processed by a set of
convolutions and upsampling layers, the latter which consist of simply
upsampling the input by using the nearest neighbour technique so that
the output is twice as wide and twice as tall.
when i tried to implement it in tensorflow i used tf.image.resize_nearest_neighbor for upsampling but when i used it i found the cost didn't change in all the epochs except of the 2nd epoch, and without it the cost is optmized and changed
This part of code
def Model(Input_images):
#some code till the following last part
Color_weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,256,128])),'W_conv2':tf.Variable(tf.random_normal([3,3,128,64])),
'W_conv3':tf.Variable(tf.random_normal([3,3,64,64])),
'W_conv4':tf.Variable(tf.random_normal([3,3,64,32])),'W_conv5':tf.Variable(tf.random_normal([3,3,32,2]))}
Color_biases = {'b_conv1':tf.Variable(tf.random_normal([128])),'b_conv2':tf.Variable(tf.random_normal([64])),'b_conv3':tf.Variable(tf.random_normal([64])),
'b_conv4':tf.Variable(tf.random_normal([32])),'b_conv5':tf.Variable(tf.random_normal([2]))}
Color_layer1 = tf.nn.relu(Conv2d(Fuse, Color_weights['W_conv1'], 1) + Color_biases['b_conv1'])
Color_layer1_up = tf.image.resize_nearest_neighbor(Color_layer1,[56,56])
Color_layer2 = tf.nn.relu(Conv2d(Color_layer1_up, Color_weights['W_conv2'], 1) + Color_biases['b_conv2'])
Color_layer3 = tf.nn.relu(Conv2d(Color_layer2, Color_weights['W_conv3'], 1) + Color_biases['b_conv3'])
Color_layer3_up = tf.image.resize_nearest_neighbor(Color_layer3,[112,112])
Color_layer4 = tf.nn.relu(Conv2d(Color_layer3, Color_weights['W_conv4'], 1) + Color_biases['b_conv4'])
return Color_layer4
The Training Code
Prediction = Model(Input_images)
Colorization_MSE = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(Prediction,tf.Variable(tf.random_normal([2,112,112,32]))))
Optmizer = tf.train.AdadeltaOptimizer(learning_rate= 0.05).minimize(Colorization_MSE)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for epoch in range(EpochsNum):
epoch_loss = 0
Batch_indx = 1
for i in range(int(ExamplesNum / Batch_size)):#Over batches
print("Batch Num ",i + 1)
ReadNextBatch()
a, c = sess.run([Optmizer,Colorization_MSE],feed_dict={Input_images:Batch_GreyImages})
epoch_loss += c
print("epoch: ",epoch + 1, ",Los: ",epoch_loss)
So what is wrong with my logic or if the problem is in
tf.image.resize_nearest_neighbor what should i do or what is it's replacement ?
Ok, i solved it, i noticed that tf.random normal was the problem and when i replaced it with tf.truncated normal it is works well

Theano gradient doesn't work with .sum(), only .mean()?

I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
(THIS DOESN'T WORK, RESULTS IN A W VECTOR FULL OF NANs):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
(THIS DOES WORK, PERFECTLY):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.

Resources