Understanding code wrt Logistic Regression using gradient descent - machine-learning

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?

In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.

Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

Related

Modifying the loss in ppo in stable-baselines3

I'm trying to implement an addition to the loss function of the ppo algorithm in stable-baselines3. For this I collected additional observations for the states s(t-10) and s(t+1) which I can access in the train-function of the PPO class in ppo.py as part of the rollout_buffer.
I'm using a 3-layer-mlp as my network architecture and need the outputs of the second layer for the triplet (s(t-α), s(t), s(t+1)) to use them to calculate L = max(d(s(t+1) , s(t)) − d(s(t+1) , s(t−α)) + γ, 0), where d is the L2-distance.
Finally I want to add this term to the old loss, so loss = loss + 0.3 * L
This is my implementation starting with the original loss in line 242:
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
###############################
net1 = nn.Sequential(*list(self.policy.mlp_extractor.policy_net.children())[:-1])
L_losses = []
a = 0
obs = rollout_data.observations
obs_alpha = rollout_data.observations_alpha
obs_plusone = rollout_data.observations_plusone
inds = rollout_data.inds
for i in inds:
if i > alpha: # only use observations for which L can be calculated
fs_t = net1(obs[a])
fs_talpha = net1(obs_alpha[a])
fs_tone = net1(obs_plusone[a])
L = max(
th.norm(th.subtract(fs_tone, fs_t)) - th.norm(th.subtract(fs_tone, fs_talpha)) + 1.0, 0.0)
L_losses.append(L)
else:
L_losses.append(0)
a += 1
L_loss = th.mean(th.FloatTensor(L_losses))
loss += 0.3 * L_loss
So with net1 I tried to get a clone of the original network with the outputs from the second layer. I am unsure if this is the right way to do this.
I do have some questions about my approach as the resulting performance is slightly worse compared to without the added term although it should be slightly better:
Is my way of getting the outputs of the second layer of the mlp network working?
When loss.backward() is called can the gradient be calculated correctly (with the new term included)?

How can i adapt multivariate normal for batch operations?

I am implementing from scratch the multivariate normal probability function in python. The formula for it is as follows:
I was able to code this version, where $\mathbf{x}$ is an input vector (single sample). However, i could make good use of numpy's matrix operations and extend it to the case of using $\mathbf{X}$ (set of samples) to return all the samples probabilities at once. This is equal to the scipy's implementation.
This is the code i have made:
def multivariate_normal(X, center, cov):
k = X.shape[0]
det_cov = np.linalg.det(cov)
inv_cov = np.linalg.inv(cov)
o = 1 / np.sqrt( (2 * np.pi) ** k * det_cov)
p = np.exp( -.5 * ( np.dot(np.dot((X - center).T, inv_cov), (X - center))))
return o * p
Thanks in advance.

Neural Network MNIST: Backpropagation is correct, but training/test accuracy very low

I am building a neural network to learn to recognize handwritten digits from MNIST. I have confirmed that backpropagation calculates the gradients perfectly (gradient checking gives error < 10 ^ -10).
It appears that no matter how I train the weights, the cost function always tends towards around 3.24-3.25 (never below that, just approaching from above) and the training/test set accuracy is very low (around 11% for the test set). It appears that the h values in the end are all very close to 0.1 and to each other.
I cannot find why my program cannot produce better results. I was wondering if anyone could maybe take a look at my code and please tell me any reasons for this occurring. Thank you so much for all your help, I really appreciate it!
Here is my Python code:
import numpy as np
import math
from tensorflow.examples.tutorials.mnist import input_data
# Neural network has four layers
# The input layer has 784 nodes
# The two hidden layers each have 5 nodes
# The output layer has 10 nodes
num_layer = 4
num_node = [784,5,5,10]
num_output_node = 10
# 30000 training sets are used
# 10000 test sets are used
# Can be adjusted
Ntrain = 30000
Ntest = 10000
# Sigmoid Function
def g(X):
return 1/(1 + np.exp(-X))
# Forwardpropagation
def h(W,X):
a = X
for l in range(num_layer - 1):
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
return a
# Cost Function
def J(y, W, X, Lambda):
cost = 0
for i in range(Ntrain):
H = h(W,X[i])
for k in range(num_output_node):
cost = cost + y[i][k] * math.log(H[k]) + (1-y[i][k]) * math.log(1-H[k])
regularization = 0
for l in range(num_layer - 1):
for i in range(num_node[l]):
for j in range(num_node[l+1]):
regularization = regularization + W[l][i+1][j] ** 2
return (-1/Ntrain * cost + Lambda / (2*Ntrain) * regularization)
# Backpropagation - confirmed to be correct
# Algorithm based on https://www.coursera.org/learn/machine-learning/lecture/1z9WW/backpropagation-algorithm
# Returns D, the value of the gradient
def BackPropagation(y, W, X, Lambda):
delta = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
delta[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for i in range(Ntrain):
A = np.empty(num_layer-1, dtype = object)
a = X[i]
for l in range(num_layer - 1):
A[l] = a
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
diff = a - y[i]
delta[num_layer-2] = delta[num_layer-2] + np.outer(np.insert(A[num_layer-2],0,1),diff)
for l in range(num_layer-2):
index = num_layer-2-l
diff = np.multiply(np.dot(np.array([W[index][k+1] for k in range(num_node[index])]), diff), np.multiply(A[index], 1-A[index]))
delta[index-1] = delta[index-1] + np.outer(np.insert(A[index-1],0,1),diff)
D = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
D[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for l in range(num_layer-1):
for i in range(num_node[l]+1):
if i == 0:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * delta[l][i][j]
else:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * (delta[l][i][j] + Lambda * W[l][i][j])
return D
# Neural network - this is where the learning/adjusting of weights occur
# W is the weights
# learn is the learning rate
# iterations is the number of iterations we pass over the training set
# Lambda is the regularization parameter
def NeuralNetwork(y, X, learn, iterations, Lambda):
W = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
W[l] = np.random.rand(num_node[l]+1,num_node[l+1])/100
for k in range(iterations):
print(J(y, W, X, Lambda))
D = BackPropagation(y, W, X, Lambda)
for l in range(num_layer-1):
W[l] = W[l] - learn * D[l]
print(J(y, W, X, Lambda))
return W
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Training data, read from MNIST
inputpix = []
output = []
for i in range(Ntrain):
inputpix.append(2 * np.array(mnist.train.images[i]) - 1)
output.append(np.array(mnist.train.labels[i]))
np.savetxt('input.txt', inputpix, delimiter=' ')
np.savetxt('output.txt', output, delimiter=' ')
# Train the weights
finalweights = NeuralNetwork(output, inputpix, 2, 5, 1)
# Test data
inputtestpix = []
outputtest = []
for i in range(Ntest):
inputtestpix.append(2 * np.array(mnist.test.images[i]) - 1)
outputtest.append(np.array(mnist.test.labels[i]))
np.savetxt('inputtest.txt', inputtestpix, delimiter=' ')
np.savetxt('outputtest.txt', outputtest, delimiter=' ')
# Determine the accuracy of the training data
count = 0
for i in range(Ntrain):
H = h(finalweights,inputpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and output[i][j] == 1:
count = count + 1
print(count/Ntrain)
# Determine the accuracy of the test data
count = 0
for i in range(Ntest):
H = h(finalweights,inputtestpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and outputtest[i][j] == 1:
count = count + 1
print(count/Ntest)
Your network is tiny, 5 neurons make it basically a linear model. Increase it to 256 per layer.
Notice, that trivial linear model has 768 * 10 + 10 (biases) parameters, adding up to 7690 floats. Your neural network on the other hand has 768 * 5 + 5 + 5 * 5 + 5 + 5 * 10 + 10 = 3845 + 30 + 60 = 3935. In other words despite being nonlinear neural network, it is actualy a simpler model than a trivial logistic regression applied to this problem. And logistic regression obtains around 11% error on its own, thus you cannot really expect to beat it. Of course this is not a strict argument, but should give you some intuition for why it should not work.
Second issue is related to other hyperparameters, you seem to be using:
huge learning rate (is it 2?) it should be more of order 0.0001
very little training iterations (are you just executing 5 epochs?)
your regularization parameter is huge (it is set to 1), so your network is heavily penalised for learning anything, again - change it to something order of magnitude smaller
The NN architecture is most likely under-fitting. Maybe, the learning rate is high/low. Or there are most issues with the regularization parameter.

How do I implement the optimization function in tensorflow?

minΣ(||xi-Xci||^2+ λ||ci||),
s.t cii = 0,
where X is a matrix of shape d * n and C is of the shape n * n, xi and ci means a column of X and C separately.
X is known here and based on X we want to find C.
Usually with a loss like that you need to vectorize it, instead of working with columns:
loss = X - tf.matmul(X, C)
loss = tf.reduce_sum(tf.square(loss))
reg_loss = tf.reduce_sum(tf.square(C), 0) # L2 loss for each column
reg_loss = tf.reduce_sum(tf.sqrt(reg_loss))
total_loss = loss + lambd * reg_loss
To implement the zero constraint on the diagonal of C, the best way is to add it to the loss with another constant lambd2:
reg_loss2 = tf.trace(tf.square(C))
total_loss = total_loss + lambd2 * reg_loss2

Theano gradient doesn't work with .sum(), only .mean()?

I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
(THIS DOESN'T WORK, RESULTS IN A W VECTOR FULL OF NANs):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
(THIS DOES WORK, PERFECTLY):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.

Resources