How does keras basic optimizer works? - machine-learning

Here is part of get_updates code from SGD from keras(source)
moments = [K.zeros(shape) for shape in shapes]
self.weights = [self.iterations] + moments
for p, g, m in zip(params, grads, moments):
v = self.momentum * m - lr * g # velocity
self.updates.append(K.update(m, v))
Observation:
Since moments variable is a list of zeros tensors. Each m in the for loop is a zero tensor with the shape of p. Then the self.momentum * m, at the first line of the loop, is just a scalar multiply by zero tensor which result a zero tensor.
Question
What am I missing here?

Yes - during a first iteration of this loop m is equal to 0. But then it's updated by a current v value in this line:
self.updates.append(K.update(m, v))
So in next iteration you'll have:
v = self.momentum * old_velocity - lr * g # velocity
where old_velocity is a previous value of v.

Related

Using quadratic function with unknown constant term, How can i find these unknown constant using gradient descent?

everyone.
I am beginner of machine learning and start learning about gradient descent right now. However, I got a one big problem. Following question is like this :
given numbers [0,0],[1,1],[1,2],[2,1] and
equation will be [ f=(a2)*x^2 + (a1)*x + a0 ]
With hand-solving, i got a answer [-1,5/2,0]
but it is hard to find out the solution from making a python code with gradient descent with these given data.
In my case, I try to make a code with gradient descent method with easiest and fastest way like :
learningRate = 0.1
make **a series of number of x
initialize given 1,1,1 for a2,a1,a0
partial derivative for a2,a1,a0 (a2_p:2x, a1_p:x, a0_p:1)
gradient descent method : (ex) a2 = a2 - (learningRate)( y - [(a2)*x^2 + (a1)*x + a0] )(a2_p)
ps. Honestly, I do not know what should i put 'x' and 'y' or a2, a1, a0.
However, i got a wrong answer with different result each time.
So, I want to get a hint for correct equation or code sequence.
Thank you for reading my lowest level of question.
There are a few errors in your equations
For the function f(x) = a2*x^2+a1*x+a0, partial derivatives for a2, a1 and a0 are x^2, x and 1, respectively.
Suppose cost function is (1/2)*(y-f(x))^2
Partial derivatives of cost function with respect to ai is -(y-f(x))* partial derivative of f(x) for ai, where i belongs to [0,2]
So, the gradient descent equation is:
ai = ai + learning_rate*(y-f(x)) * partial derivative of f(x) for ai, where i belongs to [0,2]
I hope this code helps
#Training sample
sample = [(0,0),(1,1),(1,2),(2,1)]
#Our function => a2*x^2+a1*x+a0
class Function():
def __init__(self, a2, a1, a0):
self.a2 = a2
self.a1 = a1
self.a0 = a0
def eval(self, x):
return self.a2*x**2+self.a1*x+self.a0
def partial_a2(self, x):
return x**2
def partial_a1(self, x):
return x
def partial_a0(self, x):
return 1
#Initialise function
f = Function(1,1,1)
#To Calculate loss from the sample
def loss(sample, f):
return sum([(y-f.eval(x))**2 for x,y in sample])/len(sample)
epochs = 100000
lr = 0.0005
#To record the best values
best_values = (0,0,0)
for epoch in range(epochs):
min_loss = 100
for x, y in sample:
#Gradient descent
f.a2 = f.a2+lr*(y-f.eval(x))*f.partial_a2(x)
f.a1 = f.a1+lr*(y-f.eval(x))*f.partial_a1(x)
f.a0 = f.a0+lr*(y-f.eval(x))*f.partial_a0(x)
#Storing the best values
epoch_loss = loss(sample, f)
if min_loss > epoch_loss:
min_loss = epoch_loss
best_values = (f.a2, f.a1, f.a0)
print("Loss:", min_loss)
print("Best values (a2,a1,a0):", best_values)
Output:
Loss: 0.12500004789165717
Best values (a2,a1,a0): (-1.0001922562970325, 2.5003368582261487, 0.00014521557599919338)

CVXPY DCPError for Convex Function

I have a convex optimization problem I am trying to solve with cvxpy. Given a 1 x n row vector y and an m x n matrix C, I want to find a scalar b and a 1 x m row vector a such that the sum of squares of y - (aC + b(aC # aC)) is as small as possible (the # denotes element wise multiplication). In addition, all entires in a must be nonnegative and sum to 1 and -100 <= b <= 100. Below is my attempt to solve this using cvxpy.
import numpy as np
import cvxpy as cvx
def find_a(y, C, b_min=-100, b_max=100):
b = cvx.Variable()
a = cvx.Variable( (1,C.shape[0]) )
aC = a * C # this should be matrix multiplication
x = (aC + cvx.multiply(b, cvx.square(aC)))
objective = cvx.Minimize ( cvx.sum_squares(y - x) )
constraints = [0. <= a,
a <= 1.,
b_min <= b,
b <= b_max,
cvx.sum(a) == 1.]
prob = cvx.Problem(objective, constraints)
result = prob.solve()
print a.value
print result
y = np.asarray([[0.10394265, 0.25867508, 0.31258457, 0.36452763, 0.36608997]])
C = np.asarray([
[0., 0.00169811, 0.01679245, 0.04075472, 0.03773585],
[0., 0.00892802, 0.03154158, 0.06091544, 0.07315024],
[0., 0.00962264, 0.03245283, 0.06245283, 0.07283019],
[0.04396226, 0.05245283, 0.12245283, 0.18358491, 0.23886792]])
find_a(y, C)
I keep getting a DCPError: Problem does not follow DCP rules. error when I try to solve for a. I am thinking that either my function is not really convex, or I do not understand how to construct the proper cvxpy Problem. Any help would be greatly appreciated.

Understanding code wrt Logistic Regression using gradient descent

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

Implementing a custom objective function in Keras

I am trying to implement a custom Keras objective function:
in 'Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression', Narihira et al.
This is the sum of equations (4) and (6) from the previous picture. Y* is the ground truth, Y a prediction map and y = Y* - Y.
This is my code:
def custom_objective(y_true, y_pred):
#Eq. (4) Scale invariant L2 loss
y = y_true - y_pred
h = 0.5 # lambda
term1 = K.mean(K.sum(K.square(y)))
term2 = K.square(K.mean(K.sum(y)))
sca = term1-h*term2
#Eq. (6) Gradient L2 loss
gra = K.mean(K.sum((K.square(K.gradients(K.sum(y[:,1]), y)) + K.square(K.gradients(K.sum(y[1,:]), y)))))
return (sca + gra)
However, I suspect that the equation (6) is not correctly implemented because the results are not good. Am I computing this right?
Thank you!
Edit:
I am trying to approximate (6) convolving with Prewitt filters. It works when my input is a chunk of images i.e. y[batch_size, channels, row, cols], but not with y_true and y_pred (which are of type TensorType(float32, 4D)).
My code:
def cconv(image, g_kernel, batch_size):
g_kernel = theano.shared(g_kernel)
M = T.dtensor3()
conv = theano.function(
inputs=[M],
outputs=conv2d(M, g_kernel, border_mode='full'),
)
accum = 0
for curr_batch in range (batch_size):
accum = accum + conv(image[curr_batch])
return accum/batch_size
def gradient_loss(y_true, y_pred):
y = y_true - y_pred
batch_size = 40
# Direction i
pw_x = np.array([[-1,0,1],[-1,0,1],[-1,0,1]]).astype(np.float64)
g_x = cconv(y, pw_x, batch_size)
# Direction j
pw_y = np.array([[-1,-1,-1],[0,0,0],[1,1,1]]).astype(np.float64)
g_y = cconv(y, pw_y, batch_size)
gra_l2_loss = K.mean(K.square(g_x) + K.square(g_y))
return (gra_l2_loss)
The crash is produced in:
accum = accum + conv(image[curr_batch])
...and error description is the following one:
*** TypeError: ('Bad input argument to theano function with name "custom_models.py:836" at index 0 (0-based)', 'Expected an array-like
object, but found a Variable: maybe you are trying to call a function
on a (possibly shared) variable instead of a numeric array?')
How can I use y (y_true - y_pred) as a numpy array, or how can I solve this issue?
SIL2
term1 = K.mean(K.square(y))
term2 = K.square(K.mean(y))
[...]
One mistake spread across the code was that when you see (1/n * sum()) in the equations, it is a mean. Not the mean of a sum.
Gradient
After reading your comment and giving it more thought, I think there is a confusion about the gradient. At least I got confused.
There are two ways of interpreting the gradient symbol:
The gradient of a vector where y should be differentiated with respect to the parameters of your model (usually the weights of the neural net). In previous edits I started to write in this direction because that's the sort of approach used to trained the model (eg. gradient descent). But I think I was wrong.
The pixel intensity gradient in a picture, as you mentioned in your comment. The diff of each pixel with its neighbor in each direction. In which case I guess you have to translate the example you gave into Keras.
To sum up, K.gradients() and numpy.gradient() are not used in the same way. Because numpy implicitly considers (i, j) (the row and column indices) as the two input variables, while when you feed a 2D image to a neural net, every single pixel is an input variable. Hope I'm clear.

How do I implement the optimization function in tensorflow?

minΣ(||xi-Xci||^2+ λ||ci||),
s.t cii = 0,
where X is a matrix of shape d * n and C is of the shape n * n, xi and ci means a column of X and C separately.
X is known here and based on X we want to find C.
Usually with a loss like that you need to vectorize it, instead of working with columns:
loss = X - tf.matmul(X, C)
loss = tf.reduce_sum(tf.square(loss))
reg_loss = tf.reduce_sum(tf.square(C), 0) # L2 loss for each column
reg_loss = tf.reduce_sum(tf.sqrt(reg_loss))
total_loss = loss + lambd * reg_loss
To implement the zero constraint on the diagonal of C, the best way is to add it to the loss with another constant lambd2:
reg_loss2 = tf.trace(tf.square(C))
total_loss = total_loss + lambd2 * reg_loss2

Resources