Does RMSProp optimizer in tensorflow use Nesterov momentum? - machine-learning

When you create RMSPRop optimizer, it asks for the momentum value. What is this momentum? Is it Nesterov or the other one? How do I use Nesterov momentum with the RMSProp in tf?
There is a formula in the doc string here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/rmsprop.py#L25
mean_square = decay * mean_square{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * g_t / sqrt(mean_square + epsilon)
delta = - mom
Could someone explain what's the g_t term mean and where is this formula being computed?
As far as I understand, in Nesterov momentum+rmsprop you first change the weights with the current momentum, compute new gradients, divide them by the sqrt(mean_square+epsilon) and add to the momentum. Is this what's happening here? I wasn't able to find the implementation of training_ops.apply_rms_prop since I'm not very familiar with the tf source.
I'm coming from coursera Geoffrey Hinton course about neural networks, where this Nesterov momentum+rmsprop algorithm is explained. How do I use it in tf?
Please, correct me if I'm wrong in my understanding of Nesterov momentum or any other thing.

The documentation you refer to explicitly mentions:
This implementation of RMSProp uses plain momentum, not Nesterov momentum.
AFAIK there is no built-in implementation for Nesterov momentum in RMSProp. You can of course adjust the function according to your own needs.
As #xolodec said, g_t is the gradient.

Related

How does the Keras Adam optimizer learning rate hyper-parameter relate to individually computed learning rates for network parameters?

So through my limited understanding of Adam (mainly through this post: https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c) I gather that the Adam optimizer computes individual learning rates for each parameter in a network.
But in the Keras docs (https://keras.io/optimizers/) the Adam optimizer takes a learning rate parameter.
My question is how does the learning rate parameter taken by the Adam object correlate to these computed learning rates? As far as I can tell, this isn't covered in the post linked (Or it is but it went over my head).
As this is very specific question, I wouldn't go to any mathematical details of Adam. I guess in the article, the line it computes individual learning rates for different parameters got you off.
This is the screenshot of the actual Adam algorithm proposed in the paper https://arxiv.org/pdf/1412.6980.pdf
Adam keeps an exponentially decaying average of past gradients so it behaves like a heavy ball with friction which helps it faster convergence and stability.
But, if you look into the algorithm there's an alpha (step size), this is the keras equivalent of learning rate = 0.001 we provide. So, the algorithm needs a step size to update the parameters (simply, it's a scaling factor for the weight update). As for the varying learning rate (or update), you can see the last equation (it uses m_t and v_t, these are updated in the loop) but the alpha stays fixed in the whole algorithm. This is the keras learning rate that we have to provide.
As, alpha stays same, we sometimes have to use learning rate scheduling where we actually decrease the learning rate after few epochs. There are other variations where we increase the learning rate first then decrease.
Just wanted to add this, in case an implementation/example in 1-D clarifies anything:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
eps = 1e-6
delta = 1e-6
MAX_ITER = 100000
def f(x):
return (np.square(x) / 10) - 2*np.sin(x)
def df(x):
return (f(x) - f(x - delta))/delta
def main():
x_0 = -13 # initial position
a = 0.1 # step size / learning rate
x_k = x_0
B_1 = 0.99 # first decay rate
B_2 = 0.999 # second decay rate
i = 0
m_k = df(x_k)
d_k = df(x_k)**2
while True:
# update moment estimates and parameters
m_k = B_1 * m_k + (1 - B_1) * df(x_k)
d_k = B_2 * d_k + (1 - B_2) * (df(x_k)**2)
x_k = x_k - a * m_k / sqrt(d_k + eps)
# termination criterion
if abs(df(x_k)/df(x_0)) <= eps:
break
if i > MAX_ITER:
break
i = i+1

Mini batch stochastic gradient descent

My question is what changes should be made to SGD algorithm to implement mini-batch SGD algorithm.
In the book, Machine Learning by Tom Mitchell, the GD, and SGD algorithms are explained very well. Here is a snippet of the book for SGD backpropagation algorithm:
I know that the difference between SGD and mini-batch SGD is that in the former we use one training example to update weights in each iteration (of the outer while-loop), while in the latter, a batch of training examples should be used in each iteration. But I still can't figure how the below algorithm should be changed to account for this change.
Here is what I think it should look like, but can't confirm this from several tutorials I followed on the web.
Until the termination condition is met, Do
batch <- get next batch
For each <x,t> in batch, Do
1- Propagate the input forward through the network.
2- d_k += o_k(1 - o_k)(t_k - o_k)
3- d_h += o_h(1 - o_h)sum(w_kh * d_k)
For each network weight w_ij, Do
w_ji += etha * d_j * x_ij
Any help is much appreciated!

Is there an optimizer in keras based on precision or recall instead of loss?

I am developping a segmentation neural network with only two classes, 0 and 1 (0 is the background and 1 the object that I want to find on the image). On each image, there are about 80% of 1 and 20% of 0. As you can see, the dataset is unbalanced and it makes the results wrong. My accuracy is 85% and my loss is low, but that is only because my model is good at finding the background !
I would like to base the optimizer on another metric, like precision or recall which is more usefull in this case.
Does anyone know how to implement this ?
You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.
THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
precision = tp / (tp + fp)
return precision
def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))
recall = tp / (tp + fn)
return recall
def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
beta_squared = beta ** 2
return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall)
def model_fit(X,y,X_test,y_test):
class_weight={
1: 1/(np.sum(y) / len(y)),
0:1}
np.random.seed(47)
model = Sequential()
model.add(Dense(1000, input_shape=(X.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
return model
No. To do a 'gradient descent', you need to compute a gradient. For this the function need to be somehow smooth. Precision/recall or accuracy is not a smooth function, it has only sharp edges on which the gradient is infinity and flat places on which the gradient is zero. Hence you can not use any kind of numerical method to find a minimum of such a function - you would have to use some kind of combinatorial optimization and that would be NP-hard.
As others have stated, precision/recall is not directly usable as a loss function. However, better proxy loss functions have been found that help with a whole family of precision/recall related functions (e.g. ROC AUC, precision at fixed recall, etc.)
The research paper Scalable Learning of Non-Decomposable Objectives covers this with a method to sidestep the combinatorial optimization by the use of certain calculated bounds, and some Tensorflow code by the authors is available at the tensorflow/models repository. Additionally, there is a followup question on StackOverflow that has an answer that adapts this into a usable Keras loss function.
Special thanks to Francois Chollet and other participants on the Keras issue thread here that turned up that research paper. You may also find that thread provides other useful insights into the problem at hand.
Having the same problem with an unbalanced dataset, I'd suggest you use the F1 score as the metric of your optimizer.
Andrew Ng teaches that having ONE metric for the model is the simplest (best?) way to train a model. If you have 2 metrics, like precision and recall - it's not clear which one is more important. Trying to set limits on one metric obviously impacts the other metric...
F1 score is the prodigy of recall and precision - it is their harmonic mean.
Keras that I'm using, unfortunately has no implementation of F1 score as a metric, like there is one for accuracy, or many other Keras metrics https://keras.io/api/metrics/.
I found an implementation of the F1 score as a Keras metric, used at each epoch at:
https://medium.com/#aakashgoel12/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
I've implemented the simple function from the above article and the model trains now on F1 score as its Keras optimizer metric. Results on test: accuracy went down a bit and F1 score went up a lot.
I have the same problem regarding an unbalanced dataset for binary classification and I want to increase the recall sensitivity too. I found out that there is a built-in function for recall in tf.keras and can be used in the compile statement as follow:
from tensorflow.keras.metrics import Recall, Accuracy
model.compile(loss='binary_crossentropy' , optimizer=opt, metrics=[Accuracy(),Recall()])
The recommended approach to deal with an unbalanced dataset like you have is to use class_weights or sample_weights. See the model fit API for details.
Quote:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
With weights that are inversely proportional to the class frequency the loss will avoid just predicting the background class.
I understand that this is not how you formulated the question but imho it is the most practical approach to the issue you are facing.
I think that the Callbacks and Early Stopping mechanisms provide one with techniques that can lead you as close as possible to what you want to achieve. Please read the following article by Jason Brownlee about Early Stopping (read to the end!):
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

Machine learning using ReLu return NaN

I am trying to figure out this whole machine learning thing, so I was making some testing. I wanted to make it learn the sinus function (with a radian angle). The neural network is:
1 Input (radian angle) / 2 hidden layer / 1 output (prediction of the sinus)
For the squash activation I am using: RELU and it's important to note that when I was using the Logistic function instead of RELU the script was working.
So to do that, I've made a loop that start at 0 and finish at 180, and it will translate the number in radian (radian = loop_index*Math.PI/180) then it'll basically do the sinus of this radian angle and store the radian and the sinus result.
So my table look like this for an entry: {input:[RADIAN ANGLE], output:[sin(radian)]}
for(var i = 0; i <= 180; i++) {
radian = (i*(Math.PI / 180));
train_table.push({input:[radian],output:[Math.sin(radian)]})
}
I use this table to train my Neural Network using Cross Entropy and a learning rate of 0.3 with 20000 iterations.
The problem is that it fail, when I try to predict anything it returns "NaN"
I am using the framework Synaptic (https://github.com/cazala/synaptic) and here is a JSfiddle of my code: https://jsfiddle.net/my7xe9ks/2/
A learning rate must be carefully tuned, this parameter matters a lot, specially when the gradients explode and you get a nan. When this happens, you have to reduce the learning rate, usually by a factor of 10.
In your specific case, the learning rate is too high, if you use 0.05 or 0.01 the network now trains and works properly.
Also another important detail is that you are using cross-entropy as a loss, this loss is used for classification, and you have a regression problem. You should prefer a mean squared error loss instead.

Scikit-learn - Stochastic Gradient Descent with custom cost and gradient functions

I am implementing matrix factorization to predict a movie rating by a reviewer. The dataset is taken from MovieLen (http://grouplens.org/datasets/movielens/). This is a well-studied recommendation problem so I just implement this matrix factorization method as for my learning purpose.
I model the cost function as a root-mean-square error between predict rating and actual rating in the training dataset. I use scipy.optimize.minimize function (I use conjugate gradient descent) to factor the movie rating matrix, but this optimization tool is too slow even for only a dataset with 100K items. I plan to scale my algorithms for the dataset with 20 million items.
I have been searching for a Python-based solution for Stochastic Gradient Descent, but the stochastic gradient descent I found on scikit-learn does not allow me to use my custom cost and gradient functions.
I can implement my own stochastic gradient descent but I am checking with you guys if there exists a tool for doing this already.
Basically, I am wondering if there is such as API that is similar to this:
optimize.minimize(my_cost_function,
my_input_param,
jac=my_gradient_function,
...)
Thanks!
Un
This is so simple (at least the vanilla method) to implement that I don't think there is a "framework" around it.
It is just
my_input_param += alpha * my_gradient_function
Maybe you want to have a look at theano, which will do the differentiation for you, though. Depending on what you want to do, it might be a bit overkill, though.
I've been trying to do something similar in R but with a different custom cost function.
As I understand it the key is to find the gradient and see which way takes you towards the local minimum.
With linear regression (y = mx + c) and a least squares function, our cost function is
(mx + c - y)^2
The partial derivative of this with relation to m is
2m(mX + c - y)
Which with the more traditional machine learning notation where m = theta gives us theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
I don't know this for sure but I would assume that for linear regression and a cost function of sqrt(mx + c - y) that the gradient step is the partial derivative with relation to m, which I believe is
m/(2*sqrt(mX + c - y))
If any/all of this is incorrect please (anybody) correct me. This is something I am trying to learn myself and would appreciate knowing if I'm heading in completely the wrong direction.

Resources