What is `lr_policy` in Caffe? - machine-learning

I just try to find out how I can use Caffe. To do so, I just took a look at the different .prototxt files in the examples folder. There is one option I don't understand:
# The learning rate policy
lr_policy: "inv"
Possible values seem to be:
"fixed"
"inv"
"step"
"multistep"
"stepearly"
"poly"
Could somebody please explain those options?

It is a common practice to decrease the learning rate (lr) as the optimization/learning process progresses. However, it is not clear how exactly the learning rate should be decreased as a function of the iteration number.
If you use DIGITS as an interface to Caffe, you will be able to visually see how the different choices affect the learning rate.
fixed: the learning rate is kept fixed throughout the learning process.
inv: the learning rate is decaying as ~1/T
step: the learning rate is piecewise constant, dropping every X iterations
multistep: piecewise constant at arbitrary intervals
You can see exactly how the learning rate is computed in the function SGDSolver<Dtype>::GetLearningRate (solvers/sgd_solver.cpp line ~30).
Recently, I came across an interesting and unconventional approach to learning-rate tuning: Leslie N. Smith's work "No More Pesky Learning Rate Guessing Games". In his report, Leslie suggests to use lr_policy that alternates between decreasing and increasing the learning rate. His work also suggests how to implement this policy in Caffe.

If you look inside the /caffe-master/src/caffe/proto/caffe.proto file (you can find it online here) you will see the following descriptions:
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.

Related

How does the Keras Adam optimizer learning rate hyper-parameter relate to individually computed learning rates for network parameters?

So through my limited understanding of Adam (mainly through this post: https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c) I gather that the Adam optimizer computes individual learning rates for each parameter in a network.
But in the Keras docs (https://keras.io/optimizers/) the Adam optimizer takes a learning rate parameter.
My question is how does the learning rate parameter taken by the Adam object correlate to these computed learning rates? As far as I can tell, this isn't covered in the post linked (Or it is but it went over my head).
As this is very specific question, I wouldn't go to any mathematical details of Adam. I guess in the article, the line it computes individual learning rates for different parameters got you off.
This is the screenshot of the actual Adam algorithm proposed in the paper https://arxiv.org/pdf/1412.6980.pdf
Adam keeps an exponentially decaying average of past gradients so it behaves like a heavy ball with friction which helps it faster convergence and stability.
But, if you look into the algorithm there's an alpha (step size), this is the keras equivalent of learning rate = 0.001 we provide. So, the algorithm needs a step size to update the parameters (simply, it's a scaling factor for the weight update). As for the varying learning rate (or update), you can see the last equation (it uses m_t and v_t, these are updated in the loop) but the alpha stays fixed in the whole algorithm. This is the keras learning rate that we have to provide.
As, alpha stays same, we sometimes have to use learning rate scheduling where we actually decrease the learning rate after few epochs. There are other variations where we increase the learning rate first then decrease.
Just wanted to add this, in case an implementation/example in 1-D clarifies anything:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
eps = 1e-6
delta = 1e-6
MAX_ITER = 100000
def f(x):
return (np.square(x) / 10) - 2*np.sin(x)
def df(x):
return (f(x) - f(x - delta))/delta
def main():
x_0 = -13 # initial position
a = 0.1 # step size / learning rate
x_k = x_0
B_1 = 0.99 # first decay rate
B_2 = 0.999 # second decay rate
i = 0
m_k = df(x_k)
d_k = df(x_k)**2
while True:
# update moment estimates and parameters
m_k = B_1 * m_k + (1 - B_1) * df(x_k)
d_k = B_2 * d_k + (1 - B_2) * (df(x_k)**2)
x_k = x_k - a * m_k / sqrt(d_k + eps)
# termination criterion
if abs(df(x_k)/df(x_0)) <= eps:
break
if i > MAX_ITER:
break
i = i+1

Machine learning using ReLu return NaN

I am trying to figure out this whole machine learning thing, so I was making some testing. I wanted to make it learn the sinus function (with a radian angle). The neural network is:
1 Input (radian angle) / 2 hidden layer / 1 output (prediction of the sinus)
For the squash activation I am using: RELU and it's important to note that when I was using the Logistic function instead of RELU the script was working.
So to do that, I've made a loop that start at 0 and finish at 180, and it will translate the number in radian (radian = loop_index*Math.PI/180) then it'll basically do the sinus of this radian angle and store the radian and the sinus result.
So my table look like this for an entry: {input:[RADIAN ANGLE], output:[sin(radian)]}
for(var i = 0; i <= 180; i++) {
radian = (i*(Math.PI / 180));
train_table.push({input:[radian],output:[Math.sin(radian)]})
}
I use this table to train my Neural Network using Cross Entropy and a learning rate of 0.3 with 20000 iterations.
The problem is that it fail, when I try to predict anything it returns "NaN"
I am using the framework Synaptic (https://github.com/cazala/synaptic) and here is a JSfiddle of my code: https://jsfiddle.net/my7xe9ks/2/
A learning rate must be carefully tuned, this parameter matters a lot, specially when the gradients explode and you get a nan. When this happens, you have to reduce the learning rate, usually by a factor of 10.
In your specific case, the learning rate is too high, if you use 0.05 or 0.01 the network now trains and works properly.
Also another important detail is that you are using cross-entropy as a loss, this loss is used for classification, and you have a regression problem. You should prefer a mean squared error loss instead.

How to write a solver.prototxt satisfy a given condition in CAFFE?

I am writing the solver.prototxt that follows the rule of the paper https://arxiv.org/pdf/1604.02677.pdf
In the training phase, the learning rate was set as 0.001 initially and decreased by a factor of 10 when the loss stopped decreasing till 10−7. The discount weight was set as 1 initially and decreased by a factor of 10 every ten thousand iterations until a marginal value 10−3.
Note that, the discount weight is loss_weight in Caffe. Based on the information above, I wrote my solver as
train_net: "train.prototxt"
lr_policy: "step"
gamma: 0.1
stepsize: 10000
base_lr: 0.001 #0.002
In train.prototxt, I also set
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "deconv"
bottom: "label"
top: "loss"
loss_weight: 1
}
However, I still don't know how to set solver to satisfy the rule "decreased by a factor of 10 when the loss stopped decreasing till 10−7" and "decreased by a factor of 10 every ten thousand iterations until a marginal value 10−3". I did not found any caffe rule can do it as the reference:
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.
If anyone knows it, please give me some guide to writing the solver.prototxt to satisfy above condition.
Learning rate reduction
Part of the problem is that the phrase decreased by a factor of 10 when the loss stopped decreasing till 10e−7 does not quite make sense. I think that, perhaps, the authors are trying to say that every time the loss quit decreasing, they reduced the learning rate by a factor of 10, until the learning rate got to 10e-7.
If so, then this is a manual process, not something you can choose with Caffe parameters. Most of all, "when the loss stopped decreasing" is a non-trivial judgement, although a long-base moving average will give you a good indication. I expect that the authors did this manually, stopping and restarting the training from a checkpoint.
You can get a similar effect with a learning rate decay policy of step: set gamma to 0.1, and set the step parameter high enough to ensure that training has levelled off before each rate reduction. This will waste some computer time, but might save you overall trouble.
Discount weight
In Caffe, the loss weight is merely the relative weighting among the various losses in the model, linear factors used to achieve the final loss statistic. Caffe does not provide run-time alteration of the weight. Perhaps this was something else that the authors tuned by hand.
I tried reading the areas of the paper around the two references to "discount weight", but found it hard reading. I'll wait until someone proofreads and edits that paper for grammar and clarity. In the meantime, I hope this answer helps you some.
You can find a little more information here.

Gradient descent stochastic update - Stopping criterion and update rule - Machine Learning

My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.

The best way to calculate the best threshold with P. Viola, M. Jones Framework

I'm trying to implement P. Viola and M. Jones detection framework in C++ (at the beginning, simply sequence classifier - not cascaded version). I think I have designed all required class and modules (e.g Integral images, Haar features), despite one - the most important: the AdaBoost core algorithm.
I have read the P. Viola and M. Jones original paper and many other publications. Unfortunately I still don't understand how I should find the best threshold for the one weak classifier? I have found only small references to "weighted median" and "gaussian distribution" algorithms and many pieces of mathematics formulas...
I have tried to use OpenCV Train Cascade module sources as a template, but it is so comprehensive that doing a reverse engineering of code is very time-consuming. I also coded my own simple code to understand the idea of Adaptive Boosting.
The question is: could you explain me the best way to calculate the best threshold for the one weak classifier?
Below I'm presenting the AdaBoost pseudo code, rewritten from sample found in Google, but I'm not convinced if it's correctly approach. Calculating of one weak classifier is very slow (few hours) and I have doubts about method of calculating the best threshold especially.
(1) AdaBoost::FindNewWeakClassifier
(2) AdaBoost::CalculateFeatures
(3) AdaBoost::FindBestThreshold
(4) AdaBoost::FindFeatureError
(5) AdaBoost::NormalizeWeights
(6) AdaBoost::FindLowestError
(7) AdaBoost::ClassifyExamples
(8) AdaBoost::UpdateWeights
DESCRIPTION (1)
-Generates all possible arrangement of features in detection window and put to the vector
DO IN LOOP
-Runs main calculating function (2)
END
DESCRIPTION(2)
-Normalizes weights (5)
DO FOR EACH HAAR FEATURE
-Puts sequentially next feature from list on all integral images
-Finds the best threshold for each feature (3)
-Finds the error for each the best feature in current iteration (4)
-Saves errors for each the best feature in current iteration in array
-Saves threshold for each the best feature in current iteration in array
-Saves the threshold sign for each the best feature in current iteration in array
END LOOP
-Finds for classifier index with the lowest error selected by above loop (6)
-Gets the value of error from the best feature
-Calculates the value of the best feature in the all integral images (7)
-Updates weights (8)
-Adds new, weak classifier to vector
DESCRIPTION (3)
-Calculates an error for each feature threshold on positives integral images - seperate for "+" and "-" sign (4)
-Returns threshold and sign of the feature with the lowest error
DESCRIPTION(4)
- Returns feature error for all samples, by calculating inequality f(x) * sign < sign * threshold
DESCRIPTION (5)
-Ensures that samples weights are probability distribution
DESCRIPTION (6)
-Finds the classifier with the lowest error
DESCRIPTION (7)
-Calculates a value of the best features at all integral images
-Counts false positives number and false negatives number
DESCRIPTION (8)
-Corrects weights, depending on classification results
Thank you for any help
In the original viola-Jones paper here, section 3.1 Learning Discussion (para 4, to be precise) you will find out the procedure to find optimal threshold.
I'll sum up the method quickly below.
Optimal threshold for each feature is sample-weight dependent and therefore calculated in very iteration of adaboost. The best weak classifier's threshold is saved as mentioned in the pseudo code.
In every round, for each weak classifier, you have to arrange the N training samples according to the feature value. Putting a threshold will separate this sequence in 2 parts. Both parts will have either positive or negative samples in majority along with a few samples of other type.
T+ : total sum of positive sample weights
T- : total sum of negative sample weights
S+ : sum of positive sample weights below the threshold
S- : sum of negative sample weights below the threshold
Error for this particular threshold is -
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
Why the minimum? here's an example:
If the samples and threshold is like this -
+ + + + + - - | + + - - - - -
In the first round, if all weights are equal(=w), taking the minimum will give you the error of 4*w, instead of 10*w.
You calculate this error for all N possible ways of separating the samples.
The minimum error will give you the range of threshold values. The actual threshold is probably the average of the adjacent feature values (I'm not sure though, do some research on this).
This was the second step in your DO FOR EACH HAAR FEATURE loop.
The cascades given along with OpenCV were created by Rainer Lienhart and I don't know what method he used.
You could closely follow the OpenCV source codes to get any further improvements on this procedure.

Resources