I am writing the solver.prototxt that follows the rule of the paper https://arxiv.org/pdf/1604.02677.pdf
In the training phase, the learning rate was set as 0.001 initially and decreased by a factor of 10 when the loss stopped decreasing till 10−7. The discount weight was set as 1 initially and decreased by a factor of 10 every ten thousand iterations until a marginal value 10−3.
Note that, the discount weight is loss_weight in Caffe. Based on the information above, I wrote my solver as
train_net: "train.prototxt"
lr_policy: "step"
gamma: 0.1
stepsize: 10000
base_lr: 0.001 #0.002
In train.prototxt, I also set
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "deconv"
bottom: "label"
top: "loss"
loss_weight: 1
}
However, I still don't know how to set solver to satisfy the rule "decreased by a factor of 10 when the loss stopped decreasing till 10−7" and "decreased by a factor of 10 every ten thousand iterations until a marginal value 10−3". I did not found any caffe rule can do it as the reference:
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.
If anyone knows it, please give me some guide to writing the solver.prototxt to satisfy above condition.
Learning rate reduction
Part of the problem is that the phrase decreased by a factor of 10 when the loss stopped decreasing till 10e−7 does not quite make sense. I think that, perhaps, the authors are trying to say that every time the loss quit decreasing, they reduced the learning rate by a factor of 10, until the learning rate got to 10e-7.
If so, then this is a manual process, not something you can choose with Caffe parameters. Most of all, "when the loss stopped decreasing" is a non-trivial judgement, although a long-base moving average will give you a good indication. I expect that the authors did this manually, stopping and restarting the training from a checkpoint.
You can get a similar effect with a learning rate decay policy of step: set gamma to 0.1, and set the step parameter high enough to ensure that training has levelled off before each rate reduction. This will waste some computer time, but might save you overall trouble.
Discount weight
In Caffe, the loss weight is merely the relative weighting among the various losses in the model, linear factors used to achieve the final loss statistic. Caffe does not provide run-time alteration of the weight. Perhaps this was something else that the authors tuned by hand.
I tried reading the areas of the paper around the two references to "discount weight", but found it hard reading. I'll wait until someone proofreads and edits that paper for grammar and clarity. In the meantime, I hope this answer helps you some.
You can find a little more information here.
Related
I am trying to figure out this whole machine learning thing, so I was making some testing. I wanted to make it learn the sinus function (with a radian angle). The neural network is:
1 Input (radian angle) / 2 hidden layer / 1 output (prediction of the sinus)
For the squash activation I am using: RELU and it's important to note that when I was using the Logistic function instead of RELU the script was working.
So to do that, I've made a loop that start at 0 and finish at 180, and it will translate the number in radian (radian = loop_index*Math.PI/180) then it'll basically do the sinus of this radian angle and store the radian and the sinus result.
So my table look like this for an entry: {input:[RADIAN ANGLE], output:[sin(radian)]}
for(var i = 0; i <= 180; i++) {
radian = (i*(Math.PI / 180));
train_table.push({input:[radian],output:[Math.sin(radian)]})
}
I use this table to train my Neural Network using Cross Entropy and a learning rate of 0.3 with 20000 iterations.
The problem is that it fail, when I try to predict anything it returns "NaN"
I am using the framework Synaptic (https://github.com/cazala/synaptic) and here is a JSfiddle of my code: https://jsfiddle.net/my7xe9ks/2/
A learning rate must be carefully tuned, this parameter matters a lot, specially when the gradients explode and you get a nan. When this happens, you have to reduce the learning rate, usually by a factor of 10.
In your specific case, the learning rate is too high, if you use 0.05 or 0.01 the network now trains and works properly.
Also another important detail is that you are using cross-entropy as a loss, this loss is used for classification, and you have a regression problem. You should prefer a mean squared error loss instead.
If I have a trained binary classifier, what is the probability of making a correct prediction by chance?
For example, lets say that I want to make 5 predictions. What is the probability of getting all 5 predictions correct by chance?
Is it: 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = 0.0313 ?
You are correct, however, under the assumption that classes are equally probable.
As a similar thought experiment, if you have a model with 99% accuracy (meaning that for any, randomly chosen sample, it will provide correct label 99% of the time), it also does not have high probability of having all samples correctly. For 100 samples it is just about 36%, and for 300 it is less than 5%... for 1000 it is 0.004%.
In general probability of many event happening one by one will fall down very quickly (exponentially) if the probability of each success is constant.
I just try to find out how I can use Caffe. To do so, I just took a look at the different .prototxt files in the examples folder. There is one option I don't understand:
# The learning rate policy
lr_policy: "inv"
Possible values seem to be:
"fixed"
"inv"
"step"
"multistep"
"stepearly"
"poly"
Could somebody please explain those options?
It is a common practice to decrease the learning rate (lr) as the optimization/learning process progresses. However, it is not clear how exactly the learning rate should be decreased as a function of the iteration number.
If you use DIGITS as an interface to Caffe, you will be able to visually see how the different choices affect the learning rate.
fixed: the learning rate is kept fixed throughout the learning process.
inv: the learning rate is decaying as ~1/T
step: the learning rate is piecewise constant, dropping every X iterations
multistep: piecewise constant at arbitrary intervals
You can see exactly how the learning rate is computed in the function SGDSolver<Dtype>::GetLearningRate (solvers/sgd_solver.cpp line ~30).
Recently, I came across an interesting and unconventional approach to learning-rate tuning: Leslie N. Smith's work "No More Pesky Learning Rate Guessing Games". In his report, Leslie suggests to use lr_policy that alternates between decreasing and increasing the learning rate. His work also suggests how to implement this policy in Caffe.
If you look inside the /caffe-master/src/caffe/proto/caffe.proto file (you can find it online here) you will see the following descriptions:
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.
My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.
I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.