Some TensorFlow examples calculate the cost function like this:
cost = tf.reduce_sum((pred-y)**2 / (2*n_samples))
So the quotient is the number of samples, multiplied by two.
The reason for the extra factor of 2, is it so that when the cost function is differentiated for back propagation, it will cancel a factor of 1/2 and save an operation?
If so, is it still recommended to do this, does it actually provide a significant performance improvement?
It's convenient in math, because one doesn't need to carry the 0.5 all along. But in code, it doesn't make a big difference, because this change makes the gradients (and, correspondingly, the updates of trainable variables) two times bigger or smaller. Since the updates are multiplied by the learning rate, this factor of 2 can be undone by a minor change of the hyperparameter. I say minor, because it's common to try the learning rates in log-scale during model selection anyway: 0.1, 0.01, 0.001, ....
As a result, no matter what particular formula is used in the loss function, its effect is negligible and doesn't lead to any training speed up. The choice of the right learning rate is more important.
Related
I am trying to implement a Differentially private FL binary classification model using gaussian adaptive clipping geometric method.
aggregation_factory = tff.aggregators.DifferentiallyPrivateFactory.gaussian_adaptive(
noise_multiplier=0.6,
clients_per_round=10,
initial_l2_norm_clip=0.1,
target_unclipped_quantile=0.8,
learning_rate=0.2)
I know that the initial_l2_norm_clip is the initial value of clipping norm which is updated based on the target_unclipped_quantile value.
How can we determine the appropriate value of initial_l2_norm_clip for a particular model?
when I set it (initial_l2_norm_clip) to 0.1 I am getting a really low AOC (around 0.4) but when I set it to a higher value of 1.0 I am getting a better AOC value (around 0.8) and in both cases the 'clip' metric which is recorded by the iterative process always increases (i.e it goes from 0.1 to 0.3 and 1.0 to 1.2)
my model is running for 13 rounds with 10 clients per round does this make a difference?
One thing I would flag is that 13 training rounds is relatively few in general. If you run the training for longer, I would expect the clip norm will eventually stabilize around the same value, regardless of the initial value.
The point of the adaptive selection of the clipping norm is that the hyper parameter configuration of the initial norm should not matter that much. If you see the clipping norm reported in metrics increase during training, it means the initial_l2_norm_clip is small, relative to the target_unclipped_quantile of the values actually seen at runtime. So, you can increase the initial norm and it should match the target quantile faster. If you want to spend the time tuning this parameter, you can also use the gaussian_fixed constructor, and have the clipping norm constant throughout training.
However, note that if you are interested in differential privacy, a larger clipping norm will likely degrade the guarantee you can get. So there is a tradeoff to be explored, together with the total number of rounds to train a model.
Some weeks ago I started coding the Levenberg-Marquardt algorithm from scratch in Matlab. I'm interested in the polynomial fitting of the data but I haven't been able to achieve the level of accuracy I would like. I'm using a fifth order polynomial after I tried other polynomials and it seemed to be the best option. The algorithm always converges to the same function minimization no matter what improvements I try to implement. So far, I have unsuccessfully added the following features:
Geodesic acceleration term as a second order correction
Delayed gratification for updating the damping parameter
Gain factor to get closer to the Gauss-Newton direction or
the steepest descent direction depending on the iteration.
Central differences and forward differences for the finite difference method
I don't have experience in nonlinear least squares, so I don't know if there is a way to minimize the residual even more or if there isn't more room for improvement with this method. I attach below an image of the behavior of the polynomial for the last iterations. If I run the code for more iterations, the curve ends up not changing from iteration to iteration. As it is observed, there is a good fit from time = 0 to time = 12. But I'm not able to fix the behavior of the function from time = 12 to time = 20. Any help will be very appreciated.
Fitting a polynomial does not seem to be the best idea. Your data set looks like an exponential transient, with an horizontal asymptote. Forcing a polynomial to that will work very poorly.
I'd rather try with a simple model, such as
A (1 - e^(-at)).
With naked eye, A ~ 15. You should have a look at the values of log(15 - y).
I just finished implementing a convolutional neural network from scratch. This is the first time I've done this. When testing my backpropagation algorithm, the outputted delta values for the weights are extremely large compared to what the original value was. For example, all my weights are initialized to a random number between -0.1 and 0.1, but the delta values outputted are around 75000. This obviously is much too big of a change, and it requires a very small learning rate to even be near functional. A learning rate like 0.01 seems like the convention but mine needs to be at least 0.0000001, leading me to believe I'm doing something wrong. The thing is I don't see how the deltas couldn't be large. To get the derivative of weights with regard to the cost function I convolve the activations of the previous layer (mostly positive due to leaky reLu) with the previous errors (all either 0.1 or 1 due to the derivative of leaky reLu). Obviously the sum of all these positive numbers will get very large as it propagates through the layers. Did I skip a step somewhere? Is this an exploding gradient problem? Should I use gradient clipping or batch normalization?
Depending on the size of the convolutions -0.1 to 0.1 seems extremely large. Try something like 0.01 or even less.
If you want to do a more insightful initialization you can take a look at glorot (http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi) or he (https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initializations.
The crux is to initialize with either uniform or gaussian values with mean 0 and standard deviation equal to square root of the input channels.
I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?
SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".
I learnt gradient descent through online resources (namely machine learning at coursera). However the information provided only said to repeat gradient descent until it converges.
Their definition of convergence was to use a graph of the cost function relative to the number of iterations and watch when the graph flattens out. Therefore I assume that I would do the following:
if (change_in_costfunction > precisionvalue) {
repeat gradient_descent
}
Alternatively, I was wondering if another way to determine convergence is to watch the coefficient approach it's true value:
if (change_in_coefficient_j > precisionvalue) {
repeat gradient_descent_for_j
}
...repeat for all coefficients
So is convergence based on the cost function or the coefficients? And how do we determine the precision value? Should it be a % of the coefficient or total cost function?
You can imagine how Gradient Descent (GD) works thinking that you throw marble inside a bowl and you start taking photos. The marble will oscillate till friction will stop it in the bottom. Now imaging that you are in an environment that friction is so small that the marble takes a long time to stop completely, so we can assume that when the oscillations are small enough the marble have reach the bottom (although it could continue oscillating). In the following image you could see the first eight steps (photos of the marble) of the GD.
If we continue taking photos the marble makes not appreciable movements, you should zoom the image:
We could keep taking photos and the movements will be more irrelevants.
So reaching a point in which GD makes very small changes in your objective function is called convergence, which doesn't mean it has reached the optimal result (but it is really quite quite near, if not on it).
The precision value can be chosen as the threshold in which you consecutive iterations of GD are almost the same:
grad(i) = 0.0001
grad(i+1) = 0.000099989 <-- grad has changed less than 0.01% => STOP
I think I understand your question. Based on my understanding, the GD function is based on the cost function. It iterates until the convergence of cost function.
Imagine, plotting a graph of cost function (y-axis) against the number of iterations of GD(x-axis).
Now, if the GD works properly the curve is concave up, or decreasing(similar to that of 1/x). Since, the curve is decreasing, the decrease in cost function becomes smaller and smaller, and then there comes a point where the curve is almost flattened. Around that point, we say the GD is more or less converged (again, where the cost function decreases by a unit less than the precision_value).
So, I would your first approach is what you need:
(if(change_in_costFunction > precision_value))
repeat GD;