I start to learn the machine learning shorly. I meet a problem when I read the book of PRML. It talk about the LMS algorithm and use it to solve the problem of the regression.
wi+1 = wi + alpha*gradient
I don't know how to determine the 'alpha'.
So, how to solve it?
alpha is the learning_rate or step size (of gradient descent). It is a critical parameter we want to tune, so you could start with a list of options, like:
0.1, 0.01, 0.001, ...
and see which one works better with respect to training time and prediction accuracy. If the learning_rate is too high, you might see the cost is not decreasing (or even increasing). On the other hand, if it is too low, you might notice the learning is taking too long to converge to a good state.
If you use tensorflow (or other library/tools), to implement the algorithm you would need to choose an optimizer, for example, a GradientDescentOptimizer. You will notice the first argument is a learning_rate.
learning_rate: A Tensor or a floating point value. The learning rate to use.
Related
NOTE: when you see (0) in the functions it represents Theta not Zero
I've been studying Andrew Ng's Machine Learning Course, and I have the following inquery:
(Short Version: If one were to look at all the mathematical expressions/calculations used for both Forward AND Backward propagation, then it appears to me that we never use the Cost Function directly, but its Derivative , so what is the importance of the cost function and its choice anyway? is it purely to evaluate our system whenever we feel like it?)
Andrew mentioned that for Logistic Regression, using the MSE (Mean Squared Error) Cost function
wouldn't be good, because applying it to our Sigmoid function would yield a non-convex cost function that has a lot of Local Optima, so it is best that we use the following logistic cost function:
Which will have 2 graphs (one for y=0 and one for y=1), both of which are convex.
My question is the following, since it is our objective to minimize the cost function (aka have the Derivative reach 0), which we achieve by using Gradient Descent, updating our weights using the Derivative of the Cost Function, which in both cases (both cost functions) is the same derivative:
dJ = (h0(x(i)) - y(i)) . x(i)
So how did the different choice of cost function in this case effect our algorithm in any way? because in forward propagation, all we need is
h0(x(i)) = Sigmoid(0Tx)
which can be calculated without ever needing to calculate the cost function, then in backward propagation and in updating the weights, we always use the derivative of the cost function, so when does the Cost Function itself come into play? is it just necessary when we want an indication of how well our network is doing? (then why not just depend on the derivative to know that)
The forward propagation does not need the cost function in any way because you just applying all your learned weights to the corresponding input.
The cost function is generally used to measure how good your algorihm is by comparing your models outcome (therefore applying your current weights to your input) with the true label of the input (in supervised algorithms). The main objective is therefore to minimize the cost function error as (in most cases) you want the difference of the prediction and the true label as small as possible. In optimization it is pretty helpful if your function you want to optimize is convex because it guarantees that if you find a local minimum it is at the same time the global minimum.
For minimizing the cost function, gradient descent is used to iteratively update your weights to get closer to the minimum. This is done w.r.t to the learned weights such that you are able to update your weights of the model for achieving the lowest possible costs. The backpropagation algorithm is used to adjust the weights using the cost function in the backward pass.
Technically, you are correct: we do not explicitly use the cost function in any of the calculations for forward propagation and back propagation.
You asked 'what is the importance of the cost function and its choice anyway?'. I have two answers:
The cost function is incredibly important because its gradient is what allows us to update our weights. Although we are only actually computing the gradient of the cost function and not the cost function itself, choosing a different cost function would mean we would have a different gradient, thus changing how we update our weights.
The cost function allows us to evaluate our model performance. It is common practice to plot cost vs epoch to understand how the cost decreases over time.
Your answer indicted you essentially understood all of this already but I hoped to clarify it a bit. Thanks!
Is Loss dependent upon learning rate and batch size. For .e.g if i keep batch size 4 and a learning rate lets say .002 then loss does not converge but if change the batch size to 32 keeping the learning rate same , i get a converging loss curve. Is this okk?
I would say that the loss is highly dependent on what parameters you use for your training. On the other hand, I would not call it a dependency in terms of a mathematical function but rather a relation.
If your network does not learn you need to tweak the parameters (architecture, learning rate, batch size, etc.).
It is hard to give a more specific answer to your question. What parameters that are ok are depending on the problem. However, if it converges and you can validate your solution I would say that you are fine.
I am using dice loss for my implementation of a Fully Convolutional Network(FCN) which involves hypernetworks. The model has two inputs and one output which is a binary segmentation map. The model is updating weights but loss is constant.
It is not even overfitting on only three training examples
I have used other loss functions as well like dice+binarycrossentropy loss, jacard loss and MSE loss but the loss is almost constant.
I have also tried almost every activation function like ReLU, LeakyReLU, Tanh. Moreover I have to use sigmoid at the the output because I need my outputs to be in range [0,1]
Learning rate is 0.01. Moreover, I have tried different learning rates as well like 0.0001, 0.001, 0.1. And no matter what loss the training starts at, it always comes at this value
This shows gradients for three training examples. And overall loss
tensor(0.0010, device='cuda:0')
tensor(0.1377, device='cuda:0')
tensor(0.1582, device='cuda:0')
Epoch 9, Overall loss = 0.9604763123724196, mIOU=0.019766070265581623
tensor(0.0014, device='cuda:0')
tensor(0.0898, device='cuda:0')
tensor(0.0455, device='cuda:0')
Epoch 10, Overall loss = 0.9616242945194244, mIOU=0.01919178702228237
tensor(0.0886, device='cuda:0')
tensor(0.2561, device='cuda:0')
tensor(0.0108, device='cuda:0')
Epoch 11, Overall loss = 0.960331304506822, mIOU=0.01983801422510155
I expect the loss to converge in few epochs.
What should I do?
It's not really a question for stack overflow. There's a million things which could be wrong and it's usually not possible to post enough code to allow us to pinpoint the issue, and even if it were, nobody could bother reading that much.
That being said, there are some general guidelines which often work for me.
Try reducing the problem. If you replace your network with a single convolutional layer, will it converge? If yes, apparently something's wrong with your network
Look at the data as you feed it as well as the labels (matplotlib plots, etc). Perhaps you're misaligning input with output (cropping issues, etc) or your data augmentation is way too strong.
Look for, well..., bugs. Perhaps you're returning torch.sigmoid(x) from your network and then feeding it into torch.nn.functional.binary_cross_entropy_with_logits (effectively applying sigmoid twice). Maybe your last layer is ReLU and your network just cannot (by construction) output negative values where you would expect them.
Finally, I've personally never had much success training with dice as the primary loss function, so I would definitely try to get it working with cross entropy first, and then move on to dice.
#Muhammad Hamza Mughal
You got to add code of at least your forward and train functions for us to pinpoint the issue, #Jatentaki is right there could be so many things that could mess up a ML / DL code. Even I moved recently to pytorch from Keras, took some time to get used to it. But, here are the things I'd do:
1) As you're dealing with images, try to pre-process them a bit ( rotation, normalization, Gaussian Noise etc).
2) Zero gradients of your optimizer at the beginning of each batch you fetch and also step optimizer after you calculated loss and called loss.backward().
3) Add a weight decay term to your optimizer call, typically L2, as you're dealing with Convolution networks have a decay term of 5e-4 or 5e-5.
4) Add a learning rate scheduler to your optimizer, to change learning rates if there's no improvement over time.
We really can't include code in our answers. It's up to the practitioner to scout for how to implement all this stuff. Hope this helps.
#MuhammadHamzaMughal since you are using sigmoid to generate predictions, have you made sure that the target attributes in ground truth/training data/validation data are all in range [0-1] ?
Normalize the data with min-max normalization so that it is in [0-1] range.
Recall that when exponentially decaying the learning rate in TensorFlow one does:
decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)
the docs mention this staircase option as:
If the argument staircase is True, then global_step /decay_steps is an
integer division and the decayed learning rate follows a staircase
function.
when is it better to decay every X number of steps and follow at stair case function rather than a smoother version that decays more and more with every step?
The existing answers didn't seem to describe this. There are two different behaviors being described as 'staircase' behavior.
From the feature request for staircase, the behavior is described as being a hand-tuned piecewise constant decay rate, so that a user could provide a set of iteration boundaries and a set of decay rates, to have the decay rate jump to the specified value after the iterations pass a given boundary.
If you look into the actual code for this feature pull request, you'll see that the PR isn't related much to the staircase option in the function arguments. Instead, it defines a wholly separate piecewise_constant operation, and the associated unit test shows how to define your own custom learning rate as a piecewise constant with learning_rate_decay.piecewise_constant.
From the documentation on decaying the learning rate, the behavior is described as treating global_step / decay_steps as integer division, so for the first set of decay_steps steps, the division results in 0, and the learning rate is constant. Once you cross the decay_steps-th iteration, you get the decay rate raised to a power of 1, then a power of 2, etc. So you only observe decay rates at the particular powers, rather than smoothly varying across all the powers if you treated the global step as a float.
As to advantages, this is just a hyperparameter decision you should make based on your problem. Using the staircase option allows you hold a decay rate constant, essentially like maintaining a higher temperature in simulated annealing for a longer time. This can allow you explore more of the solution space by taking bigger strides in the gradient direction, at the cost of possible noisy or unproductive updates. Meanwhile, smoothly increasing the decay rate power will steadily "cool" the exploration, which can limit you by making you stuck near a local optimum, but it can also prevent you from wasting time with noisily large gradient steps.
Whether one approach or the other is better (a) often doesn't matter very much and (b) usually needs to be specially tuned in the cases when it might matter.
Separately, as the feature request link mentions, the piecewise constant operation seems to be for very specifically tuned use cases, when you have separate evidence in favor of a hand-tuned decay rate based on collecting training metrics as a function of iteration. I would generally not recommend that for general use.
Good question.
For all I know it is preference of the research group.
Back from the old times, it was computationally more efficient to reduce the learning rate only every epoch. That's why some people prefer to use it nowadays.
Another, hand-wavy, story that people may tell is it prevents from local optima. By "suddenly" changing the learning rate, the weights might jump to a better bassin. (I don;t agree with this, but add it for completeness)
When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update the theta parameters in the gradient descent algorithm.
My question is how do we calculate this lambda regularization parameter?
The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less overfitting but also greater bias. So the real question is "How much bias are you willing to tolerate in your estimate?"
One approach you can take is to randomly subsample your data a number of times and look at the variation in your estimate. Then repeat the process for a slightly larger value of lambda to see how it affects the variability of your estimate. Keep in mind that whatever value of lambda you decide is appropriate for your subsampled data, you can likely use a smaller value to achieve comparable regularization on the full data set.
CLOSED FORM (TIKHONOV) VERSUS GRADIENT DESCENT
Hi! nice explanations for the intuitive and top-notch mathematical approaches there. I just wanted to add some specificities that, where not "problem-solving", may definitely help to speed up and give some consistency to the process of finding a good regularization hyperparameter.
I assume that you are talking about the L2 (a.k. "weight decay") regularization, linearly weighted by the lambda term, and that you are optimizing the weights of your model either with the closed-form Tikhonov equation (highly recommended for low-dimensional linear regression models), or with some variant of gradient descent with backpropagation. And that in this context, you want to choose the value for lambda that provides best generalization ability.
CLOSED FORM (TIKHONOV)
If you are able to go the Tikhonov way with your model (Andrew Ng says under 10k dimensions, but this suggestion is at least 5 years old) Wikipedia - determination of the Tikhonov factor offers an interesting closed-form solution, which has been proven to provide the optimal value. But this solution probably raises some kind of implementation issues (time complexity/numerical stability) I'm not aware of, because there is no mainstream algorithm to perform it. This 2016 paper looks very promising though and may be worth a try if you really have to optimize your linear model to its best.
For a quicker prototype implementation, this 2015 Python package seems to deal with it iteratively, you could let it optimize and then extract the final value for the lambda:
In this new innovative method, we have derived an iterative approach to solving the general Tikhonov regularization problem, which converges to the noiseless solution, does not depend strongly on the choice of lambda, and yet still avoids the inversion problem.
And from the GitHub README of the project:
InverseProblem.invert(A, be, k, l) #this will invert your A matrix, where be is noisy be, k is the no. of iterations, and lambda is your dampening effect (best set to 1)
GRADIENT DESCENT
All links of this part are from Michael Nielsen's amazing online book "Neural Networks and Deep Learning", recommended reading!
For this approach it seems to be even less to be said: the cost function is usually non-convex, the optimization is performed numerically and the performance of the model is measured by some form of cross validation (see Overfitting and Regularization and why does regularization help reduce overfitting if you haven't had enough of that). But even when cross-validating, Nielsen suggests something: you may want to take a look at this detailed explanation on how does the L2 regularization provide a weight decaying effect, but the summary is that it is inversely proportional to the number of samples n, so when calculating the gradient descent equation with the L2 term,
just use backpropagation, as usual, and then add (λ/n)*w to the partial derivative of all the weight terms.
And his conclusion is that, when wanting a similar regularization effect with a different number of samples, lambda has to be changed proportionally:
we need to modify the regularization parameter. The reason is because the size n of the training set has changed from n=1000 to n=50000, and this changes the weight decay factor 1−learning_rate*(λ/n). If we continued to use λ=0.1 that would mean much less weight decay, and thus much less of a regularization effect. We compensate by changing to λ=5.0.
This is only useful when applying the same model to different amounts of the same data, but I think it opens up the door for some intuition on how it should work, and, more importantly, speed up the hyperparametrization process by allowing you to finetune lambda in smaller subsets and then scale up.
For choosing the exact values, he suggests in his conclusions on how to choose a neural network's hyperparameters the purely empirical approach: start with 1 and then progressively multiply÷ by 10 until you find the proper order of magnitude, and then do a local search within that region. In the comments of this SE related question, the user Brian Borchers suggests also a very well known method that may be useful for that local search:
Take small subsets of the training and validation sets (to be able to make many of them in a reasonable amount of time)
Starting with λ=0 and increasing by small amounts within some region, perform a quick training&validation of the model and plot both loss functions
You will observe three things:
The CV loss function will be consistently higher than the training one, since your model is optimized for the training data exclusively (EDIT: After some time I've seen a MNIST case where adding L2 helped the CV loss decrease faster than the training one until convergence. Probably due to the ridiculous consistency of the data and a suboptimal hyperparametrization though).
The training loss function will have its minimum for λ=0, and then increase with the regularization, since preventing the model from optimally fitting the training data is exactly what regularization does.
The CV loss function will start high at λ=0, then decrease, and then start increasing again at some point (EDIT: this assuming that the setup is able to overfit for λ=0, i.e. the model has enough power and no other regularization means are heavily applied).
The optimal value for λ will be probably somewhere around the minimum of the CV loss function, it also may depend a little on how does the training loss function look like. See the picture for a possible (but not the only one) representation of this: instead of "model complexity" you should interpret the x axis as λ being zero at the right and increasing towards the left.
Hope this helps! Cheers,
Andres
The cross validation described above is a method used often in Machine Learning. However, choosing a reliable and safe regularization parameter is still a very hot topic of research in mathematics.
If you need some ideas (and have access to a decent university library) you can have a look at this paper:
http://www.sciencedirect.com/science/article/pii/S0378475411000607