Learning rate & gradient descent difference? - machine-learning

What is the difference between the two?, the two serve to reach the minimum point (lower loss) of a function for example.
I understand (I think) that the learning rate is multiplied by the gradient ( slope ) to make the gradient descent , but is that so ? Do I miss something?
What is the difference between lr and gradient?
Thanks

Deep learning neural networks are trained using the stochastic gradient descent algorithm.
Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.
The amount that the weights are updated during training is referred to as the step size or the “learning rate.”
Specifically, the learning rate is a configurable hyperparameter
used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.
The challenge of training deep learning neural networks involves carefully selecting the learning rate. It may be the most important hyperparameter for the model.
The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.
— Page 429, Deep Learning, 2016.
For more on what the learning rate is and how it works, see the post:
How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks
Also you can refer to here: Understand the Impact of Learning Rate on Neural Network Performance

Related

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

Is tuning batch size or epochs necessary for linear regression with TensorFlow?

I am working on an article where I focus on a simple problem – linear regression over a large data set in the presence of standard normal or uniform noise. I chose Estimator API from TensorFlow as the modeling framework.
I am finding that, hyperparameter tuning is, in fact, of little importance for such a machine learning problem when the number of training steps can be made sufficiently large. By hyperparameter I mean batch size or number of epochs in the training data stream.
Is there any paper/article with formal proof of this?
I don't think there is a paper specifically focused on this question, because it's a more or less fundamental fact. The introductory chapter of this book discusses the probabilistic interpretation of machine learning in general and loss function optimization in particular.
In short, the idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.

Machine Learning: Is it better to retrain a model if the loss stagnates at a high value?

Meaning to say if during training you have set your learning rate too high and you had unfortunately reached a local minimum where the value is too high, is it good to retrain with a lower learning rate or should you start from a higher learning rate for the poor-performing model, in hopes that the loss will escape the local minimum?
In the strict sense, you don't have to retrain as you can continue training with a lower learning rate (this is called a learning shedule). A very common approach is to lower the learning rate (by usually dividing by 10) each time the loss stagnates or becomes constant.
Another approach is to use an optimizer that scales the learning rate with the gradient magnitude, so the learning rate naturally decays as you get closer to the minima. Examples of this are ADAM, Adagrad and RMSProp.
In any case, make sure to find the optimal learning rate on a validation set, this will considerably improve performance and make learning faster. This applies to both plain SGD and with any other optimizer.

What's the relationship between an SVM and hinge loss?

My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
I will answer one thing at at time
Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?
SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.
Or is it more complex than that?
No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.
Intuitively look at these three common losses:
hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2
Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.
How do the support vectors come into play?
Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.
What about the slack variables?
This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.
Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).
To sum up:
there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)

Ada-Delta method doesn't converge when used in Denoising AutoEncoder with MSE loss & ReLU activation?

I just implemented AdaDelta (http://arxiv.org/abs/1212.5701) for my own Deep Neural Network Library.
The paper kind of says that SGD with AdaDelta is not sensitive to hyperparameters, and that it always converge to somewhere good. (at least the output reconstruction loss of AdaDelta-SGD is comparable to that of well-tuned Momentum method)
When I used AdaDelta-SGD as learning method in in Denoising AutoEncoder, it did converge in some specific settings, but not always.
When I used MSE as loss function, and Sigmoid as activation function, it converged very quickly, and after iterations of 100 epochs, the final reconstruction loss was better than all of plain SGD, SGD with Momentum, and AdaGrad.
But when I used ReLU as activation function, it didn't converge but continued to be stacked(oscillating) with high(bad) reconstruction loss (just like the case when you used plain SGD with very high learning rate).
The magnitude of reconstruction loss it stacked was about 10 to 20 times higher than the final reconstruction loss generated with Momentum method.
I really don't understand why it happened since the paper says AdaDelta is just good.
Please let me know the reason behind the phenomena and teach me how I could avoid it.
The activation of a ReLU is unbounded, making its use in Auto Encoders difficult since your training vectors likely do not have arbitrarily large and unbounded responses! ReLU simply isn't a good fit for that type of network.
You can force a ReLU into an auto encoder by applying some transformation to the output layer, as is done here. However, hey don't discuss the quality of the results in terms of an auto-encoder, but instead only as a pre-training method for classification. So its not clear that its a worth while endeavor for building an auto encoder either.

Resources