What is the way of implementing Batch gradient descent using sklearn for classification?
We have SGDClassifier for Stochastic GD which will take single instance at a time and Linear/Logistic Regression which uses normal equation.
The possible answer to the question as pointed out in the other similar question as well from sklearn docs:
SGD allows minibatch (online/out-of-core) learning, see the
partial_fit method.
But is partial_fit really a batch gradient decent?
SGD: The gradient of the cost function is calculated and the weights are updated using the gradient decent step for each sample.
Batch/Mini Batch GD: The gradient of the cost function is calculated and the weights are updated using the gradient decent step once per batch.
So Batch GD with batch size of 1 == SGD.
Now that we are clear about definitions lets investigate the code of sklearn SGDClassifier.
The docstring of partial_fit says
Perform one epoch of stochastic gradient descent on given samples.
But this is not a batch GD but it looks more like a helper function to run fit method with max_iter=1 (infact commented as same in docstrings).
partial_fit calls _partial_fit with max_iter==1. Reference link
fit method calls _fit which calls _partial_fit with max_iter set to the assigned\default maximum iterations. Reference link
conclusion:
partial_fit does not really do batch GD, i.e it is not calculating the gradients and updating the weight per batch but rather doing so for each sample.
There seems to be no mechanism in sklearn to do batch gradient descend.
Related
I understand that both LinearRegression class and SGDRegressor class from scikit-learn performs linear regression. However, only SGDRegressor uses Gradient Descent as the optimization algorithm.
Then what is the optimization algorithm used by LinearRegression, and what are the other significant differences between these two classes?
LinearRegression always uses the least-squares as a loss function.
For SGDRegressor you can specify a loss function and it uses Stochastic Gradient Descent (SGD) to fit. For SGD you run the training set one data point at a time and update the parameters according to the error gradient.
In simple words - you can train SGDRegressor on the training dataset, that does not fit into RAM. Also, you can update the SGDRegressor model with a new batch of data without retraining on the whole dataset.
To understand the algorithm used by LinearRegression, we must have in mind that there is (in favorable cases) an analytical solution (with a formula) to find the coefficients which minimize the least squares:
theta = (X'X)^(-1)X'Y (1)
where X' is the the transpose matrix of X.
In the case of non-invertibility, the inverse can be replaced by the Moore-Penrose pseudo-inverse calculated using "singular value decomposition" (SVD). And even in the case of invertibility, the SVD method is faster and more stable than applying the formula (1).
PS - No LaTeX (MathJaX) in Stackoverflow ???
--
Pierre (from France)
How do we set parameters for sklearn.linear_model.SGDRegressor to make it perform Batch gradient descent?
I want to solve a linear-regression problem using Batch gradient descent. I need to make SGD act like batch gradient descent, and this should be done (I think) by making it modify the model at the end of an epoch. Can it be somehow parameterized to behave like that?
I need to make SGD act like batch gradient descent, and this should be done (I think) by making it modify the model at the end of an epoch.
You cannot do that; it is clear from the documentation that:
the gradient of the loss is estimated each sample at a time and the model is updated along the way
And although in the SGDClassifier docs it is mentioned that
SGD allows minibatch (online/out-of-core) learning
which presumably holds also for SGDRegressor, what is actually meant is that you can use the partial_fit method for providing the data in different batches; the computations (and updates), however, are always performed per sample.
If you really need to perform linear regression with GD, you could do it easily in Keras or Tensorflow, assembling an LR model and using a batch size equal to the whole of your training samples.
I am new to AI. I just learnt GD and about batches for gradient decent. I am confused about whats the exact difference between them. Any solution for this would be appreciated.
Thanks in advance
All of those methods are first order optimization methods, only require the knowledge of gradients, to minimize fintie sum functions. This means that we minimize a function F that is written as the sum of N functions f_{i}, and we can compute the gradient of each of those functions in any given point.
The GD methods consists in using the gradient of F, wich is equal to the sum of gradients of all f_{i} to do one update, i.e.
x <- x - alpha* grad(F)
The stochastic GD, cinsists in selecting randomly one function f_{i}, and doing an update using its gradients, i.e.
x <- x - alpha*grad(f_{i})
So each update is faster, but we need more updates to find the optimimum.
Mini-batch GD is in between of those two strategies and selects m functions f_{i} randomly to do one update.
For more information look at this link
Check this.
In both gradient descent (GD) and stochastic gradient descent (SGD), you iteratively update a set of parameters to minimize an error function.
While in GD, you have to run through all the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use only one or subset of training sample from your training set to do the update for a parameter in a particular iteration. If you use a subset, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples is large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values is enough because they reach the optimal values and keep oscillating there.
Hope this will help you.
The Keras implementation of dropout references this paper.
The following excerpt is from that paper:
The idea is to use a single neural net at test time without dropout.
The weights of this network are scaled-down versions of the trained
weights. If a unit is retained with probability p during training, the
outgoing weights of that unit are multiplied by p at test time as
shown in Figure 2.
The Keras documentation mentions that dropout is only used at train time, and the following line from the Dropout implementation
x = K.in_train_phase(K.dropout(x, level=self.p), x)
seems to indicate that indeed outputs from layers are simply passed along during test time.
Further, I cannot find code which scales down the weights after training is complete as the paper suggests. My understanding is this scaling step is fundamentally necessary to make dropout work, since it is equivalent to taking the expected output of intermediate layers in an ensemble of "subnetworks." Without it, the computation can no longer be considered sampling from this ensemble of "subnetworks."
My question, then, is where is this scaling effect of dropout implemented in Keras, if at all?
Update 1: Ok, so Keras uses inverted dropout, though it is called dropout in the Keras documentation and code. The link http://cs231n.github.io/neural-networks-2/#reg doesn't seem to indicate that the two are equivalent. Nor does the answer at https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout. I can see that they do similar things, but I have yet to see anyone say they are exactly the same. I think they are not.
So a new question: Are dropout and inverted dropout equivalent? To be clear, I'm looking for mathematical justification for saying they are or aren't.
Yes. It is implemented properly. From the time when Dropout was invented - folks improved it also from the implementation point of view. Keras is using one of this techniques. It's called inverted dropout and you may read about it here.
UPDATE:
To be honest - in the strict mathematical sense this two approaches are not equivalent. In inverted case you are multiplying every hidden activation by a reciprocal of dropout parameter. But due to that derivative is linear it is equivalent to multiplying all gradient by the same factor. To overcome this difference you must set different learning weight then. From this point of view this approaches differ. But from a practical point view - this approaches are equivalent because:
If you use a method which automatically sets the learning rate (like RMSProp or Adagrad) - it will make almost no change in algorithm.
If you use a method where you set your learning rate automatically - you must take into account the stochastic nature of dropout and that due to the fact that some neurons will be turned off during training phase (what will not happen during test / evaluation phase) - you must to rescale your learning rate in order to overcome this difference. Probability theory gives us the best rescalling factor - and it is a reciprocal of dropout parameter which makes the expected value of a loss function gradient length the same in both train and test / eval phases.
Of course - both points above are about inverted dropout technique.
Excerpted from the original Dropout paper (Section 10):
In this paper, we described dropout as a method where we retain units with probability p at training time and scale down the weights by multiplying them by a factor of p at test time. Another way to achieve the same effect is to scale up the retained activations by multiplying by 1/p at training time and not modifying the weights at test time. These methods are equivalent with appropriate scaling of the learning rate and weight initializations at each layer.
Note though, that while keras's dropout layer is implemented using inverted dropout. The rate parameter the opposite of keep_rate.
keras.layers.Dropout(rate, noise_shape=None, seed=None)
Dropout consists in randomly setting a fraction rate of input units to
0 at each update during training time, which helps prevent
overfitting.
That is, rate sets the rate of dropout and not the rate to keep which you would expect with inverted dropout
Keras Dropout
I see that in scikit-learn I can build an SVM classifier with linear kernel in at last 3 different ways:
LinearSVC
SVC with kernel='linear' parameter
Stochastic Gradient Descent with loss='hinge' parameter
Now, I see that the difference between the first two classifiers is that the former is implemented in terms of liblinear and the latter in terms of libsvm.
How the first two classifiers differ from the third one?
The first two always use the full data and solve a convex optimization problem with respect to these data points.
The latter can treat the data in batches and performs a gradient descent aiming to minimize expected loss with respect to the sample distribution, assuming that the examples are iid samples of that distribution.
The latter is typically used when the number of samples is very big or not ending. Observe that you can call the partial_fit function and feed it chunks of data.
Hope this helps?