why doesnt Stochastic gradient descent fluctuate - machine-learning

In batch gradient descent the parameters were updated based on the total/average loss of all the points
In Stochastic gradient descent or SGD
we are updating the parameters after every point instead of one epoch.
so lets say if the final point is an outlier woudnt that cause the whole fitted line to fluctuate drastically.
How is it reliable .
or converge on a contour like this SGD contour

While it is true that in its most pristine form SGD operates on just 1 sample point, in reality this is not the dominant practice. In practice, we use a mini-batch of say 256, 128 or 64 samples rather than operating on the full batch size containing all the samples in the database, which might be well over than 1 million samples. So clearly operating on a mini-batch of say 256 is much faster than operating on 1 million points and at the same time helps curb the variability caused due to just using 1 sample point.
A second point is that there is no final point. One simply keeps iterating over the dataset. The learning rate for SGD is generally quite small say 1e-3. So even if a sample point happens to be an outlier, the wrong gradients will be scaled by 1e-3 and hence SGD will not be too much off the correct trajectory. When it iterates over the upcoming sample points, which are not outliers, it will again head towards the correct direction.
So altogether using a medium-sized mini-batch and using a small learning rate helps SGD to not digress a lot from the correct trajectory.
Now the word stochastic in SGD can also imply various other measures. For example some practitioners also use gradient clipping i.e. they clamp the calculated gradient to maximum value if the gradients are well over this decided maximum threshold. You can find more on gradient clipping in this post. Now, this is just one trick amongst dozens of other techniques and if you are interested can read source code of popular implementation of SGD in PyTorch or TensorFlow.

Related

What causes fluctuations in cross-entropy loss?

I am using the Soft Max Algorithm for the CIFAR10 Data set and am having some questions regarding my cross-entropy loss graph. I managed to get an accuracy rate of 40% with the algorithm, so the accuracy is improving. The confusing part is interpreting the results from the cross entropy graph as it is not similar to any of other graphs I've seen online for similar problems. Was wondering if anyone could give some insight into how to interpret the following graphs. On the y is loss, on x is batch number. The two graphs are for batch size 1 and 100.
Batch size 1:
Batch size 100:
What causes these fluctuations:
A (mini)batch is just a small part of the CIFAR-10. Sometimes you pick easy examples, sometimes you pick hard ones. Or perhaps what seems easy is just difficult after the model has adjusted to the previous batch. Afterall, it is called Stochastic Gradient Descent. See e.g. the dicussion here.
Interpreting those plots:
Batch size 100: It's clearly improving :-) I would recommend you take the mean of the cross entropy across the batch, rather than summing them.
Batch size 1: There seems to be some improvement for first ~40k steps. Then it's probably just oscillation. You need to schedule the learning rate.
Other related points:
Softmax is not an algorithm, but a function which turns a vector of arbitrary values into one that is non-negative and sums up to 1, thus is interpretable as probabilities.
Those plots are very clumsy. Try a scatter plot with small dotsize.
Plot accuracy together with the cross-entropy (on a different scale, with a coarser time resolution) to get a feeling for their relation.

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Are there any fixed relationships between mini batch gradient decent and gradient decent

For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.
When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by L´eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.

Multilayer perceptron for ocr works with only some data sets

NEW DEVELOPMENT
I recently used OpenCV's MLP implementation to test whether it could solve the same tasks. OpenCV was able to classify the same data sets that my implementation was able to, but unable to solve the one's that mine could not. Maybe this is due to termination parameters (determining when to end training). I stopped before 100,000 iterations, and the MLP did not generalize. This time the network architecture was 400 input neurons, 10 hidden neurons, and 2 output neurons.
I have implemented the multilayer perceptron algorithm, and verified that it works with the XOR logic gate. For OCR I taught the network to correctly classify letters of "A"s and "B"s that have been drawn with a thick drawing untensil (a marker). However when I try to teach the network to classify a thin drawing untensil (a pencil) the network seems to become stuck in a valley and unable to classify the letters in a reasonable amount of time. The same goes for letters I drew with GIMP.
I know people say we have to use momentum to get out of the valley, but the sources I read were vague. I tried increasing a momentum value when the change in error was insignificant and decreasing when above, but it did not seem to help.
My network architecture is 400 input neurons (one for each pixel), 2 hidden layers with 25 neurons each, and 2 neurons in the output layer. The images are gray scale images and the inputs are -0.5 for a black pixel and 0.5 for a white pixel.
EDIT:
Currently the network is trainning until the calculated error for each trainning example falls below an accepted error constant. I have also tried stopping trainning at 10,000 epochs, but this yields bad predictions. The activation function used is the sigmoid logistic function. The error function I am using is the sum of the squared error.
I suppose I may have reached a local minimum rather than a valley, but this should not happen repeatedly.
Momentum is not always good, it can help the model to jump out of the a bad valley but may also make the model to jump out of a good valley. Especially when the previous weights update directions is not good.
There are several reasons that make your model not work well.
The parameter are not well set, it is always a non-trivial task to set the parameters of the MLP.
An easy way is to first set the learning rate, momentum weight and regularization weight to a big number, but to set the iteration (or epoch) to a very large weight. Once the model diverge, half the learning rate, momentum weight and regularization weight.
This approach can make the model to slowly converge to a local optimal, and also give the chance for it to jump out a bad valley.
Moreover, in my opinion, one output neuron is enough for two class problem. There is no need to increase the complexity of the model if it is not necessary. Similarly, if possible, use a three-layer MLP instead of a four-layer MLP.

What should be a generic enough convergence criteria of Stochastic Gradient Descent

I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector? As, in - given dw_i, dw_i+1, how do I detect the change of sign, does it even have a meaning in more than 2 dimensions?
P.S. Apologies for non-math/latex symbols..still getting used to the stuff.
First, stochastic gradient descent is the on-line version of gradient descent method. The update rule is using a single example at a time.
Suppose, f(x) is your cost function for a single example, the stopping criteria of SGD for N-dimensional vector is usually:
See this1, or this2 for details.
Second, there is a further twist on stochastic gradient descent using so-called “minibatches”. It works identically to SGD, except that it uses more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. See this3.

Resources