What causes fluctuations in cross-entropy loss? - machine-learning

I am using the Soft Max Algorithm for the CIFAR10 Data set and am having some questions regarding my cross-entropy loss graph. I managed to get an accuracy rate of 40% with the algorithm, so the accuracy is improving. The confusing part is interpreting the results from the cross entropy graph as it is not similar to any of other graphs I've seen online for similar problems. Was wondering if anyone could give some insight into how to interpret the following graphs. On the y is loss, on x is batch number. The two graphs are for batch size 1 and 100.
Batch size 1:
Batch size 100:

What causes these fluctuations:
A (mini)batch is just a small part of the CIFAR-10. Sometimes you pick easy examples, sometimes you pick hard ones. Or perhaps what seems easy is just difficult after the model has adjusted to the previous batch. Afterall, it is called Stochastic Gradient Descent. See e.g. the dicussion here.
Interpreting those plots:
Batch size 100: It's clearly improving :-) I would recommend you take the mean of the cross entropy across the batch, rather than summing them.
Batch size 1: There seems to be some improvement for first ~40k steps. Then it's probably just oscillation. You need to schedule the learning rate.
Other related points:
Softmax is not an algorithm, but a function which turns a vector of arbitrary values into one that is non-negative and sums up to 1, thus is interpretable as probabilities.
Those plots are very clumsy. Try a scatter plot with small dotsize.
Plot accuracy together with the cross-entropy (on a different scale, with a coarser time resolution) to get a feeling for their relation.

Related

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Should batch size matter at inference

I am training a model
5-layer very narrow CNN,
followed by a 5-layer highway,
then fully connected and
softmax over 7 classes.
As there are 7 equally distributed classes, the random bet lassification accuracy would be 14 % (1/7th is roughly 14 %).
Actual accuracy is 40 %. So the net somewhat learns..
Now the weird thing is it learns only with a batch size of 2. Batch sizes of 16, 32 or 64 don't learn at all.
Now the even weirder thing: If I take the checkpoint of the trained net (accuracy 40 %, trained at batch size 2) and start it with a batch size of 32 I should keep on getting my 40 % at least for the first couple of steps, right? I do when I restart at barch size 2. But with the bs 32 initial accuracy is, guess what, 14 %.
Any idea why the batch size would ruin inference? I fear I might be having a shape error somewhere but I cannot find anything.
Thx for your thoughts
There are two possible modes of work for a batch normalization layer at inference time:
Compute the activation mean and variance over a given inference batch
Use average mean and variance from the training batches
In Pytorch, for example, the track_running_stats parameter of the BatchNorm2D layer is set to True, or in other words, Option 2 is the default:
If you choose Option 1, then of course, the size of the inference batch and the characteristics of each sample in it will affect the outputs of the other samples.
So γ and β are learned in the training and used in inference as is, and if you do not change the default behavior, the "same" is true for E[x] and Var[x]. In purpose I wrote "same" within quotes, as these are just batch statistics from the training.
If we're already talking about batch size, I'd mention that it may sound tempting to use very large batch sizes in training, to have more accurate statistics and to have a better approximation of the loss function for an SGD step. Yet, approximating the loss function too well has drawbacks, such as overfitting.
You should look at the accuracy when your model converges, not when it is still training. It's hard to compare effects of different batch size during training steps because they can get "lucky" and follow a good gradient path. In general smaller batch size tends to be more noisy and could give you good peaks and bad drops in accuracy.
It's hard to tell without looking at the code, but I think that large batch sizes cause the gradient to be too large and the training cannot converge. One way to fight this effect would be to increase the batch size but decrease the learning rate. You can also try to clip gradient magnitude.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Will larger batch size make computation time less in machine learning?

I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.Now At first what i have read and learnt about batch size in machine learning:
let's first suppose that we're doing online learning, i.e. that we're
using a mini­batch size of 1. The obvious worry about online learning
is that using mini­batches which contain just a single training
example will cause significant errors in our estimate of the gradient.
In fact, though, the errors turn out to not be such a problem. The
reason is that the individual gradient estimates don't need to be
super­accurate. All we need is an estimate accurate enough that our
cost function tends to keep decreasing. It's as though you are trying
to get to the North Magnetic Pole, but have a wonky compass that's
10­-20 degrees off each time you look at it. Provided you stop to
check the compass frequently, and the compass gets the direction right
on average, you'll end up at the North Magnetic Pole just
fine.
Based on this argument, it sounds as though we should use online
learning. In fact, the situation turns out to be more complicated than
that.As we know we can use matrix techniques to compute the gradient
update for all examples in a mini­batch simultaneously, rather than
looping over them. Depending on the details of our hardware and linear
algebra library this can make it quite a bit faster to compute the
gradient estimate for a mini­batch of (for example) size 100 , rather
than computing the mini­batch gradient estimate by looping over the
100 training examples separately. It might take (say) only 50 times as
long, rather than 100 times as long.Now, at first it seems as though
this doesn't help us that much.
With our mini­batch of size 100 the learning rule for the weights
looks like:
where the sum is over training examples in the mini­batch. This is
versus for online learning.
Even if it only takes 50 times as long to do the mini­batch update, it
still seems likely to be better to do online learning, because we'd be
updating so much more frequently. Suppose, however, that in the
mini­batch case we increase the learning rate by a factor 100, so the
update rule becomes
That's a lot like doing separate instances of online learning with a
learning rate of η. But it only takes 50 times as long as doing a
single instance of online learning. Still, it seems distinctly
possible that using the larger mini­batch would speed things up.
Now i tried with MNIST digit dataset and ran a sample program and set the batch size 1 at first.I noted down the training time needed for the full dataset.Then i increased the batch size and i noticed that it became faster.
But in case of training with this code and github link changing the batch size doesn't decrease the training time.It remained same if i use 30 or 128 or 64.They are saying that they got 92% accuracy.After two or three epoch they have got above 40% accuracy.But when i ran the code in my computer without changing anything other than the batch size i got worse result after 10 epoch like only 28% and test accuracy stuck there in the next epochs.Then i thought since they have used batch size of 128 i need to use that.Then i used the same but it became more worse only give 11% after 10 epoch and stuck in there.Why is that??
Neural networks learn by gradient descent an error function in the weight space which is parametrized by the training examples. This means the variables are the weights of the neural network. The function is "generic" and becomes specific when you use training examples. The "correct" way would be to use all training examples to make the specific function. This is called "batch gradient descent" and is usually not done for two reasons:
It might not fit in your RAM (usually GPU, as for neural networks you get a huge boost when you use the GPU).
It is actually not necessary to use all examples.
In machine learning problems, you usually have several thousands of training examples. But the error surface might look similar when you only look at a few (e.g. 64, 128 or 256) examples.
Think of it as a photo: To get an idea of what the photo is about, you usually don't need a 2500x1800px resolution. A 256x256px image will give you a good idea what the photo is about. However, you miss details.
So imagine gradient descent to be a walk on the error surface: You start on one point and you want to find the lowest point. To do so, you walk down. Then you check your height again, check in which direction it goes down and make a "step" (of which the size is determined by the learning rate and a couple of other factors) in that direction. When you have mini-batch training instead of batch-training, you walk down on a different error surface. In the low-resolution error surface. It might actually go up in the "real" error surface. But overall, you will go in the right direction. And you can make single steps much faster!
Now, what happens when you make the resolution lower (the batch size smaller)?
Right, your image of what the error surface looks like gets less accurate. How much this affects you depends on factors like:
Your hardware/implementation
Dataset: How complex is the error surface and how good it is approximated by only a small portion?
Learning: How exactly are you learning (momentum? newbob? rprop?)
I'd like to add to what's been already said here that larger batch size is not always good for generalization. I've seen these cases myself, when an increase in batch size hurt validation accuracy, particularly for CNN working with CIFAR-10 dataset.
From "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima":
The stochastic gradient descent (SGD) method and its variants are
algorithms of choice for many Deep Learning tasks. These methods
operate in a small-batch regime wherein a fraction of the training
data, say 32–512 data points, is sampled to compute an approximation
to the gradient. It has been observed in practice that when using a
larger batch there is a degradation in the quality of the model, as
measured by its ability to generalize. We investigate the cause for
this generalization drop in the large-batch regime and present
numerical evidence that supports the view that large-batch methods
tend to converge to sharp minimizers of the training and testing
functions—and as is well known, sharp minima lead to poorer
generalization. In contrast, small-batch methods consistently converge
to flat minimizers, and our experiments support a commonly held view
that this is due to the inherent noise in the gradient estimation. We
discuss several strategies to attempt to help large-batch methods
eliminate this generalization gap.
Bottom-line: you should tune the batch size, just like any other hyperparameter, to find an optimal value.
The 2018 opinion retweeted by Yann LeCun is the paper Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi suggesting a good generic maximum batch size is:
32
With some interplay with choice of learning rate.
The earlier 2016 paper On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima gives some reason for not using big batches, which I paraphrase badly, as big batches are likely to get stuck in local (“sharp”) minima, small batches not.

Neural network and batch learning

I am new to neural networks and would like to find out when am I supposed to reduce the learning rate as opposed to the batch size.
I would understand that if the learning diverges, the learning rate would have to be reduced.
However, when do I reduce or increase the batch size? My guess is that if the loss fluctuates too much, it would be ideal to reduce the batch size?
If you increase the batch size, the gradient is more likely to point towards the right direction so that the (overall) error decreases. Especially compared to updating the weights after considering only a single example which results in a very random and noisy gradient.
Therefore, if the loss function fluctuates, you can do both: increase the batch size and decrease the learning rate. The drawback of a larger batch size is the higher computational cost per update. So if the training takes too long, see if it still converges with a smaller batch size.
You can read more here or here. (Btw, https://stats.stackexchange.com/ is more suitable for theoretical questions which do not contain specific code implementations)
The "correct" way to learn is to use all of your training data for every single step in your gradient descent. However, this takes bit of time to compute it as this is a really heavy function which is - most of the time - parameterized by the thousands of training examples.
The idea is that the error function / weight update looks similar enough when you leave a couple of your training examples out. This speeds the calculateion of the error function up. However, the drawback is that you might not go in the correct direction with some gradient descent steps. But it should be "almost" correct in most cases.
So the rationale is that even if you don't go completely in the correct direction, you can do a lot more steps in the same time so that it doesn't matter.
A very common choice for the mini-batch size is 128 or 256. The most extreme choice is usually called "stochastic gradient descent" and uses only 1 training example.
As so often in ML, it is a good idea to just try different values.

Resources