How to interpret the results of a training/validation learning curve? - machine-learning

I am using the Random Forest classifier in the Scikit package and have plotted F1 scores versus training set size. The red is the training set F1 scores and the green is the scores for the validation set. This is about what I expected but I would like some advice on interpretation.
I see that there is some significant variance, yet the validation curve appears to be converging. Should I assume that adding data would do little to affect the variance given the convergence or am I jumping to conclusion about the rate of convergence?
Is the amount of variance here significant enough to warrant taking further actions that may increase the bias slightly? I realize this is a fairly domain-specific question but I wonder if there is any general guidelines for how much variance is worth a bit of bias tradeoff?

I see that there is some significant variance, yet the validation curve appears to be converging. Should I assume that adding data would do little to affect the variance given the convergence or am I jumping to conclusion about the rate of convergence?
This seems true conditioning on your learning procedure, thus in particular - selection of hyperparameters. Thus it does not mean that given different set of hyperparameters the same effect would occur. It only seems that given current setting - rate of convergence is relatively small thus getting to 95% would probably require significant amounts of data.
Is the amount of variance here significant enough to warrant taking further actions that may increase the bias slightly? I realize this is a fairly domain-specific question but I wonder if there is any general guidelines for how much variance is worth a bit of bias tradeoff?
Yes, in general - these kind of curves at least do not reject option to go for higher bias. You clearly overfit towards training set. On the other hand, trees usually do that, thus increasing bias might be hard without changing the model. One option that I would suggest is going for Extremely Randomized Trees, which is nearly the same as Random Forest, but with randomly chosen threshold instead of full optimization. They have significantly bigger bias and should take these curves a bit closer to each other.
Obviously there is no guarantee - as you said, this is data specific, but the general characteristic looks promising (however might require changing the model).

Related

optimal batch size for image classification using deep learning

I have a broad question, but should be still relevant.
lets say I am doing a 2 class image classification using a CNN. a batch size of 32-64 should be sufficient for training purpose. However, if I had data with about 13 classes, surely 32 batch size would not be sufficient for a good model, as each batch might get 2-3 images of each class. is there a generic or approximate formula to determine the batch size for training? or should that be determined as a hyperparameter using techniques like grid search or bayesian methods?
sedy
Batch size is a hyper parameter like e.g. learning rate. It is really hard to say what is the perfect size for your problem.
The problem you are mentioning might exist but is only really relevant in specific problems where you can't just to random sampling like face/person re-identification.
For "normal" problems random sampling is sufficient. The reason behind minibatch training is, to get a more stable training. You want your weight updates to go in the right direction in regards to the global minimum of the loss function for the whole dataset. A minibatch is an approximation of this.
With increasing the batchsize you get less updates but "better" updates. With a small batchsize you get more updates, but they will more often go in the wrong direction. If the batch size is to small (e.g. 1) the network might take a long time to converge and thus increases the training time. To large of a batch size can hurt the generalization of the network. Good paper about the topic On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Another interesting paper on the topic is: Don't Decay the Learning Rate, Increase the Batch Size. Which analyzes the effect of batch size on the training. In general learning rate and batch size have effects on each other.
In general batch size is more a factor to reduce training time, because you can make use of parallelism and have less weight updates with increasing batch size and more stability. As with everything look at what others did for a task comparable with your problem and take it as a baseline and experiment with it a little. Also with huge networks the available memory often limits the maximum batch size anyway.

Convergence and regularization in linear regression classifier

I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.
Now I am not sure, how to choose the "good" value for a.
Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.
Value of 'a'
Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.
You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.
Data Set Partition
ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.
Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.
Convergence
This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.
The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.
Does any of this help?

Will larger batch size make computation time less in machine learning?

I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.Now At first what i have read and learnt about batch size in machine learning:
let's first suppose that we're doing online learning, i.e. that we're
using a mini­batch size of 1. The obvious worry about online learning
is that using mini­batches which contain just a single training
example will cause significant errors in our estimate of the gradient.
In fact, though, the errors turn out to not be such a problem. The
reason is that the individual gradient estimates don't need to be
super­accurate. All we need is an estimate accurate enough that our
cost function tends to keep decreasing. It's as though you are trying
to get to the North Magnetic Pole, but have a wonky compass that's
10­-20 degrees off each time you look at it. Provided you stop to
check the compass frequently, and the compass gets the direction right
on average, you'll end up at the North Magnetic Pole just
fine.
Based on this argument, it sounds as though we should use online
learning. In fact, the situation turns out to be more complicated than
that.As we know we can use matrix techniques to compute the gradient
update for all examples in a mini­batch simultaneously, rather than
looping over them. Depending on the details of our hardware and linear
algebra library this can make it quite a bit faster to compute the
gradient estimate for a mini­batch of (for example) size 100 , rather
than computing the mini­batch gradient estimate by looping over the
100 training examples separately. It might take (say) only 50 times as
long, rather than 100 times as long.Now, at first it seems as though
this doesn't help us that much.
With our mini­batch of size 100 the learning rule for the weights
looks like:
where the sum is over training examples in the mini­batch. This is
versus for online learning.
Even if it only takes 50 times as long to do the mini­batch update, it
still seems likely to be better to do online learning, because we'd be
updating so much more frequently. Suppose, however, that in the
mini­batch case we increase the learning rate by a factor 100, so the
update rule becomes
That's a lot like doing separate instances of online learning with a
learning rate of η. But it only takes 50 times as long as doing a
single instance of online learning. Still, it seems distinctly
possible that using the larger mini­batch would speed things up.
Now i tried with MNIST digit dataset and ran a sample program and set the batch size 1 at first.I noted down the training time needed for the full dataset.Then i increased the batch size and i noticed that it became faster.
But in case of training with this code and github link changing the batch size doesn't decrease the training time.It remained same if i use 30 or 128 or 64.They are saying that they got 92% accuracy.After two or three epoch they have got above 40% accuracy.But when i ran the code in my computer without changing anything other than the batch size i got worse result after 10 epoch like only 28% and test accuracy stuck there in the next epochs.Then i thought since they have used batch size of 128 i need to use that.Then i used the same but it became more worse only give 11% after 10 epoch and stuck in there.Why is that??
Neural networks learn by gradient descent an error function in the weight space which is parametrized by the training examples. This means the variables are the weights of the neural network. The function is "generic" and becomes specific when you use training examples. The "correct" way would be to use all training examples to make the specific function. This is called "batch gradient descent" and is usually not done for two reasons:
It might not fit in your RAM (usually GPU, as for neural networks you get a huge boost when you use the GPU).
It is actually not necessary to use all examples.
In machine learning problems, you usually have several thousands of training examples. But the error surface might look similar when you only look at a few (e.g. 64, 128 or 256) examples.
Think of it as a photo: To get an idea of what the photo is about, you usually don't need a 2500x1800px resolution. A 256x256px image will give you a good idea what the photo is about. However, you miss details.
So imagine gradient descent to be a walk on the error surface: You start on one point and you want to find the lowest point. To do so, you walk down. Then you check your height again, check in which direction it goes down and make a "step" (of which the size is determined by the learning rate and a couple of other factors) in that direction. When you have mini-batch training instead of batch-training, you walk down on a different error surface. In the low-resolution error surface. It might actually go up in the "real" error surface. But overall, you will go in the right direction. And you can make single steps much faster!
Now, what happens when you make the resolution lower (the batch size smaller)?
Right, your image of what the error surface looks like gets less accurate. How much this affects you depends on factors like:
Your hardware/implementation
Dataset: How complex is the error surface and how good it is approximated by only a small portion?
Learning: How exactly are you learning (momentum? newbob? rprop?)
I'd like to add to what's been already said here that larger batch size is not always good for generalization. I've seen these cases myself, when an increase in batch size hurt validation accuracy, particularly for CNN working with CIFAR-10 dataset.
From "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima":
The stochastic gradient descent (SGD) method and its variants are
algorithms of choice for many Deep Learning tasks. These methods
operate in a small-batch regime wherein a fraction of the training
data, say 32–512 data points, is sampled to compute an approximation
to the gradient. It has been observed in practice that when using a
larger batch there is a degradation in the quality of the model, as
measured by its ability to generalize. We investigate the cause for
this generalization drop in the large-batch regime and present
numerical evidence that supports the view that large-batch methods
tend to converge to sharp minimizers of the training and testing
functions—and as is well known, sharp minima lead to poorer
generalization. In contrast, small-batch methods consistently converge
to flat minimizers, and our experiments support a commonly held view
that this is due to the inherent noise in the gradient estimation. We
discuss several strategies to attempt to help large-batch methods
eliminate this generalization gap.
Bottom-line: you should tune the batch size, just like any other hyperparameter, to find an optimal value.
The 2018 opinion retweeted by Yann LeCun is the paper Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi suggesting a good generic maximum batch size is:
32
With some interplay with choice of learning rate.
The earlier 2016 paper On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima gives some reason for not using big batches, which I paraphrase badly, as big batches are likely to get stuck in local (“sharp”) minima, small batches not.

Neural network and batch learning

I am new to neural networks and would like to find out when am I supposed to reduce the learning rate as opposed to the batch size.
I would understand that if the learning diverges, the learning rate would have to be reduced.
However, when do I reduce or increase the batch size? My guess is that if the loss fluctuates too much, it would be ideal to reduce the batch size?
If you increase the batch size, the gradient is more likely to point towards the right direction so that the (overall) error decreases. Especially compared to updating the weights after considering only a single example which results in a very random and noisy gradient.
Therefore, if the loss function fluctuates, you can do both: increase the batch size and decrease the learning rate. The drawback of a larger batch size is the higher computational cost per update. So if the training takes too long, see if it still converges with a smaller batch size.
You can read more here or here. (Btw, https://stats.stackexchange.com/ is more suitable for theoretical questions which do not contain specific code implementations)
The "correct" way to learn is to use all of your training data for every single step in your gradient descent. However, this takes bit of time to compute it as this is a really heavy function which is - most of the time - parameterized by the thousands of training examples.
The idea is that the error function / weight update looks similar enough when you leave a couple of your training examples out. This speeds the calculateion of the error function up. However, the drawback is that you might not go in the correct direction with some gradient descent steps. But it should be "almost" correct in most cases.
So the rationale is that even if you don't go completely in the correct direction, you can do a lot more steps in the same time so that it doesn't matter.
A very common choice for the mini-batch size is 128 or 256. The most extreme choice is usually called "stochastic gradient descent" and uses only 1 training example.
As so often in ML, it is a good idea to just try different values.

Number of instances or the content of the instances more important (machine learning)?

Say in the document classification domain, if I'm having a dataset of 1000 instances but the instances (documents) are rather of small content; and I'm having another dataset of say 200 instances but each individual instance with richer content. If IDF is out of my concern, will the number of instances really matter in training? Do classification algorithms sort of take that into account?
Thanks.
sam
You could pose this as a general machine learning problem. The simplest problem that can help you understand how the size of training data matters is curve fitting.
The uncertainty and bias of a classifier or a fitted model are functions of the sample size. Small sample size is a well known problem which we often try to avoid by collecting more training samples. This is because the uncertainty estimation of non-linear classifiers is estimated by a linear approximation of the model. And this estimation is accurate only if a large number samples are available as the main condition of the central limit theorem.
The proportion of outliers is also an important factor you should consider when deciding on the training sample size. If a larger sample size means a greater proportion of outliers then should limit the sample size.
The document size is actually is an indirect indicator of feature space size. If for example from each document you have got only 10 features then you're trying to separate/classify the documents in a 10-dimensional space. If you have got 100 features in each document then the same is happening in a 100-dimensional space. I guess it's easy for you to see drawing lines that separate the documents in a higher dimension is easier.
For both document size and sample size the rule of thumb is go to as high as possible but in practice this is not possible. And for example, if you estimate the uncertainty function of the classifier then you find a threshold that sample sizes higher than that lead to virtually no reduction of uncertainty and bias. Empirically you can also find this threshold for some problems by Monte Carlo simulation.
Most engineers don't bother to estimate uncertainty and that often leads to sub-optimal behavior of the methods they implement. This is fine for toy problems but in real-world problems considering uncertainty of estimations and computation is vital for most systems. I hope that answers your questions to some degree.

Resources