Achieving better test result with fewer training data - why? - machine-learning

I'm currently dealing with a strange data set. I've split it up into a 50% training and 50% test set. I get better test (not training!) accuracy results, when I omit around 30% of the training set, which I find weirdly unintuitive. I've tried different training and test splits, and I can always find a set of around 30% of the training set which worsens the test accuracy.
What does this tells me about the data?
Are the labels sometimes wrong? Do I have too few data in the test set so that it is biased? Or…?

Related

overfitting and data augmentation in random forest, prediction

I want to make a prediction model using Random Forest, but overfitting occurs. We adjusted various parameters due to overfitting, but there is no big change.
When I looked up the reason, I checked the Internet post that it could be caused by a small number of data (1,000). As you know, in the case of image classification, data augmentation increases the amount of data by gradually transforming the shape and angle of the image.
How about increasing the amount of data in predictions like this? And we copied the entire data, and we made about three times as many data as three thousand. This prevents overfitting and increases accuracy.
But I'm not sure if this is the right way in terms of data science, so I'm writing like this.
In addition to these methods, I would like to ask you how to avoid overfitting the prediction problem or how to increase the amount of data.
Thank you!

How to judge whether model is overfitting or not

I am doing video classification with a model combining CNN and LSTM.
In the training data, the accuracy rate is 100%, but the accuracy rate of the test data is not so good.
The number of training data is small, about 50 per class.
In such a case, can I declare that over learning is occurring?
Or is there another cause?
Most likely you are indeed overfitting if the performance of your model is perfect on the training data, yet poor on test/validation data set.
A good way of observing that effect is to evaluate your model on both training and validation data after each epoch of training. You might observe that while you train, the performance on your validation set is increasing initially, and then starts to decrease. That is the moment when your model starts to overfit and where you can interrupt your training.
Here's a plot demonstrating this phenomenon with the blue and red lines corresponding to errors on training and validation sets respectively.

Accuracy below 50% for binary classification

I am training a Naive Bayes classifier on a balanced dataset with equal number of positive and negative examples. At test time I am computing the accuracy in turn for the examples in the positive class, negative class, and the subsets which make up the negative class. However, for some subsets of the negative class I get accuracy values lower than 50%, i.e. random guessing. I am wondering, should I worry about these results being much lower than 50%? Thank you!
It's impossible to fully answer this question without specific details, so here instead are guidelines:
If you have a dataset with equal amounts of classes, then random guessing would give you 50% accuracy on average.
To be clear, are you certain your model has learned something on your training dataset? Is the training dataset accuracy higher than 50%? If yes, continue reading.
Assuming that your validation set is large enough to rule out statistical fluctuations, then lower than 50% accuracy suggests that something is indeed wrong with your model.
For example, are your classes accidentally switched somehow in the validation dataset? Because notice that if you instead use 1 - model.predict(x), your accuracy would be above 50%.

Overfitting in convolutional neural network

I was applying CNN for classification of hand gestures I have 10 gestures and 100 images for each gestures. Model constructed by me was giving accuracy around 97% on training data, and I got 89% accuracy in testing data. Can I say that my model is overfitted or is it acceptable to have such accuracy graph(shown below)?
Add more data to training set
When you have a large amount of data(all kinds of instances) in your training set, it is good to create an overfitting model.
Example: Let's say you want to detect just one gesture say 'thumbs-up'(Binary classification problem) and you have created your positive training set with around 1000 images where images are rotated, translated, scaled, different colors, different angles, viewpoint varied, back-ground cluttered...etc. And if your training accuracy is 99%, your test accuracy will also be somewhere close.
Because our training set is big enough to cover all instances of the positive class, so even if the model is overfitted, it will perform well with the test set as the instances in the test set will only be a slight variation to that of the instances in the training set.
In your case, your model is good but if you can add some more data, you will get even better accuracy.
What kind of data to add?
Manually go through the test samples which the model got wrong and check for patterns if you can figure out what kind of samples are going wrong, you can add such kind to your training set and re-train again.

Will larger batch size make computation time less in machine learning?

I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.Now At first what i have read and learnt about batch size in machine learning:
let's first suppose that we're doing online learning, i.e. that we're
using a mini­batch size of 1. The obvious worry about online learning
is that using mini­batches which contain just a single training
example will cause significant errors in our estimate of the gradient.
In fact, though, the errors turn out to not be such a problem. The
reason is that the individual gradient estimates don't need to be
super­accurate. All we need is an estimate accurate enough that our
cost function tends to keep decreasing. It's as though you are trying
to get to the North Magnetic Pole, but have a wonky compass that's
10­-20 degrees off each time you look at it. Provided you stop to
check the compass frequently, and the compass gets the direction right
on average, you'll end up at the North Magnetic Pole just
fine.
Based on this argument, it sounds as though we should use online
learning. In fact, the situation turns out to be more complicated than
that.As we know we can use matrix techniques to compute the gradient
update for all examples in a mini­batch simultaneously, rather than
looping over them. Depending on the details of our hardware and linear
algebra library this can make it quite a bit faster to compute the
gradient estimate for a mini­batch of (for example) size 100 , rather
than computing the mini­batch gradient estimate by looping over the
100 training examples separately. It might take (say) only 50 times as
long, rather than 100 times as long.Now, at first it seems as though
this doesn't help us that much.
With our mini­batch of size 100 the learning rule for the weights
looks like:
where the sum is over training examples in the mini­batch. This is
versus for online learning.
Even if it only takes 50 times as long to do the mini­batch update, it
still seems likely to be better to do online learning, because we'd be
updating so much more frequently. Suppose, however, that in the
mini­batch case we increase the learning rate by a factor 100, so the
update rule becomes
That's a lot like doing separate instances of online learning with a
learning rate of η. But it only takes 50 times as long as doing a
single instance of online learning. Still, it seems distinctly
possible that using the larger mini­batch would speed things up.
Now i tried with MNIST digit dataset and ran a sample program and set the batch size 1 at first.I noted down the training time needed for the full dataset.Then i increased the batch size and i noticed that it became faster.
But in case of training with this code and github link changing the batch size doesn't decrease the training time.It remained same if i use 30 or 128 or 64.They are saying that they got 92% accuracy.After two or three epoch they have got above 40% accuracy.But when i ran the code in my computer without changing anything other than the batch size i got worse result after 10 epoch like only 28% and test accuracy stuck there in the next epochs.Then i thought since they have used batch size of 128 i need to use that.Then i used the same but it became more worse only give 11% after 10 epoch and stuck in there.Why is that??
Neural networks learn by gradient descent an error function in the weight space which is parametrized by the training examples. This means the variables are the weights of the neural network. The function is "generic" and becomes specific when you use training examples. The "correct" way would be to use all training examples to make the specific function. This is called "batch gradient descent" and is usually not done for two reasons:
It might not fit in your RAM (usually GPU, as for neural networks you get a huge boost when you use the GPU).
It is actually not necessary to use all examples.
In machine learning problems, you usually have several thousands of training examples. But the error surface might look similar when you only look at a few (e.g. 64, 128 or 256) examples.
Think of it as a photo: To get an idea of what the photo is about, you usually don't need a 2500x1800px resolution. A 256x256px image will give you a good idea what the photo is about. However, you miss details.
So imagine gradient descent to be a walk on the error surface: You start on one point and you want to find the lowest point. To do so, you walk down. Then you check your height again, check in which direction it goes down and make a "step" (of which the size is determined by the learning rate and a couple of other factors) in that direction. When you have mini-batch training instead of batch-training, you walk down on a different error surface. In the low-resolution error surface. It might actually go up in the "real" error surface. But overall, you will go in the right direction. And you can make single steps much faster!
Now, what happens when you make the resolution lower (the batch size smaller)?
Right, your image of what the error surface looks like gets less accurate. How much this affects you depends on factors like:
Your hardware/implementation
Dataset: How complex is the error surface and how good it is approximated by only a small portion?
Learning: How exactly are you learning (momentum? newbob? rprop?)
I'd like to add to what's been already said here that larger batch size is not always good for generalization. I've seen these cases myself, when an increase in batch size hurt validation accuracy, particularly for CNN working with CIFAR-10 dataset.
From "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima":
The stochastic gradient descent (SGD) method and its variants are
algorithms of choice for many Deep Learning tasks. These methods
operate in a small-batch regime wherein a fraction of the training
data, say 32–512 data points, is sampled to compute an approximation
to the gradient. It has been observed in practice that when using a
larger batch there is a degradation in the quality of the model, as
measured by its ability to generalize. We investigate the cause for
this generalization drop in the large-batch regime and present
numerical evidence that supports the view that large-batch methods
tend to converge to sharp minimizers of the training and testing
functions—and as is well known, sharp minima lead to poorer
generalization. In contrast, small-batch methods consistently converge
to flat minimizers, and our experiments support a commonly held view
that this is due to the inherent noise in the gradient estimation. We
discuss several strategies to attempt to help large-batch methods
eliminate this generalization gap.
Bottom-line: you should tune the batch size, just like any other hyperparameter, to find an optimal value.
The 2018 opinion retweeted by Yann LeCun is the paper Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi suggesting a good generic maximum batch size is:
32
With some interplay with choice of learning rate.
The earlier 2016 paper On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima gives some reason for not using big batches, which I paraphrase badly, as big batches are likely to get stuck in local (“sharp”) minima, small batches not.

Resources