overfitting and data augmentation in random forest, prediction - random-forest

I want to make a prediction model using Random Forest, but overfitting occurs. We adjusted various parameters due to overfitting, but there is no big change.
When I looked up the reason, I checked the Internet post that it could be caused by a small number of data (1,000). As you know, in the case of image classification, data augmentation increases the amount of data by gradually transforming the shape and angle of the image.
How about increasing the amount of data in predictions like this? And we copied the entire data, and we made about three times as many data as three thousand. This prevents overfitting and increases accuracy.
But I'm not sure if this is the right way in terms of data science, so I'm writing like this.
In addition to these methods, I would like to ask you how to avoid overfitting the prediction problem or how to increase the amount of data.
Thank you!

Related

Why in preprocessing image data, we need to do zero-centered data?

Why in preprocessing image data for a neural network, we need to zero-centered data. Why is this?
Mean-subtraction or zero-centering is a common pre-processing technique that involves subtracting mean from each of the data point to make it zero-centered. Consider a case where inputs to a neuron are all positive or all negative. In that case the gradient calculated during back propagation will either be positive or negative (of the same sign as inputs). And hence parameter updates are only restricted to specific directions which in turn will make it difficult to converge. As a result, the gradient updates go too far in different directions which makes optimization harder. Many algorithms show better performances when the dataset is symmetric (with a zero-mean).

Reducing pixels in large data set (sklearn)

Im currently working on a classification project but I'm in doubt about how I should start off.
Goal
Accurately classifying pictures of size 80*80 (so 6400 pixels) in the correct class (binary).
Setting
5260 training samples, 600 test samples
Question
As there are more pixels than samples, it seems logic to me to 'drop' most of the pixels and only look at the important ones before I even start working out a classification method (like SVM, KNN etc.).
Say the training data consists of X_train (predictors) and Y_train (outcomes). So far, I've tried looking at the SelectKBest() method from sklearn for feature extraction. But what would be the best way to use this method and to know how many k's I've actually got to select?
It could also be the case that I'm completely on the wrong track here, so correct me if I'm wrong or suggest an other approach to this if possible.
You are suggesting to reduce the dimension of your feature space. That is a method of regularization to reduce overfitting. You haven't mentioned overfitting is an issue so I would test that first. Here are some things I would try:
Use transfer learning. Take a pretrained network for image recognition tasks and fine tune it to your dataset. Search for transfer learning and you'll find many resources.
Train a convolutional neural network on your dataset. CNNs are the go-to method for machine learning on images. Check for overfitting.
If you want to reduce the dimensionality of your dataset, resize the image. Going from 80x80 => 40x40 will reduce the number of pixels by 4x, assuming your task doesn't depend on fine details of the image you should maintain classification performance.
There are other things you may want to consider but I would need to know more about your problem and its requirements.

How to judge whether model is overfitting or not

I am doing video classification with a model combining CNN and LSTM.
In the training data, the accuracy rate is 100%, but the accuracy rate of the test data is not so good.
The number of training data is small, about 50 per class.
In such a case, can I declare that over learning is occurring?
Or is there another cause?
Most likely you are indeed overfitting if the performance of your model is perfect on the training data, yet poor on test/validation data set.
A good way of observing that effect is to evaluate your model on both training and validation data after each epoch of training. You might observe that while you train, the performance on your validation set is increasing initially, and then starts to decrease. That is the moment when your model starts to overfit and where you can interrupt your training.
Here's a plot demonstrating this phenomenon with the blue and red lines corresponding to errors on training and validation sets respectively.

Overfitting in convolutional neural network

I was applying CNN for classification of hand gestures I have 10 gestures and 100 images for each gestures. Model constructed by me was giving accuracy around 97% on training data, and I got 89% accuracy in testing data. Can I say that my model is overfitted or is it acceptable to have such accuracy graph(shown below)?
Add more data to training set
When you have a large amount of data(all kinds of instances) in your training set, it is good to create an overfitting model.
Example: Let's say you want to detect just one gesture say 'thumbs-up'(Binary classification problem) and you have created your positive training set with around 1000 images where images are rotated, translated, scaled, different colors, different angles, viewpoint varied, back-ground cluttered...etc. And if your training accuracy is 99%, your test accuracy will also be somewhere close.
Because our training set is big enough to cover all instances of the positive class, so even if the model is overfitted, it will perform well with the test set as the instances in the test set will only be a slight variation to that of the instances in the training set.
In your case, your model is good but if you can add some more data, you will get even better accuracy.
What kind of data to add?
Manually go through the test samples which the model got wrong and check for patterns if you can figure out what kind of samples are going wrong, you can add such kind to your training set and re-train again.

Is this image too complex for a shallow NN classifier?

I am trying to classify a series of images like this one, with each class of comprising images taken from similar cellular structure:
I've built a simple network in Keras to do this, structured as:
1000 - 10
The network unaltered achieves very high (>90%) accuracy on MNIST classification, but almost never higher than 5% on these types of images. Is this because they are too complex? My next approach would be to try stacked deep autoencoders.
Seriously - I don't expect any nonconvolutional model to work well on this type of data.
A nonconv net for MNIST works well because the data is well preprocessed (it is centered in the middle and resized to certain size). Your images are not.
You may notice (on your pictures) that certain motifs reoccure - like this darker dots - with different positions and sizes - if you don't use convolutional model you will not capture that efficiently (e.g. you will have to recognize a dark dot moved a little bit in the image as a completely different object).
Because of this I think that you should try convolutional MNIST model instead classic one or simply try to design your own.
First question, is if you run the training longer do you get better accuracy? You may not have trained long enough.
Also, what is the accuracy on training data and what is the accuracy on testing data? If they are both high, you can run longer or use a more complex model. If training accuracy is better than testing accuracy, you are essentially at the limits of your data. (i.e. brute force scaling of model size wont help, but clever improvements might, i.e. try convolutional nets)
Finally, complex and noisy data you may need a lot of data to make a reasonable classification. So you need many, many images.
Deep stacked autoencoders, as I understand it is an unsupervised method, which isn't directly suitable for classification.

Resources