Handling imbalanced data in Machine Learning? - machine-learning

In the data, if the target feature is imbalanced say 2% good to 98% bad, and say 2% is 500 records, what if I use that 500 bad records plus only 500 good records from the 98% and train the model in machine learning.
My Question is will the Model generalize well with that 500 + 500 data as it is 50:50 good vs bad? and I do the selection of that good 500 records based multiple iterations to get the high accuracy as only 1000 records which will run faster in the machine to get the output.

[https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/][1]
Hi,
Hope above reference link will clear your concepts.
In case of working on unbalanced data its bad approach to check only one possibility, you have to try different method like collecting more data, creating data, changing the accuracy measurement (roc curve or different type of matrix) or sampling the input data.

Related

why too many epochs will cause overfitting?

I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.

Test accuracy vs Training time on Weka

From what I know, test accuracy should increase when training time increase(up to some point); but experimenting with weka yielded the opposite. I am wondering if misunderstood someting.
I used diabetes.arff for classification with 70% for training and 30% for testing. I used MultilayerPerceptron classifier and tried training times 100,500,1000,3000,5000.
Here are my results,
Training time Accuracy
100 75.2174 %
500 75.2174 %
1000 74.7826 %
3000 72.6087 %
5000 70.4348 %
10000 68.6957 %
What can be the reason for this? Thank you!
You got a very nice example of overfitting.
Here is the short explanation of what happened:
You model (doesn't matter whether this is multilayer perceptron, decision trees or literally anything else) can fit the training data in two ways.
First one is a generalization - model tries to find patterns and trends and use them to make predictions. The second one is remembering the exact data points from the training dataset.
Imagine the computer vision task: classify images into two categories – humans vs trucks. The good model will find common features that are present in human pictures but not in the trucks pictures (smooth curves, skin-color surfaces). This is a generalization. Such model will be able to handle new pictures pretty well. The bad model, overfitted one, will just remember exact images, exact pixels of the training dataset and will have no idea what to do with new images on the test set.
What can you do to prevent overfitting?
There are few common approaches to deal with overfitting:
Use simpler models. With fewer parameters, it will be difficult for a model to remember the dataset
Use regularization. Constrain the weights of the model and/or use dropout in your perceptron.
Stop the training process. Split your training data once more, so you will have three parts of the data: training, dev, and test. Then train your model using training data only and stop the training when the error on the dev set stopped decreasing.
The good starting point to read about overfitting is Wikipedia: https://en.wikipedia.org/wiki/Overfitting

Is it normal to get big error in backpropagation neural network when I using the same data training and data test?

I'm doing some programming with neural network backpropagation.
I have about 90 datas and doing some training with all data for data training (90 datas) and same data for data test (90 datas). I'm using iteration threshold about 2 iteration to test it and it gave me quite big error (About 60% with MAPE/Mean Absolute Square Error).
I'm afraid I've got the algorithm wrong since the only way to get training error less than threshold 10% is using iteration threshold around 3000k iteration and it's training takes quite a long time (I'm not using momentum. Just a Backpropagation Neural Network). But the test accuracy around 95-99% after that using said condition.
Is this normal? Or my program is work as it shouldn't be?
Of course, it will depend on the data set used, but I wouldn't be surprised if you get an error below 1% even for highly nonlinear data (I've seen this for example in sales data). As long as you separate training and test data sets, the error is expected to rise, but with the same set, it should drop to zero if there are enough hidden units. The capacity of an ANN to fit nonlinear data is huge (and, of course, the more fitted, the less general).
So, I would look for some program bug instead.
You say 3000k iteration, but i'm assume you mean 3k or 3000. The other answer says there might a bug in your code, but 3000 iterations for a problem with 90 samples is definitely normal.
You cannot expect a neural network to fit a training set with just 2 iterations, especially with a low learning rate.
TL;DR - you have nothing to worry. 3000 iterations is fine.

Interpretation of Neural Network (CNN) Result / Accuracy

I'm kind of new to the subject and build a convolutional neural network based on google's tensorflow. I wanted to classify a test data set of pictures belonging to 10 categories. My CNN setup is aligned to the tensorflow tutorial with some amendmends to meet my images' size.
I ran the trainig step repeatedly for 20 times over a random sample of 500 images and then repeated that step for 50 times on different samples of size 500. I used a sample of 200 as validation data set (kept this fixed for all runs). As a result I got an accuracy of about 35%, which isn't to bad in my eyes, since I didn't do any optimizations and the images are kind of hard to assign to a single category evan for humans.
So here are my questions:
Does it really make sense to run a step for 20 times over the same batch? (I did this becuase it's about what fits in the RAM and loading a new batch took quite a while - so I could get more runs in less time)
In the training accuracy diagram (see below) there's a jump at some point around step 120-130. From there on the accuracy goes up close to 100% for each 20-run of the same random batch. What does that jump mean in terms of network structure / learning?
Your spikes are likely due to the network overfitting on the batch that you are repeatedly showing it, while not really learning something that is useful in general. This also answers your first question - in this case, it doesn't make sense.

Inconsistency in cross-validation results

I have a set of dataset recorded from subjects as they perform some particular cognitive task. The data consists of 16 channels and a number of sample points per channel and I want to classify this data according to the cognitive task being performed (everything is labelled).
The issue is that I do not have a large amount of data (approximately 60 trials per session, 30 for each cognitive task) and I have 2 sessions. I am trying to train a Linear Discriminant Analysis (LDA) classifier to classify this data. The classifier is later to be used in real-time to give some form of output every number of samples.
I made use of a 5-fold cross-validation to measure the generalization error of my classifier. The problem is that when I run this 5-fold cross validation a number of times, the results I get are not constant at all. Instead there's a significant variation in the overall accuracy (for example, first 5-fold cross validation may yield an average accuracy of 80%, 2nd yields an accuracy of 65%, 3rd yields an average of 72% etc...). Is this normal? If not, what could be the causes?
It sounds like you may have some bad data or your classifier is overfitting. You can perform Leave-one-out cross-validation and note your results. It can help to find data that may be biasing the results.

Resources