SMOTE oversampling and cross-validation - machine-learning

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.
Is this due to oversampling?
Is it bad practice to perform cross-validation on data on which SMOTE is applied?
Are there any ways to solve this problem?

I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.

According to my experience, dividing the data set by hand is not good way to deal with this problem. When you have 1 data set, you should have cross validation on each classifier you use in a way that 1 fold of your cross validation is your test set_which you should not implement SMOTE on it_ and you have 9 other folds as your training set in which you must have a balanced data set. Repeat this action in a loop for 10 times. Then you will have better result than dividing whole data set by hand.
It is obvious that if you apply SMOTE on both test and training set, you are having synthesized test set which gives you a high accuracy that is not actually correct.

Related

How is Machine Learning Overfitting works

How noise in the data, target complexity and size of the training set are related to over-fitting?
I am guessing that you are a beginner, suppose you have dataset with lots of features(as in columns). you create a model and test it on your training and test dataset, you notice that it gives you an accuracy of 100 percent on your training set and 60-70 on your test set, this is an example of Overfitting. it is because you have chosen a lot of features which were not related to predicting the outcome.
you can remove it by dropping those irrelevant columns(which are called as noise), apply K-fold cross validation on your data.
this video might help you get a better understanding
https://www.youtube.com/watch?v=Anq4PgdASsc

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

dealing with imbalanced classification data?

I am building a predictive model, on which I predict if a client will subscribe again or not. I already have the dataset and the problem is that it is imbalanced ( the NOs are more then the YESs). I believe that my model is biased, but when I check the accuracy on the training set and the testing set with the predictions made the accuracy is really close (0.8879 on training set and 0.8868 on the test set). The reason why I am confused, is if my model is biased why do I have the accuracy of training and test set close? Or is my model not biased?
Quick response: Yes, your model is very likely to predict everything as the Majority Class.
Let's think of it in a simpler way. You have an optimizer in the training process, who tries to maximize the accuracy (minimize the misclassification). Suppose you have a training set of 1000 images, and you have only 10 tigers in that dataset, and you intend to learn a classifier to distinguish tigers vs non-tigers.
What the optimizer is very likely to do is to predict always non-tiger for every single image. Why? cause it is a much simpler model and easier(likelier in a simpler space) to achieve, and also it gets to 99% accuracy!
I suggest you read more about imbalanced data problems( This one seems to be a good one to start https://machinelearningmastery.com/what-is-imbalanced-classification/) Depending on the problem you are to solve, you might one try to down-sampling, or over-sampling or more advanced solutions, like changing the loss functions and metrics, using F1 or AUC and/or doing ranking instead of classification.

Deep Learning with Small Datasets and SMOTE

I have a data with 6000 records. I am having a train, validate and test set of 60-20-20. I am getting an accuracy of around 76% with XGboost. I converted my data into Time series and I apply LSTM/1-D Convnets and the accuracy is around 60%. Is my dataset too small for deep learning?
Secondly, can apply SMOTE on each of the train, test and validate set (After splitting the data) I know SMOTE should not be applied before splitting the data into train/test/validate. Is it okay to upsample, train/test/validate sets after splitting them?
If upsample the train/test/validate sets afters splitting them, I get better results with LSTM around (80%) But is this approach, right? I just want to show that with more data, we can improve the accuracy of the deep learning algorithm.
In general, SMOTE should only be applied on the train, and you tune hyperparameters with valid and leave test alone.
In your case, I am not sure how you apply SMOTE on the time series data. There should be some assumptions on that which may influence your result.

Pre-randomization before random forest training in Scikit-learn

I am getting a surprisingly significant performance boost (+10% cross-validation accuracy gain) with sklearn.ensemble.RandomForestClassifier just by virtue of pre-randomizing the training set.
This is very puzzling to me, since
(a) RandomForestClassifier supposedly randomized the training data anyway; and
(b) Why would the order of example matter so much anyway?
Any words of wisdom?
I have got the same issue and posted a question, which luckily got resolved.
In my case it's because the data are put in order, and I'm using K-fold cross-validation without shuffling when doing the test-train split. This means that the model is only trained on a chunk of adjacent samples with certain pattern.
An extreme example would be, if you have 50 rows of sample all of class A, followed by 50 rows of sample all of class B, and you manually do a train-test split right in the middle. The model is now trained with all samples of class A, but tested with all samples of class B, hence the test accuracy will be 0.
In scikit, the train_test_split do the shuffling by default, while the KFold class doesn't. So you should do one of the following according to your context:
Shuffle the data first
Use train_test_split with shuffle=True (again, this is the default)
Use KFold and remember to set shuffle=True
Ordering of the examples should not affect RF performance at all. Note Rf performance can vary by 1-2% across runs anyway. Are you keeping cross-validation set separately before training?(Just ensuring this is not because cross-validation set is different every time). Also by randomizing I assume you mean changing the order of the examples.
Also you can check the Out of Bag accuracy of the classifier in both cases for the training set itself, you don't need a separate cross-validation set for RF.
During the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Resources