I have a data with 6000 records. I am having a train, validate and test set of 60-20-20. I am getting an accuracy of around 76% with XGboost. I converted my data into Time series and I apply LSTM/1-D Convnets and the accuracy is around 60%. Is my dataset too small for deep learning?
Secondly, can apply SMOTE on each of the train, test and validate set (After splitting the data) I know SMOTE should not be applied before splitting the data into train/test/validate. Is it okay to upsample, train/test/validate sets after splitting them?
If upsample the train/test/validate sets afters splitting them, I get better results with LSTM around (80%) But is this approach, right? I just want to show that with more data, we can improve the accuracy of the deep learning algorithm.
In general, SMOTE should only be applied on the train, and you tune hyperparameters with valid and leave test alone.
In your case, I am not sure how you apply SMOTE on the time series data. There should be some assumptions on that which may influence your result.
Related
How noise in the data, target complexity and size of the training set are related to over-fitting?
I am guessing that you are a beginner, suppose you have dataset with lots of features(as in columns). you create a model and test it on your training and test dataset, you notice that it gives you an accuracy of 100 percent on your training set and 60-70 on your test set, this is an example of Overfitting. it is because you have chosen a lot of features which were not related to predicting the outcome.
you can remove it by dropping those irrelevant columns(which are called as noise), apply K-fold cross validation on your data.
this video might help you get a better understanding
https://www.youtube.com/watch?v=Anq4PgdASsc
I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)
I have train dataset and test dataset from two different sources. I mean they are from two different experiments but the results of both of them are same biological images. I want to do binary classification using deep CNN and I have following results on test accuracy and train accuracy. The blue line shows train accuracy and the red line shows test accuracy after almost 250 epochs. Why the test accuracy is almost constant and not raising? Is that because Test and Train dataset are come from different distributions?
Edited:
After I have add dropout layer, reguralization terms and mean subtraction I still get following strange results which says the model is overfitting from the beginning!
There could be 2 reasons. First you overfit on the training data. This can be validated by using the validation score as a comparison metric to the test data. If so you can use standard techniques to combat overfitting, like weight decay and dropout.
The second one is that your data is too different to be learned like this. This is harder to solve. You should first look at the value spread of both images. Are they both normalized. Matplotlib normalizes automatically for plotted images. If this still does not work you might want to look into augmentation to make your training data more similar to the test data. Here I can not tell you what to use, without seeing both the trainset and the testset.
Edit:
For normalization the test set and the training set should have a similar value spread. If you do dataset normalization you calculate mean and std on training set. But you also need to use those calculated values on the test set and not calculate the test set values from the test set. This only makes sense if the value spread is similar for both the training and test set. If this is not the case you might want to do per sample normalization first.
Other augmentation that are commonly used for every dataset are oversampling, random channel shifts, random rotations, random translation and random zoom. This makes you invariante to those operations.
I am currently struggling with a very imbalanced data set with 9 classes and a ratio of 12:1 between the most- and least-represented class. Applying weka's SMOTE filter until all classes were equally represented has drastically improved the classification results, from an overall classification accuracy of 86% to a classification accuracy of 95%. Individual class accuracys (true positive) have also been generally improved, before applying the SMOTE-filter they were ranging between 40%-99%, after applying the SMOTE filter between 94%-99%. Thereby, accuracies have been increasing with the number of times the SMOTE-filter was applied.
How reliable are those "new" results? Could this be more an effect of over-fitting?
I just want to give a heads up on my results in case somebody else stumbles on the same issue. Unfortunately, it seems like the accuracy improvements most likely came from over-fitting.
I came to this conclusion by using a training-test-setup instead of cross-validation: I randomized my data, split it into two parts of 85% training data and 15% test data. Then I applied the SMOTE-filter on the training data until all classes were equally represented. This up-sampled data then trained a classification model (END-implementation) and the test-data was used for classification. Thereby, the classification results using this setup and SMOTE were very close to classification results without SMOTE, in total about 86%. It therefore seems the accuracy improvement came from the fact that the test-data in the cross-validation-setup is also up-scaled and therefore fostering over-fitting.
Does anybody know more about this? Or somebody who wants to challenge these findings?
I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.
Is this due to oversampling?
Is it bad practice to perform cross-validation on data on which SMOTE is applied?
Are there any ways to solve this problem?
I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.
According to my experience, dividing the data set by hand is not good way to deal with this problem. When you have 1 data set, you should have cross validation on each classifier you use in a way that 1 fold of your cross validation is your test set_which you should not implement SMOTE on it_ and you have 9 other folds as your training set in which you must have a balanced data set. Repeat this action in a loop for 10 times. Then you will have better result than dividing whole data set by hand.
It is obvious that if you apply SMOTE on both test and training set, you are having synthesized test set which gives you a high accuracy that is not actually correct.