Training Random forest with different datasets gives totally different result! Why? - machine-learning

I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:
ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.
Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
I do appreciate if someone can help me to know
why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the
training.
PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.

First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
Is this for the training set or the test set? If the test set, that means you have poor generalization.
You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.
You should do one of two things:
1. A training-validation-test split
Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.
Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.
2. Use cross validation
A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.
You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).
Personally I prefer cross validation.
I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.
Of course, you might also have a buggy implementation.
I would suggest you try cross validation and see what results you get then.

Related

Best way to create validation / test set when time is a factor?

When it comes to data that has a temporal component to it, the "gold" standard in terms of validation and hyper-parameter tuning would be to use a sliding window approach. In other words, use a sliding window of size N points to evaluate the next K points after that.
However, this approach is not practical if model training is prohibitively expensive. For example, training an XGBoost model on a Spark cluster on terabytes of data over and over again.
So then what is the best approach for creating a validation set in this scenario? Let's assume we have 1 year of training data. And two months of test data after that.
Split the test set randomly to create a validation set and test set.
Split the test set by time (1 month each) to create a validation set and test set.
Option 1 intuitively feels "closer" to what we're supposed to do (test and validation set being from the same distribution), but feels more prone to over-fitting. Although is it really overfitting if the datasets are technically distinct?
Option 2 is definitely less prone to over-fitting, but will definitely produce an under-estimate of real model performance and can potentially miss important information from the time window in the test set. Also, seems like it'd be difficult to assess whether we've actually overfit or if the performance drop-off from validation to test set is simply due to the data being further out in time.
Here is my take:
#1: Split the test set randomly to create a validation set and test set.
No, You cannnot do it. It will lead to leakage from Validation set to training set if random validation data is after the random test data. It happens all the time.
#Split the test set by time (1 month each) to create a validation set and test set.
Yes, you can do it given you are making sure that 1 month of Test data is after the 1 month of validation data.
In short,
No matter how you select the data, following constrains have to enforced:
Training data has to be before Valid and Test data with zero temporal overlap.
Validation data has to be before test data with zero temporal overlap.

Cross validation and Improvement

I was wondering how the cross validation process can improve a model. I am totally new to this field and keen to learn.
I understood the principle of cross-validation but don't understand how it improves a model. Let's say the model is divided into 4 folds than if I train my model on the 3 first fourth and test on the last one the model is gonna train well. But when I repeat this step by training the model on the last 3 fourth and test on the first one, most of the training data has already been "reviewed" by the model? The model won't improve with data already seen right? Is it a "mean" of the models made with the different training data sets?
Thank you in advance for your time!
Cross validation doesn't actually improve the model, but helps you to accurately score it's performance.
Let's say at the beginning of your training you divide your data into 80% train and 20% test sets. Then you train on the said 80% and test on 20% and get the performance metric.
The problem is, when separating the data in the beginning, you did so hopefully randomly, or otherwise arbitrary, and as a result, the model performance you obtained is somehow relying on the pseudo-random number generator you've used or your judgement.
So instead you divide your data into, for example, 5 random equal sets. Then you take set 1, put it aside, train on sets 2-5, test on set 1 and record the performance metric. Then you put aside set 2, and train a fresh (not trained) model on sets 1, 3-5, test on set 2, record the metric and so on.
After 5 sets you will have 5 performance metrics. If you take their average (of the most appropriate kind) it would be a better representation of your model performance, because you are 'averaging out' the random effects of data splitting.
I think it is explained well in this blog with some code in Python.
With 4-fold cross-validation you are effectively training 4 different models. There's no dependency between the models and one does not train on top of the other.
What will happen later depends on the implementation. Typically you can access all models that were trained and it's left to you what to do with that.

(cross) validation of CNN models - when to bring in test data?

For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!
In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

Does it make sense using validation set together with a crossvalidation approach?

I want to train a MultiLayerPerceptron using Weka with ~200 samples and 6 attributes.
I was thinking of spliting into train and test, and on train, specify a certain % of the train as Validation set.
But then I considered using fold-crossvalidation in order to make a better use of my set of samples.
My question is: Does it make sense to specify a validation set when doing a crossvalidation approach?
And, considering the size of the sample, can you suggest me some numbers for the two approaches? (e.g. 2/3 for train, 1/3 test, and 20% validation... and for CV: 10-fold, 2-fold, or LOOCV instead...)
Thank you in advance!
Your questions sounds like you're not exactly familiar with cross-validation. Like you noticed there is a parameter for the number of folds to run. For a simple cross-validation the parameter defines the number of subsets which are created out of your original set. Let that parameter be k. Your original set is splitted into k equally sized subset. Then for each run, the trainig is run on k-1 subsets and the validation is done on the remaining, k-th subset. Then another permutation of k-1 subsets of the k subsets is used for training, and so on. So you run k iterations of this process.
For your data set size, k=10 sounds alright, but basically everything is worth testing, as long as you take all results into account and don't take the best one.
For the very simple evaluation you just use 2/3 as training set and the 1/3 "test set" is actually your validation set. There are more sophisticated approaches though which use the test set as a termination criterion and another validation set as the final evaluation (since your results might be overfitted to the test set as well, because it defines the termination). For this approach you obviously need to split up the set differently (e.g. 2/3 training, 3/12 test and 1/12 validation).
You should be carefully because you don't have much sample. On the other hand if you want to check your model accuracy you should partition a test set for your model. Cross validation splits your data as train and validation data. Then when we consider that you don't have much sample and your validation set will be so small you can have a look at that approach:
5×2 cross-validation, which uses training cross-validation and
validation sets of equal size (Dietterich (1998))
You can find more info at Ethem Alpaydin's Machine Learning book about it.
Don't memorize the data and don't test on small amounts of sample it looks like a dilemma but the certain decision depends on your data set.

Resources