Splitting training and test data

Splitting training and test data - machine-learning

Can someone recommend what is the best percent of divided the training data and testing data in Machine learning. What are the disadvantages if i split training and test data in 30-70 percentage ?

There is no one "right way" to split your data unfortunately, people use different values which are chosen based on different heuristics, gut feeling and personal experience/preference. A good starting point is the Pareto principle (80-20).
Sometimes using a simple split isn't an option as you might simply have too much data - in that case you might need to sample your data or use smaller testing sets if your algorithm is computationally complex. An important part is to randomly choose your data. The trade-off is pretty simple: less testing data = the performance of your algorithm will have bigger variance. Less training data = parameter estimates will have bigger variance.
For me personally more important than the size of the split is that you obviously shouldn't always perform your tests only once on the same test split as it might be biased (you might be lucky or unlucky with your split). That's why you should do tests for several configurations (for instance you run your tests X times each time choosing a different 20% for testing). Here you can have problems with the so called model variance - different splits will result in different values. That's why you should run tests several times and average out the results.
With the above method you might find it troublesome to test all the possible splits. A well established method of splitting data is the so called cross validation and as you can read in the wiki article there are several types of it (both exhaustive and non-exhaustive). Pay extra attention to the k-fold cross validation.

Read up on the various strategies for cross-validation.
A 10%-90% split is popular, as it arises from 10x cross-validation.
But you could do 3x or 4x cross validation, too. (33-67 or 25-75)
Much larger errors arise from:
having duplicates in both test and train
unbalanced data
Make sure to first merge all duplicates, and do stratified splits if you have unbalanced data.

Related

(cross) validation of CNN models - when to bring in test data?

For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!

In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.

Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

How specific should a Support Vector Machine Model be?

The whole point of using an SVM is that the algorithm will be able to decide whether an input is true or false etc etc.
I am trying to use an SVM for predictive maintenance to predict how likely a system is to overheat.
For my example, the range is 0-102°C and if the temperature reaches 80°C or above it's classed as a failure.
My inputs are arrays of 30 doubles(the last 30 readings).
I am making some sample inputs to train the SVM and I was wondering if it is good practice to pass in very specific data to train it - eg passing in arrays 80°C, 81°C ... 102°C so that the model will automatically associate these values with failure. You could do an array of 30 x 79°C as well and set that to pass.
This seems like a complete way of doing it, although if you input arrays like that - would it not be the same as hardcoding a switch statement to trigger when the temperature reads 80->102°C.
Would it be a good idea to pass in these "hardcoded" style arrays or should I stick to more random inputs?

If there is a finite set of possibilities I would really recommend using Naïve Bayes, as that method would fit this problem perfectly. However if you are forced to use an SVM, I would say that would be rather difficult. For starters the main idea with an SVM is to use it for classification, and the amount of scenarios does not really matter. The input is however seldom discrete, so I guess there usually are infinite scenarios. However, an SVM implemented normally would only give you a classification, unless you have 100 classes one for 1% another one for 2%, this wouldn't really solve problem.
The conclusion is that this could work, but it would not be considered "best practice". You can imagine your 30 dimensional vector space divided into 100 small sub spaces, and each datapoint, a 30x1 vector is a point in that vectorspace so that the probability is decided by which of the 100 subsets its in. However, having a 100 classes and not very clean or insufficient data, will lead to very bad, hard performing models.
Cheers :)

Is it a good practice to use your full data set for predictions?

I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.

You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.

In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.

Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?

However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.

Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart