Suggest using Kfold splits or validation_split kwarg in Keras Training? - machine-learning

In many examples, I see train/cross-validation dataset splits being performed by using a Kfold, StratifiedKfold, or other pre-built dataset splitter. Keras models have a built in validation_split kwarg that can be used for training.
model.fit(self, x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None)
(https://keras.io/models/model/)
validation_split: float between 0 and 1: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
I am new to the field and tools, so my intuition on what the different splitters offer you. Mainly though, I can't find any information on how Keras' validation_split works. Can someone explain it to me and when separate method is preferable? The built-in kwarg seems to me like the cleanest and easiest way to split test datasets, without having to architect your training loops much differently.

The difference between the two is quite subtle and they can be used in conjunction.
Kfold and similar functions in scikit-learn will randomly split your data into k folds. You can then train models holding out a single fold each time and testing on the fold.
validation_split takes a fraction of your data non-randomly. According to the Keras documentation it will take the fraction from the end of your data, e.g. 0.1 will hold out the final 10% of rows in the input matrix. The purpose of the validation split is to allow you to assess how the model is performing on the training set and a held out set at every epoch in the training period. If the model continues to improve on the training set but not the validation set then it is a clear sign of potential overfitting.
You could theoretically use KFold cross-validation to construct a model while also using validation_split to monitor the performance of each model. At each fold you will be generating a new validation_split from the training data.

Related

How to initialize the parameter in the cross validation method and get the final model after training and evaluating using this method?

As I learned about cross-validation algorithm, from most of the articles on the web, there are variety of cross-validation methods. Here I want to be clear about the k-fold cross-validation technique.
In the k-fold cross-validation algorithm, we can split the training set in to k not-overlapped folds.
As we split the training data in to k folds, we have to train the model in k iterations.
So, in each iteration, we train the model with (k-1) folds and validate it with the remained fold.
In each split we can calculate the desired metric(s) of our model.
At the end we can report the training error by taking the average of scores of all iterations.
But what is the final trained model?
Some points in those articles are not clear for me?
Should I initiate model's parameters in each iteration?
I ask this, because if I don’t initialize the parameter's it could save the pattern of data which I want to be unseen in the next iteration and so on…
Should I save the initial parameter of the split in which I gained the best score, as the best initial values of the parameters?
Should I retrain the model initiating it with the initial values of the parameters gained in my second question and then feed it with whole training dataset and gain the final trained model?
Alright so before answering your question I will go a bit back to explain the purpose of cross validation and model evaluation. You can read these slides or research more about statistical learning theory if you want to go deeper.
Train/test split
Suppose you have a model with defined hyperparameter (or none) and you train it on the training split. If you calculate the metrics over the test split, this will give you the risk of the model on new data. Then you know that this particular model will perform like that on unseen data.
So we have a learning process B, that takes a dataset S (here the training dataset) as well as hyperparameters h, and gives a fitted model m; then B(S, h)->m (training B on S with hp h gives a model m, with its parameters). Then we tested this model to evaluate the risk R on the test dataset.
k-fold Cross validation
When doing k-fold cross validation, you fit k models using the learning process B. Each model is fitted on a different training set, and the risk is computed on non overlapping samples.
Then, you calculate the mean risk among the folds. A common mistake is that it gives you the performance of the model, that's not true. This gives you the mean (or expected) performances of the learning process B (and hyperparams h). That means, if you train a new model using B (and hyperparams h), its expected performance will be around the calculated metrics (of course this is not always true).
For your questions
Yes you should train the model from scratch, if possible with the same initial parameters (if initialization is not random) to avoid any difference between folds. Using a warm start with the previous parameters can modify the learning process, and the fitting.
No, if initialization is random let it be, if it is fixed use the same initial parameters for all folds
For the two previous questions, if by initial parameters you meant hyperparameters, then you should keep the same for all folds, otherwise the calculated risk will be useless. If you want to try multiple hyperparameters, you have to repeat the cross validation multiple times, and then you can select the best ones based on the risk calculated.
Once you tuned your hyperparameters you can train the model on your whole training set. This will give you a model m. Before your cross validation you can keep a small test split to evaluate this final model on unseen data

Disadvantages of train-test split

"Train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in." The above is most of the blogs mentioned about which I don't understand that. I think the disadvantages is not overfitting but underfitting. When we split the data , assume State A and B become the training dataset and try to predict the State C which is completely different than the training data that will lead to underfitting. Can someone fill me in why most of the blogs state 'test-split' lead to overfitting.
It would be more correct to talk about selection bias, which your question describes.
Selection bias can not really tie to overfitting, but to fitting a biased set, therefore the model will be unable to generalize/predict correctly.
In other words, whether "fitting" or "overfitting" applies to a biased train set, that is still wrong.
The semantic strain on the "over" prefix is just that. It implies bias.
Imagine you have no selection bias. In that case, when you overfit even a healthy set, by definition of overfitting, you will still make the model biased towards your train set.
Here, your starting training set is already biased. So any fitting, even "correct fitting", will be biased, just like it happens in overfitting.
In fact train/test split does have some randomness. See below with sci-kit learn train_test_split
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
Here, in order to have some initial intuition, you may change the random_state value to some random integer and train the model multiple times to see if you could get a comparable test accuracies in each run. If the dataset is small (in order of 100s) the test accuracies may differ significantly. But when you have a larger dataset (in order of 10000s) the test accuracies become more or less similar as the train set would include at least some examples from all samples.
Of course, cross validation is performed to minimize the effect of overfitting and to make the results more generalized. But with too large datasets, it would be really expensive to do cross validation.
The "train_test_split" function will not necessarily be biased if you do it only once on a data set. What I mean is that by selecting a value for "random_state" feature of the function, you can make different groups of train and test data sets.
Imagine you have a data set, and after applying the train_test_split and training your model, you get low accuracy score on your test data.
If you alter the random_state value and retrain your model, you will get a different accuracy score on your data set.
Consequently, you can essentially be tempted to find the best value for random_state feature to train your model in a way that will have best accuracy. Well, guess what?, you have just introduced bias to your model. So you have found a train set which could train your model in such way that would work the best on the test set.
However, when we use something such as KFold cross Validation, we break down the data set into five or ten (depending on size) groups of train and test data set. Every time we train the model, we can see a different score. The average of all the scores will probably be something more realistic for the model, when trained on the whole data set. It would look like something like this:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kfold = KFold(5, True, 1)
R_2 = []
for train_index, test_index in kfold.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
Model = LinearRegression().fit(X_train, y_train)
r2 = metrics.r2_score(y_test, Model.predict(X_test))
R_2.append(r2)
R_2mean = np.mean(R_2)

validation loss and validation data of multi-output model in Keras

I want to train a model with one input and two outputs in Keras, but I have a couple of issues with the setup of validations.
1) The Keras functional API documentation says that model.fit can take in a list of numpy arrays as output when there are multiple outputs. However, for the validation_data argument to model.fit, it says that the model can take in a tuple of the form (x_val, y_val) or (x_val, y_val, val_sample_weights). Then how can I pass in the y_val of my second output? Would I be able to do that using validation_split or would validation split also only be applied to one of my outputs?
2) Also what is the validation loss that is passed to the EarlyStopping Callback going to be? For the losses that are returned by functions like model.evaluate, two loss values will be returned. For training, the sum of the losses times their weights will be minimized. How does this this work with EarlyStopping? I want the early stopping to also be based on the minimization of the sum of losses times their weights, but I don't know if this is what will actually happen.
It's specified that both y_train and y_val might be a list of numpy.arrays. From my experience val_split should work fine.
The final loss is a sum of all model losses and it's used in purpose to check EarlyStopping criterion.

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Pre-randomization before random forest training in Scikit-learn

I am getting a surprisingly significant performance boost (+10% cross-validation accuracy gain) with sklearn.ensemble.RandomForestClassifier just by virtue of pre-randomizing the training set.
This is very puzzling to me, since
(a) RandomForestClassifier supposedly randomized the training data anyway; and
(b) Why would the order of example matter so much anyway?
Any words of wisdom?
I have got the same issue and posted a question, which luckily got resolved.
In my case it's because the data are put in order, and I'm using K-fold cross-validation without shuffling when doing the test-train split. This means that the model is only trained on a chunk of adjacent samples with certain pattern.
An extreme example would be, if you have 50 rows of sample all of class A, followed by 50 rows of sample all of class B, and you manually do a train-test split right in the middle. The model is now trained with all samples of class A, but tested with all samples of class B, hence the test accuracy will be 0.
In scikit, the train_test_split do the shuffling by default, while the KFold class doesn't. So you should do one of the following according to your context:
Shuffle the data first
Use train_test_split with shuffle=True (again, this is the default)
Use KFold and remember to set shuffle=True
Ordering of the examples should not affect RF performance at all. Note Rf performance can vary by 1-2% across runs anyway. Are you keeping cross-validation set separately before training?(Just ensuring this is not because cross-validation set is different every time). Also by randomizing I assume you mean changing the order of the examples.
Also you can check the Out of Bag accuracy of the classifier in both cases for the training set itself, you don't need a separate cross-validation set for RF.
During the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Resources