Difference between doing cross-validation and validation_data/validation_split in Keras - machine-learning

First, I split the dataset into train and test, for example:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)
I then use GridSearchCV with cross-validation to find the best performing model:
validator = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)
And by doing this, I have:
A model is trained using k-1 of the folds as training data; the resulting
model is validated on the remaining part of the data (scikit-learn.org)
But then, when reading about Keras fit fuction, the document introduces 2 more terms:
validation_split: Float between 0 and 1. Fraction of the training data
to be used as validation data. The model will set apart this fraction
of the training data, will not train on it, and will evaluate the loss
and any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data: tuple (x_val, y_val) or tuple (x_val, y_val,
val_sample_weights) on which to evaluate the loss and any model
metrics at the end of each epoch. The model will not be trained on
this data. validation_data will override validation_split.
From what I understand, validation_split (to be overridden by validation_data) will be used as an unchanged validation dataset, meanwhile hold-out set in cross-validation changes during each cross-validation step.
First question: is it necessary to use validation_split or validation_data since I already do cross validation?
Second question: if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
grid_result = validator.fit(train_images, train_labels, validation_data=None, validation_split=0)
Question 3: If I do so, what will happen during the training, would Keras just simply ignore the validation step?
Question 4: Does the validation_split belong to k-1 folds or the hold-out fold, or will it be considered as "test set" (like in the case of cross validation) which will never be used to train the model.

Validation is performed to ensure that the model is not overfitting on the dataset and it would generalize to new data. Since in the parameters grid search you are also doing validation then there is no need to perform the validation step by the Keras model itself during training. Therefore to answer your questions:
is it necessary to use validation_split or validation_data since I already do cross validation?
No, as I mentioned above.
if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
No, since by default no validation is done in Keras (i.e. by default we have validation_split=0.0, validation_data=None in fit() method).
If I do so, what will happen during the training, would Keras just simply ignore the validation step?
Yes, Keras won't perform the validation when training the model. However note that, as I mentioned above, the grid search procedure would perform validation to better estimate the performance of the model with a specific set of parameters.

Related

Train Test Valid data sets... General question about fitting the models

So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.
So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data
So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.
If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.
Am I wrong?
Training step :
Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.
#train step
clf = your_classifier
clf.fit(Xtrain, ytrain)
Test step :
Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.
#test step
ypred = clf.predict(Xtest)
Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
test_cm = confusion_matrix(ytest,ypred)
test_report = classification_report(ytest,ypred)
test_accuracy = accuracy_score(ytest, ypred)

How to check the predicted output during fitting of the model in Keras?

I am new in Keras and I learned fitting and evaluating the model.
After evaluating the model one can see the actual predictions made by model.
I am wondering Is it also possible to see the predictions during fitting in Keras? Till now I cant find any code doing this.
Since this question doesn't specify "epochs", and since using callbacks may represent extra computation, I don't think it's exactly a duplication.
With tensorflow, you can use a custom training loop with eager execution turned on. A simple tutorial for creating a custom training loop: https://www.tensorflow.org/tutorials/eager/custom_training_walkthrough
Basically you will:
#transform your data in to a Dataset:
dataset = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(some_buffer).batch(batchSize)
#the above is buggy in some versions regarding shuffling, you may need to shuffle
#again between each epoch
#create an optimizer
optimizer = tf.keras.optimizers.Adam()
#create an epoch loop:
for e in range(epochs):
#create a batch loop
for i, (x, y_true) in enumerate(dataset):
#create a tape to record actions
with tf.GradientTape() as tape:
#take the model's predictions
y_pred = model(x)
#calculate loss
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
#calculate gradients
gradients = tape.gradient(loss, model.trainable_weights)
#apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights)
You can use the y_pred var for doing anything, including getting its numpy_pred = y_pred.numpy() value.
The tutorial gives some more details about metrics and validation loop.

What does initial_epoch in Keras mean?

I'm a little bit confused about initial_epoch value in fit and fit_generator methods. Here is the doc:
initial_epoch: Integer. Epoch at which to start training (useful for resuming a previous training run).
I understand, it is not useful if you start training from scratch. It is useful if you trained your dataset and want to improve accuracy or other values (correct me if I'm wrong). But I'm not sure what it really does.
So after all this, I have 2 questions:
What does initial_epoch do and what is it for?
When can I use initial_epoch?
When I change my dataset?
When I change the learning rate, optimizer or loss function?
Both of them?
Since in some of the optimizers, some of their internal values (e.g. learning rate) are set using the current epoch value, or even you may have (custom) callbacks that depend on the current value of epoch, the initial_epoch argument let you specify the initial value of epoch to start from when training.
As stated in the documentation, this is mostly useful when you have trained your model for some epochs, say 10, and then saved it and now you want to load it and resume the training for another 10 epochs without disrupting the state of epoch-dependent objects (e.g. optimizer). So you would set initial_epoch=10 (i.e. we have trained the model for 10 epochs) and epochs=20 (not 10, since the total number of epochs to reach is 20) and then everything resume as if you were initially trained the model for 20 epochs in one single training session.
However, note that when using built-in optimizers of Keras you don't need to use initial_epoch, since they store and update their state internally (without considering the value of current epoch) and also when saving a model the state of the optimizer will be stored as well.
The answer above is correct however it is important to note that if you have trained for 10 epochs and set initial_epoch=10 and epochs=20 you train for 10 more epochs until you reach a total of 20 epochs. For example I trained for 2 epochs, then set initial_epoch=2 and epochs=4. The result is it trains for 4-2=2 more epochs. The new data in the history object starts at epoch 3. So the returned history object does start from epoch 1 as you might expect. Another words the state of the history object is not preserved from the initial training epochs. If you do not set initial_epoch and you train for 2 epochs, then rerun the fit_generator with epochs=4 it will train for 4 more epochs starting from the state preserved at the end of the second epoch (provided you use the built in optimizers). Again the history object state is NOT preserved from the initial training and only contains the data for the last 4 epochs. I noticed this because I plot the validation loss versus epochs.
Here is an example of how to integrate the initial_epoch in your code
#Training first 4 Epcohs and saving
model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=32, epochs=4)
model.save("partial.h5")
#loading the model, training another 4 Epochs and then saving the updated model.
from keras.models import load_model
new_model = load_model('partial.h5')
new_model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=32, initial_epoch=4,epochs=8)
new_model.save("updated.h5")
Also don't forget to specify a particular random_state value while splitting the data into train and test, so that it encounters the same set of training data each time you reinitiate the training process, so that there is no data leakage of test data entering the training data.

How to extract train and validation sets in Keras?

I implement a neural net in keras, with the following structure:
model = Sequential([... layers ...])
model.compile(optimizer=..., loss=...)
hist=model.fit(x=X,y=Y, validation_split=0.1, epochs=100)
Is there a way to extract from either model or hist the train and validation sets? That is, I want to know which indices in X and Y were used for training and which were used for validation.
Keras splits the dataset at
split_at = int(x[0].shape * (1-validation_split))
into the train and validation part. So if you have n samples, the first int(n*(1-validation_split)) samples will be the training sample, the remainder is the validation set.
If you want to have more control, you can split the dataset yourself and pass the validation dataset with the parameter validation_data:
model.fit(train_x, train_y, …, validation_data=(validation_x, validation_y))

Cross Validation in Keras

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

Resources