Cross Validation on Train, Validation & Test Set - machine-learning

In the scenario of having three sets
A train set of e.g. 80% (for model training)
A validation set of e.g. 10% (for model training)
A test set of e.g. 10% (for final model testing)
let's say I perform k-fold cross validation (CV) on the example dataset of [1,2,3,4,5,6,7,8,9,10]. Let's also say
10 is the test set in this example
the remaining [1,2,3,4,5,6,7,8,9] will be used for training and validation
leave-one-out CV would than look something like this
# Fold 1
[2, 3, 4, 5, 6, 7, 8, 9] # train
[1] # validation
# Fold 2
[1, 3, 4, 5, 6, 7, 8, 9] # train
[2] # validation
# Fold 3
[1, 2, 4, 5, 6, 7, 8, 9] # train
[3] # validation
# Fold 4
[1, 2, 3, 5, 6, 7, 8, 9] # train
[4] # validation
# Fold 5
[1, 2, 3, 4, 6, 7, 8, 9] # train
[5] # validation
# Fold 6
[1, 2, 3, 4, 5, 7, 8, 9] # train
[6] # validation
# Fold 7
[1, 2, 3, 4, 5, 6, 8, 9] # train
[7] # validation
# Fold 8
[1, 2, 3, 4, 5, 6, 7, 9] # train
[8] # validation
# Fold 9
[1, 2, 3, 4, 5, 6, 7, 8] # train
[9] # validation
Great, now the model has been built and validation using each data point of the combined train and validation set once.
Next, I would test my model on the test set (10) and get some performance.
What I was wondering now is why we not also perform CV using the test set and average the result to see the impact of different test sets? Meaning why we don't do the above process 10 times such that we have each data point also in the test set?
It would be obviously computationally extremely expensive but I was thinking about that cause it seemed difficult to choose an appropriate test set. For example, it could be that my model from above would have performed much differently when I would have chosen 1 as the test set and trained and validated on the remaining points.
I wondered about this in scenarios where I have groups in my data. For example
[1,2,3,4] comes from group A,
[5,6,7,8] comes from group B and
[9,10] comes from group C.
In this case when choosing 10 as the test set, it could perform much differently than choosing 1 right, or am I missing something here?

All your train-validation-test splits should be randomly sampled and sufficiently big. Hence if your data comes from different groups you should have roughly the same distribution of groups across train, validation and test pools. If your test performance varies based on the sampling seed you're definitely doing something wrong.
As to why not use test set for cross-validation, this would result in overfitting. Usually you would run your cross-validation many times with different hyperparameters and use cv score to select best models. If you don't have a separate test set to evaluate your model at the end of model selection you would never know if you overfitted to the training pool during model selection iterations.

Related

Dask Tutorial: Explain unexpected results from tutorial

The tutorial page requested that we ask questions here.
On tutorial 01_dask.delayed, there is the following code:
Parallelizing Increment
Prep
from time import sleep
def inc(x):
sleep(1)
return x + 1
def add(x, y):
sleep(1)
return x + y
data = [1, 2, 3, 4, 5, 6, 7, 8]
Calc
results = []
for x in data:
y = delayed(inc)(x)
results.append(y)
total = delayed(sum)(results)
print("Before computing:", total) # Let's see what type of thing total is
result = total.compute()
print("After computing :", result) # After it's computed
This code takes 1 second. This makes sense; each of the 8 inc calculations takes 1 second, the rest are ~ instantaneous, and it can all be run fully in parallel.
Parallelizing Increment and Double
Prep
def double(x):
sleep(1)
return 2 * x
def is_even(x):
return not x % 2
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Calc
results = []
for x in data:
if is_even(x): # even
y = delayed(double)(x)
else: # odd
y = delayed(inc)(x)
results.append(y)
#total = delayed(sum)(results)
total = sum(results)
This takes 2 seconds, which seems strange to me. The situation is the same as above; there are 10 actions that each take 1 second each, and can again be run fully in parallel.
The only thing I can imagine is that my machine is only able to allow for 8 tasks in parallel, but this is tough to know for sure because I have an Intel Core i7 and it seems that some have 8 threads and some have 16. (I have a MacBook Pro, and Apple notoriously likes to hide this detailed information from us pleebs.)
Can anyone confirm if this is what is going on? I am nearly certain, because bumping the data object for the first portion from data = [1, 2, 3, 4, 5, 6, 7, 8] to data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] also bumps the time up to 2 seconds.
I believe your analysis is correct, and you have 8 threads running in parallel for 1s each, before moving on to the remaining data, which do not fill all the threads, but still take 1s to complete.
You may want to try with the distributed scheduler, which provides dashboards for more feedback on what is going on (see later in the tutorial).

Converting keras code to pytorch code with Conv1D layer

I have some keras code that I need to convert to Pytorch. I've done some research but so far I am not able to reproduce the results I got from keras. I have spent many hours on this any tips or help is very appreciated.
Here is the keras code I am dealing with. The input shape is (None, 105, 768) where None is the batch size and I want to apply Conv1D to the input. The desire output in keras is (None, 105)
x = tf.keras.layers.Dropout(0.2)(input)
x = tf.keras.layers.Conv1D(1,1)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Activation('softmax')(x)
What I've tried, but worse in term of results:
self.conv1d = nn.Conv1d(768, 1, 1)
self.dropout = nn.Dropout(0.2)
self.softmax = nn.Softmax()
def forward(self, input):
x = self.dropout(input)
x = x.view(x.shape[0],x.shape[2],x.shape[1])
x = self.conv1d(x)
x = torch.squeeze(x, 1)
x = self.softmax(x)
The culprit is your attempt to swap the dimensions of the input around, since Keras and PyTorch have different conventions for the dimension order.
x = x.view(x.shape[0],x.shape[2],x.shape[1])
.view() does not swap the dimensions, but changes which part of the data is part of a given dimension. You can consider it as a 1D array, then you decide how many steps you take to cover the dimension. An example makes it much simpler to understand.
# Let's start with a 1D tensor
# That's how the underlying data looks in memory.
x = torch.arange(6)
# => tensor([0, 1, 2, 3, 4, 5])
# How the tensor looks when using Keras' convention (expected input)
keras_version = x.view(2, 3)
# => tensor([[0, 1, 2],
# [3, 4, 5]])
# Vertical isn't swapped with horizontal, but the data is arranged differently
# The numbers are still incrementing from left to right
incorrect_pytorch_version = keras_version.view(3, 2)
# => tensor([[0, 1],
# [2, 3],
# [4, 5]])
To swap the dimensions you need to use torch.transpose.
correct_pytorch_version = keras_version.transpose(0, 1)
# => tensor([[0, 3],
# [1, 4],
# [2, 5]])

Why does sklearn Imputer need to fit?

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:
imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0)
imputer = Imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.
Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean?
I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.
Thank you.
The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data.
To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.
from sklearn.preprocessing import Imputer
obj = Imputer(strategy='mean')
obj.fit([[1, 2, 3], [2, 3, 4]])
print(obj.statistics_)
# array([ 1.5, 2.5, 3.5])
X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
print(X)
# array([[ 4. , 2.5, 6. ],
# [ 5. , 6. , 3.5]])
You can do both steps in one if your train and test data are identical, using fit_transform.
X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
print(X)
# array([[ 1. , 2. , 4. ],
# [ 2. , 3. , 4. ]])
This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.
See the doc for more information about cross-validation.

Correct way to split data to batches for Keras stateful RNNs

As the documentation states
the last state for each sample at index i in a batch will be used as
initial state for the sample of index i in the following batch
does it mean that to split data to batches I need to do it the following way
e.g. let's assume that I am training a stateful RNN to predict the next integer in range(0, 5) given the previous one
# batch_size = 3
# 0, 1, 2 etc in x are samples (timesteps and features omitted for brevity of the example)
x = [0, 1, 2, 3, 4]
y = [1, 2, 3, 4, 5]
batches_x = [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
batches_y = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
then the state after learning on x[0, 0] will be initial state for x[1, 0]
and x[0, 1] for x[1, 1] (0 for 1 and 1 for 2 etc)?
Is it the right way to do it?
Based on this answer, for which I performed some tests.
Stateful=False:
Normally (stateful=False), you have one batch with many sequences:
batch_x = [
[[0],[1],[2],[3],[4],[5]],
[[1],[2],[3],[4],[5],[6]],
[[2],[3],[4],[5],[6],[7]],
[[3],[4],[5],[6],[7],[8]]
]
The shape is (4,6,1). This means that you have:
1 batch
4 individual sequences = this is batch size and it can vary
6 steps per sequence
1 feature per step
Every time you train, either if you repeat this batch or if you pass a new one, it will see individual sequences. Every sequence is a unique entry.
Stateful=True:
When you go to a stateful layer, You are not going to pass individual sequences anymore. You are going to pass very long sequences divided in small batches. You will need more batches:
batch_x1 = [
[[0],[1],[2]],
[[1],[2],[3]],
[[2],[3],[4]],
[[3],[4],[5]]
]
batch_x2 = [
[[3],[4],[5]], #continuation of batch_x1[0]
[[4],[5],[6]], #continuation of batch_x1[1]
[[5],[6],[7]], #continuation of batch_x1[2]
[[6],[7],[8]] #continuation of batch_x1[3]
]
Both shapes are (4,3,1). And this means that you have:
2 batches
4 individual sequences = this is batch size and it must be constant
6 steps per sequence (3 steps in each batch)
1 feature per step
The stateful layers are meant to huge sequences, long enough to exceed your memory or your available time for some task. Then you slice your sequences and process them in parts. There is no difference in the results, the layer is not smarter or has additional capabilities. It just doesn't consider that the sequences have ended after it processes one batch. It expects the continuation of those sequences.
In this case, you decide yourself when the sequences have ended and call model.reset_states() manually.

How to use a Gaussian Process for Binary Classification?

I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.
In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.

Resources