Using Select-K-best on unknown test-set - machine-learning

I trained a logistic regression classifier in sklearn. My base feature-file has 65 features, now I extrapolated them to a 1000 by considering quadratic combinations also (using PolynomialFeatures()). And then I reduced them back to 100 by Select-K-Best() method.
However, once I have my model trained and I get a new test_file, it would only have the 65 base features but my model expects 100 of them.
So, how can I apply the Select-K-Best() method on my test-set when I do not know the labels which is required in Select-K-Best.fit() function

You shouldn't fit SelectKBest again on test data - use the same (already fit) SelectKBest instance as in training instead. I.e. you should only use .transform method on test data, not .fit method.
scikit-learn provides an utility which makes managing multiple steps like that easier; it is called Pipeline. It should be something like that in your case (via make_pipeline helper):
pipe = make_pipeline(
PolynomialFeatures(2),
SelectKBest(100),
LogisticRegression()
)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

Related

Specifying class or sample weights in Keras for one-hot encoded labels in a TF Dataset

I am trying to train an image classifier on an unbalanced training set. In order to cope with the class imbalance, I want either to weight the classes or the individual samples. Weighting the classes does not seem to work. And somehow for my setup I was not able to find a way to specify the samples weights. Below you can read how I load and encode the training data and the two approaches that I tried.
Training data loading and encoding
My training data is stored in a directory structure where each image is place in the subfolder corresponding to its class (I have 32 classes in total). Since the training data is too big too all load at once into memory I make use of image_dataset_from_directory and by that describe the data in a TF Dataset:
train_ds = keras.preprocessing.image_dataset_from_directory (training_data_dir,
batch_size=batch_size,
image_size=img_size,
label_mode='categorical')
I use label_mode 'categorical', so that the labels are described as a one-hot encoded vector.
I then prefetch the data:
train_ds = train_ds.prefetch(buffer_size=buffer_size)
Approach 1: specifying class weights
In this approach I try to specify the class weights of the classes via the class_weight argument of fit:
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
class_weight=class_weights
)
For each class we compute weight which are inversely proportional to the number of training samples for that class. This is done as follows (this is done before the train_ds.prefetch() call described above):
class_num_training_samples = {}
for f in train_ds.file_paths:
class_name = f.split('/')[-2]
if class_name in class_num_training_samples:
class_num_training_samples[class_name] += 1
else:
class_num_training_samples[class_name] = 1
max_class_samples = max(class_num_training_samples.values())
class_weights = {}
for i in range(0, len(train_ds.class_names)):
class_weights[i] = max_class_samples/class_num_training_samples[train_ds.class_names[i]]
What I am not sure about is whether this solution works, because the keras documentation does not specify the keys for the class_weights dictionary in case the labels are one-hot encoded.
I tried training the network this way but found out that the weights did not have a real influence on the resulting network: when I looked at the distribution of predicted classes for each individual class then I could recognize the distribution of the overall training set, where for each class the prediction of the dominant classes is most likely.
Running the same training without any class weight specified led to similar results.
So I suspect that the weights don't seem to have an influence in my case.
Is this because specifying class weights does not work for one-hot encoded labels, or is this because I am probably doing something else wrong (in the code I did not show here)?
Approach 2: specifying sample weight
As an attempt to come up with a different (in my opinion less elegant) solution I wanted to specify the individual sample weights via the sample_weight argument of the fit method. However from the documentation I find:
[...] This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.
Which is indeed the case in my setup where train_ds is a dataset. Now I really having trouble finding documentation from which I can derive how I can modify train_ds, such that it has a third element with the weight. I thought using the map method of a dataset can be useful, but the solution I came up with is apparently not valid:
train_ds = train_ds.map(lambda img, label: (img, label, class_weights[np.argmax(label)]))
Does anyone have a solution that may work in combination with a dataset loaded by image_dataset_from_directory?

Using a composite estimator with sklearn's cross_validate method, what does the "fit_time" argument include?

Using sklearn make_pipeline utility you can create a composite estimator as I have done below (clf). Every time the cross_validate method is called, it first fits the minmax scaler on the kfolds that are not being used for validation, and the transforms the final fold, only then is the model fit.
The cross_validate method returns a value called the "fit_time". Does this fit time take into account the minmax scaling, or does it only time training the "model" (second argument to make_pipeline).
Thanks.
#The following clf uses minmax scaling
clf = make_pipeline(preprocessing.MinMaxScaler(), model)
results = cross_validate(clf, Data, labels, cv=kfold,return_train_score=True)
Based on the docs:
fit_time
The time for fitting the estimator on the train set for each cv split.
I would say it only counts the estimator, which should be the model and not the scalling.
But you can test it, if you exclude your scalling in one test run and see if the time is different.

Explanation Needed for Autokeras's AutoModel and GraphAutoModel

I understand what AutoKeras ImageClassifier does (https://autokeras.com/image_classifier/)
clf = ImageClassifier(verbose=True, augment=False)
clf.fit(x_train, y_train, time_limit=12 * 60 * 60)
clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
y = clf.evaluate(x_test, y_test)
But i am unable to Understand what does AutoModel class (https://autokeras.com/auto_model/) does, or how is it different from ImageClassifier
autokeras.auto_model.AutoModel(
inputs,
outputs,
name="auto_model",
max_trials=100,
directory=None,
objective="val_loss",
tuner="greedy",
seed=None)
Documentation for Arguments Inputs and Outputs Says
inputs: A list of or a HyperNode instance. The input node(s) of the AutoModel.
outputs: A list of or a HyperHead instance. The output head(s) of the AutoModel.
What is HyperNode Instance ?
Similarly, what is GraphAutoModel class ? (https://autokeras.com/graph_auto_model/)
autokeras.auto_model.GraphAutoModel(
inputs,
outputs,
name="graph_auto_model",
max_trials=100,
directory=None,
objective="val_loss",
tuner="greedy",
seed=None)
Documentation Reads
A HyperModel defined by a graph of HyperBlocks. GraphAutoModel is a subclass of HyperModel. Besides the HyperModel properties, it also has a tuner to tune the HyperModel. The user can use it in a similar way to a Keras model since it also has fit() and predict() methods.
What is HyperBlocks ?
If Image Classifier automatically does HyperParameter Tuning, what is the use of GraphAutoModel ?
Links to Any Documents / Resources for better understanding of AutoModel and GraphAutoModel appreciated .
Having worked with autokeras recently, I can share my little knowledge.
Task API
When doing a classical task such as image classification/regression, text classification/regression, ..., you can use the simplest APIs provided by autokeras called Task API: ImageClassifier, ImageRegressor, TextClassifier, TextRegressor, ... In this case you have one input (image or text or tabular data, ...) and one output (classification, regression).
Automodel
However when you are in a situation where you have for example a task that requires multi inputs/outputs architecture, then you cannot use directly Task API, and this is where Automodel comes into play with the I/O API. you can check the example provided in the documentation where you have two inputs (image and structured data) and two outputs (classification and regression)
GraphAutoModel
GraphAutomodel works like keras functional API. It assembles different blocks (Convolutions, LSTM, GRU, ...) and create a model using this block, then it will look for the best hyperparameters given this architecture you provided. Suppose for instance I want to do a binary classification task using time series as input data.
First let's generate a toy dataset :
import numpy as np
import autokeras as ak
x = np.random.randn(100, 7, 3)
y = np.random.choice([0, 1], size=100, p=[0.5, 0.5])
Here x is a time series of 100 samples, each sample is a sequence of length 7 and a features dimension of 3. The corresponding target variable y is binary (0, 1).
Using GraphAutomodel, I can specify the architecture I want, using what is called HyperBlocks. There are many blocks: Conv, RNN, Dense, ... check the full list here.
In my case I want to use RNN blocks to create a model because I have time series data :
input_layer = ak.Input()
rnn_layer = ak.RNNBlock(layer_type="lstm")(input_layer)
dense_layer = ak.DenseBlock()(rnn_layer)
output_layer = ak.ClassificationHead(num_classes=2)(dense_layer)
automodel = ak.GraphAutoModel(input_layer, output_layer, max_trials=2, seed=123)
automodel.fit(x, y, validation_split=0.2, epochs=2, batch_size=32)
(If you are not familiar with the above style of defining model, then you should check the keras functional API documentation).
So in this example I have more flexibility for creating the skeleton of architecture I would like to use : LSTM block followed by a Dense layer, followed by a Classification layer, However I didn't specify any hyperparameter, (number of lstm layers, number of dense layers, size of lstm layers, size of dense layers, activation functions, dropout, batchnorm, ....), Autokeras will do the hyperparameters tuning automatically based on the architecture (skeleton) I provided.

Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?

Data pre-processers such as StandardScaler should be used to fit_transform the train set and only transform (not fit) the test set. I expect the same fit/transform process applies to cross-validation for tuning the model. However, I found cross_val_score and GridSearchCV fit_transform the entire train set with the preprocessor (rather than fit_transform the inner_train set, and transform the inner_validation set). I believe this artificially removes the variance from the inner_validation set which makes the cv score (the metric used to select the best model by GridSearch) biased. Is this a concern or did I actually miss anything?
To demonstrate the above issue, I tried the following three simple test cases with the Breast Cancer Wisconsin (Diagnostic) Data Set from Kaggle.
I intentionally fit and transform the entire X with StandardScaler()
X_sc = StandardScaler().fit_transform(X)
lr = LogisticRegression(penalty='l2', random_state=42)
cross_val_score(lr, X_sc, y, cv=5)
I include SC and LR in the Pipeline and run cross_val_score
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(penalty='l2', random_state=42))
])
cross_val_score(pipe, X, y, cv=5)
Same as 2 but with GridSearchCV
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(random_state=42))
])
params = {
'lr__penalty': ['l2']
}
gs=GridSearchCV(pipe,
param_grid=params, cv=5).fit(X, y)
gs.cv_results_
They all produce the same validation scores.
[0.9826087 , 0.97391304, 0.97345133, 0.97345133, 0.99115044]
No, sklearn doesn't do fit_transform with entire dataset.
To check this, I subclassed StandardScaler to print the size of the dataset sent to it.
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print(len(X))
return super().fit_transform(X,y)
If you now replace StandardScaler in your code, you'll see dataset size passed in first case is actually bigger.
But why does the accuracy remain exactly same? I think this is because LogisticRegression is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier for example, you'll find accuracy between two cases start to vary.
X,y = load_breast_cancer(return_X_y=True)
X_sc = StScaler().fit_transform(X)
lr = KNeighborsClassifier(n_neighbors=1)
cross_val_score(lr, X_sc,y, cv=5)
Outputs:
569
[0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]
And the 2nd case,
pipe = Pipeline([
('sc', StScaler()),
('lr', KNeighborsClassifier(n_neighbors=1))
])
print(cross_val_score(pipe, X, y, cv=5))
Outputs:
454
454
456
456
456
[0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]
Not big change accuracy-wise, but change nonetheless.
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.
More over if your model is already biased from starting we have to make it balance by SMOTE /Oversampling of Less Target Variable/Under-sampling of High target variable.

Transformer of a pipeline acts on the training data instead of the whole data in GridSearchCV

I am using pipeline and GridSearchCV to select features automatically. Since the data set is small, I set the parameter 'cv' in GridSearchCV to StratifiedShuffleSplit. The code looks like as follows:
selection = SelectKBest()
clf = LinearSVC()
pipeline = Pipeline([("select", selection), ("classify", clf)])
cv = StratifiedShuffleSplit(n_splits=50)
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv = cv)
grid_search.fit(X, y)
It seems that SelectKBest acts on the training data of each split instead of the whole data set (the latter is what I want) since the result becomes different if I separate the 'select' and 'classify', where the StratifiedShuffleSplit will for sure only act on the classifier.
What is the correct way of using pipeline and GridSearchCV in this case? Thanks a lot!
Cross-validating the whole pipeline (i.e. running SelectKBest only on training part of each split) is the way to go. Otherwise the model is allowed to look at the test part - it means quality estimates become wrong. Best hyperparameters found with these unfair quality estimates may also work worse on a real unseen data.
At prediction time you're not going to re-run SelectKBest on (training dataset + example to be predicted) and then re-train the classifier, why would you do that in evaluation?
You can also find the answer of this question from page 245 to page 246 in the book-"The elements of statistical learning (2rd edition)" by Hastie,etc.

Resources