I am using a PipeLine, GridsearchCV, and I am trying to retrieve the features selected
pipeline = Pipeline(
[
('transform', SimpleImputer(strategy='mean')),
('selector',SelectKBest(f_regression)),
('classifier',KNeighborsClassifier())
]
)
scoring = ['precision', 'recall','accuracy']
CV = StratifiedKFold(n_splits = 4, random_state = None, shuffle=True)
search = GridSearchCV(
estimator = pipeline,
param_grid = {'selector__k':[3,4,5,6,7,8,9,10],
'classifier__n_neighbors':[3,4,5,6,7], 'classifier__weights' :['uniform', 'distance'],
'classifier__algorithm' :['auto', 'ball_tree', 'kd_tree', 'brute'],
'classifier__p':[1,2] },
n_jobs=-1,
refit='accuracy',
scoring=scoring,
cv=CV,
verbose=0
)
search.fit(data,target)
selectkbest acts on the training data of each split instead of the whole dataset which is perfect.
the confusing part for me is this line that returns a set of features:
search.best_estimator_.named_steps['selector'].get_support()
what are the features I am getting here? I assume that for each iteration there is a different set of selected features based on the split.
Since the search parameter refit is not False, the best-performing set of hyperparameters has been used to refit a model onto the entire training set (no splitting into folds for this part); that single model is what is exposed in the attribute best_estimator_.
You could define an additional scoring callable that would return the feature list; then in cv_results_ you would have the features selected for each hyperparameter combination and each fold.
Related
I have a multiple instance dataset for which I want to predict the instance category as well as a (derived) bag label using Keras' Functional API. Simple instance prediction works and getting a bag label from that also works. But since the bag label is outside of the model the results seem to be suboptimal.
My thinking is as follows:
For each instance in a bag, start up a separate branch of the model.
After running each instance through its branch, concatenate the results.
After concatenation, predict the bag label based on probabilities
What I have written so far - here, n_instances is the number of instances per bag, n_feat the number of features per instance, and n_classes the number of possible categories an instance/bag can belong to.
from keras.layers import *
inputs = []
instance_layer = [None] * n_instances
for i in range(n_instances):
inp = Input(shape=n_feat)
inputs.append(inp)
instance_layer[i] = Dense(units=256, activation='ReLU')(inp)
instance_layer[i] = Dense(units=128, activation='ReLU')(instance_layer[i])
instance_layer[i] = Dense(units=64, activation='ReLU')(instance_layer[i])
instance_layer[i] = Dense(units=n_classes + 1, activation='sigmoid')(instance_layer[i]) # output to be converted to one-hot vector
output_tensor = Concatenate()(instance_layer)
"""
Code to go from concatenated tensor to a single bag prediction
"""
model = tf.keras.models.Model(inputs, output_tensor)
Issues:
It seems to me like each instance sees a separate model while I want to keep the models identical
Concatenate() produces a tensor of length n_instances*n_classes, whereas I'm interested in a tensor of shape (n_instances, n_classes). I would prefer to use CategoricalCrossEntropy as a loss function.
Any pointers on how to go from this tensor of instance predictions to a bag prediction?
For posterity:
instance_model = tf.keras.models.Sequential([
Dense(units=256, name='fc_256', activation='ReLU', input_dim=n_feat),
Dense(units=128, name='fc_128', activation='ReLU'),
Dense(units=64, name='fc_64', activation='ReLU'),
Dense(units=n_classes+1, name='label_predictions', activation='sigmoid')
])
This is then wrapped in a TimeDistributed layer which returns a tensor with n_instances rows and n_classes+1 columns, for an input tensor of n_instances rows and n_feat columns. n_instances is variable here, hence the None in the input shape:
inputs = Input(shape=(None, n_feat), name="input")
instance_output = TimeDistributed(instance_model)(inputs)
# Condense into bag prediction
bag_output = GlobalAveragePooling1D(name="pooling")(instance_output)
model = tf.keras.models.Model(inputs, bag_output)
I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:
# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0
for train_index, test_index in rkf.split(X, y):
# Print a dot for each train / test partition
sys.stdout.write('.')
sys.stdout.flush()
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
scoring=make_scorer(accuracy_score))
result = gridcv.fit(X_train, y_train)
###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)
# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
split += 1
However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.
Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?
Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.
Additionally, your current implementation suffers from data leakage.
You're performing feature scaling on the full X_train dataset before performing your inner cross-validation.
This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:
...
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
...
Another couple of tips:
GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).
EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.
If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.
If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.
When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.
If using the latter, the above pipeline can be extended to incorporate categorical encoding:
pipeline = Pipeline(
[
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
],
)
I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.
For the given imbalanced data , I have created a different pipelines for standardization & one hot encoding
numeric_transformer = Pipeline(steps = [('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=['ohe', OneHotCategoricalEncoder()])
After that a column transformer keeping the above pipelines in one
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer,categorical_features)]
The final pipeline is as below
smt = SMOTE(random_state=42)
rf = pl1([('preprocessor', preprocessor),('smote',smt),
('classifier', RandomForestClassifier())])
I am doing the pipeline fit on imbalanced data so i have included the SMOTE technique along with the pre-processing and classifier. As it is imbalanced I want to check for the recall score.
Is the correct way as shown in the code below? I am getting recall around 0.98 which can cause the model to overfit. Any suggestions if I am making any mistake?
scores = cross_val_score(rf, X, y, cv=5,scoring="recall")
The important concern in imbalanced settings is to ensure that enough members of the minority class will be present in each CV fold; thus, it would seem advisable to enforce that using StratifiedKFold, i.e.:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
scores = cross_val_score(rf, X, y, cv=skf, scoring="recall")
Nevertheless, it turns out that even when using the cross_val_score as you do (i.e. simply with cv=5), scikit-learn takes care of it and engages a stratified CV indeed; from the docs:
cv : int, cross-validation generator or an iterable, default=None
None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold.
For int/None inputs, if the estimator is a classifier and y is either
binary or multiclass, StratifiedKFold is used. In all other cases,
KFold is used.
So, using your code as is:
scores = cross_val_score(rf, X, y, cv=5, scoring="recall")
is absolutely fine indeed.
I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)
Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):
params = {
'linearsvc__C' : [...],
'linearsvc__tol' : [...],
'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)
My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.
Does sklearn account for this?
And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:
clf = make_pipeline(SMOTE(), SVC(kernel='linear'))
still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here.
Does sklearn/imblearn account for this as well?
Many thanks in advance
I'm trying to make a network that outputs a depth map, and semantic segmentation data separately.
In order to train the network, I'd like to use categorical cross entropy for the segmentation branch, and mean squared error for the branch that outputs the depth map.
I couldn't find any info on implementing the two loss functions for each branches in the Keras documentation for the Functional API.
Is it possible for me to use these loss functions simultaneously during training, or would it be better for me to train the different branches separately?
From the documentation of Model.compile:
loss: String (name of objective function) or objective function. See
losses. If the model has multiple outputs, you can use a different
loss on each output by passing a dictionary or a list of losses. The
loss value that will be minimized by the model will then be the sum of
all individual losses.
If your output is named, you can use a dictionary mapping the names to the corresponding losses:
x = Input((10,))
out1 = Dense(10, activation='softmax', name='segmentation')(x)
out2 = Dense(10, name='depth')(x)
model = Model(x, [out1, out2])
model.compile(loss={'segmentation': 'categorical_crossentropy', 'depth': 'mse'},
optimizer='adam')
Otherwise, use a list of losses (in the same order as the corresponding model outputs).
x = Input((10,))
out1 = Dense(10, activation='softmax')(x)
out2 = Dense(10)(x)
model = Model(x, [out1, out2])
model.compile(loss=['categorical_crossentropy', 'mse'], optimizer='adam')