Nested cross validation with pipeline sklearn - machine-learning

I am trying to apply nested cross-validation with pipeline from the Sklearn library as seen below:
pipeline = imbpipeline(steps=[['smote', SMOTE(random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,
max_iter=1000)]])
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
param_grid = {'classifier__C':[0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
grid_search = GridSearchCV(estimator=pipeline,
param_grid=param_grid,
scoring='accuracy',
cv=cv_inner,
n_jobs=-1,
refit=True)
scores = cross_val_score(grid_search,
train_set,
train_labels,
scoring='accuracy',
cv=cv_outer,
n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
The code works just fine but I can't figure out how to extract the best parameters found with the above procedure.
As per documentation I tried:
grid_search.best_params_
but I get :
AttributeError: 'GridSearchCV' object has no attribute 'best_params_'
which I can't really understand.
Any thoughts?

Before you can get the best parameters, you need to fit the data. You should add one line:
grid_search.fit(X_train, y_train)

Related

Using Ray-Tune with sklearn's RandomForestClassifier

Putting together different base and documentation examples, I have managed to come up with this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
def objective(config, reporter):
for i in range(config['iterations']):
model = RandomForestClassifier(random_state=0, n_jobs=-1, max_depth=None, n_estimators= int(config['n_estimators']), min_samples_split=int(config['min_samples_split']), min_samples_leaf=int(config['min_samples_leaf']))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Feed the score back to tune?
reporter(precision=precision_score(y_test, y_pred, average='macro'))
space = {'n_estimators': (100,200),
'min_samples_split': (2, 10),
'min_samples_leaf': (1, 5)}
algo = BayesOptSearch(
space,
metric="precision",
mode="max",
utility_kwargs={
"kind": "ucb",
"kappa": 2.5,
"xi": 0.0
},
verbose=3
)
scheduler = AsyncHyperBandScheduler(metric="precision", mode="max")
config = {
"num_samples": 1000,
"config": {
"iterations": 10,
}
}
results = run(objective,
name="my_exp",
search_alg=algo,
scheduler=scheduler,
stop={"training_iteration": 400, "precision": 0.80},
resources_per_trial={"cpu":2, "gpu":0.5},
**config)
print(results.dataframe())
print("Best config: ", results.get_best_config(metric="precision"))
It runs and I am able to get a best configuration at the end of everything. However, my doubt mainly lies in the objective function. Do I have that properly written? There are no samples that I could find
Follow up question:
What is num_samples in the config object? Is it the number of samples it will extract from the overall training data for each trial?
Tune now has native sklearn bindings: https://github.com/ray-project/tune-sklearn
Can you give that a shot instead?
To answer your original question, the objective function looks good; and num_samples is the total number of hyperparameter configurations you want to try.
Also, you'll want to remove the forloop from your training function:
def objective(config, reporter):
model = RandomForestClassifier(random_state=0, n_jobs=-1, max_depth=None, n_estimators= int(config['n_estimators']), min_samples_split=int(config['min_samples_split']), min_samples_leaf=int(config['min_samples_leaf']))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Feed the score back to tune
reporter(precision=precision_score(y_test, y_pred, average='macro'))

Tensorflow 2 XOR implementation

I am a tensorflow newbie and to start with I want to train XOR model giving 4 inputs having 2 values and learn 4 output having 1 value.
Here is what I am doing in TF 2
model = keras.Sequential([
keras.layers.Input(shape=(2,)),
keras.layers.Dense(units=2, activation='relu'),
keras.layers.Dense(units=1, activation='softmax')
])
model.compile(optimizer='adam',
loss=tf.losses.CategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(
(tf.cast([[0,0],[0,1],[1,0],[1,1]], tf.float32), tf.cast([0,1,1,0], tf.float32)),
epochs=4,
steps_per_epoch=1,
validation_data=(tf.cast([[0.7, 0.7]], tf.float32), tf.cast([0], tf.float32)),
validation_steps=1
)
Above code is giving error IndexError: list index out of range
Please help me with this and I want to understand how to come up with shapes to give to model.
You have a problem with assigning your parameters in the fit function in:
history = model.fit(
(tf.cast([[0,0],[0,1],[1,0],[1,1]], tf.float32), tf.cast([0,1,1,0], tf.float32)),
epochs=4,
steps_per_epoch=1,
validation_data=(tf.cast([[0.7, 0.7]], tf.float32), tf.cast([0], tf.float32)),
validation_steps=1)
Try and replace that line with this:
x_train = tf.cast([[0,0],[0,1],[1,0],[1,1]], tf.float32)
y_train = tf.cast([0,1,1,0], tf.float32)
x_test = tf.cast([[0.7, 0.7]], tf.float32)
y_test = tf.cast([0], tf.float32)
history = model.fit(
x=x_train, y=y_train,
epochs=4,
steps_per_epoch=1,
validation_data=(x_test, y_test),
validation_steps=1
)
And your issue should be solved.
PS: Just a suggestion, when you are doing binary classification, try to use sigmoid instead of a softmax, and respectively a BinaryCrossentropy loss instead of CategoricalCrossentropy. Good luck

Using SMOTEENN in GridSearchCV Pipeline with Preprocesing

I am working on a classification problem with a highly imbalanced dataset. I am trying to use SMOTEENN in the grid search pipeline, however I keep getting this ValueError:
ValueError: Invalid parameter randomforestclassifier for estimator Pipeline(memory=None,
steps=[('preprocessor_X',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('scaler',
StandardScaler(copy=True,
with_mean=True,
with_std=True))],
verbose=False),
['number_of_participants',
'count_timely_submission',
'count_by_self',
'count_at_ra...
class_weight='balanced',
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs=None,
oob_score=False,
random_state=0,
verbose=0,
warm_start=False))],
verbose=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
I found online that SMOTEENN can be used with GridSearchCV if the Pipeline from imblearn is imported. I am using the Pipeline from imblearn but it still gives me this error.
The issue first started when I tried to use SMOTEENN and get the X and y variables. I have a prepare_data() function that breaks the data into X,y. I wanted to use SMOTEENN in that function and return the balanced data. However, one of my features is of type string - and needs to be put in OneHotEncoder. For some reason, SMOTEENN doesn't seem to process strings. Thus, I needed to use it in the pipeline so that SMOTEENN would be effective post-preprocessing.
I am pasting my pipeline code below. Any help or explanation would be much appreciated! Thank you!
def ML_RandomF(X, y, random_state, n_folds, oneHot_ftrs,
num_ftrs, ordinal_ftrs, ordinal_cats, beta, test_size, score_type):
scoring = {'roc_auc_score': make_scorer(roc_auc_score),
'f_beta': make_scorer(fbeta_score, beta=beta, average='weighted'),
'accuracy': make_scorer(accuracy_score)}
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=test_size, random_state = random_state)
kf = StratifiedKFold(n_splits=n_folds,shuffle=True,random_state=random_state)
reg = RandomForestClassifier(random_state=random_state, n_estimators=100, class_weight="balanced")
sme = SMOTEENN(random_state=random_state)
model = Pipeline([
('sampling', sme),
('classification', reg)])
# ordinal encoder
ordinal_transformer = Pipeline(steps=[
('ordinal', OrdinalEncoder(categories = ordinal_cats))])
# oneHot encoder
onehot_transformer = Pipeline(steps=[
('ordinal', OneHotEncoder(sparse=False, handle_unknown='ignore'))])
# standard scaler
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
preprocessor_X = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_ftrs),
('oneH', onehot_transformer, oneHot_ftrs),
('ordinal', ordinal_transformer, ordinal_ftrs)])
pipe = Pipeline(steps=[('preprocessor_X', preprocessor_X), ('model', model)])
param_grid = {'randomforestclassifier__max_depth': [3,5,7,10],
'randomforestclassifier__min_samples_split': [10,25,40]}
grid = GridSearchCV(pipe,param_grid=param_grid,
scoring=scoring,cv=kf, refit=score_type,
return_train_score=True,iid=True, verbose=2, n_jobs=-1)
grid.fit(X_other, y_other)
return grid, grid.score(X_test, y_test)
You had named RandomForestClassifier as classification and that pipeline is named as model in your next pipeline. Hence you have to change your param_grid as follows
param_grid = {'model__classification__max_depth': [3,5,7,10],
'model__classification__min_samples_split': [10,25,40]}

StandardScaler with make_pipeline

If I use make_pipeline, do I still need to use fit and transform functions to fit my model and transform or it will perform these functions itself?
Also, does StandardScaler also perform the normalization or only the scaling?
Explaining the code: I want to apply PCA and later applying normalization with svm.
pca = PCA(n_components=4).fit(X)
X = pca.transform(X)
# training a linear SVM classifier 5-fold
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
clf = make_pipeline(preprocessing.StandardScaler(), SVC(kernel = 'linear'))
scores = cross_val_score(clf, X, y, cv=5)
Also abit confused what happens if I don't use the fit function in the below code:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
clf = SVC(kernel = 'linear', C = 1)
scores = cross_val_score(clf, X, y, cv=5)
StandardScaler does both normalization and scaling.
cross_val_score() will fit (transform) your data set for you, so you don't need to call it explicitly.
A bit more common approach would be to put all steps (StandardScale, PCA, SVC) in one pipeline and use GridSearchCV for tuning hyperparameters and chosing best parameters (estimators).
Demo:
pipe = Pipeline([
('scale, StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))
])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))

Cross validation in classifying text documents using scikit-learn

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.

Resources