grid search random forests "RandomForestClassifier instance is not fitted yet" - machine-learning

Im trying to do grid search on random forests classifier, I'm trying to test different PCA componenet and n_estimators
model_rf = RandomForestClassifier()
pca_rf = Pipeline([('pca', PCA()), ('rf', RandomForestClassifier())])
param_grid_rf = [{
'pca__n_components': [20],
'rf__n_estimators': [5]
}]
grid_cv_rf = GridSearchCV(estimator=pca_rf, cv=5,param_grid=param_grid_rf)
grid_cv_rf.fit(x_train, y_train1)
test_pca_evaluate = pca.transform(x_test)
y_pred = model_rf.predict(test_pca_evaluate)#error here
In the last line I get an error "This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method."

It's a pretty straightforward error - the RandomForestClassifier whose method you're calling hasn't been fit yet, meaning you haven't called model_rf.fit. That object isn't fitted by your grid_cv_rf object.
I think what you want is grid_cv_rf.predict(x_test), because that grid_cv_rf object does both your PCA as well as RF fitting.

Related

get_support of the features selected in GridSearch CV

I am using a PipeLine, GridsearchCV, and I am trying to retrieve the features selected
pipeline = Pipeline(
[
('transform', SimpleImputer(strategy='mean')),
('selector',SelectKBest(f_regression)),
('classifier',KNeighborsClassifier())
]
)
scoring = ['precision', 'recall','accuracy']
CV = StratifiedKFold(n_splits = 4, random_state = None, shuffle=True)
search = GridSearchCV(
estimator = pipeline,
param_grid = {'selector__k':[3,4,5,6,7,8,9,10],
'classifier__n_neighbors':[3,4,5,6,7], 'classifier__weights' :['uniform', 'distance'],
'classifier__algorithm' :['auto', 'ball_tree', 'kd_tree', 'brute'],
'classifier__p':[1,2] },
n_jobs=-1,
refit='accuracy',
scoring=scoring,
cv=CV,
verbose=0
)
search.fit(data,target)
selectkbest acts on the training data of each split instead of the whole dataset which is perfect.
the confusing part for me is this line that returns a set of features:
search.best_estimator_.named_steps['selector'].get_support()
what are the features I am getting here? I assume that for each iteration there is a different set of selected features based on the split.
Since the search parameter refit is not False, the best-performing set of hyperparameters has been used to refit a model onto the entire training set (no splitting into folds for this part); that single model is what is exposed in the attribute best_estimator_.
You could define an additional scoring callable that would return the feature list; then in cv_results_ you would have the features selected for each hyperparameter combination and each fold.

scikit-learn and imblearn: does GridSearchCV/RandomSearchCV apply preprocessing to the validation set as well?

I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)
Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):
params = {
'linearsvc__C' : [...],
'linearsvc__tol' : [...],
'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)
My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.
Does sklearn account for this?
And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:
clf = make_pipeline(SMOTE(), SVC(kernel='linear'))
still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here.
Does sklearn/imblearn account for this as well?
Many thanks in advance

GridSearchCV and its feature importance

In gridsearchCV, when I fit like something as follows:
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid,cv=5,scoring = 'neg_mean_squared_error')
grid_search.fit(X_train,y_train)
and after that,
when I execute this,
GridSearch.best_estimator_.feature_importances_
it gives an array of values
so my question is what values does GridSearch.best_estimator_.feature_importances_ this line return??
In your case, GridSearch.best_estimator_.feature_importances_ returns a RandomForestRegressor object.
Therefore, according to RandomForestRegressor documentation:
feature_importances_ : array of shape = [n_features]
Return the feature importances (the higher, the more important the feature).
In other words, it returns the most important features according to your training set X_train. Each element of feature_importances_ corresponds to one feature of X_train (e.g: first element of feature_importances_ refers to the first feature/column of X_train).
The higher the value of an element in feature_importances_, the more important is the feature in X_train.

how the generator is trained with the output of discriminator in Generative adversarial Networks

Recently I have learned about Generative Adversarial Networks.
For training the Generator, I am somehow confused how it learns. Here is an implemenation of GANs:
`# train generator
z = Variable(xp.random.uniform(-1, 1, (batchsize, nz), dtype=np.float32))
x = gen(z)
yl = dis(x)
L_gen = F.softmax_cross_entropy(yl, Variable(xp.zeros(batchsize, dtype=np.int32)))
L_dis = F.softmax_cross_entropy(yl, Variable(xp.ones(batchsize, dtype=np.int32)))
# train discriminator
x2 = Variable(cuda.to_gpu(x2))
yl2 = dis(x2)
L_dis += F.softmax_cross_entropy(yl2, Variable(xp.zeros(batchsize, dtype=np.int32)))
#print "forward done"
o_gen.zero_grads()
L_gen.backward()
o_gen.update()
o_dis.zero_grads()
L_dis.backward()
o_dis.update()`
So it computes a loss for the Generator as it is mentioned in the paper.
However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array).
But we know that in general, for training a network, we compute a loss function in the last layer (a loss between the last layers output and the real output) and then we compute the gradients. So for example, if the output is 64*64, then we compare it with a 64*64 image and then compute the loss and do the back propagation.
However, in the codes that I see in Generative Adversarial Networks, I see they compute a loss for the Generator from the discriminator output (which is just a number) and then they call the back propagation for Generator. The Generators last layers is for example 64*64 pixels but the discriminator loss is 1*1 (which is different from the usual networks) So I do not understand how it cause the Generator to be learned and trained?
I thought if we attach the two networks (attaching the Generator and Discriminator) and then call the back propagation but just update the Generators parameters, it makes sense and it should work. But what I see in the codes are totally different.
So I am asking how it is possible?
Thanks
You say 'However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array)' whereas the loss is always a scalar value. When we compute mean square error of two images it is also a scalar value.
L_adversarial = E[log(D(x))]+E[log(1−D(G(z))]
x is from real data distribution
z is the latent data distribution which is transformed by the Generator
Coming back to your actual question, The Discriminator network has a sigmoid activation function in the last layer which means it outputs in the range [0,1]. Discriminator tries to maximize this loss by maximizing both terms that are added in the loss function. Maximum value of first term is 0 and occurs when D(x) is 1 and maximum value of second term is also 0 and occurs when 1-D(G(z)) is 1 which means D(G(z)) is 0. So Discriminator tries to do a binary classification my maximizing this loss function where it tries to output 1 when it is fed x(real data) and 0 when it is fed G(z)(generated fake data).
But the Generator tries to minimize this loss in other words it tries to fool the Discriminator by generating fake samples which are similar to real samples. With time both Generator and Discriminator gets better and better. This is the intuition behind GAN.
The code is in pytorch
bce_loss = nn.BCELoss() #bce_loss = -ylog(y_hat)-(1-y)log(1-y_hat)[similar to L_adversarial]
Discriminator = ..... #some network
Generator = ..... #some network
optimizer_generator = ....... #some optimizer for generator network
optimizer_discriminator = ....... #some optimizer for discriminator network
z = ...... #some latent data distribution that is transformed by the generator
real = ..... #real data distribution
#####################
#Update Discriminator
#####################
fake = Generator(z)
fake_prediction = Discriminator(fake)
real_prediction = Discriminator(real)
discriminator_loss = bce_loss(fake_prediction,torch.zeros(batch_size))+bce_loss(real_prediction,torch.ones(batch_size))
discriminator_loss.backward()
optimizer_discriminator.step()
#################
#Update Generator
#################
fake = Generator(z)
fake_prediction = Discriminator(fake)
generator_loss = bce_loss(fake_prediction,torch.ones(batch_size))
generator_loss.backward()
optimizer_generator.step()

scikit-learn: get selected features when using SelectKBest within pipeline

I am trying to do features selection as a part of the a scikit-learn pipeline, on a multi-label scenario. My purpose is to select best K features, for some given k.
It might be simple, but I don't understand how to get the selected features indices in such a scenario.
on a regular scenario I could do something like that:
anova_filter = SelectKBest(f_classif, k=10)
anove_filter.fit_transform(data.X, data.Y)
anova_filter.get_support()
but on a multilabel scenario my labels dimensions are #samples X #unique_labels so fit and fit_transform yield the following exception:
ValueError: bad input shape
which makes sense, because it expects labels of dimension [#samples]
on the multilabel scenario, it makes sense to do something like that:
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),('svm', LinearSVC())])
multiclf = OneVsRestClassifier(clf, n_jobs=-1)
multiclf.fit(data.X, data.Y)
but then the object I'm getting is of type sklearn.multiclass.OneVsRestClassifier which doesn't have a get_support function. How do I get the trained SelectKBest model when it's used during a pipeline?
The way you set it up, there will be one SelectKBest per class. Is that what you intended?
You can get them via
multiclf.estimators_[i].named_steps['f_classif'].get_support()
If you want one feature selection for all the OvR models,
you can do
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),
('svm', OneVsRestClassifier(LinearSVC()))])
and get the single feature selection with
clf.named_steps['f_classif'].get_support()

Resources