I am trying to use GridSearchCV to optimize a pipeline that does feature selection in the beginning and classification using KNN at the end. I have fitted the model using my data set but when I see the best parameters found by GridSearchCV, it only gives the best parameters for SelectKBest. I have no idea why it doesn't show the best parameters for KNN.
Here is my code.
Addition of KNN and SelectKbest
classifier = KNeighborsClassifier()
parameters = {"classify__n_neighbors": list(range(5,15)),
"classify__p":[1,2]}
sel = SelectKBest(f_classif)
param={'kbest__k': [10, 20 ,30 ,40 ,50]}
GridsearchCV with pipeline and parameter grid
model = GridSearchCV(Pipeline([('kbest',sel),('classify', classifier)]),
param_grid=[param,parameters], cv=10)
fitting the model
model.fit(X_new, y)
the result
print(model.best_params_)
{'kbest__k': 40}
That's an incorrect way of merging dicts I believe. Try
param_grid={**param,**parameters}
or (Python 3.9+)
param_grid=param|parameters
When param_grid is a list, the disjoint union of the grids generated by each dictionary in the list is explored. So your search is over (1) the default k=10 selected features and every combination of classifier parameters, and separately (2) the default classifier parameters and each value of k. That the best parameters just show k=40 means that having more features, even with default classifier, performed best. You can check your cv_results_ to verify.
As dx2-66 answers, merging the dictionaries will generate the full grid you probably are after. You could also just define a single dictionary from the start.
Related
I used sklearn GridSearchCV to search # of topics using lda model. After fitting the model, the fitted model is saved in CV_model.best_estimator_. Based on skelarn document, GridSearchCV has default option 'refit, default=True', which 'Refit an estimator using the best found parameters on the whole dataset.' Sklearn GridSearchCV
Since the document says the it has already fit on the full data, I therefore believed 'CV_model.best_estimator_.fit_transform(full_train_data)' shall bring the same result as 'CV_model.best_estimator_.transform(full_train_data)' . However, outputs from using fit_transform and transform differ. What did I miss? Should I use fit_transform or transform after GridsearchCV?
I realized it might be due to the unfixed random state, after I assigned a fixed random state, .transform() and .fit_transform() return same results.
I am using pipeline and GridSearchCV to select features automatically. Since the data set is small, I set the parameter 'cv' in GridSearchCV to StratifiedShuffleSplit. The code looks like as follows:
selection = SelectKBest()
clf = LinearSVC()
pipeline = Pipeline([("select", selection), ("classify", clf)])
cv = StratifiedShuffleSplit(n_splits=50)
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv = cv)
grid_search.fit(X, y)
It seems that SelectKBest acts on the training data of each split instead of the whole data set (the latter is what I want) since the result becomes different if I separate the 'select' and 'classify', where the StratifiedShuffleSplit will for sure only act on the classifier.
What is the correct way of using pipeline and GridSearchCV in this case? Thanks a lot!
Cross-validating the whole pipeline (i.e. running SelectKBest only on training part of each split) is the way to go. Otherwise the model is allowed to look at the test part - it means quality estimates become wrong. Best hyperparameters found with these unfair quality estimates may also work worse on a real unseen data.
At prediction time you're not going to re-run SelectKBest on (training dataset + example to be predicted) and then re-train the classifier, why would you do that in evaluation?
You can also find the answer of this question from page 245 to page 246 in the book-"The elements of statistical learning (2rd edition)" by Hastie,etc.
I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.
I am trying to do features selection as a part of the a scikit-learn pipeline, on a multi-label scenario. My purpose is to select best K features, for some given k.
It might be simple, but I don't understand how to get the selected features indices in such a scenario.
on a regular scenario I could do something like that:
anova_filter = SelectKBest(f_classif, k=10)
anove_filter.fit_transform(data.X, data.Y)
anova_filter.get_support()
but on a multilabel scenario my labels dimensions are #samples X #unique_labels so fit and fit_transform yield the following exception:
ValueError: bad input shape
which makes sense, because it expects labels of dimension [#samples]
on the multilabel scenario, it makes sense to do something like that:
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),('svm', LinearSVC())])
multiclf = OneVsRestClassifier(clf, n_jobs=-1)
multiclf.fit(data.X, data.Y)
but then the object I'm getting is of type sklearn.multiclass.OneVsRestClassifier which doesn't have a get_support function. How do I get the trained SelectKBest model when it's used during a pipeline?
The way you set it up, there will be one SelectKBest per class. Is that what you intended?
You can get them via
multiclf.estimators_[i].named_steps['f_classif'].get_support()
If you want one feature selection for all the OvR models,
you can do
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),
('svm', OneVsRestClassifier(LinearSVC()))])
and get the single feature selection with
clf.named_steps['f_classif'].get_support()
I am using the CRFSuite package here
http://www.chokkan.org/software/crfsuite/tutorial.html
and I have successfully used it to build a classifier and tag text. However, I'm wondering if I can get a confidence value for each prediction it makes?
It doesn't seem so. What I would really like is to get the probability of a word being each type of tag ('PER', 'LOC', 'MISC', etc), rather than just the prediction itself.
The API provides extracting conditional probabilities. I guess you mean the crfsuite binary does not have that as option. You could edit the source and add the option yourself
I hope this serves as an answer. Sklearn crfsuite provides probability for each label.
predict_marginals(X)
Make a prediction.
Parameters: X (list of lists of dicts) – feature dicts in python-crfsuite format
Returns: y – predicted probabilities for each label at each position
Return type: list of lists of dicts
Source: https://sklearn-crfsuite.readthedocs.io/en/latest/_modules/sklearn_crfsuite/estimator.html#CRF.predict_marginals