scikit-learn: get selected features when using SelectKBest within pipeline - machine-learning

I am trying to do features selection as a part of the a scikit-learn pipeline, on a multi-label scenario. My purpose is to select best K features, for some given k.
It might be simple, but I don't understand how to get the selected features indices in such a scenario.
on a regular scenario I could do something like that:
anova_filter = SelectKBest(f_classif, k=10)
anove_filter.fit_transform(data.X, data.Y)
anova_filter.get_support()
but on a multilabel scenario my labels dimensions are #samples X #unique_labels so fit and fit_transform yield the following exception:
ValueError: bad input shape
which makes sense, because it expects labels of dimension [#samples]
on the multilabel scenario, it makes sense to do something like that:
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),('svm', LinearSVC())])
multiclf = OneVsRestClassifier(clf, n_jobs=-1)
multiclf.fit(data.X, data.Y)
but then the object I'm getting is of type sklearn.multiclass.OneVsRestClassifier which doesn't have a get_support function. How do I get the trained SelectKBest model when it's used during a pipeline?

The way you set it up, there will be one SelectKBest per class. Is that what you intended?
You can get them via
multiclf.estimators_[i].named_steps['f_classif'].get_support()
If you want one feature selection for all the OvR models,
you can do
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),
('svm', OneVsRestClassifier(LinearSVC()))])
and get the single feature selection with
clf.named_steps['f_classif'].get_support()

Related

Feature selection with GridsearchCV

I am trying to use GridSearchCV to optimize a pipeline that does feature selection in the beginning and classification using KNN at the end. I have fitted the model using my data set but when I see the best parameters found by GridSearchCV, it only gives the best parameters for SelectKBest. I have no idea why it doesn't show the best parameters for KNN.
Here is my code.
Addition of KNN and SelectKbest
classifier = KNeighborsClassifier()
parameters = {"classify__n_neighbors": list(range(5,15)),
"classify__p":[1,2]}
sel = SelectKBest(f_classif)
param={'kbest__k': [10, 20 ,30 ,40 ,50]}
GridsearchCV with pipeline and parameter grid
model = GridSearchCV(Pipeline([('kbest',sel),('classify', classifier)]),
param_grid=[param,parameters], cv=10)
fitting the model
model.fit(X_new, y)
the result
print(model.best_params_)
{'kbest__k': 40}
That's an incorrect way of merging dicts I believe. Try
param_grid={**param,**parameters}
or (Python 3.9+)
param_grid=param|parameters
When param_grid is a list, the disjoint union of the grids generated by each dictionary in the list is explored. So your search is over (1) the default k=10 selected features and every combination of classifier parameters, and separately (2) the default classifier parameters and each value of k. That the best parameters just show k=40 means that having more features, even with default classifier, performed best. You can check your cv_results_ to verify.
As dx2-66 answers, merging the dictionaries will generate the full grid you probably are after. You could also just define a single dictionary from the start.

keras model output vector of one hot vectors, is it possible? are there any alternatives if not?

So I have an output vector of dim=7 and 4 possible classes for each position, so my question is, is it possible to feed the keras model a vector of one hot vectors, where each position of the vector is a one hot vector? something like this [[1000],[1000],[0100],[0010],[0001],[0001],[0010]].
If this is not possible are there any alternatives?
If you want your output of your model to be like that when your model = keras.models.Model(...), the answer is not possible because the output that you provide (which is like a step respond "[1000] => [0000]") will have a gradient of infinity at step and 0 at other point.
What people do is to create a model that give distribution over different action and select the highest probability as predicted value and using cross entropy loss to optimize the model. For example, from your output [1,0,0,0] you will have something like [0.9,0.01,0.01,0.08] instead. Then you can pick first instance as predicted value. This will make sure that your model have gradient at all point.
So if you really want your model to have dim = 7 and 4 different value, you can create output size of 28 = 7*4 with sigmoid activation, then pick first 4 as your dimension 1 (maybe using something like np.argmax(output[0:4])) and so on.

How to get jlibsvm prediction probability in multi-class classification

I am new to SVM. I am using jlibsvm for a multi-class classification problem. Basically, I am doing a sentence classification problem. There are 3 Classes. What I understood is I am doing One-against-all classification. I have a comparatively small train set. A total of 75 sentences, In which 25 sentences belongs to each class.
I am making 3 SVMs (so 3 different models), where, while training, in SVM_A, sentences belong to CLASS A will have a true label, i.e., 1 and other sentences will have a -1 label. Correspondingly done for SVM_B, and SVM_C.
While testing, to get the true label of a sentence, I am giving the sentence to 3 models and I am taking the prediction probability returned by these 3 models. Which one returns the highest will be the class the sentence belong to.
This is how I am doing. But I am getting the same prediction probability for every sentence in the test set for all models.
A predicted:0.012820514
B predicted:0.012820514
C predicted:0.012820514
These values repeat for all sentences in the training set.
The following is how I set parameters for training:
C_SVC svm = new C_SVC();
MutableBinaryClassificationProblemImpl problem;
ImmutableSvmParameterGrid.Builder builder = ImmutableSvmParameterGrid.builder();
// create training parameters ------------
HashSet<Float> cSet;
HashSet<LinearKernel> kernelSet;
cSet = new HashSet<Float>();
cSet.add(1.0f);
kernelSet = new HashSet<LinearKernel>();
kernelSet.add(new LinearKernel());
// configure finetuning parameters
builder.eps = 0.001f; // epsilon
builder.Cset = cSet; // C values used
builder.kernelSet = kernelSet; //Kernel used
builder.probability=true; // To get the prediction probability
ImmutableSvmParameter params = builder.build();
What am I doing wrong?
Is there any other better way to do multi-class classification other than this?
You are getting the same output, because you generate the same model three times.
The reason for this is, that jlibsvm is able to perform multiclass classification out of the box based on the provided data (LIBSVM itself supports this too). If it detects, that more than two class labes are provided in the given data, it automatically performs multiclass classification. So there is no need for a manually 1vsN approach. Just supply the data with class-labels for each category.
However, jlibsvm is still in beta and relies on a rather old version of LIBSVM (2.88). A lot has changed. For a more intiuitive Java binding (in comparison to the default LIBSVM version), you can take a look at zlibsvm, which is available via Maven Central and based on the latest LIBSVM version.

Transformer of a pipeline acts on the training data instead of the whole data in GridSearchCV

I am using pipeline and GridSearchCV to select features automatically. Since the data set is small, I set the parameter 'cv' in GridSearchCV to StratifiedShuffleSplit. The code looks like as follows:
selection = SelectKBest()
clf = LinearSVC()
pipeline = Pipeline([("select", selection), ("classify", clf)])
cv = StratifiedShuffleSplit(n_splits=50)
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv = cv)
grid_search.fit(X, y)
It seems that SelectKBest acts on the training data of each split instead of the whole data set (the latter is what I want) since the result becomes different if I separate the 'select' and 'classify', where the StratifiedShuffleSplit will for sure only act on the classifier.
What is the correct way of using pipeline and GridSearchCV in this case? Thanks a lot!
Cross-validating the whole pipeline (i.e. running SelectKBest only on training part of each split) is the way to go. Otherwise the model is allowed to look at the test part - it means quality estimates become wrong. Best hyperparameters found with these unfair quality estimates may also work worse on a real unseen data.
At prediction time you're not going to re-run SelectKBest on (training dataset + example to be predicted) and then re-train the classifier, why would you do that in evaluation?
You can also find the answer of this question from page 245 to page 246 in the book-"The elements of statistical learning (2rd edition)" by Hastie,etc.

How to use SelectFromModel in sklearn to find the positively informative features for a class

I think I understand that until recently people used the attribute coef_ to extract the most informative features from linear models in python's machine learning library sklearn. Now users get pointed to SelectFromModel instead. SelectFromModel allows to reduce the features based on a threshold. So something like the following code reduces the features down to those features which have an importance > 0.5. My question now: Is there any way to determine whether a feature is positivly or negatively discriminating for a class?
I have my data in a pandas dataframe called data, first column a list of filenames of text files, second column the label.
count_vect = CountVectorizer(input="filename", analyzer="word")
X_train_counts = count_vect.fit_transform(data["filenames"])
print(X_train_counts.shape)
tf_transformer = TfidfTransformer(use_idf=True)
traindata = tf_transformer.fit_transform(X_train_counts)
print(traindata.shape) #report size of the training data
clf = LogisticRegression()
model = SelectFromModel(clf, threshold=0.5)
X_transform = model.fit_transform(traindata, data["labels"])
print("reduced features: ", X_transform.shape)
#get the names of all features
words = np.array(count_vect.get_feature_names())
#get the names of the important features using the boolean index from model
print(words[model.get_support()])
To my knowledge you need to stick back to the .coef_ method and see which coefficients are negative or positive. a negative coefficient obviously decreases the odds of that class to happen (so negative relationship), while a positive coefficient increases the odds the class to happen (so positive relationship).
However this method will not give you the significance, only the direction. You will need the SelectFromModel method to extract that.

Resources