Meta classifier based on "or" logic in scikit-learn - machine-learning

How can I build a meta-classifier in scikit-learn out of N binary classifiers which will return 1 if any of the classifiers returns 1?
Currently I've tried VotingClassifier, but it lacks the logic that I need, both with voting equal to hard and soft. Pipeline seems to be oriented towards sequential computation
I can write the logic by myself, but I am wondering if there is anything built-in?

The built-in options are only soft and hard voting. As you mentioned, we can create a custom function to this meta-classifier, which uses OR logic based on the source code here. This custom meta classifier can fit into the pipeline as well.
from sklearn.utils.validation import check_is_fitted
class CustomMetaClassifier(VotingClassifier):
def predict(self, X):
""" Predict class labels for X.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
The input samples.
Returns
----------
maj : array-like, shape = [n_samples]
Predicted class labels.
"""
check_is_fitted(self, 'estimators_')
maj = np.max(eclf1._predict(X), 1)
maj = self.le_.inverse_transform(maj)
return maj
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.ensemble import RandomForestClassifier, VotingClassifier
>>> clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',
... random_state=1)
>>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
>>> clf3 = GaussianNB()
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> eclf1 = CustomMetaClassifier(estimators=[
... ('lr', clf1), ('rf', clf2), ('gnb', clf3)])
>>> eclf1 = eclf1.fit(X, y)
>>> eclf1.predict(X)
array([1, 1, 1, 2, 2, 2])

Related

Is it necessary to use cross-validation after data is split using StratifiedShuffleSplit?

I used StratifiedShuffleSplit to split the data and now I am wondering if I need to use cross-validation again as I go for building the classification model(Logistic Regression,KNN,Random Forest etc.) I am confused about it because reading the documentation in Sklearn I get the impression that StratifiedShuffleSplit is a mix of splitting the data and cross-validating it at the same time.
StratifiedShuffleSplit provides you just a list with train/test indices. How it will be used depends on you.
You can fit the model with train set and predict on the test and calculate the score manually - so implementing cross validation by yourself
Or you can use cross_val_score and pass StratifiedShuffleSplit() to it and cross_val_score will do the same thing.
Example:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
model = RandomForestClassifier(n_estimators=1, random_state=1)
# Calculate scores automatically
accuracy_per_split = cross_val_score(model, X, y, scoring="accuracy", cv=sss, n_jobs=1)
print(f"Accuracies per splits: {accuracy_per_split}")
# Calculate scores manually
accuracy_per_split = []
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracy_per_split.append(acc)
print(f"Accuracies per splits: {accuracy_per_split}")

Reverse column transform

In a machine learning model y_train is a string so column transform was applied in y_train before training the model ,How to remove the transform while printing the end result
All scikit-learn transformers have .inverse_transform method to do so. e.g. LabelEncoder https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

Naive Bayes model from scikit learn

What is the difference between the two gnb.fit()?
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train.ravel())
gnb.fit(X_train, y_train).predict(X_test)
I see two differences:
np.ravel flattens arrays, e.g.
x = np.array([[1, 2, 3], [4, 5, 6]])
np.ravel(x)
array([1, 2, 3, 4, 5, 6])
gnb.fit returns the model itself, and it can be used to immediately predict. Predicting does not modify the model.
If your y_train is one-dimensional, you'll get the same model regardless of whether you call gbn.fit(x, y) or gbn.fit(x, y.ravel()) because y == y.ravel()

Is F1 micro the same as Accuracy?

I have tried many examples with F1 micro and Accuracy in scikit-learn and in all of them, I see that F1 micro is the same as Accuracy. Is this always true?
Script
from sklearn import svm
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score, accuracy_score
# prepare dataset
iris = load_iris()
X = iris.data[:, :2]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# svm classification
clf = svm.SVC(kernel='rbf', gamma=0.7, C = 1.0).fit(X_train, y_train)
y_predicted = clf.predict(X_test)
# performance
print "Classification report for %s" % clf
print metrics.classification_report(y_test, y_predicted)
print("F1 micro: %1.4f\n" % f1_score(y_test, y_predicted, average='micro'))
print("F1 macro: %1.4f\n" % f1_score(y_test, y_predicted, average='macro'))
print("F1 weighted: %1.4f\n" % f1_score(y_test, y_predicted, average='weighted'))
print("Accuracy: %1.4f" % (accuracy_score(y_test, y_predicted)))
Output
Classification report for SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.7, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
precision recall f1-score support
0 1.00 0.90 0.95 10
1 0.50 0.88 0.64 8
2 0.86 0.50 0.63 12
avg / total 0.81 0.73 0.74 30
F1 micro: 0.7333
F1 macro: 0.7384
F1 weighted: 0.7381
Accuracy: 0.7333
F1 micro = Accuracy
In classification tasks for which every test case is guaranteed to be assigned to exactly one class, micro-F is equivalent to accuracy. It won't be the case in multi-label classification.
This is because we are dealing with a multi class classification , where every test data should belong to only 1 class and not multi label , in such case where there is no TN , we can call True Negatives as True Positives.
Formula wise ,
correction : F1 score is 2* precision* recall / (precision + recall)
Micoaverage precision, recall, f1 and accuracy are all equal for cases in which every instance must be classified into one (and only one) class. A simple way to see this is by looking at the formulas precision=TP/(TP+FP) and recall=TP/(TP+FN). The numerators are the same, and every FN for one class is another classes's FP, which makes the denominators the same as well. If precision = recall, then f1 will also be equal.
For any inputs should should be able to show that:
from sklearn.metrics import accuracy_score as acc
from sklearn.metrics import f1_score as f1
f1(y_true,y_pred,average='micro')=acc(y_true,y_pred)
I had the same issue so I investigated and came up with this:
Just thinking about the theory, it is impossible that accuracy and the f1-score are the very same for every single dataset. The reason for this is that the f1-score is independent from the true-negatives while accuracy is not.
By taking a dataset where f1 = acc and adding true negatives to it, you get f1 != acc.
>>> from sklearn.metrics import accuracy_score as acc
>>> from sklearn.metrics import f1_score as f1
>>> y_pred = [0, 1, 1, 0, 1, 0]
>>> y_true = [0, 1, 1, 0, 0, 1]
>>> acc(y_true, y_pred)
0.6666666666666666
>>> f1(y_true,y_pred)
0.6666666666666666
>>> y_true = [0, 1, 1, 0, 1, 0, 0, 0, 0]
>>> y_pred = [0, 1, 1, 0, 0, 1, 0, 0, 0]
>>> acc(y_true, y_pred)
0.7777777777777778
>>> f1(y_true,y_pred)
0.6666666666666666

How do you use fit_params for RandomizedSearch with VotingClassifier in Sklearn?

Hi I'm trying to use fit_params (for sample_weight on GradientBoostingClassifier) for RandomizedSearch with VotingClassifier in Sklearn since the dataset is unbalanced. Could someone give me advice and possibly code sample?
My current-not-working code is below:
random_search = RandomizedSearchCV(my_votingClassifier, param_distributions=param_dist,
n_iter=n_iter_search, n_jobs=-1, fit_params={'sample_weight':y_np_array})
Error:
TypeError: fit() got an unexpected keyword argument 'sample_weight'
Taking into account that there doesn't seem to be a direct way to pass sample_weight parameter through the VotingClassifier I came across this little "hack":
Override the fit method of the classifiers at the bottom. For example, if you are using a DecisionTreeClassifier you could override its fit method by passing through the desired sample_weight parameter.
class MyDecisionTreeClassifier(DecisionTreeClassifier):
def fit(self, X , y = None):
return super(DecisionTreeClassifier, self).fit(X,y,sample_weight=y)
Now in your ensemble of classifiers in your VotingClassifier you can use your own MyDecisionTreeClassifier.
Full working example:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import RandomizedSearchCV
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
class MyDecisionTreeClassifier(DecisionTreeClassifier):
def fit(self, X , y = None):
return super(DecisionTreeClassifier, self).fit(X,y,sample_weight=y)
clf1 = MyDecisionTreeClassifier()
clf2 = RandomForestClassifier()
params = {'dt__max_depth': [5, 10],'dt__max_features':[1,2]}
eclf = VotingClassifier(estimators=[('dt', clf1), ('rf', clf2)], voting='hard')
random_search = RandomizedSearchCV(eclf, param_distributions=params,n_iter=4)
random_search.fit(X, y)
print(random_search.grid_scores_)
print(random_search.best_score_)

Resources