scikit-learn calculate F1 in multilabel classification - machine-learning

I am trying to calculate macro-F1 with scikit in multi-label classification
from sklearn.metrics import f1_score
y_true = [[1,2,3]]
y_pred = [[1,2,3]]
print f1_score(y_true, y_pred, average='macro')
However it fails with error message
ValueError: multiclass-multioutput is not supported
How I can calculate macro-F1 with multi-label classification?

In the current scikit-learn release, your code results in the following warning:
DeprecationWarning: Direct support for sequence of sequences multilabel
representation will be unavailable from version 0.17. Use
sklearn.preprocessing.MultiLabelBinarizer to convert to a label
indicator representation.
Following this advice, you can use sklearn.preprocessing.MultiLabelBinarizer to convert this multilabel class to a form accepted by f1_score. For example:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
y_true = [[1,2,3]]
y_pred = [[1,2,3]]
m = MultiLabelBinarizer().fit(y_true)
f1_score(m.transform(y_true),
m.transform(y_pred),
average='macro')
# 1.0

Related

Get 100% accuracy score on Decision tree model

I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.

How to draw ROC curve for a multi-class dataset?

I have a multi-class confusion matrix as below and would like to draw its associated ROC curve for one of its classes (e.g. class 1). I know the "one-VS-all others" theory should be used in this case, but I want to know how exactly we need to change the threshold to obtain different pairs of TP and corresponding FP rates.enter image description here
SkLearn has a handy implementation which calculates the tpr and fpr and another function which generates the auc for you. You can just apply this to your data by treating each class on its own (all other data being negative) by looping through each class. The code below was inspired by the scikit-learn page on this topic itself.
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
#generating synthetic data
N_classes = 3
N_per_class=100
labels = np.concatenate([[i]*N_per_class for i in range(N_classes)])
preds = np.stack([np.random.uniform(0,1,N_per_class*N_classes) for _ in range(N_classes)]).T
preds /= preds.sum(1,keepdims=True) #approximate softmax
tpr,fpr,roc_auc = ([[]]*N_classes for _ in range(3))
f,ax = plt.subplots()
#generate ROC data
for i in range(N_classes):
fpr[i], tpr[i], _ = roc_curve(labels==i, preds[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
ax.plot(fpr[i],tpr[i])
plt.legend(['Class {:d}'.format(d) for d in range(N_classes)])
plt.xlabel('FPR')
plt.ylabel('TPR')

SVM duality: set of hyperparameters not supported

I am trying to train a SVM model on the Iris dataset. The aim is to classify Iris virginica flowers from other types of flowers. Here is the code:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2,3)] # petal length, petal width
y = (iris["target"]==2).astype(np.float64) # Iris virginica
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge", dual=False))
])
svm_clf.fit(X,y)
My book, which is Aurelien Geron's "Hands-On Machine Learning with Scikit-Learn , Keras and TensorFlow", 2nd edition, at page 156 says:
For better performance, you should set the dual hyperparameter to
False, unless there are more features than training instances
But If I set the dual hyperparameter to False, I get the following error:
ValueError: Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
It instead works if I set the dual hyperparameter to True.
Why is this set of hyperparameters not supported?
L2 SVM with L1 loss (hinge) cannot be solving in the primal form. Only its dual form can be solved effectively. This is due to the limitation of the LIBLINEAR library used by sklearn. If you want to solve the primal form of the L2 SVM you will have to use L2 loss (squared hinge) instead.
LinearSVC(C=1, loss='squared_hinge', dual=False).fit(X,y)
For mode details: Link 1

Logistic Regression sklearn with categorical Output

i have to train a model with logistic Regression in sklearn. I saw everywhere that the outcome has to be binary but my label is good, bad or normal. I have 12 features and i don't know how can i deal with three Labels ? I am very thankful for every answer
You can use Multinomial Logistic Regression.
In python, you can modify your Logistic Regression code as:
LogisticRegression(multi_class='multinomial').fit(X_train,y_train)
You can see Logistic Regression documentation in Scikit-Learn for more details.
It's called as one-vs-all Classification or Multi class classification.
From sklearn.linear_model.LogisticRegression:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
Code example:
# Authors: Tom Dupre la Tour <tom.dupre-la-tour#m4x.org>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# make 3-class dataset for classification
centers = [[-5, 0], [0, 1.5], [5, -1]]
X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
transformation = [[0.4, 0.2], [-0.4, 1.2]]
X = np.dot(X, transformation)
for multi_class in ('multinomial', 'ovr'):
clf = LogisticRegression(solver='sag', max_iter=100, random_state=42,
multi_class=multi_class).fit(X, y)
# print the training scores
print("training score : %.3f (%s)" % (clf.score(X, y), multi_class))
Check for full code example: Plot multinomial and One-vs-Rest Logistic Regression

Use RBF Kernel with Chi-squared distance metric in SVM

How to achieve the title mentioned task. Do we have any parameter in RBF kernel to set the distance metric as chi-squared distance metric. I can see a chi2_kernel in the sk-learn library.
Below is the code that i have written.
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.preprocessing import Imputer
from numpy import genfromtxt
from sklearn.metrics.pairwise import chi2_kernel
file_csv = 'dermatology.data.csv'
dataset = genfromtxt(file_csv, delimiter=',')
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=1)
dataset = imp.fit_transform(dataset)
target = dataset[:, [34]].flatten()
data = dataset[:, range(0,34)]
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)
# TODO : willing to set chi-squared distance metric instead. How to do that ?
clf = svm.SVC(kernel='rbf', C=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
Are you sure you want to compose rbf and chi2? Chi2 on its own defines a valid kernel, and all you have to do is
clf = svm.SVC(kernel=chi2_kernel, C=1)
since sklearn accepts functions as kernels (however this will require O(N^2) memory and time). If you would like to compose these two it is a bit more complex, and you will have to implement your own kernel to do that. For a bit more control (and other kernels) you might also try pykernels, however there is no support for composing yet.

Resources