Negative decision_function values

Negative decision_function values - machine-learning

I am using support vector classifier from sklearn on the Iris dataset. When I call decision_function it returns negative values. But all samples in test dataset after classification has right class. I think that decision_function should return the positive value when the sample is an inlier and negative if the sample is an outlier. Where I am wrong?
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data[:,:]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3,
random_state=0)
clf = SVC(probability=True)
print(clf.fit(X_train,y_train).decision_function(X_test))
print(clf.predict(X_test))
print(y_test)
Here is the output:
[[-0.76231668 -1.03439531 -1.40331645]
[-1.18273287 -0.64851109 1.50296097]
[ 1.10803774 1.05572833 0.12956269]
[-0.47070432 -1.08920859 -1.4647051 ]
[ 1.18767563 1.12670665 0.21993744]
[-0.48277866 -0.98796232 -1.83186272]
[ 1.25020033 1.13721691 0.15514536]
[-1.07351583 -0.84997114 0.82303659]
[-1.04709616 -0.85739411 0.64601611]
[-1.23148923 -0.69072989 1.67459938]
[-0.77524787 -1.00939817 -1.08441968]
[-1.12212245 -0.82394879 1.11615504]
[-1.14646662 -0.91238712 0.80454974]
[-1.13632316 -0.8812114 0.80171542]
[-1.14881866 -0.95169643 0.61906248]
[ 1.15821271 1.10902205 0.22195304]
[-1.19311709 -0.93149873 0.78649126]
[-1.21653084 -0.90953622 0.78904491]
[ 1.16829526 1.12102515 0.20604678]
[ 1.18446364 1.1080255 0.15199149]
[-0.93911991 -1.08150089 -0.8026332 ]
[-1.15462733 -0.95603159 0.5713605 ]
[ 0.93278883 0.99763184 0.34033663]
[ 1.10999556 1.04596018 0.14791409]
[-1.07285663 -1.01864255 -0.10701465]
[ 1.21200422 1.01284263 0.0416991 ]
[ 0.9462457 1.01076579 0.36620915]
[-1.2108146 -0.79124775 1.43264808]
[-1.02747495 -0.25741977 1.13056021]
...
[ 1.16066886 1.11212424 0.22506538]]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
2 1 1 2 0 2 0 0]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
1 1 1 2 0 2 0 0]

You need to consider the decision_function and the prediction separately. The decision is the distance from the hyperplane to your sample. This means by looking at the sign you can tell if your sample is located right or left to the hyperplane. So negative values are perfectly fine and indicate the negative class ("the other side of the hyperplane").
With the iris dataset you have a multi-class problem. As the SVM is binary classifier, there is no inherent multi-class classification. Two approaches are the "one-vs-rest" (OvR) and "one-vs-one" methods, which construct a multi-class classifier from the binary "units".
One-vs-one
Now that you already know OvR, OvA is not that much harder to grasp. You basically construct a classifier of every combination of class pairs (A,
B). In your case: 0 vs 1, 0 vs 2, 1 vs 2.
Note: The values of (A, B) and (B, A) can be obtained from a single binary classifier. You only change what is considered the positive class and thus you have to invert the sign.
Doing this gives you a matrix:
+-------+------+-------+-------+
| A / B | #0 | #1 | #2 |
+-------+------+-------+-------+
| | | | |
| #0 | -- | -1.18 | -0.64 |
| | | | |
| #1 | 1.18 | -- | 1.50 |
| | | | |
| #2 | 0.64 | -1.50 | -- |
+-------+------+-------+-------+
Read this as following:
Decision function value when class A (row) competes against class B (column).
In order to extract a result a vote is performed. In the basic form you can imagine this as a single vote that each classifier can give: Yes or No. This could lead to draws, so we use the whole decision function values instead.
+-------+------+-------+-------+-------+
| A / B | #0 | #1 | #2 | SUM |
+-------+------+-------+-------+-------+
| | | | | |
| #0 | - | -1.18 | -0.64 | -1.82 |
| | | | | |
| #1 | 1.18 | - | 1.50 | 2.68 |
| | | | | |
| #2 | 0.64 | -1.50 | - | 0.86 |
+-------+------+-------+-------+-------+
The resulting columns gives you again a vector [-1.82, 2.68, 0.86]. Now apply arg max and it matches your prediction.
One-vs-rest
I keep this section to avoid further confusion. The scikit-lear SVC classifier (libsvm) has a decision_function_shape parameter, which deceived me into thinking it was OvR (i am using liblinear most of the time).
For a real OvR respone you get one value from the decision function per classifier, e.g.
[-1.18273287 -0.64851109 1.50296097]
Now to obtain a prediction from this you could just apply arg max, which would return the last index with a value of 1.50296097. From here on the decision function's value is not needed anymore (for this single prediction). That's why you noticed that your predictions are fine.
However you also specified probability=True, which uses the value of the distance_function and passes it to a sigmoid function. Sample principle as above, but now you also have confidence values (i prefer this term over probabilities, since it only describes the distance to the hyperplane) between 0 and 1.
Edit:
Oops, sascha is right. LibSVM uses one-vs-one (despite the shape of the decision function).

Christopher is correct, but assuming OvR here.
Now you are doing the OvO scheme without noticing it!
Here some example, which:
explains how to predict using OvO + decision_function
But first OvO's theory on prediction from:
ECS289: Scalable Machine Learning
(Cho-Jui Hsieh; UC Davis; Oct 27, 2015)
Code:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import numpy as np
iris = datasets.load_iris()
X = iris.data[:,:]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3,
random_state=0)
clf = SVC(decision_function_shape='ovo') # EXPLICIT OVO-usage!
clf.fit(X, y)
def predict(dec):
# OVO prediction-scheme
# hardcoded for 3 classes!
# OVO order assumption: 0 vs 1; 0 vs 2; 1 vs 2 (lexicographic!)
# theory: http://www.stat.ucdavis.edu/~chohsieh/teaching/ECS289G_Fall2015/lecture9.pdf page 18
# and: http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf page 8
class0 = dec[0] + dec[1]
class1 = -dec[0] + dec[2]
class2 = -dec[1] - dec[2]
return np.argmax([class0, class1, class2])
dec_vals = clf.decision_function(X_test)
pred_vals = clf.predict(X_test)
pred_vals_own = np.array([predict(x) for x in dec_vals])
for i in range(len(X_test)):
print('decision_function vals : ', dec_vals[i])
print('sklearns prediction : ', pred_vals[i])
print('own prediction using dec: ', pred_vals_own[i])
Output:
decision_function vals : [-0.76867027 -1.04536032 -1.60216452]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-1.19939987 -0.64932285 1.6951256 ]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 1.11946664 1.05573131 0.06261988]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-0.46107656 -1.09842529 -1.50671611]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [ 1.2094164 1.12827802 0.1415261 ]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-0.47736819 -0.99988924 -2.15027278]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [ 1.25467104 1.13814461 0.07643985]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-1.07557745 -0.87436887 0.93179222]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.05047139 -0.88027404 0.80181305]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.24310627 -0.70058067 1.906847 ]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-0.78440125 -1.00630434 -0.99963088]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-1.12586024 -0.84193093 1.25542752]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.15639222 -0.91555677 1.07438865]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.14345638 -0.90050709 0.95795276]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.15790163 -0.95844647 0.83046875]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 1.17805731 1.11063472 0.1333462 ]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-1.20283096 -0.93961585 0.98410451]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.22782802 -0.90725712 1.05316513]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 1.16903803 1.12221984 0.11367107]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [ 1.17145967 1.10832227 0.08212776]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-0.9506135 -1.08467062 -0.79851794]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-1.16266048 -0.9573001 0.79179457]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 0.99991983 0.99976567 0.27258784]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [ 1.14009372 1.04646327 0.05173163]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-1.08080806 -1.03404209 -0.06411027]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [ 1.23515997 1.01235174 -0.03884014]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [ 0.99958361 1.0123953 0.31647776]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-1.21958703 -0.8018796 1.67844367]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.03327108 -0.25946619 1.1567434 ]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 1.12368215 1.11169071 0.20956223]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-0.82416303 -1.07792277 -1.1580516 ]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-1.13071754 -0.96096255 0.65828256]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 1.194643 1.12966124 0.15746621]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-1.04070512 -1.04532308 -0.20319486]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-0.70170723 -1.09340841 -1.9323473 ]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-1.24655214 -0.74489305 1.15450078]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [ 0.99984598 1.03781258 0.2790073 ]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-0.99993896 -1.06846079 -0.44496083]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [-1.22495071 -0.83041964 1.41965874]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-1.286798 -0.72689128 1.72244026]
sklearns prediction : 1
own prediction using dec: 1
decision_function vals : [-0.75503345 -1.09561165 -1.44344022]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [ 1.24778268 1.11179415 0.05277115]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [-0.79577073 -1.00004599 -0.99974376]
sklearns prediction : 2
own prediction using dec: 2
decision_function vals : [ 1.07018075 1.0831253 0.22181655]
sklearns prediction : 0
own prediction using dec: 0
decision_function vals : [ 1.16705531 1.11326796 0.15604895]
sklearns prediction : 0
own prediction using dec: 0

Related

Extra zeros appended in confusion matrix making it 3x3 instead of 2x2 using IsolationForest for Anomaly detection

I am using below code to predict anomaly detection. It is a binary classification so the confusion matrix should be 2x2 instead it is 3x3. There are extra zeros appended in T-shape. Similar thing happened using OneClassSVM few weeks back as well but I thought I was doing something wrong. Could you please help me fix this?
import numpy as np
import pandas as pd
import os
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn import metrics
from sklearn.metrics import roc_auc_score
data = pd.read_csv('opensky_train.csv')
#to make sure that normal data contains no anomaly
sortedData = data.sort_values(by=['class'])
target = pd.DataFrame(sortedData['class'])
Y = target.replace(['surveill', 'other'], [1,0])
X = sortedData.drop(['class'], axis = 1)
x_normal = X.iloc[:200,:]
y_normal = Y.iloc[:200,:]
x_anomaly = X.iloc[200:,:]
y_anomaly = Y.iloc[200:,:]
Edited:
column_values = y_anomaly.values.ravel()
unique_values = pd.unique(column_values)
print(unique_values)
Output : [0 1]
clf = IsolationForest(random_state=0).fit(x_normal)
pred = clf.predict(x_anomaly)
print(pred)
Output : [ 1 1 1 1 1 1 -1 1 -1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 -1 1 1 1 1 1 1 -1 1 1 -1 1 1 -1 1 1 -1 1 -1 1
-1 1 1 -1 -1 1 -1 -1 1 1 1 1 -1 1 1 -1 -1 1 1 1 1 1 1 1
-1 1 1 1 1 1 1 1 1 1 -1]
#printing the results
print(confusion_matrix(y_anomaly, pred))
print (classification_report(y_anomaly, pred))
Result:
Confusion Matrix :
[[ 0 0 0]
[ 7 0 60]
[12 0 28]]
precision recall f1-score support
-1 0.00 0.00 0.00 0
0 0.00 0.00 0.00 67
1 0.32 0.70 0.44 40
accuracy 0.26 107
macro avg 0.11 0.23 0.15 107
weighted avg 0.12 0.26 0.16 107

Inliers are labeled 1, while outliers are labeled -1
Source: scikit-learn Anomaly and Outlier detection.
Your example has transformed the classes to 0,1 - so the three possible options are -1,0,1
You need to change from
Y = target.replace(['surveill', 'other'], [1,0])
to
Y = target.replace(['surveill', 'other'], [1,-1])

How to create combined ROC Curve for 2 classifiers and two different data set

I have a dataset of 1127 patients. My goal was to classify each patient to 0 or 1.
I have two different classifiers but with the same purpose - to classify the patient to 0 or 1.
I've run one classifier on 364 patients and the second classifier on the 763 patients.
for each classifier\group, I generated the ROC curve.
Now, I would like to combine the curves.
someone could guide me on how to do it?
I'm thinking of calculating the weighted FPR and TPR, but I'm not sure how to do it.
The number of FPR\TPR pairs is different between the curves (The first ROC curve based on 312 pairs and the second ROC curve based on 666 pairs).
Thanks!!!

Imports
import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
Data generation
# simulate first dataset with 364 obs
df1 = \
pd.DataFrame(i for i in range(364))
df1['predict_proba_1'] = np.random.normal(0,1,len(df1))
df1['epsilon'] = np.random.normal(0,1,len(df1))
df1['true'] = (0.7*df1['epsilon'] < df1['predict_proba_1']) * 1
df1 = df1.drop(columns=[0, 'epsilon'])
# simulate second dataset with 763 obs
df2 = \
pd.DataFrame(i for i in range(763))
df2['predict_proba_2'] = np.random.normal(0,1,len(df2))
df2['epsilon'] = np.random.normal(0,1,len(df2))
df2['true'] = (0.7*df2['epsilon'] < df2['predict_proba_2']) * 1
df2 = df2.drop(columns=[0, 'epsilon'])
Quick look at generated data
df1
predict_proba_1 true
0 1.234549 1
1 -0.586544 0
2 -0.229539 1
3 0.132185 1
4 -0.411284 0
.. ... ...
359 -0.218775 0
360 -0.985565 0
361 0.542790 1
362 -0.463667 0
363 1.119244 1
[364 rows x 2 columns]
df2
predict_proba_2 true
0 0.278755 1
1 0.653663 0
2 -0.304216 1
3 0.955658 1
4 -1.341669 0
.. ... ...
758 1.359606 1
759 -0.605894 0
760 0.379738 0
761 1.571615 1
762 -1.102565 0
[763 rows x 2 columns]
Necessary functions
def show_ROCs(scores_list: list, ys_list: list, labels_list:list = None):
"""
This function plots a couple of ROCs. Corresponding labels are optional.
Parameters
----------
scores_list : list of array-likes with scorings or predicted probabilities.
ys_list : list of array-likes with ground true labels.
labels_list : list of labels to be displayed in plotted graph.
Returns
----------
None
"""
if len(scores_list) != len(ys_list):
raise Exception('len(scores_list) != len(ys_list)')
fpr_dict = dict()
tpr_dict = dict()
for x in range(len(scores_list)):
fpr_dict[x], tpr_dict[x], _ = roc_curve(ys_list[x], scores_list[x])
for x in range(len(scores_list)):
try:
plot_ROC(fpr_dict[x], tpr_dict[x], str(labels_list[x]) + ' AUC:' + str(round(auc(fpr_dict[x], tpr_dict[x]),3)))
except:
plot_ROC(fpr_dict[x], tpr_dict[x], str(x) + ' ' + str(round(auc(fpr_dict[x], tpr_dict[x]),3)))
plt.show()
def plot_ROC(fpr, tpr, label):
"""
This function plots a single ROC. Corresponding label is optional.
Parameters
----------
fpr : array-likes with fpr.
tpr : array-likes with tpr.
label : label to be displayed in plotted graph.
Returns
----------
None
"""
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label=label)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
Plotting
show_ROCs(
[df1['predict_proba_1'], df2['predict_proba_2']],
[df1['true'], df2['true']],
['df1 with {} obs'.format(len(df1)), 'df2 with {} obs'.format(len(df2))]
)

How to calculate multiclass overall accuracy, sensitivity and specificity?

Can anyone explain how to calculate the accuracy, sensitivity and specificity of multi-class dataset?

Sensitivity of each class can be calculated from its
TP/(TP+FN)
and specificity of each class can be calculated from its
TN/(TN+FP)
For more information about concept and equations
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
For multi-class classification, you may use one against all approach.
Suppose there are three classes: C1, C2, and C3
"TP of C1" is all C1 instances that are classified as C1.
"TN of C1" is all non-C1 instances that are not classified as C1.
"FP of C1" is all non-C1 instances that are classified as C1.
"FN of C1" is all C1 instances that are not classified as C1.
To find these four terms of C2 or C3 you can replace C1 with C2 or C3.
In a simple sentences :
In a 2x2, once you have picked one category as positive, the other is automatically negative. With 9 categories, you basically have 9 different sensitivities, depending on which of the nine categories you pick as "positive". You could calculate these by collapsing to a 2x2, i.e. Class1 versus not-Class1, then Class2 versus not-Class2, and so on.
Example :
we get a confusion matrix for the 7 types of glass:
=== Confusion Matrix ===
a b c d e f g <-- classified as
50 15 3 0 0 1 1 | a = build wind float
16 47 6 0 2 3 2 | b = build wind non-float
5 5 6 0 0 1 0 | c = vehic wind float
0 0 0 0 0 0 0 | d = vehic wind non-float
0 2 0 0 10 0 1 | e = containers
1 1 0 0 0 7 0 | f = tableware
3 2 0 0 0 1 23 | g = headlamps
a true positive rate (sensitivity) calculated for each type of glass, plus an overall weighted average:
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.714 0.174 0.667 0.714 0.690 0.532 0.806 0.667 build wind float
0.618 0.181 0.653 0.618 0.635 0.443 0.768 0.606 build wind non-float
0.353 0.046 0.400 0.353 0.375 0.325 0.766 0.251 vehic wind float
0.000 0.000 0.000 0.000 0.000 0.000 ? ? vehic wind non-float
0.769 0.010 0.833 0.769 0.800 0.788 0.872 0.575 containers
0.778 0.029 0.538 0.778 0.636 0.629 0.930 0.527 tableware
0.793 0.022 0.852 0.793 0.821 0.795 0.869 0.738 headlamps
0.668 0.130 0.670 0.668 0.668 0.539 0.807 0.611 Weighted Avg.

You may print a classification report from the link below, you will get the overall accuracy of your model.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report
compute sensitivity and specificity for multi classification
from sklearn.metrics import precision_recall_fscore_support
res = []
for l in [0,1,2,3]:
prec,recall,_,_ = precision_recall_fscore_support(np.array(y_true)==l,
np.array(y_prediction)==l,
pos_label=True,average=None)
res.append([l,recall[0],recall[1]])
pd.DataFrame(res,columns = ['class','sensitivity','specificity'])

Scikit-learn pipeline : possibility of rows aggregation (pandas groupby)?

I wish to run all preprocessing and model optimisation tasks in a single pipeline with the following steps :
onehot encoding
SVD dimension reduction
aggregation (pandas groupby)
Random Forest modelisation
my input variables are :
X_train with 349 rows, which will become 338 rows after step3 (aggregation)
y_train with 338 rows
I get the error "Found input variables with inconsistent numbers of samples."
It's because sklearn doesn't allow a difference of rows number between X_train and y_train.
Do you know another method, if possible, to have an aggregation in a pipeline ?
here is my code :
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import TruncatedSVD
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
# does nothing, but is here to collect numerical columns
class nothing(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
class Aggregator(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X = pd.DataFrame(X)
X = X.rename(columns = {0 :'InvoiceNo', 1 : 'amount', 2:'Quantity',
3:'UnitPrice',4:'CustomerID' })
X['InvoiceNo'] = X['InvoiceNo'].astype('int')
X['Quantity'] = X['Quantity'].astype('float64')
X['UnitPrice'] = X['UnitPrice'].astype('float64')
aggregations = dict()
for col in range(5, X.shape[1]-1) :
aggregations[col] = 'max'
aggregations.update({ 'CustomerID' : 'first',
'amount' : "sum",'Quantity' : 'mean', 'UnitPrice' : 'mean'})
# aggregating all basket lines
result = X.groupby('InvoiceNo').agg(aggregations)
# add number of lines in the basket
result['lines_nb'] = X.groupby('InvoiceNo').size()
return result
numeric_features = ['InvoiceNo','amount', 'Quantity', 'UnitPrice',
'CustomerID']
numeric_transformer = Pipeline(steps=[('nothing', nothing())])
categorical_features = ['StockCode', 'Country']
preprocessor = ColumnTransformer(
[
# 'num' transformer does nothing, but is here to
# collect numerical columns
('num', numeric_transformer ,numeric_features ),
('cat', Pipeline([
('onehot', OneHotEncoder(handle_unknown='ignore')),
('best', TruncatedSVD(n_components=100)),
]), categorical_features)
]
)
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('aggregator', Aggregator()),
('rf', RandomForestClassifier(n_estimators=400,
max_features='auto',
class_weight=class_weights)),
])
X_train_transformed = pipe.fit_transform(X_train)
ValueError: Found input variables with inconsistent numbers of samples: [349, 338]
more detail to answer to #desertnaut comment :
example :
X_train contains 4 rows :
customer_num : 1 article_ref : 1 money : 10$
customer_num : 1 article_ref : 2 money : 15$
customer_num : 2 article_ref : 5 money : 5$
customer_num : 3 article_ref : 4 money : 11$
I aggregate the 4 rows with pandas groupby=cucstomer_num, the resulting dataframe , X_train_transformed , has 3 rows, one per customer
y_train has 3 rows, containing the group (label to predict) for customer_num 1, customer_num 2 et customer_num 3.
The standard method is :
pipeline 1 : transform X_train (4 rows) to X_train_transformed (3 rows)
pipeline 2 : fit a model to (X_train_transformed (3 rows), y_train(3 rows))
I whish to have a single pipeline to handle pipeline 1 and pipeline 2

How to interpret this triangular shape ROC AUC curve?

I have 10+ features and a dozen thousand of cases to train a logistic regression for classifying people's race. First example is French vs non-French, and second example is English vs non-English. The results are as follows:
//////////////////////////////////////////////////////
1= fr
0= non-fr
Class count:
0 69109
1 30891
dtype: int64
Accuracy: 0.95126
Classification report:
precision recall f1-score support
0 0.97 0.96 0.96 34547
1 0.92 0.93 0.92 15453
avg / total 0.95 0.95 0.95 50000
Confusion matrix:
[[33229 1318]
[ 1119 14334]]
AUC= 0.944717975754
//////////////////////////////////////////////////////
1= en
0= non-en
Class count:
0 76125
1 23875
dtype: int64
Accuracy: 0.7675
Classification report:
precision recall f1-score support
0 0.91 0.78 0.84 38245
1 0.50 0.74 0.60 11755
avg / total 0.81 0.77 0.78 50000
Confusion matrix:
[[29677 8568]
[ 3057 8698]]
AUC= 0.757955582999
//////////////////////////////////////////////////////
However, I am getting some very strange looking AUC curves with trianglar shapes instead of jagged round curves. Any explanation as to why I am getting such shape? Any possible mistake I have made?
Codes:
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
+ my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
+ my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
#$$
lr.fit(X_train, y_train)
# Making predictions using trained data
#$$
y_train_predictions = lr.predict(X_train)
#$$
y_test_predictions = lr.predict(X_test)
#print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
print 'Classification report:'
print classification_report(y_test, y_test_predictions)
#print sk_confusion_matrix(y_train, y_train_predictions)
print 'Confusion matrix:'
print sk_confusion_matrix(y_test, y_test_predictions)
#print y_test[1:20]
#print y_test_predictions[1:20]
#print y_test[1:10]
#print np.bincount(y_test)
#print np.bincount(y_test_predictions)
# Find and plot AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC=',roc_auc
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

You're doing it wrong. According to documentation:
y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class or confidence values.
Thus at this line:
roc_curve(y_test, y_test_predictions)
You should pass into roc_curve function result of decision_function (or some of two columns from predict_proba result) instead of actual predictions.
Look at these examples http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Negative decision_function values - machine-learning

Related

Extra zeros appended in confusion matrix making it 3x3 instead of 2x2 using IsolationForest for Anomaly detection

How to create combined ROC Curve for 2 classifiers and two different data set

How to calculate multiclass overall accuracy, sensitivity and specificity?

Scikit-learn pipeline : possibility of rows aggregation (pandas groupby)?

How to interpret this triangular shape ROC AUC curve?

Categories

Resources