multi performance metrics using cross validation - machine-learning

I want to use accuracy, precision, recall, and F-measure as performance metrics. In the case of just accuracy, the code works fine, but when there are many metrics, I get errors. I wonder how I can do that?.
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score),
'recall' : make_scorer(recall_score),
'f1_score' : make_scorer(f1_score)}
# load dataset
# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
#models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
#scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=5, random_state=seed)
cv_results = model_selection.cross_validate(model, X_, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
'''
msg = "%s: %f (%f)" % (name, cv_results['accuracy'].mean(), cv_results['accuracy'].std())
msg2 = "%s: %f (%f)" % (name, cv_results['precision'].mean(), cv_results['precision'].std())
msg3 = "%s: %f (%f)" % (name, cv_results['recall'].mean(), cv_results['recall'].std())
msg4 = "%s: %f (%f)" % (name, cv_results['f1_score'].mean(), cv_results['f1_score'].std())
print(msg)
print(msg2)
print(msg3)
print(msg4)
'''
The below code is used to show the accuracy results of the models in case we have accuracy as only scoring. I want to edit it and make it work for the above case where I have many scoring functions. How I can do that?
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
results has the those values, I wonder what to do to get the metrics scores :
[{'fit_time': array([0.05684781, 0.03089881, 0.04285073, 0.03789902, 0.04088998]),
'score_time': array([0.00798011, 0.00497937, 0.00498676, 0.00598478, 0.00398898]),
'test_accuracy': array([0.95977011, 0.94827586, 0.96551724, 0.95677233, 0.94524496]),
'test_precision': array([0.95209581, 0.94886364, 0.97633136, 0.97701149, 0.93785311]),
'test_recall': array([0.96363636, 0.94886364, 0.95375723, 0.93922652, 0.95402299]),
'test_f1': array([0.95783133, 0.94886364, 0.96491228, 0.95774648, 0.94586895])},
{'fit_time': array([0.01396322, 0.00897574, 0.01296639, 0.0089767 , 0.01097035]),
'score_time': array([0.0069809 , 0.0079782 , 0.00698042, 0.0069809 , 0.00598478]),
'test_accuracy': array([0.97701149, 0.95402299, 0.96264368, 0.95389049, 0.97982709]),
'test_precision': array([0.99371069, 0.97058824, 0.99382716, 1. , 0.99408284]),
'test_recall': array([0.95757576, 0.9375 , 0.93063584, 0.91160221, 0.96551724]),
'test_f1': array([0.97530864, 0.95375723, 0.96119403, 0.95375723, 0.97959184])},
{'fit_time': array([0.00698161, 0.00698113, 0.00698113, 0.0039897 , 0.00498629]),
'score_time': array([0.00598383, 0.00598478, 0.00398827, 0.0039897 , 0.00498652]),
'test_accuracy': array([1. , 1. , 1. , 0.99711816, 1. ]),
'test_precision': array([1., 1., 1., 1., 1.]),
'test_recall': array([1. , 1. , 1. , 0.99447514, 1. ]),
'test_f1': array([1. , 1. , 1. , 0.99722992, 1. ])},
{'fit_time': array([0.00398946, 0.00399137, 0.00498724, 0.00299191, 0.00299263]),
'score_time': array([0.00398922, 0.00498629, 0.00697994, 0.00498533, 0.00698185]),
'test_accuracy': array([0.87068966, 0.89655172, 0.90229885, 0.88760807, 0.88184438]),
'test_precision': array([0.78571429, 0.83018868, 0.83574879, 0.82272727, 0.80930233]),
'test_recall': array([1., 1., 1., 1., 1.]),
'test_f1': array([0.88 , 0.90721649, 0.91052632, 0.90274314, 0.89460154])},
{'fit_time': array([0.03992987, 0.04884362, 0.04388309, 0.03992462, 0.03992629]),
'score_time': array([0.01694345, 0.01100636, 0.01097107, 0.0119369 , 0.01093674]),
'test_accuracy': array([0.9683908 , 0.95689655, 0.97413793, 0.95389049, 0.97982709]),
'test_precision': array([0.99358974, 1. , 0.9939759 , 1. , 1. ]),
'test_recall': array([0.93939394, 0.91477273, 0.95375723, 0.91160221, 0.95977011]),
'test_f1': array([0.96573209, 0.95548961, 0.97345133, 0.95375723, 0.97947214])}]

As mentioned in the documentation:
For single metric evaluation, where the scoring parameter is a string, callable or None, the keys will be - ['test_score', 'fit_time', 'score_time']
And for multiple metric evaluation, the return value is a dict with the following keys - ['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']
In your case, you can get the accuracy/recall/etc with
cv_results["test_accuracy"]
cv_results["test_recall"]
...

Related

How to implement LIME in a Bert model?

I am new to machine learning. I noticed that such questions have been asked before as well but did not receive a proper solution. Below is the code for semantic similarity and I want to implement LIME as a base. Please, help me out.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v1')
# Two lists of sentences
sentences1 = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['The cat sits outside',
'A woman watches TV',
'The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
I don't know what Bert is, but try this sample code and see if it helps you.
import pandas as pd
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.utils import shuffle
from io import StringIO
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import lime
from lime import lime_text
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline
df = pd.read_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\Briefcase\\PDFs\\1-ALL PYTHON & R CODE SAMPLES\\A - GITHUB\\Natural Language Processing - Amazon Reviews\\Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv')
# let's experiment with some sentiment analysis concepts
# first we need to clean up the stuff in the independent field of the DF we are workign with
df.replace('\'','', regex=True, inplace=True)
df['review_title'] = df[['reviews.title']].astype(str)
df['review_text'] = df[['reviews.text']].astype(str)
df['review_title'] = df['reviews.title'].str.replace('\d+', '')
df['review_text'] = df['reviews.text'].str.replace('\d+', '')
# get rid of special characters
df['review_title'] = df['reviews.title'].str.replace(r'[^\w\s]+', '')
df['review_text'] = df['reviews.text'].str.replace(r'[^\w\s]+', '')
# get rid of double spaces
df['review_title'] = df['reviews.title'].str.replace(r'\^[a-zA-Z]\s+', '')
df['review_text'] = df['reviews.text'].str.replace(r'\^[a-zA-Z]\s+', '')
# convert all case to lower
df['review_title'] = df['reviews.title'].str.lower()
df['review_text'] = df['reviews.text'].str.lower()
list_corpus = df["review_text"].tolist()
list_labels = df["reviews.rating"].tolist()
X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, test_size=0.2, random_state=40)
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)
pred = logreg.predict(test_vectors)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred, average='weighted')
recall = recall_score(y_test, pred, average='weighted')
f1 = f1_score(y_test, pred, average='weighted')
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))
list_corpus[3]
c = make_pipeline(vectorizer, logreg)
class_names=list(df.review_title.unique())
explainer = LimeTextExplainer(class_names=class_names)
idx = 3
exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, labels=[1, 1])
print('Document id: %d' % idx)
print('Predicted class =', class_names[logreg.predict(test_vectors[idx]).reshape(1,-1)[0,0]])
print('True class: %s' % class_names[y_test[idx]])
print ('Explanation for class %s' % class_names[1])
print ('\n'.join(map(str, exp.as_list(label=1))))
exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, top_labels=2)
print(exp.available_labels())
exp.show_in_notebook(text=False)
https://towardsdatascience.com/explain-nlp-models-with-lime-shap-5c5a9f84d59b
https://marcotcr.github.io/lime/tutorials/Lime%20-%20multiclass.html
https://towardsdatascience.com/understanding-model-predictions-with-lime-a582fdff3a3b

How to extract coefficients from fitted pipeline for penalized logistic regression?

I have a set of training data that consists of X, which is a set of n columns of data (features), and Y, which is one column of target variable.
I am trying to train my model with logistic regression using the following pipeline:
pipeline = sklearn.pipeline.Pipeline([
('logistic_regression', LogisticRegression(penalty = 'none', C = 10))
])
My goal is to obtain the values of each of the n coefficients corresponding to the features, under the assumption of a linear model (y = coeff_0 + coeff_1*x1 + ... + coeff_n*xn).
What I tried was to train this pipeline on my data with model = pipeline.fit(X, Y). So I think that I now have the model that contains the coefficients that I want. However, I don't know how to access them. I'm looking for something like mode.best_params_('logistic_regression').
Does anyone know how to extract the fitted coefficients from a model like this?
Have a look at the scikit-learn documentation for Pipeline, this example is inspired by it:
from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = make_classification(n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
# access coefficients
print(anova_svm['svc'].coef_)
model.coef_ does the job, .best_params_ is usualy associated with GridSearch, i.e. hyperparameter optimization.
In your specific case try: model['logistic_regression'].coefs_.
Example to get the coefs from a pipeline.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
X, y = load_iris(return_X_y=True)
pipeline = Pipeline([('lr', LogisticRegression(penalty = 'l2',
C = 10))])
pipeline.fit(X, y)
pipeline['lr'].coef_
array([[-0.42923513, 2.08235619, -4.28084811, -1.97174699],
[ 1.06321671, -0.08077595, -0.46911772, -2.3221883 ],
[-0.63398158, -2.00158024, 4.74996583, 4.29393529]])
here is how to visualize the coefficients and measure model accuracy. I used the baby weight and height and gestation period to predict preterm
pipeline = Pipeline([('lr', LogisticRegression(penalty='l2',C=10))])
scaler=StandardScaler()
#X=np.array(df['gestation_wks']).reshape(-1,1)
X=scaler.fit_transform(df[['bwt_lbs','height_ft','gestation_wks']])
y=np.array(df['PreTerm'])
X_train,X_test, y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)
pipeline.fit(X_train,y_train)
y_pred_prob=pipeline.predict_proba(X_test)
predictions=pipeline.predict(X_test)
print(predictions)
sns.countplot(x=predictions, orient='h')
plt.show()
#print(predictions[:,0])
print(pipeline['lr'].coef_)
print(pipeline['lr'].intercept_)
print('Coefficients close to zero will contribute little to the end result')
num_err = np.sum(y != pipeline.predict(X))
print("Number of errors:", num_err)
def my_loss(y,w):
s = 0
for i in range(y.size):
# Get the true and predicted target values for example 'i'
y_i_true = y[i]
y_i_pred = w[i]
s = s + (y_i_true - y_i_pred)**2
return s
print("Loss:",my_loss(y_test,predictions))
fpr, tpr, threshholds = roc_curve(y_test,y_pred_prob[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)
print("Model Accuracy={accuracy}".format(accuracy=accuracy))
cm=confusion_matrix(y_test,predictions)
print(cm)

raise ValueError("Unknown label type: %s" % repr(ys)) ValueError: Unknown label type: (array

Im trying to make a Machine Learning approach but I'm having some problems. This is my Code:
import sys
import scipy
import numpy
import matplotlib
import pandas
import sklearn
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
dataset = pandas.read_csv('Libro111.csv')
array = numpy.asarray(dataset,dtype=numpy.float64) #all values are float64
X = array[:,1:49]
Y = array[:,0]
validation_size = 0.2
seed = 7.0
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
scoring = 'accuracy'
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
And then I get two different errors.
For Logistic Regression:
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 172, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
I found someone who had the same problems but I couldn't sort it out yet..
And (most important):
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 97, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([ 0.5, 0. , 1. , 1. , 0.5, 0.5, 1. , 0.5, 0. , 0.5, 1. ,
0. , 0. , 0. , 1. , 1......
In both cases the error come when I execute "cv_result" line... So, I hope you can help me...
"ValueError: Unknown label type: 'continuous'" means Your "Y" values are not class type of data (multiple rows share a same integer value. each integer represent a class). Therefore, you cannot use "DecisionTreeClassifier", "KNeighborsClassifier", "LogisticRegression"(do not be fooled by its name, LogisticRegression is a boolean classification method) or any other classification machine learning methods. In reality, your "Y" values are all different or 'continuous' (probably are float numbers), so you can only use the regression machine learning (i.e. "RandomForestRegressor").
Here are two solutions:
a) Group Y values into bins (classes). Apply classification modeling to your data.
b) If you prefer your predictions to have values (float numbers), You need to use the regression machine learning methods to predict Y values.
By the way, the "scoring = 'accuracy'" evaluation method is for classification modeling.

How to encode categorical data for use with Semi-supervised algorithm LabelPropagation

I am attempting to use the anneal.arff dataset with Python scikit-learn's semisupervised algorithm LabelPropagation. The anneal dataset is categorical data, so I preprocessed it so that the output class for each item of instance
looks like [0. 0. 1. 0. 0.]. This is a numeric list that encodes the output class
as 5 possible values with 0's everywhere, and 1. in the position of the corresponding class. This is what I would expect.
For semi-supervised learning, most of the training data must be unlabeled, so
I modified the training set so that the unlabeled data has output [-1, -1, -1, -1, -1]. I previously tried just using -1, but the code emits the same error as shown below.
I train the classifier as follows, Y_train includes labeled and "unlabeled" data:
lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X, Y_train)
I receive the error shown below after calling the fit method:
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\semi_supervised\label_propagation.py", line 221, in fit
X, y = check_X_y(X, y)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 526, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (538, 5)
This suggests that something is wrong with the shape of my Y_train list,
but this is the correct shape. What am I doing wrong?
Can LabelPropagation take as training data in this form, or does it only
accept unlabeled data as a scalar -1?
--- edit ---
Here is the code that generates the error. I'm sorry about the confusion over algorithms--I want to use both LabelSpreading and LabelPropagation, and choosing one or the other doesn't fix this error.
from scipy.io import arff
import pandas as pd
import numpy as np
import math
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from copy import deepcopy
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
f = "../../Documents/UCI/anneal.arff"
dataAsRecArray, meta = arff.loadarff(f)
dataset_raw = pd.DataFrame.from_records(dataAsRecArray)
dataset = pd.get_dummies(dataset_raw)
class_names = [col for col in dataset.columns if 'class_' in col]
print (dataset.shape)
number_of_output_columns = len(class_names)
print (number_of_output_columns)
def run(name, model, dataset, percent):
# Split-out validation dataset
array = dataset.values
X = array[:, 0:-number_of_output_columns]
Y = array[:, -number_of_output_columns:]
validation_size = 0.40
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
num_samples = len(Y_train)
num_labeled_points = math.floor(percent*num_samples)
indices = np.arange(num_samples)
unlabeled_set = indices[num_labeled_points:]
Y_train[unlabeled_set] = [-1, -1, -1, -1, -1]
lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X_train, Y_train)
"""
predicted_labels = lp_model.transduction_[unlabeled_set]
print(predicted_labels[:10])
"""
if __name__ == "__main__":
#percentages = [0.1, 0.2, 0.3, 0.4]
percentages = [0.1]
models = []
models.append(('LS', LabelSpreading()))
#models.append(('CART', DecisionTreeClassifier()))
#models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
for percent in percentages:
run(name, model, dataset, percent)
print ("bye")
Your Y_train has shape (538, 5) but should be 1d. LabelPropagation doesn't support multi-label or multi-output multi-class right now.
The error message could be more informative, though :-/

Probability prediction method of KNeighborsClassifier returns only 0 and 1

Can anyone tell me what's the problem with my code?
Why I can predict probability of iris dataset by using LinearRegression but, KNeighborsClassifier gives me 0 or 1 while it should give me a result like the one LinearRegression yields?
from sklearn.datasets import load_iris
from sklearn import metrics
iris = load_iris()
X = iris.data
y = iris.target
for train_index, test_index in skf:
X_train, X_test = X_total[train_index], X_total[test_index]
y_train, y_test = y_total[train_index], y_total[test_index]
from sklearn.linear_model import LogisticRegression
ln = LogisticRegression()
ln.fit(X_train,y_train)
ln.predict_proba(X_test)[:,1]
array([ 0.18075722, 0.08906078, 0.14693156, 0.10467766,
0.14823032,
0.70361962, 0.65733216, 0.77864636, 0.67203114, 0.68655163,
0.25219798, 0.3863194 , 0.30735105, 0.13963637, 0.28017798])
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='euclidean')
knn.fit(X_train, y_train)
knn.predict_proba(X_test)[0:10,1]
array([ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])
Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results. Currently your points are simply always having 5 neighbours of the same label (thus probability 0 or 1).
here, I have a knn model - model_knn
using sklearn
result = {}
model_classes = model_knn.classes_
predicted = model_knn.predict(word_average)
score = model_knn.predict_proba(word_average)
index = np.where(model_classes == predicted[0])[0][0]
result["predicted"] = predicted[0]
result["score"] = score[0][index]

Resources