How to interpret scikit's learn confusion matrix and classification report? - machine-learning

I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score
import pandas as pd
df = pd.read_csv('/corpus.csv',
header=0, sep=',', names=['id', 'content', 'label'])
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
from sklearn.svm import SVC
svm_1 = SVC(kernel='linear')
svm_1.fit(X, y)
svm_1_prediction = svm_1.predict(X_test)
Then with the metrics I obtained the following confusion matrix and classification report, as follows:
print '\nClasification report:\n', classification_report(y_test, svm_1_prediction)
print '\nConfussion matrix:\n',confusion_matrix(y_test, svm_1_prediction)
Then, this is the result:
Clasification report:
precision recall f1-score support
1 1.00 0.76 0.86 71
2 1.00 0.84 0.91 43
3 1.00 0.74 0.85 89
4 0.98 0.95 0.96 288
5 0.87 1.00 0.93 367
avg / total 0.94 0.93 0.93 858
Confussion matrix:
[[ 54 0 0 0 17]
[ 0 36 0 1 6]
[ 0 0 66 5 18]
[ 0 0 0 273 15]
[ 0 0 0 0 367]]
How can I interpret the above confusion matrix and classification report. I tried reading the documentation and this question. But still can interpretate what happened here particularly with this data?. Wny this matrix is somehow "diagonal"?. By the other hand what means the recall, precision, f1score and support for this data?. What can I say about this data?. Thanks in advance guys

Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help.
Coming to confusion matrix, it is much detailed representation of what's going on with your labels. So there were 71 points in the first class (label 0). Out of these, your model was successful in identifying 54 of those correctly in label 0, but 17 were marked as label 4. Similarly look at second row. There were 43 points in class 1, but 36 of them were marked correctly. Your classifier predicted 1 in class 3 and 6 in class 4.
Now you can see the pattern this follows. An ideal classifiers with 100% accuracy would produce a pure diagonal matrix which would have all the points predicted in their correct class.
Coming to Recall/Precision. They are some of the mostly used measures in evaluating how good your system works. Now you had 71 points in first class (call it 0 class). Out of them your classifier was able to get 54 elements correctly. That's your recall. 54/71 = 0.76. Now look only at first column in the table. There is one cell with entry 54, rest all are zeros. This means your classifier marked 54 points in class 0, and all 54 of them were actually in class 0. This is precision. 54/54 = 1. Look at column marked 4. In this column, there are elements scattered in all the five rows. 367 of them were marked correctly. Rest all are incorrect. So that reduces your precision.
F Measure is harmonic mean of Precision and Recall.
Be sure you read details about these. https://en.wikipedia.org/wiki/Precision_and_recall

Here's the documentation for scikit-learn's sklearn.metrics.precision_recall_fscore_support method: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support
It seems to indicate that the support is the number of occurrences of each particular class in the true responses (responses in your test set). You can calculate it by summing the rows of the confusion matrix.

Confusion Matrix tells us about the distribution of our predicted values across all the actual outcomes.Accuracy_scores, Recall(sensitivity), Precision, Specificity and other similar metrics are subsets of Confusion Matrix.
F1 scores are the harmonic means of precision and recall.
Support columns in Classification_report tell us about the actual counts of each class in test data.
Well, rest is explained above beautifully.
Thank you.

Related

Fit clustering outputs into Machine Learning model

Just a machine learning/data science problem.
a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C).
b) Then i fit that output result (cluster A, B or C) back into my dataset as a new feature (i.e. now total of 21 features).
c) I run a regression model to predict a label value with the 21 features.
Wonder if step b) is redundant (since the features already exist in the earlier dataset), if I use a more powerful model (Random forest, XGBoost), or not, and how to explain this mathematically.
Any opinions and suggestions will be great!
Great idea: just give it a try and see how that goes. This is highly dependent on your dataset and model choice as you guessed. Hard to predict how adding this type of feature will behave, just like any other feature engineering. But caution, in some cases it's not even improving your performance. See a test below where performance actually decreases, with Iris dataset:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn import metrics
# load data
iris = load_iris()
X = iris.data[:, :3] # only keep three out of the four available features to make it more challenging
y = iris.target
# split train / test
indices = np.random.permutation(len(X))
N_test = 30
X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]
X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]
# compute a clustering method (here KMeans) based on available features in X_train
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
new_clustering_feature_train = kmeans.predict(X_train)
new_clustering_feature_test = kmeans.predict(X_test)
# create a new input train/test X with this feature added
X_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])
X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])
Now let's compare the two models that learnt either only on X_train or on X_train_with_clustering_feature:
model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)
print(metrics.classification_report(model1.predict(X_test), y_test))
precision recall f1-score support
0 1.00 1.00 1.00 45
1 0.95 0.97 0.96 38
2 0.97 0.95 0.96 37
accuracy 0.97 120
macro avg 0.97 0.97 0.97 120
weighted avg 0.98 0.97 0.97 120
And the other model:
model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)
print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))
0 1.00 1.00 1.00 45
1 0.87 0.97 0.92 35
2 0.97 0.88 0.92 40
accuracy 0.95 120
macro avg 0.95 0.95 0.95 120
weighted avg 0.95 0.95 0.95 120
Aha nice one! You might think you are using two models, but actually you are combining two models into one, with skip connections. As it is one model, there is no way knowing for sure what is the best architecture beforehand, per the No Free Lunch Theorem. So, practically, you have have to try it out, and mathematically, there's no knowing it beforehand, because of the No Free Lunch Theorem.

What is Random seed in Azure Machine Learning?

I am learning Azure Machine Learning. I am frequently encountering the Random Seed in some of the steps like,
Split Data
Untrained algorithm models as Two Class Regression, Multi-class regression, Tree, Forest,..
In the tutorial, they choose Random Seed as '123'; trained model has high accuracy but when I try to choose other random integers like 245, 256, 12, 321,.. it did not do well.
Questions
What is a Random Seed Integer?
How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?
Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?
Pretext
I have Iris-Sepal-Petal-Dataset with Sepal (Length & Width) and Petal (Length & Width)
Last column in data-set is 'Binomial ClassName'
I am training the data-set with Multiclass Decision Forest Algorithm and splitting the data with different random seeds 321, 123 and 12345 in order
It affects the final quality of trained model. Random seed#123 being best of Prediction probability score: 1.
Observations
1. Random seed: 321
2. Random seed: 123
3. Random seed: 12345
What is a Random Seed Integer?
Will not go into any details regarding what a random seed is in general; there is plenty of material available by a simple web search (see for example this SO thread).
Random seed serves just to initialize the (pseudo)random number generator, mainly in order to make ML examples reproducible.
How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?
Arguably this is already answered implicitly above: you are simply not supposed to choose any particular random seed, and your results should be roughly the same across different random seeds.
Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?
Now, to the heart of your question. The answer here (i.e. with the iris dataset) is the small-sample effects...
To start with, your reported results across different random seeds are not that different. Nevertheless, I agree that, at first sight, a difference in macro-average precision of 0.9 and 0.94 might seem large; but looking more closely it is revealed that the difference is really not an issue. Why?
Using the 20% of your (only) 150-samples dataset leaves you with only 30 samples in your test set (where the evaluation is performed); this is stratified, i.e. about 10 samples from each class. Now, for datasets of that small size, it is not difficult to imagine that a difference in the correct classification of only 1-2 samples can have this apparent difference in the performance metrics reported.
Let's try to verify this in scikit-learn using a decision tree classifier (the essence of the issue does not depend on the specific framework or the ML algorithm used):
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Result:
[[10 0 0]
[ 0 9 1]
[ 0 0 10]]
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 0.90 0.95 10
2 0.91 1.00 0.95 10
micro avg 0.97 0.97 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
Let's repeat the code above, changing only the random_state argument in train_test_split; for random_state=123 we get:
[[10 0 0]
[ 0 7 3]
[ 0 2 8]]
precision recall f1-score support
0 1.00 1.00 1.00 10
1 0.78 0.70 0.74 10
2 0.73 0.80 0.76 10
micro avg 0.83 0.83 0.83 30
macro avg 0.84 0.83 0.83 30
weighted avg 0.84 0.83 0.83 30
while for random_state=12345 we get:
[[10 0 0]
[ 0 8 2]
[ 0 0 10]]
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 0.80 0.89 10
2 0.83 1.00 0.91 10
micro avg 0.93 0.93 0.93 30
macro avg 0.94 0.93 0.93 30
weighted avg 0.94 0.93 0.93 30
Looking at the absolute numbers of the 3 confusion matrices (in small samples, percentages can be misleading), you should be able to convince yourself that the differences are not that big, and they can be arguably justified by the random element inherent in the whole procedure (here the exact split of the dataset into training and test).
Should your test set be significantly bigger, these discrepancies would be practically negligible...
A last notice; I have used the exact same seed numbers as you, but this does not actually mean anything, as in general the random number generators across platforms & languages are not the same, hence the corresponding seeds are not actually compatible. See own answer in Are random seeds compatible between systems? for a demonstration.
The seed is used to initialize the pseudorandom number generator in Python.
The random module uses the seed value as a base to generate a random number. if seed value is not present it takes system current time. if you provide same seed value before generating random data it will produce the same data. refer https://pynative.com/python-random-seed/ for more details.
Example:
import random
random.seed( 30 )
print ("first number - ", random.randint(25,50))
random.seed( 30 )
print ("Second number- ", random.randint(25,50))
Output:
first number - 42
Second number - 42

classification_report vs f1_score in scikit-learn's classification metrics

What is the right way to evaluate a binary classifier using scikit-learn's evaluation metrics?
Given y_test and y_pred as the gold and predicted labels, shouldn't the F1 score in the classification_report output be the same as what f1_score produces?
Here is how I do it:
print(classification_reprot(y_test, y_pred)
gives the following table:
precision recall f1-score support
0 0.49 0.18 0.26 204
1 0.83 0.96 0.89 877
avg / total 0.77 0.81 0.77 1081
However,
print(f1_score(y_test, y_pred)
gives F1 score = 0.89
Now, given the above outputs, is the performance of this model F1 score = 0.89 or is it 0.77?
In short, for your case, the f1-score is 0.89, and the weighted average f1-score is 0.77.
Take a look at the docstring of sklearn.metrics.f1_score:
The F1 score can be interpreted as a weighted average of the precision and
recall, where an F1 score reaches its best value at 1 and worst score at 0.
The relative contribution of precision and recall to the F1 score are
equal. The formula for the F1 score is::
F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
The key is the final sentence here. If you're looking for the weighted average f1 score for each class, then you shouldn't feed the function a 0/1 binary classification. So, for example, you could do
f1_score(y_test + 1, y_pred + 1)
# 0.77
If the class labels are not 0/1, then it is treated as a multiclass metric (where you care about all precision/recall scores) rather than a binary metric (where you care about precision/recall only for positive samples). I agree that this might be a bit surprising, but in general 0/1 classes are treated as a marker of binary classification.
Edit: some of the behavior listed here is deprecated since Scikit-learn 0.16 – in particular the confusing implicit assumptions about binary vs non-binary classifications. See this github thread for details.

GridSearchCV scoring parameter: using scoring='f1' or scoring=None (by default uses accuracy) gives the same result

I'm using an example extracted from the book "Mastering Machine Learning with scikit learn".
It uses a decision tree to predict whether each of the images on a web page is an
advertisement or article content. Images that are classified as being advertisements could then be hidden using Cascading Style Sheets. The data is publicly available from the Internet Advertisements Data Set: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, which contains data for 3,279 images.
The following is the complete code for completing the classification task:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random
def main(argv):
df = pd.read_csv('ad-dataset/ad.data', header=None)
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]
X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)
pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)
if __name__ == '__main__':
main(sys.argv[1:])
The RESULTS of using scoring='f1' in GridSearchCV as in the example is:
The RESULTS of using scoring=None (by default Accuracy measure) is the same as using F1 score:
If I'm not wrong optimizing the parameter search by different scoring functions should yield different results. The following case shows that different results are obtained when scoring='precision' is used.
The RESULTS of using scoring='precision' is DIFFERENT than the other two cases. The same would be true for 'recall', etc:
WHY 'F1' AND None, BY DEFAULT ACCURACY, GIVE THE SAME RESULT??
EDITED
I agree with both answers by Fabian & Sebastian. The problem should be the small param_grid. But I just wanted to clarify that the problem surged when I was working with a totally different (not the one in the example here) highly imbalance dataset 100:1 (which should affect the accuracy) and using Logistic Regression. In this case also 'F1' and accuracy gave the same result.
The param_grid that I used, in this case, was the following:
parameters = {"penalty": ("l1", "l2"),
"C": (0.001, 0.01, 0.1, 1, 10, 100),
"solver": ("newton-cg", "lbfgs", "liblinear"),
"class_weight":[{0:4}],
}
I guess that the parameter selection is also too small.
I think that the author didn't choose this example very well. I may be missing something here, but min_samples_split=1 doesn't make sense to me: Isn't it the same as setting min_samples_split=2 since you can't split 1 sample -- essentially, it's a waste of computational time.
From the documentation: min_samples_split: "The minimum number of samples required to split an internal node."
Btw. this is a very small grid and there is not much choice anyways, which may explain why accuracy and f1 give you the same parameter combinations and hence the same scoring tables.
Like mentioned above, the dataset may be well balanced which is why F1 and accuracy scores may prefer the same parameter combinations. So, looking further at your GridSearch results using (a) F1 score and (b) Accuracy, I conclude that in both cases a depth of 150 works best. Since this is the lower boundary, it gives you a slight hind that lower "depth" values may work even better. However, I suspect that the tree doesn't even go that deep on this dataset (you can end up with "pure" leaves even well before reaching the max depth).
So, let's repeat the experiment with a little bit more sensible values using the following parameter grid
parameters = {
'clf__max_depth': list(range(2, 30)),
'clf__min_samples_split': (2,),
'clf__min_samples_leaf': (1,)
}
The optimal "depth" for the best F1 score seems to be around 15.
Best score: 0.878
Best parameters set:
clf__max_depth: 15
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.99 716
1 0.92 0.89 0.91 104
avg / total 0.98 0.98 0.98 820
Next, let's try it using "accuracy" (or None) as our scoring metric:
> Best score: 0.967
Best parameters set:
clf__max_depth: 6
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.98 716
1 0.93 0.85 0.88 104
avg / total 0.97 0.97 0.97 820
As you can see, you get different results now, and the "optimal" depth is different if you use "accuracy."
I don't agree that optimizing the parameter search by different scoring functions should yield necessarily different results necessarily. If your dataset is balanced (roughly same number of samples in each class), I would expect that model selection by accuracy and F1 would yield very similar results.
Also, have in mind that GridSearchCV optimizes over a discrete grid. Maybe using a thinner grid of parameters would yield the results that you are looking for.
On an unbalanced dataset use the "labels" parameter of the f1_score scorer to use only the f1 score of the class you are interested in. Or consider using "sample_weight".

SVM for gender classification: 100% correct results with linear kernel, but much poorer results with RBF

I have crafted a little program for gender classification based on image of a face. I used Yale face databse (175 images for males and the same number for females), converted them to grayscale and equalized histograms, so after preprocessing images look like this:
I ran following code to test results (it uses SVM and linear kernel):
def run_gender_classifier():
Xm, Ym = mkdataset('gender/male', 1) # mkdataset just preprocesses images,
Xf, Yf = mkdataset('gender/female', 0) # flattens them and stacks into a matrix
X = np.vstack([Xm, Xf])
Y = np.hstack([Ym, Yf])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.1,
random_state=100)
model = svm.SVC(kernel='linear')
model.fit(X_train, Y_train)
print("Results:\n%s\n" % (
metrics.classification_report(
Y_test, model.predict(X_test))))
And got 100% precision!
In [22]: run_gender_classifier()
Results:
precision recall f1-score support
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 19
avg / total 1.00 1.00 1.00 35
I could expect different results, but 100% correct for image classification look really suspicious to me.
Furthermore, when I changed kernel to RBF, results became totally bad:
In [24]: run_gender_classifier()
Results:
precision recall f1-score support
0 0.46 1.00 0.63 16
1 0.00 0.00 0.00 19
avg / total 0.21 0.46 0.29 35
Which seems even more strange for me.
So my questions are:
Is there any mistake in my approach or code?
If not, how can results for linear kernel be so good, and for RBF so bad?
Note, that I also got 100% correct results with logistic regression, and very poor results with deep belief networks, so it's not specific to SVM, but rather for linear and non-linear models.
Just for completeness, here's my code for preprocessing and making dataset:
import cv2
from sklearn import linear_model, svm, metrics
from sklearn.cross_validation import train_test_split
def preprocess(im):
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
im = cv2.resize(im, (100, 100))
return cv2.equalizeHist(im)
def mkdataset(path, label):
images = (cv2.resize(cv2.imread(fname), (100, 100))
for fname in list_images(path))
images = (preprocess(im) for im in images)
X = np.vstack([im.flatten() for im in images])
Y = np.repeat(label, X.shape[0])
return X, Y
All of described models require tuning parameters:
Linear SVM : C
RBF SVM : C, gamma
DBN : Layers count, Neurons count, Output classifier, Training rate ...
And you simply omitted this element. So it is quite natural, that models with smallest number of tunable parameters behaved better - as simply there is bigger probability that default parameters actually worked.
100% score always looks suspicious and you should double check it "by hand" - phisically split data into train and test (put into different directories), train on one part, save your model to a file. Then in separate code - load a model, and test it on test files with displaying image + label from the model. This way you will make sure, that there is no implmenentation error (as you really don't care whether there is any processing error, if you have a physical proof that your model recognizes those faces, right?). This is purely "psychological method", which makes it obvious that there is no error in data splitting/sharing and further evaluation.
UPDATE
As suggested in the comment, I also checked your dataset, and as as it is stated on the official website:
The extended Yale Face Database B contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions.
So this is for sure a problem - this is not the dataset for the gender recognition. Your classifier simply memorizes these 28 subjects, which are easily splitted to male/female. It simply won't work on any image from other subjects. The only "valuable" part of this dataset is the set of 28 faces of distinctive individuals, which you can extract by hand, but 28 images seems at least row of magnitude too small to be useful.
Friend from what I understand your description of your questions and I think the explanation is simple, due to the problem with linear kernel is better than the RBF believe in your logic is correct however you should is using RBF somewhat wrong I think that it will serve to your problem, continue trying to develop a way to just use the linear kernel

Resources