how to calculate the precision and F1? - machine-learning

i am using TF_IDF for for feature selection and Naive Bayes Classifier. i want to calculate the total
accuracy and precision.
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
print("NB, Count Vectors: ", accuracy)
# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print("NB, WordLevel TF-IDF: ", accuracy)
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y,
xvalid_tfidf_ngram)
print("NB, N-Gram Vectors: ", accuracy)
# Naive Bayes on Character Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y,
xvalid_tfidf_ngram_chars)
print("NB, CharLevel Vectors: ", accuracy)

use this:
from sklearn.metrics import classification_report
print(classification_report(true_value,predicted_value))
this will give you all that you want

Related

Inspection of trees in a Quantile Random Forest Regression model

I am interested in training a random forest to learn some conditional quantile on some data {X, y} sampled independently from some distribution.
That is, for some $$\alpha \in (0, 1)$$, a mapping $$\hat{q}{\alpha}(x) \in [0, 1]$$ such that for each $X$, $$argmin{\hat{q}{\alpha} P(y < \hat{q}\alpha(x)) > \alpha$$.
Is there any clear way to build a random forest effectively in python that could yield such a model?
Additionally, I have one added requirement that may be possible with the current libraries, though I am unsure. Requirement: I would like to select a subset of points, A, from my training set and select and exclude those trees that were trained with points in A from my random forest as I make predictions.
There is a Python-based, scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to estimate conditional quantiles here: https://github.com/zillow/quantile-forest
Your additional requirement of making predictions on training samples by excluding trees that included those samples during training is called out-of-bag (OOB) estimation, and can also be done with the above package.
Setup should be as easy as:
pip install quantile-forest
Then, here's an example of how to fit a quantile random forest model and use it to predict quantiles with OOB estimation for a subset (here the first 100 rows) of the training data:
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
qrf = RandomForestQuantileRegressor()
qrf.fit(X_train, y_train)
# Predict OOB quantiles for first 100 training samples.
y_pred_oob = qrf.predict(
X_train[:100, :],
quantiles=[0.025, 0.5, 0.975],
oob_score=True,
indices=np.arange(100),
)

Logistic Regression sklearn with categorical Output

i have to train a model with logistic Regression in sklearn. I saw everywhere that the outcome has to be binary but my label is good, bad or normal. I have 12 features and i don't know how can i deal with three Labels ? I am very thankful for every answer
You can use Multinomial Logistic Regression.
In python, you can modify your Logistic Regression code as:
LogisticRegression(multi_class='multinomial').fit(X_train,y_train)
You can see Logistic Regression documentation in Scikit-Learn for more details.
It's called as one-vs-all Classification or Multi class classification.
From sklearn.linear_model.LogisticRegression:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
Code example:
# Authors: Tom Dupre la Tour <tom.dupre-la-tour#m4x.org>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# make 3-class dataset for classification
centers = [[-5, 0], [0, 1.5], [5, -1]]
X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
transformation = [[0.4, 0.2], [-0.4, 1.2]]
X = np.dot(X, transformation)
for multi_class in ('multinomial', 'ovr'):
clf = LogisticRegression(solver='sag', max_iter=100, random_state=42,
multi_class=multi_class).fit(X, y)
# print the training scores
print("training score : %.3f (%s)" % (clf.score(X, y), multi_class))
Check for full code example: Plot multinomial and One-vs-Rest Logistic Regression

I have done a classification problem where I am getting 99.9% accuracy but precision,recall,f1 is coming 0

Post Average Ensemble classification, I am getting a wierd confusion matrix and even weirder metric scores.
Code:-
x = data_train[categorical_columns + numerical_columns]
y = data_train['target']
from imblearn.over_sampling import SMOTE
x_sample, y_sample = SMOTE().fit_sample(x, y.values.ravel())
x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)
# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)
# Train-Test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_sample, y_sample,
test_size=0.40,
shuffle=False)
Accuracy is 99.9% but recall,f1-score and precision are 0. Never faced this problem before.I have used Adaboost Classifier.
Confusion Matrix for ADB:
[[46399 25]
[ 0 0]]
Accuracy for ADB:
0.9994614854385663
Precision for ADB:
0.0
Recall for ADB:
0.0
f1_score for ADB:
0.0
Since it is an imbalanced dataset so I have used SMOTE. And now I am getting the results as follows:
Confusion Matrix for ETC:
[[ 0 0]
[ 336 92002]]
Accuracy for ETC:
0.99636119474106
Precision for ETC:
1.0
Recall for ETC:
0.99636119474106
f1_score for ETC:
0.9981772811109906
This is happening because you have unbalanced dataset (99.9% 0's and only 0.1% 1's). In such scenario's using accuracy as metric can be misleading.
You can read more about what metrics to use in such scenario's here
HI as above answers mentioned it is because of skewed(unbalanced data). However, I would like to give a simpler solution. Use SVM's.
model = sklearn.svm.SVC(class_weight = 'balanced')
model.fit(X_train, y_train)
Using balanced class_weight would automatically give equal importance to all the classes irrespective of the number of datapoints of each class in the dataset. Also, using 'rbf' kernel in SVM would give a really good accuracy.

Using cross-validation to select optimal threshold: binary classification in Keras

I have a Keras model that takes a transformed vector x as input and outputs probabilities that each input value is 1.
I would like to take the predictions from this model and find an optimal threshold. That is, maybe the cutoff value for "this value is 1" should be 0.23, or maybe it should be 0.78, or something else. I know cross-validation is a good tool for this.
My question is how to work this in to training. For example, say I have the following model (taken from here):
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
I train the model and get some output probabilities:
model.fit(train_x, train_y)
predictions = model.predict(train_y)
Now I want to learn the threshold for the value of each entry in predictions that would give the best accuracy, for example. How can I learn this parameter, instead of just choosing one after training is complete?
EDIT: For example, say I have this:
def fake_model(self):
#Model that returns probability that each of 10 values is 1
a_input = Input(shape=(2, 10), name='a_input')
dense_1 = Dense(5)(a_input)
outputs = Dense(10, activation='sigmoid')(dense_1)
def hamming_loss(y_true, y_pred):
return tf.to_float(tf.reduce_sum(abs(y_true - y_pred))) /tf.to_float(tf.size(y_pred))
fakemodel = Model(a_input, outputs)
#Use the outputs of the model; find the threshold value that minimizes the Hamming loss
#Record the final confusion matrix.
How can I train a model like this end-to-end?
If an ROC curve isn't what you are looking for, you could create a custom Keras Layer that takes in the outputs of your original model and tries to learn an optimal threshold given the true outputs and the predicted probabilities.
This layer subtracts the threshold from the predicted probability, multiplies by a relatively large constant (in this case 100) and then applies the sigmoid function. Here is a plot that shows the function at three different thresholds (.3, .5, .7).
Below is the code for the definition of this layer and the creation of a model that is composed solely of it, after fitting your original model, feed it's outputs probabilities to this model and start training for an optimal threshold.
class ThresholdLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(ThresholdLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name="threshold", shape=(1,), initializer="uniform",
trainable=True)
super(ThresholdLayer, self).build(input_shape)
def call(self, x):
return keras.backend.sigmoid(100*(x-self.kernel))
def compute_output_shape(self, input_shape):
return input_shape
out = ThresholdLayer()(input_layer)
threshold_model = keras.Model(inputs=input_layer, outputs=out)
threshold_model.compile(optimizer="sgd", loss="mse")
First, here's a direct answer to your question. You're thinking of an ROC curve. For example, assuming some data X_test and y_test:
from matplotlib import pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
y_pred = model.predict(X_test).ravel()
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
my_auc = auc(fpr, tpr)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve close-up')
plt.legend(loc='best')
plt.show()
Second, regarding my comment, here's an example of one attempt. It can be done in Keras, or TF, or anywhere, although he does it with XGBoost.
Hope that helps!
First idea I have is kind of brute force.
You compute on a test set a metric separately for each of your input and its corresponding predicted output.
Then for each of them iterate over values for the threshold betzeen 0 and 1 until the metric is optimized for the given input/prediction pair.
For many of the popular metrics of classification quality (accuracy, precision, recall, etc) you just cannot learn the optimal threshold while training your neural network.
This is because these metrics are not differentiable - therefore, gradient updates will fail to set the threshold (or any other parameter) correctly. Therefore, you are forced to optimize a nice smooth loss (like negative log likelihood) during training most of the parameters, and then tune the threshold by grid search.
Of course, you can come up with a smoothed version of your metric and optimize it (and sometimes people do this). But in most cases it is OK to optimize log-likelihood, get a nice probabilistic classifier, and tune the thresholds on top of it. E.g. if you want to optimize accuracy, then you should first estimate class probabilities as accurately as possible (to get close to the perfect Bayes classifier), and then just choose their argmax.

How to generate ROC Curve with cross validation using SGD classifier loss=hinge

We want to generate ROC cure with corss validation using SGD classifier with loss=hinge but it does not support because ROC curve requires probability. We want to strictly stick with hinge because it has fit our requirement and want to verify trained model accuracy using ROC curve.Please suggest how to generate ROC curve with cross validation using loss=hinge
here you can use a Decision Function for "loss=hinge" that gives you distance from hyperplane
here how you can apply
svm_clf.fit(Xtrain, Xtarget)
score_roc = svm_clf.decision_function(Ytest)
fpr, tpr, thresholds = metrics.roc_curve(Ytarget, score_roc)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Resources