log loss output is greater than 1 - machine-learning

I prepared several models for binary classification of documents in the fraud field. I calculated the log loss for all models. I thought it was essentially measuring the confidence of the predictions and that log loss should be in the range of [0-1]. I believe it is an important measure in classification when the outcome - determining the class is not sufficient for evaluation purposes. So if two models have acc, recall and precision that are quite close but one has a lower log-loss function it should be selected given there are no other parameters/metrics (such as time, cost) in the decision process.
The log loss for the decision tree is 1.57, for all other models it is in the 0-1 range. How do I interpret this score?

It's important to remember log loss does not have an upper bound. Log loss exists on the range [0, ∞)
From Kaggle we can find a formula for log loss.
In which yij is 1 for the correct class and 0 for other classes and pij is the probability assigned for that class.
If we look at the case where the average log loss exceeds 1, it is when log(pij) < -1 when i is the true class. This means that the predicted probability for that given class would be less than exp(-1) or around 0.368. So, seeing a log loss greater than one can be expected in the case that your model only gives less than a 36% probability estimate for the actual class.
We can also see this by plotting the log loss given various probability estimates.

Related

SVM model predicts instances with probability scores greater than 0.1(default threshold 0.5) as positives

I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?
That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.

Advantage of MAPE loss function over MAE and RMSE

I'm reading this article: Rolling Window Regression: a Simple Approach for Time Series Next value Predictions and he explains there the difference between five loss functions:
The first question is asking how do we measure success? We do this via
a loss function, where we try to minimize the loss function. There are
several loss functions, and they are different pros and cons.
I managed to understand the first two loss functions:
MAE ( Mean absolute error) — here all errors, big and small, are treated equally
Root Mean Square Error (RMSE) — this penalizes large errors due to the squared term. For example, with errors [0.5, 0.5] and [0.1, 0.9],
MSE for both will be 0.5 while RMSE is 0.5 and. 0.45.
But I don't understand the thrid one:
MAPE ( Mean Absolute Percentage Error) — Since #1 and #2 depending on the value range of the target variable, they cannot be compared
across datasets. In contrast, MAPE is a percentage, hence relative. It
is like accuracy in a classification problem, where everyone knows 99%
accuracy is pretty good.
Why depending on the value range of the target variable, they cannot be compared across datasets?
Why is MAPE better than them?
I don't understand his explanation.
The thing is - MAPE uses percentage.
In both MAE and RMSE I get the mean error or the root of the mean error of a dataset. Therefore in one dataset, let's say beer prices, the numbers will be small, whereas in another dataset, let's say house prices, the numbers will be large. Therefore I cannot compare the success of MAE/RMSE on one dataset to their success on another.
In contrast to them, MAPE represents the error in percentage and therefore it's not relative to the size of the numbers in the data itself, and therefore I can compare its success on the beer prices and the house prices datasets.

Confidence Probability for Binary Machine Learning Classification

When using SKlearn and getting probabilities with the predict_proba(x) function for a binary classification [1, 0] the function returns the probability that the classification falls into each class. example [.8, .34].
Is there a community adopted standard way to reduce this down to a single classification confidence which takes all factors into consideration?
Option 1)
Just take the probability for the classification that was predicted (.8 in this example)
Option 2)
Some mathematical formula or function call which which takes into consideration all of the different probabilities and returns a single number. Such a confidence approach could take into consideration who close the probabilities of the different classes and return a lower confidence if there is not much separation between the different classes.
Theres no standard of of doing it. But what you can do is vary the threshold. What I exactly mean is if you use predict instead it throws out a binary out classifying your dataset, what its doing is taking 0.5 as a threshhold for predicting. Like if the probability of classifying in 1 is >0.5 classify it as 1 and 0 if <=0.5. But this can lead to a bad f1-score in some cases.
So, the approach should be to vary the threshhold and and choose one which yields maximum f1-score or any other metric you want to use as a score function. ROC(Receiver operating characteristic)curves are meant for this purpose only. And infact, the motive behind sklearn for giving out the class probabilities for this only, to let you choose the best threshhold.
A very nice example is predicting whether the patient has cancer or not. So you have to choose your threshhold wisely, if you choose it high you'll might be getting false-negatives a lot or if you choose it low you might get false-positives a lot. So you just choose the threshold according to your needs (as its better to get more false-positives).
Hope it helps!

Machine Learning Training & Test data split method

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.
Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.
When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.
I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.
When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.
By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.
For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.
To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.
Actually, what you did is right and this process is something similar to "Stratified sampling".
In your first model,where accuracy was very low the model did not get enough correlations between features and target for positive class(1).Also it model might have somewhat over-fitted for negative class.This is called "High bias -High variance" situation.
"Stratified sampling" is nothing but when you are extracting a sample data from a big population,you make sure that all classes will have some what approximately equal proportion to make the model's training assumptions more accurate and reliable.
In the second case model was able to correlate relationships between features and target and positive and negative class characteristics was well distinguishable.
Eliminating noise is a part of data preparation that should be obviously done before putting data into a model.

Reason of having high AUC and low accuracy in a balanced dataset

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).
I have totally no idea why would this happen, can anyone explain this case for me?
The ROC curve is biased towards the positive class. The described situation with high AUC and low accuracy can occur when your classifier achieves the good performance on the positive class (high AUC), at the cost of a high false negatives rate (or a low number of true negatives).
The question of why the training process resulted in a classifier with poor predictive performance is very specific to your problem/data and the classification methods used.
The ROC analysis tells you how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.
About ROC analysis
The general context for ROC analysis is binary classification, where a classifier assigns elements of a set into two groups. The two classes are usually referred to as "positive" and "negative". Here, we assume that the classifier can be reduced to the following functional behavior:
def classifier(observation, t):
if score_function(observation) <= t:
observation belongs to the "negative" class
else:
observation belongs to the "positive" class
The core of a classifier is the scoring function that converts observations into a numeric value measuring the affinity of the observation to the positive class. Here, the scoring function incorporates the set of rules, the mathematical functions, the weights and parameters, and all the ingenuity that makes a good classifier. For example, in logistic regression classification, one possible choice for the scoring function is the logistic function that estimates the probability p(x) of an observation x belonging to the positive class.
In a final step, the classifier converts the computed score into a binary class assignment by comparing the score against a decision threshold (or prediction cutoff) t.
Given the classifier and a fixed decision threshold t, we can compute actual class predictions y_p for given observations x. To assess the capability of a classifier, the class predictions y_p are compared with the true class labels y_t of a validation dataset. If y_p and y_t match, we refer to as true positives TP or true negatives TN, depending on the value of y_p and y_t; or false positives FP or false negatives FN if y_p and y_t do not match.
We can apply this to the entire validation dataset and count the total number of TPs, TNs, FPs and FNs, as well as the true positive rate (TPR) and false positive rate rate (FPR), which are defined as follows:
TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives
Note that the TPR is often referred to as the sensitivity, and FPR is equivalent to 1-specifity.
In comparison, the accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:
accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)
Given a classifier and a validation dataset, we can evaluate the true positive rate TPR(t) and false positive rate FPR(t) for varying decision thresholds t. And here we are: Plotting FPR(t) against TPR(t) yields the receiver-operator characteristic (ROC) curve. Below are some sample ROC curves, plotted in Python using roc-utils*.
Think of the decision threshold t as a final free parameter that can be tuned at the end of the training process. The ROC analysis offers means to find an optimal cutoff t* (e.g., Youden index, concordance, distance from optimal point).
Furthermore, we can examine with the ROC curve how well the classifier can discriminate between samples from the "positive" and the "negative" class:
Try to understand how the FPR and TPR change for increasing values of t. In the first extreme case (with some very small value for t), all samples are classified as "positive". Hence, there are no true negatives (TN=0), and thus FPR=TPR=1. By increasing t, both FPR and TPR gradually decrease, until we reach the second extreme case, where all samples are classified as negative, and none as positive: TP=FP=0, and thus FPR=TPR=0. In this process, we start in the top right corner of the ROC curve and gradually move to the bottom left.
In the case where the scoring function is able to separate the samples perfectly, leading to a perfect classifier, the ROC curve passes through the optimal point FPR(t)=0 and TPR(t)=1 (see the left figure below). In the other extreme case where the distributions of scores coincide for both classes, resulting in a random coin-flipping classifier, the ROC curve travels along the diagonal (see the right figure below).
Unfortunately, it is very unlikely that we can find a perfect classifier that reaches the optimal point (0,1) in the ROC curve. But we can try to get as close to it as possible.
The AUC, or the area under the ROC curve, tries to capture this characteristic. It is a measure for how well a classifier can discriminate between the two classes. It varies between 1. and 0. In the case of a perfect classifier, the AUC is 1. A classifier that assigns a random class label to input data would yield an AUC of 0.5.
* Disclaimer: I'm the author of roc-utils
I guess you are miss reading the correct class when calculating the roc curve...
That will explain the low accuracy and the high (wrongly calculated) AUC.
It is easy to see that AUC can be misleading when used to compare two
classifiers if their ROC curves cross. Classifier A may produce a
higher AUC than B, while B performs better for a majority of the
thresholds with which you may actually use the classifier. And in fact
empirical studies have shown that it is indeed very common for ROC
curves of common classifiers to cross. There are also deeper reasons
why AUC is incoherent and therefore an inappropriate measure (see
references below).
http://sandeeptata.blogspot.com/2015/04/on-dangers-of-auc.html
Another simple explanation for this behaviour is that your model is actually very good - just its final threshold to make predictions binary is bad.
I came across this problem with a convolutional neural network on a binary image classification task. Consider e.g, that you have 4 samples with labels 0,0,1,1. Lets say your model creates continuous predictions for these four samples like so: 0.7, 0.75, 0.9 and 0.95.
We would consider this to be a good model, since high values (> 0.8) predict class 1 and low values (< 0.8) predict class 0. Hence, the ROC-AUC would be 1. Note how I used a threshold of 0.8. However, if you use a fixed and badly-chosen threshold for these predictions, say 0.5, which is what we sometimes force upon our model output, then all 4 sample predictions would be class 1, which leads to an accuracy of 50%.
Note that most models optimize not for accuracy, but for some sort of loss function. In my CNN, training for just a few epochs longer solved the problem.
Make sure that you know what you are doing when you transform a continuous model output into a binary prediction. If you do not know what threshold to use for a given ROC curve, have a look at Youden's index or find the threshold value that represents the "most top-left" point in your ROC curve.
If this is happening every single time, may be your model is not correct.
Starting from kernel you need to change and try the model with the new sets.
Look the confusion matrix every time and check TN and TP areas. The model should be inadequate to detect one of them.

Resources