How does AUC of decision tree being calculated? - machine-learning

Suppose I have a dataset which only has one continuous variable, and I try to use decision tree algorithm to build a model which classify the +ve and -ve label from the dataset. I run 10-fold cross-validation.
How does the AUC being calculated for the Decision Tree classifier? Will the algorithm check different threshold value of the classifier, and determine the AUC?
What about I have more than 2 continuous variable?
Thanks!

Off topic, but hey:
AUC only makes sense for binary classification. The number of predictors does not matter.
Decision trees do not inherently have a 'threshold' but typically in a classification problem, the leaves contain a probability distribution over the 2 classes, and so does the tree's prediction. So you could conceive of picking the positive class only if its probability is >= p, not just >= 0.5. Then you could draw an AUC curve.
So it's a little unnatural to apply this to a decision tree but can be done.

Related

Number of trees per boosting round in XGBoost

I want to figure out why in binary classification, we only need 1 tree per boosting round, while in n-class multi-class classification, we need n trees in one boosting round. I think XGBoost need to compute sum of scores for each class in every round and pass it to softmax function, so binary classification should use two trees per round. Is there anything wrong?

In Classification, what is the difference between the test accuracy and the AUC score?

I am working on a classification-based project, and I am evaluating different ML models based on their training accuracy, testing accuracy, confusion matrix, and the AUC score. I am now stuck in understanding the difference between the scores I get by calculating accuracy of a ML model on the test set (X_test), and the AUC score.
If I am correct, both metrics calculate how well a ML model is able to predict the correct class of previously unseen data. I also understand that for both, the higher the number, the better, for as long as the model is not over-fit or under-fit.
Assuming a ML model is neither over-fit nor under-fit, what is the difference between test accuracy score and the AUC score?
I don't have a background in math and stats, and pivoted towards data science from business background. Therefore, I will appreciate an explanation a business person can understand.
Both terms quantify the quality of a classification model, however, the accuracy quantifies a single manifestation of the variables, which means it describes a single confusion matrix. The AUC (area under the curve) represents the trade-off between the true-positive-rate (tpr) and the false-positive-rate (fpr) in multiple confusion matrices, that are generated for different fpr values for the same classifier.
A confusion matrix is of the form:
1) The accuracy is a measure for a single confusion matrix and is defined as:
where tp=true-positives, tn=true-negatives, fp=false-positives and fn=false-negatives (the amount of each).
2) The AUC measures the area under the ROC (receiver operating characteristic), that is the trade-off curve between the true-positive-rate and the false-positive-rate. For each choice of the false-positive-rate (fpr) threshold,the true-positive-rate (tpr) is determined. I.e, for a given classifier a fpr of 0, 0.1, 0.2 and so fourth is accepted, and for each fpr it's dependent tpr is evaluated. Therefore, you get a function tpr(fpr) that maps the interval [0,1] onto the same interval, because both rates are defined in those intervals. The area under this line is called the AUC, that is between 0 and 1, whereby a random classification is expected to yield an AUC of 0.5.
The AUC, as it is the area under the curve, is defined as:
However, in real (and finite) applications, the ROC is a step function and the AUC is determined by a weighted sum these levels.
Graphics are from Borgelt's Intelligent Data Mining Lecture.

Reason of having high AUC and low accuracy in a balanced dataset

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).
I have totally no idea why would this happen, can anyone explain this case for me?
The ROC curve is biased towards the positive class. The described situation with high AUC and low accuracy can occur when your classifier achieves the good performance on the positive class (high AUC), at the cost of a high false negatives rate (or a low number of true negatives).
The question of why the training process resulted in a classifier with poor predictive performance is very specific to your problem/data and the classification methods used.
The ROC analysis tells you how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.
About ROC analysis
The general context for ROC analysis is binary classification, where a classifier assigns elements of a set into two groups. The two classes are usually referred to as "positive" and "negative". Here, we assume that the classifier can be reduced to the following functional behavior:
def classifier(observation, t):
if score_function(observation) <= t:
observation belongs to the "negative" class
else:
observation belongs to the "positive" class
The core of a classifier is the scoring function that converts observations into a numeric value measuring the affinity of the observation to the positive class. Here, the scoring function incorporates the set of rules, the mathematical functions, the weights and parameters, and all the ingenuity that makes a good classifier. For example, in logistic regression classification, one possible choice for the scoring function is the logistic function that estimates the probability p(x) of an observation x belonging to the positive class.
In a final step, the classifier converts the computed score into a binary class assignment by comparing the score against a decision threshold (or prediction cutoff) t.
Given the classifier and a fixed decision threshold t, we can compute actual class predictions y_p for given observations x. To assess the capability of a classifier, the class predictions y_p are compared with the true class labels y_t of a validation dataset. If y_p and y_t match, we refer to as true positives TP or true negatives TN, depending on the value of y_p and y_t; or false positives FP or false negatives FN if y_p and y_t do not match.
We can apply this to the entire validation dataset and count the total number of TPs, TNs, FPs and FNs, as well as the true positive rate (TPR) and false positive rate rate (FPR), which are defined as follows:
TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives
Note that the TPR is often referred to as the sensitivity, and FPR is equivalent to 1-specifity.
In comparison, the accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:
accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)
Given a classifier and a validation dataset, we can evaluate the true positive rate TPR(t) and false positive rate FPR(t) for varying decision thresholds t. And here we are: Plotting FPR(t) against TPR(t) yields the receiver-operator characteristic (ROC) curve. Below are some sample ROC curves, plotted in Python using roc-utils*.
Think of the decision threshold t as a final free parameter that can be tuned at the end of the training process. The ROC analysis offers means to find an optimal cutoff t* (e.g., Youden index, concordance, distance from optimal point).
Furthermore, we can examine with the ROC curve how well the classifier can discriminate between samples from the "positive" and the "negative" class:
Try to understand how the FPR and TPR change for increasing values of t. In the first extreme case (with some very small value for t), all samples are classified as "positive". Hence, there are no true negatives (TN=0), and thus FPR=TPR=1. By increasing t, both FPR and TPR gradually decrease, until we reach the second extreme case, where all samples are classified as negative, and none as positive: TP=FP=0, and thus FPR=TPR=0. In this process, we start in the top right corner of the ROC curve and gradually move to the bottom left.
In the case where the scoring function is able to separate the samples perfectly, leading to a perfect classifier, the ROC curve passes through the optimal point FPR(t)=0 and TPR(t)=1 (see the left figure below). In the other extreme case where the distributions of scores coincide for both classes, resulting in a random coin-flipping classifier, the ROC curve travels along the diagonal (see the right figure below).
Unfortunately, it is very unlikely that we can find a perfect classifier that reaches the optimal point (0,1) in the ROC curve. But we can try to get as close to it as possible.
The AUC, or the area under the ROC curve, tries to capture this characteristic. It is a measure for how well a classifier can discriminate between the two classes. It varies between 1. and 0. In the case of a perfect classifier, the AUC is 1. A classifier that assigns a random class label to input data would yield an AUC of 0.5.
* Disclaimer: I'm the author of roc-utils
I guess you are miss reading the correct class when calculating the roc curve...
That will explain the low accuracy and the high (wrongly calculated) AUC.
It is easy to see that AUC can be misleading when used to compare two
classifiers if their ROC curves cross. Classifier A may produce a
higher AUC than B, while B performs better for a majority of the
thresholds with which you may actually use the classifier. And in fact
empirical studies have shown that it is indeed very common for ROC
curves of common classifiers to cross. There are also deeper reasons
why AUC is incoherent and therefore an inappropriate measure (see
references below).
http://sandeeptata.blogspot.com/2015/04/on-dangers-of-auc.html
Another simple explanation for this behaviour is that your model is actually very good - just its final threshold to make predictions binary is bad.
I came across this problem with a convolutional neural network on a binary image classification task. Consider e.g, that you have 4 samples with labels 0,0,1,1. Lets say your model creates continuous predictions for these four samples like so: 0.7, 0.75, 0.9 and 0.95.
We would consider this to be a good model, since high values (> 0.8) predict class 1 and low values (< 0.8) predict class 0. Hence, the ROC-AUC would be 1. Note how I used a threshold of 0.8. However, if you use a fixed and badly-chosen threshold for these predictions, say 0.5, which is what we sometimes force upon our model output, then all 4 sample predictions would be class 1, which leads to an accuracy of 50%.
Note that most models optimize not for accuracy, but for some sort of loss function. In my CNN, training for just a few epochs longer solved the problem.
Make sure that you know what you are doing when you transform a continuous model output into a binary prediction. If you do not know what threshold to use for a given ROC curve, have a look at Youden's index or find the threshold value that represents the "most top-left" point in your ROC curve.
If this is happening every single time, may be your model is not correct.
Starting from kernel you need to change and try the model with the new sets.
Look the confusion matrix every time and check TN and TP areas. The model should be inadequate to detect one of them.

How to get scores along with Machine Learning results?

I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)

What is the difference between linear regression and logistic regression? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
When we have to predict the value of a categorical (or discrete) outcome we use logistic regression. I believe we use linear regression to also predict the value of an outcome given the input values.
Then, what is the difference between the two methodologies?
Linear regression output as probabilities
It's tempting to use the linear regression output as probabilities but it's a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might actually
produce probabilities that could be less than 0, or even bigger than
1, logistic regression was introduced.
Source: http://gerardnico.com/wiki/data_mining/simple_logistic_regression
Outcome
In linear regression, the outcome (dependent variable) is continuous.
It can have any one of an infinite number of possible values.
In logistic regression, the outcome (dependent variable) has only a limited number of possible values.
The dependent variable
Logistic regression is used when the response variable is categorical in nature. For instance, yes/no, true/false, red/green/blue,
1st/2nd/3rd/4th, etc.
Linear regression is used when your response variable is continuous. For instance, weight, height, number of hours, etc.
Equation
Linear regression gives an equation which is of the form Y = mX + C,
means equation with degree 1.
However, logistic regression gives an equation which is of the form
Y = eX + e-X
Coefficient interpretation
In linear regression, the coefficient interpretation of independent variables are quite straightforward (i.e. holding all other variables constant, with a unit increase in this variable, the dependent variable is expected to increase/decrease by xxx).
However, in logistic regression, depends on the family (binomial, Poisson,
etc.) and link (log, logit, inverse-log, etc.) you use, the interpretation is different.
Error minimization technique
Linear regression uses ordinary least squares method to minimise the
errors and arrive at a best possible fit, while logistic regression
uses maximum likelihood method to arrive at the solution.
Linear regression is usually solved by minimizing the least squares error of the model to the data, therefore large errors are penalized quadratically.
Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotically constant.
Consider linear regression on categorical {0, 1} outcomes to see why this is a problem. If your model predicts the outcome is 38, when the truth is 1, you've lost nothing. Linear regression would try to reduce that 38, logistic wouldn't (as much)2.
In linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values. In logistic regression, the outcome (dependent variable) has only a limited number of possible values.
For instance, if X contains the area in square feet of houses, and Y contains the corresponding sale price of those houses, you could use linear regression to predict selling price as a function of house size. While the possible selling price may not actually be any, there are so many possible values that a linear regression model would be chosen.
If, instead, you wanted to predict, based on size, whether a house would sell for more than $200K, you would use logistic regression. The possible outputs are either Yes, the house will sell for more than $200K, or No, the house will not.
Just to add on the previous answers.
Linear regression
Is meant to resolve the problem of predicting/estimating the output value for a given element X (say f(x)). The result of the prediction is a continuous function where the values may be positive or negative. In this case you normally have an input dataset with lots of examples and the output value for each one of them. The goal is to be able to fit a model to this data set so you are able to predict that output for new different/never seen elements. Following is the classical example of fitting a line to set of points, but in general linear regression could be used to fit more complex models (using higher polynomial degrees):
Resolving the problem
Linear regression can be solved in two different ways:
Normal equation (direct way to solve the problem)
Gradient descent (Iterative approach)
Logistic regression
Is meant to resolve classification problems where given an element you have to classify the same in N categories. Typical examples are, for example, given a mail to classify it as spam or not, or given a vehicle find to which category it belongs (car, truck, van, etc ..). That's basically the output is a finite set of discrete values.
Resolving the problem
Logistic regression problems could be resolved only by using Gradient descent. The formulation in general is very similar to linear regression the only difference is the usage of different hypothesis function. In linear regression the hypothesis has the form:
h(x) = theta_0 + theta_1*x_1 + theta_2*x_2 ..
where theta is the model we are trying to fit and [1, x_1, x_2, ..] is the input vector. In logistic regression the hypothesis function is different:
g(x) = 1 / (1 + e^-x)
This function has a nice property, basically it maps any value to the range [0,1] which is appropiate to handle propababilities during the classificatin. For example in case of a binary classification g(X) could be interpreted as the probability to belong to the positive class. In this case normally you have different classes that are separated with a decision boundary which basically a curve that decides the separation between the different classes. Following is an example of dataset separated in two classes.
You can also use the below code to generate the linear regression
curve
q_df = details_df
# q_df = pd.get_dummies(q_df)
q_df = pd.get_dummies(q_df, columns=[
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9"
])
q_1_df = q_df["1"]
q_df = q_df.drop(["2", "3", "4", "5"], axis=1)
(import statsmodels.api as sm)
x = sm.add_constant(q_df)
train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(
x, q3_rechange_delay_df, test_size=0.2, random_state=123 )
lmod = sm.OLS(train_y, train_x).fit() lmod.summary()
lmod.predict()[:10]
lmod.get_prediction().summary_frame()[:10]
sm.qqplot(lmod.resid,line="q") plt.title("Q-Q plot of Standardized
Residuals") plt.show()
Simply put, linear regression is a regression algorithm, which outpus a possible continous and infinite value; logistic regression is considered as a binary classifier algorithm, which outputs the 'probability' of the input belonging to a label (0 or 1).
The basic difference :
Linear regression is basically a regression model which means its will give a non discreet/continuous output of a function. So this approach gives the value. For example : given x what is f(x)
For example given a training set of different factors and the price of a property after training we can provide the required factors to determine what will be the property price.
Logistic regression is basically a binary classification algorithm which means that here there will be discreet valued output for the function . For example : for a given x if f(x)>threshold classify it to be 1 else classify it to be 0.
For example given a set of brain tumour size as training data we can use the size as input to determine whether its a benine or malignant tumour. Therefore here the output is discreet either 0 or 1.
*here the function is basically the hypothesis function
They are both quite similar in solving for the solution, but as others have said, one (Logistic Regression) is for predicting a category "fit" (Y/N or 1/0), and the other (Linear Regression) is for predicting a value.
So if you want to predict if you have cancer Y/N (or a probability) - use logistic. If you want to know how many years you will live to - use Linear Regression !
Regression means continuous variable, Linear means there is linear relation between y and x.
Ex= You are trying to predict salary from no of years of experience. So here salary is independent variable(y) and yrs of experience is dependent variable(x).
y=b0+ b1*x1
We are trying to find optimum value of constant b0 and b1 which will give us best fitting line for your observation data.
It is a equation of line which gives continuous value from x=0 to very large value.
This line is called Linear regression model.
Logistic regression is type of classification technique. Dnt be misled by term regression. Here we predict whether y=0 or 1.
Here we first need to find p(y=1) (wprobability of y=1) given x from formuale below.
Probaibility p is related to y by below formuale
Ex=we can make classification of tumour having more than 50% chance of having cancer as 1 and tumour having less than 50% chance of having cancer as 0.
Here red point will be predicted as 0 whereas green point will be predicted as 1.
Cannot agree more with the above comments.
Above that, there are some more differences like
In Linear Regression, residuals are assumed to be normally distributed.
In Logistic Regression, residuals need to be independent but not normally distributed.
Linear Regression assumes that a constant change in the value of the explanatory variable results in constant change in the response variable.
This assumption does not hold if the value of the response variable represents a probability (in Logistic Regression)
GLM(Generalized linear models) does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.
| Basis | Linear | Logistic |
|-----------------------------------------------------------------|--------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| Basic | The data is modelled using a straight line. | The probability of some obtained event is represented as a linear function of a combination of predictor variables. |
| Linear relationship between dependent and independent variables | Is required | Not required |
| The independent variable | Could be correlated with each other. (Specially in multiple linear regression) | Should not be correlated with each other (no multicollinearity exist). |
In short:
Linear Regression gives continuous output. i.e. any value between a range of values.
Logistic Regression gives discrete output. i.e. Yes/No, 0/1 kind of outputs.
To put it simply, if in linear regression model more test cases arrive which are far away from the threshold(say =0.5)for a prediction of y=1 and y=0. Then in that case the hypothesis will change and become worse.Therefore linear regression model is not used for classification problem.
Another Problem is that if the classification is y=0 and y=1, h(x) can be > 1 or < 0.So we use Logistic regression were 0<=h(x)<=1.
Logistic Regression is used in predicting categorical outputs like Yes/No, Low/Medium/High etc. You have basically 2 types of logistic regression Binary Logistic Regression (Yes/No, Approved/Disapproved) or Multi-class Logistic regression (Low/Medium/High, digits from 0-9 etc)
On the other hand, linear regression is if your dependent variable (y) is continuous.
y = mx + c is a simple linear regression equation (m = slope and c is the y-intercept). Multilinear regression has more than 1 independent variable (x1,x2,x3 ... etc)
In linear regression the outcome is continuous whereas in logistic regression, the outcome has only a limited number of possible values(discrete).
example:
In a scenario,the given value of x is size of a plot in square feet then predicting y ie rate of the plot comes under linear regression.
If, instead, you wanted to predict, based on size, whether the plot would sell for more than 300000 Rs, you would use logistic regression. The possible outputs are either Yes, the plot will sell for more than 300000 Rs, or No.
In case of Linear Regression the outcome is continuous while in case of Logistic Regression outcome is discrete (not continuous)
To perform Linear regression we require a linear relationship between the dependent and independent variables. But to perform Logistic regression we do not require a linear relationship between the dependent and independent variables.
Linear Regression is all about fitting a straight line in the data while Logistic Regression is about fitting a curve to the data.
Linear Regression is a regression algorithm for Machine Learning while Logistic Regression is a classification Algorithm for machine learning.
Linear regression assumes gaussian (or normal) distribution of dependent variable. Logistic regression assumes binomial distribution of dependent variable.
The basic difference between Linear Regression and Logistic Regression is :
Linear Regression is used to predict a continuous or numerical value but when we are looking for predicting a value that is categorical Logistic Regression come into picture.
Logistic Regression is used for binary classification.

Resources