Sklearn models: decision function vs predict_proba for roc curve - machine-learning

In Sklearn, roc curve requires (y_true, y_scores). Generally, for y_scores, I feed in probabilities as outputted by a classifier's predict_proba function. But in the sklearn example, I see both predict_prob and decision_fucnction are used.
I wonder what is the difference in terms of real life model evaluation?

The functional form of logistic regression is -
f(x)=11+e−(β0+β1x1+⋯+βkxk)
This is what is returned by predict_proba.
The term inside the exponential i.e.
d(x)=β0+β1x1+⋯+βkxk
is what is returned by decision_function. The "hyperplane" referred to in the documentation is
β0+β1x1+⋯+βkxk=0

My Understanding after reading few resources:
Decision Function: Gives the distances from the hyperplane. These are therefore unbounded. This can not be equated to probabilities. For getting probabilities, there are 2 solutions - Platt Scaling & Multi-Attribute Spaces to calibrate outputs using Extreme Value Theory.
Predict Proba: Gives the actual probabilities (0 to 1) however attribute 'probability' has to be set to True while fitting the model itself. It uses Platt scaling which is known to have theoretical issues.
Refer to this in documentation.

Related

Is the loss function='Multiclass' in catboost same as log loss if I am doing a multiclassification problem?

I am making a multiclass prediction model using catboost, The final solution should have minimum Logloss error but Logloss is not present in catboost, they have something called 'Multiclass' as the loss function. Are they both same? if not then how can I measure the accuracy of the catboost model in terms of Logloss?
Are they both same? Effectively, Yes...
The catboost documentation describe the calculation of 'MultiClass' loss as what is generally considered as Multinomial/Multiclass Cross Entropy Loss. That is effectively, a Log Softmax applied to the classifier output 'a' to produce values that can be interpreted as probabilities, and subsequently then apply Negative Log Likelihood Loss (NLLLoss), wiki1 & wiki2.
Their documentation describe the calculation of 'LogLoss' also, which again is NLLLoss, however applied to 'p'. Which they describe here to be result of applying the sigmoid fn to the classifier output. Since the NLLLoss is reworked for the binary problem, only a single class probability is calculated, using 'p' and '1-p' for each class. And in this special (binary) case, use of sigmoid and softmax are equivalent.
How can I measure the the catboost model in terms of Logloss?
They describe a method to produce desired metrics on given data.
Be careful not to confuse loss/objective function 'loss_function' with evaluation metric 'eval_metric', however in this instance, the same function can be used for both, as listed in their supported metrics.
Hope this helps!
Log loss is not a loss function but a metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1.
Learn more here.

Loss function for OneVsRestClassifier

I have a OneVsRestClassifier (scikit-learn) which has been trained.
clf = OneVsRestClassifier(LogisticRegression(C=1.2, penalty='l1')).fit(X_train, y_train)
I want to find out the loss for my test data. I used log_loss function but it does not seem to work because I have multiple classes as outputs for each test case. What do I do?
The classification problem that you are referring to is known as a Multi-Label Classification problem. You have made a good decision of using the OneVsRestClassifier for this purpose. By default the score method uses the subset accuracy which is a very harsh metric as it requires you to guess the entire subset of labels correctly.
Some other loss functions, provided by scikit-learn, that you can use are as follows:
Hamming Loss - This measures the hamming distance between your prediction of labels and the true label. This is an intuitive formula to understand the hamming distance.
Jaccard Similarity Coefficient Score - This measures the Jaccard similarity between your predicted labels and the true labels.
Precision, Recall and F-Measures - In the case of multi-label classification, the notion of Precision, Recall and F-Measures can be applied to each class independently. The following guide explains how to combine them across all labels in multi-label classification.
If you need to also rank the labels as it is done in multi-label ranking problems, then there are other more advanced techniques available in scikit-learn which are very well documented with examples here. If you are dealing with this kind of a problem, then let me know in the comments, I will explain each of these metrics in more details.
Hope this helps!

Are these different definitions of Likelihood functions In Machine Learning equivalent?

Okay I have a lot of confusion in regards to the way likelihood functions are defined in the context of different machine learning algorithms. For the context of this discussion, I will reference Andrew Ng 229 lecture notes.
Here is my understanding thus far.
In the context of classification, we have two different types of algorithms: discriminative and generative. The goal in both of these cases is to determine the posterior probability, that is p(C_k|x;w), where w is parameter vector and x is feature vector and C_k is kth class. The approaches are different as in discriminative we are trying to solve for the posterior probability directly given x. And in the generative case, we are determining the conditional distributions p(x|C_k), and prior classes p(C_k), and using Bayes theorem to determine P(C_k|x;w).
From my understanding Bayes theorem takes the form: p(parameters|data) = p(data|parameters)p(parameters)/p(data) where the likelihood function is p(data|parameters), posterior is p(parameters|data) and prior is p(parameters).
Now in the context of linear regression, we have the likelihood function:
p(y|X;w) where y is the vector of target values, X is design matrix.
This makes sense in according to how we defined the likelihood function above.
Now moving over to classification, the likelihood is defined still as p(y|X;w). Will the likelihood always be defined as such ?
The posterior probability we want is p(y_i|x;w) for each class which is very weird since this is apparently the likelihood function as well.
When reading through a text, it just seems the likelihood is always defined to different ways, which just confuses me profusely. Is there a difference in how the likelihood function should be interpreted for regression vs classification or say generative vs discriminative. I.e the way the likelihood is defined in Gaussian discriminant analysis looks very different.
If anyone can recommend resources that go over this in detail I would appreciate this.
A quick answer is that the likelihood function is a function proportional to the probability of seeing the data conditional on all the parameters in your model. As you said in linear regression it is p(y|X,w) where w is your vector of regression coefficients and X is your design matrix.
In a classification context, your likelihood would be proportional to P(y|X,w) where y is your vector of observed class labels. You do not have a y_i for each class, because your training data was observed to be in one particular class. Given your model specification and your model parameters, for each observed data point you should be able to calculate the probability of seeing the observed class. This is your likelihood.
The posterior predictive distribution, p(y_new_i|X,y), is the probability you want in paragraph 4. This is distinct from the likelihood because it is the probability for some unobserved case, rather than the likelihood, which relates to your training data. Note that I removed w because typically you would want to marginalize over it rather than condition on it because there is still uncertainty in the estimate after training your model and you would want your predictions to marginalize over that rather than condition on one particular value.
As an aside, the goal of all classification methods is not to find a posterior distribution, only Bayesian methods are really concerned with a posterior and those methods are necessarily generative. There are plenty of non-Bayesian methods and plenty of non-probabilistic discriminative models out there.
Any function proportional to p(a|b) where a is fixed is a likelihood function for b. Note that p(a|b) might be called something else, depending on what's interesting at the moment. For example, p(a|b) can also be called the posterior for a given b. The names don't really matter.

Random Forests - Probability Estimates (+scikit-learn specific)

I am interested in understanding how probability estimates are calculated by random forests, both in general and specifically in Python's scikit-learn library (where probability estimated are returned by the predict_proba function).
Thanks,
Guy
The probabilities returned by a forest are the mean probabilities returned by the trees in the ensemble (docs).
The probabilities returned by a single tree are the normalized class histograms of the leaf a sample lands in.
In addition to what Andreas/Dougal said,
when you train the RF, turn on compute_importances=True.
Then inspect classifier.feature_importances_ to see which features are occurring high-up in the RF's trees.

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

Resources