Random Forests - Probability Estimates (+scikit-learn specific) - machine-learning

I am interested in understanding how probability estimates are calculated by random forests, both in general and specifically in Python's scikit-learn library (where probability estimated are returned by the predict_proba function).
Thanks,
Guy

The probabilities returned by a forest are the mean probabilities returned by the trees in the ensemble (docs).
The probabilities returned by a single tree are the normalized class histograms of the leaf a sample lands in.

In addition to what Andreas/Dougal said,
when you train the RF, turn on compute_importances=True.
Then inspect classifier.feature_importances_ to see which features are occurring high-up in the RF's trees.

Related

Is the loss function='Multiclass' in catboost same as log loss if I am doing a multiclassification problem?

I am making a multiclass prediction model using catboost, The final solution should have minimum Logloss error but Logloss is not present in catboost, they have something called 'Multiclass' as the loss function. Are they both same? if not then how can I measure the accuracy of the catboost model in terms of Logloss?
Are they both same? Effectively, Yes...
The catboost documentation describe the calculation of 'MultiClass' loss as what is generally considered as Multinomial/Multiclass Cross Entropy Loss. That is effectively, a Log Softmax applied to the classifier output 'a' to produce values that can be interpreted as probabilities, and subsequently then apply Negative Log Likelihood Loss (NLLLoss), wiki1 & wiki2.
Their documentation describe the calculation of 'LogLoss' also, which again is NLLLoss, however applied to 'p'. Which they describe here to be result of applying the sigmoid fn to the classifier output. Since the NLLLoss is reworked for the binary problem, only a single class probability is calculated, using 'p' and '1-p' for each class. And in this special (binary) case, use of sigmoid and softmax are equivalent.
How can I measure the the catboost model in terms of Logloss?
They describe a method to produce desired metrics on given data.
Be careful not to confuse loss/objective function 'loss_function' with evaluation metric 'eval_metric', however in this instance, the same function can be used for both, as listed in their supported metrics.
Hope this helps!
Log loss is not a loss function but a metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1.
Learn more here.

Sklearn models: decision function vs predict_proba for roc curve

In Sklearn, roc curve requires (y_true, y_scores). Generally, for y_scores, I feed in probabilities as outputted by a classifier's predict_proba function. But in the sklearn example, I see both predict_prob and decision_fucnction are used.
I wonder what is the difference in terms of real life model evaluation?
The functional form of logistic regression is -
f(x)=11+e−(β0+β1x1+⋯+βkxk)
This is what is returned by predict_proba.
The term inside the exponential i.e.
d(x)=β0+β1x1+⋯+βkxk
is what is returned by decision_function. The "hyperplane" referred to in the documentation is
β0+β1x1+⋯+βkxk=0
My Understanding after reading few resources:
Decision Function: Gives the distances from the hyperplane. These are therefore unbounded. This can not be equated to probabilities. For getting probabilities, there are 2 solutions - Platt Scaling & Multi-Attribute Spaces to calibrate outputs using Extreme Value Theory.
Predict Proba: Gives the actual probabilities (0 to 1) however attribute 'probability' has to be set to True while fitting the model itself. It uses Platt scaling which is known to have theoretical issues.
Refer to this in documentation.

Can a probability score from a machine learning model XGBClassifier or Neural Network be treated as a confidence score?

If there are 4 classes and output probability from the model is A=0.30,B=0.40,C=0.20 D=0.10 then can I say that output from the model is class B with 40% confidence? If not then why?
Although a softmax activation will ensure that the outputs satisfy the surface Kolmogorov axioms (probabilities always sum to one, no probability below zero and above one) and the individual values can be seen as a measure of the network's confidence, you would need to calibrate the model (train it not as a classifier but rather as a probability predictor) or use a bayesian network before you could formally claim that the output values are your per-class prediction confidences. (https://arxiv.org/pdf/1706.04599.pdf)

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

What is the OpenCV svm type parameter

The opencv SVM implementation takes a parameter labeled as "SVM type" which must be used in the CVSVMParams structure used in training the SVM. All the explanation I can find is:
// SVM type
enum { C_SVC=100, NU_SVC=101, ONE_CLASS=102, EPS_SVR=103, NU_SVR=104 };
Anyone know what these different values represent?
They are different formulations of SVM. At the heart of SVM is an mathematical optimization problem. This problem can be stated in different ways.
C-SVM uses C as the tradeoff parameter between the size of margin and the number of training points which are misclassified. C is just a number, the useful range depends on the dataset and it can range from very small (like 10-5) to very large (like 10^5), depending on your data.
nu-SVM uses nu instead of C. nu is roughly a percentage of training points which will end up as support vectors. The more support vectors, the wider your margin is, the more training points which will be misclassified. nu ranges from 0.1 to 0.8 - at 0.1 roughly 10% of training points will be support vectors, at 0.8, more like 80%. I say roughly because its just correlated that way - its not exact.
epsilon-SVR and nu-SVR use SVM for regression. Instead of doing binary classification by finding a maximum margin hyperplane, instead the concept is used to find a hypertube which best fits the data in order to use it to predict future models. They differ in the way they are parameterized (like nu-SVM and C-SVM differ).
One-Class SVM is novelty detection. Rather than binary classification, or predicting a value, instead you give the SVM a training set and it attempts to train a model to wrap around that set so that a future instance can be classified as part of the class or outside the class (novel or outlier).
In general:
Classification SVM Type 1 (also known as C-SVM classification)
Classification SVM Type 2 (also known as nu-SVM classification)
Regression SVM Type 1 (also known as epsilon-SVM regression)
Regression SVM Type 2 (also known as nu-SVM regression)
Details can be found on page SVM

Resources