I know some machine learning algorithms can output the probability of predicted labels of an input sample.
For example, give a sample with three possible labels, a probability tuple (0.2,0.3,0.5) can be outputted through some probabilistic learning algorithms, such as logistic regression or probability estimate tree. Then the label with maximum probability (here 0.5) is outputted as the final prediction.
My question is, given a new sample having the predicted probability tuple (0.3,0.4,0.3), how can I quantitatively determine the likelihood of that the predicted label (here the second label) is correct?
Many thanks
(This IMHO question doesn't belong here. It does belong to stat stack exchange)
The answer is freakingly simple: the probability/likelihood -- which is not exactly the same -- is 0.4, which is pretty low.
If you want to run a small experiment. Build/learn a model classify a few instances and compare it to the ground truth. In addition sum up the probability of the most likely label. You will see that the sum of probabilities matches the fraction of correctly classified instances
or:
your model is wrong
your sample set is to small
Related
I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?
That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.
In the tutorial https://www.tensorflow.org/tutorials/keras/classification
at
https://www.tensorflow.org/tutorials/keras/classification#make_predictions
A prediction is an array of 10 numbers. They represent the model's
"confidence" that the image corresponds to each of the 10 different
articles of clothing. You can see which label has the highest
confidence value:
Instead of confidence, if I want to estimate the probability of each class (different articles of clothing). How will I do that?
As #desertnaut mention a comment, above, the confidence in the code
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])
predictions = probability_model.predict(test_images)
given by the variable predictions are indeed probability.
Imagine we have a classification problem on a dataset where the examples are only positive (equivalently negative). For instance, on a problem where the the winning class is specified by position (e.g. think of a tennis dataset problem where the first player is always the winner). How can we create negative examples in order to train a supervised learning algorithm on this dataset? One idea could be to generate negative examples, by exchanging the positions of the features that are tied to each of the classes. Do you think this will give an unbiased dataset? Could we create negative duplicates of our original dataset and train a supervised learning algorithm on this double dataset?
Can someone give me a clear and simple definition of Maximum entropy classification? It would be very helpful if someone can provide a clear analogy, as I am struggling to understand.
"Maximum Entropy" is synonymous with "Least Informative". You wouldn't want a classifier that was least informative. It is in reference to how the priors are established. Frankly, "Maximum Entropy Classification" is an example of using buzz words.
For an example of an uninformative prior, consider given a six-sided object. The probability that any given face will appear if the object is tossed is 1/6. This would be your starting prior. It's the least informative. You really wouldn't want to start with anything else or you will bias later calculations. Of course, if you have knowledge that one side will appear more often you should incorporate that into your priors.
The Bayes formula is P(H|E) = P(E|H)P(H)/P(D)
where P(H) is the prior for the hypothesis and P(D) is the sum of all possible numerators.
For text classification where a missing word is to be inserted, E is some given document and H is the given word. IOW, the hypothesis is that H is the word which should be selected and P(H) is the weight given to the word.
Maximum Entropy Text classification means: start with least informative weights (priors) and optimize to find weights that maximize the likelihood of the data, the P(D). Essentially, it's the EM algorithm.
A simple Naive Bayes classifier would assume the prior weights would be proportional to the number of times the word appears in the document. However,this ignore correlations between words.
The so-called MaxEnt classifier, takes the correlations into account.
I can't think of a simple example to illustrate this but I can think of some correlations. For example, "the missing" in English should give higher weights to nouns but a Naive Bayes classifier might give equal weight to a verb if its relative frequency were the same as a given noun. A MaxEnt classifier considering missing would give more weight to nouns because they would be more likely in context.
I may also advise HIDDEN MARKOV AND
MAXIMUM ENTROPY
MODELS from the Department of Computer Science, Johns Hopkins. Specifically, take a look at chapter 6.6. This book explains the Maximum Entropy on the example of PoS tagging and compare MaxEnt application in MEMM with Hidden Markov Model. There are also explanation what is exactly MaxEnt with math behind.
(Taken from UNDERSTANDING DEEP LEARNING
GENERALIZATION
BY
MAXIMUM
ENTROPY (Zheng et al., 2017):
(Original Maximum Entropy Model) Supposing the dataset has input X and label
Y, the task is to find a good prediction of Y using X. The prediction Yˆ needs to maximize the
conditional entropy H(Yˆ |X) while preserving the same distribution with data (X, Y ). This is
formulated as:
min −H(Yˆ |X) (1)
s.t. P(X, Y ) = P(X, Yˆ ),
\sum(Yˆ) P(Yˆ |X) = 1
Berger et al., 1996 solves this with lagrange multipliers ωi as an exponential form:
Pω(Yˆ = y|X = x) = 1/Zω(x) exp (\sum(i) ωifi(x, y))
I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)