Are NN Classification Outputs Probabilities? - machine-learning

I have just begun to work with Neural Networks using tensor flow and I am really new to this. I trained my first model to make 2 category classifications and I'm a little curious about the output. Let's say we are making a prediction based on whether or not a house price will go up and we get an output like
House A: .99
House B: .75
House C: .55
House D: .40
Can I assume that these outputs are probabilities? So it's more likely that house B will go up, rather than House C. Or Is it just classifying it as C and B will go up and House D will not. Thanks!

Not exactly. A neural network will output a prediction of what you have trained it for. So if you trained it to predict probabilities, it sure will output (predictions of) probabilities. However, if you trained it on an observation that the price actually did go up, say a single output which is 1.0 if the price went up, and 0.0 if the price didn't, then the output will be a regression value of observation given the input. This is not necessarily the probability but can rather be viewed as the confidence of the model.

Yes each number can be thought of as a probability representing how likely a house will go up in price. Just to further clarify, the probability estimate of one house does not affect the probability estimate of the others as they are treated as separate samples. So B being more likely doesn't make C less likely. It's just that B happens to be more likely to go up.
And the classification depends on your threshold. By default I believe most classifiers use 0.5 as their threshold, so in this case A, B, and C are classified to go up and D is classified to go down.

Related

How to understand output from a Multiclass Neural Network

Built a flow in Azure ML using a Neural network Multiclass module (for settings see picture).
Some more info about the Multiclass:
The data flow is simple, split of 80/20.
Preparation of the data is made before it goes into Azure. Data looks like this:
My problem comes when I want to make sense of the output and if possible transform/calculate the output to probabilities. Output looks like this:
My question: If scored probabilities output for my model is 0.6 and scored labels = 1, how sure is the model of the scored labels 1? And how sure can I be that actual outcome will be a 1?
Can I safely assume that a scored probabilities of 0.80 = 80% chance of outcome? Or what type of outcomes should I watch out for?
To start with, your are in a binary classification setting, not in a multi-class one (we normally use this term when number of classes > 2).
If scored probabilities output for my model is 0.6 and scored labels = 1, how sure is the model of the scored labels 1?
In practice, the scored probabilities are routinely interpreted as the confidence of the model; so, in this example, we would say that your model has 60% confidence that the particular sample belongs to class 1 (and, complementary, 40% confidence that it belongs to class 0).
And how sure can I be that actual outcome will be a 1?
If you don't have any alternate means of computing such outcomes yourself (e.g. a different model), I cannot see how this question is different from your previous one.
Can I safely assume that a scored probabilities of 0.80 = 80% chance of outcome?
This is the kind of statement that would drive a professional statistician mad; nevertheless, the clarifications above regarding the confidence should be enough for your purposes (they are enough indeed for ML practitioners).
My answer in Predict classes or class probabilities? should also be helpful.

Multilabel classification neural network, any one label

I am trying to figure out to build a neural network in which let's say I have 3 output labels (A, B, C).
Now my data consist of rows in which 2 of the labels can be 1. Like A and B will be 1 and C will be 0. Now I want to train my neural network such that it can predict A or B. I don't want it to be trained to have high probability for both A and B (like multilabel problems), I want only one of them.
The reason for this is that the rows having 1 in A and B are more like don't care rows in which predicting either A or B will be correct. So I don't want neural network to find minima where it tries to predict both A and B.
Is it possible to train neural network like this?
I think using a weight is the best way I can think of for your application.
Define a weight w for each sample such that w = 0 if A = 1 and B = 1, else w = 1. Now, define your loss function as:
w * (CE(A) +CE(B)) + w' * min(CE(A), CE(B)) + CE(C)
where CE(A) gives the cross-entropy loss over label A. The w' indicates complement of w. The loss function is quite simple to understand. It will try to predict both A and B correctly when both A and B are not 1. Otherwise, it will either predict A or B correctly. Remember, which one out of A and B will be predicted correctly cannot be known in advance. Also, it may not be consistent over batches. Model will always try to predict the class C correctly.
If you are using your own weights to indicate sample importance, then you should use multiply the entire above expression with that weight.
However, I wouldn't be surprised if you get similar (or even better) performance with the classic multi-label loss function. Assuming you have equal proportion of each label, then only in 1/8th of cases, you are allowing your network to predict either A or B. Otherwise, the network has to predict all three of them correctly. Usually, the simpler loss functions work better.
TL;DR:
a typical network will give you a probability for each class.
how you interpret it is up to you.
if you get equal weights in a single label scenario it means both labels are equally likely
The typical implementation for multi class classifier with neural networks is using a softmax layer, with one output per class
if you want a single label classifier, you treat the output with the maximum value as the selected label.
the actual value of this output compared to the others is a measure of the confidence in this value.
in case of equality, it means that both outputs are as likely

Likelihood of a sample prediction in machine learning

I know some machine learning algorithms can output the probability of predicted labels of an input sample.
For example, give a sample with three possible labels, a probability tuple (0.2,0.3,0.5) can be outputted through some probabilistic learning algorithms, such as logistic regression or probability estimate tree. Then the label with maximum probability (here 0.5) is outputted as the final prediction.
My question is, given a new sample having the predicted probability tuple (0.3,0.4,0.3), how can I quantitatively determine the likelihood of that the predicted label (here the second label) is correct?
Many thanks
(This IMHO question doesn't belong here. It does belong to stat stack exchange)
The answer is freakingly simple: the probability/likelihood -- which is not exactly the same -- is 0.4, which is pretty low.
If you want to run a small experiment. Build/learn a model classify a few instances and compare it to the ground truth. In addition sum up the probability of the most likely label. You will see that the sum of probabilities matches the fraction of correctly classified instances
or:
your model is wrong
your sample set is to small

How to get scores along with Machine Learning results?

I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)

classification algorithms that return confidences?

Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated automatically by scikit-learn)? What should I change in this approach if I had more that 2 potential classes?
This is what I have done so far:
# load libraries
from sklearn import neighbors
# initialize NearestNeighbor classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
# train model
knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])
# predict ::: get class probabilities
print(knn.predict_proba(1.5))
print(knn.predict_proba(37))
print(knn.predict_proba(3.5))
Example:
Let's assume that we have created a model using the XYZ machine learning algorithm. Let's also assume that we are trying to classify users based on their gender using information such as location, hobbies, and income. Then, we have 10 say new instances that we want to classify. As normal, upon the applying of the model, we get 10 outputs, either M (for male) or F (for female). So far so good. However, I would like to somehow measure the precision of these results and then, by using a hard-coded threshold, leave out those with low precision. My question is on how to measure the precession. Is probability (as given by the predict_proba() function) a good measure? For example, can I say that if probably is between 0.9 and 1 then "keep" (otherwise "omit")? Or I should use a more sophisticated method for doing that? As you can see, I lack theoretical background so any help would be highly appreciated.
While this is more of a stats question I can give answers relative to scikit-learn.
Confidence in machine learning depends on the method used for the model. For exemple with 3-NN (what you used), predict_proba(x) will give you n/3 with x the number of "class 1" among the 3 nearest neighbours from x. You can easily say that if n/3 is smaller than 0.5 that means there are less than 2 "class 1" among the nearest neighbours and that there are more than 2 "class 0". That means your x is more likely to be from "class 0". (I assume you knew that already)
For another method like SVM the confidence can be the distance from the point considered to the hyperplan or for ensemble models it could be the number of aggregated votes towards a certain class. Scikit-learn's predict_proba() uses what is available from the model.
For multiclass problems (imagine Y can be equal to A, B or C) ypu have two main approach that are sometimes directly taken into consideration in scikit learn.
The first approach is OneVsOne. It basically compute every new sample as a AvsB AvsC and BvsC model and takes the most probable (imagine if A wins against B and against C it is very likely that the right class is A, the annoying cases are resolved by taking the class that has the highest confidence in the match ups e.g. if A wins against B, B wins against C and C wins against C, if the confidence of A winning against B is higher than the rest it will most likely be A).
The second approach is OneVsAll, in wich you compute A vs B and C, B vs A and C, C vs A and B and take the class that is the most likely by looking at the confidence scores.
Using scikit-learn's predict() will always give the most likely class based on the confidence scores that predict_proba would give.
I suggest you read this http://scikit-learn.org/stable/modules/multiclass.html very carefully.
EDIT :
Ah I see what you are trying to do. predict_proba() has a big flaw : let's assume you have a big outlier in your new instances (e.g. female with video games and guns as hobbies, software developper as a job etc.) if you use for instance k-NN and your outlier will be in the flock of the other classe's cloud of point predict_proba() could give 1 as a confidence score for Male while the instance is Female. However it will well for undecisive cases (e.g. male or female, with video games and guns as hobbies, and works in a nursery) as predict_proba() will give something around ~0.5.
I don't know if something better can be used tought. If you have enough training samples for doing cross validation I suggest you maybe look toward ROC and PR curves for optimizing your threshold.

Resources