Built a flow in Azure ML using a Neural network Multiclass module (for settings see picture).
Some more info about the Multiclass:
The data flow is simple, split of 80/20.
Preparation of the data is made before it goes into Azure. Data looks like this:
My problem comes when I want to make sense of the output and if possible transform/calculate the output to probabilities. Output looks like this:
My question: If scored probabilities output for my model is 0.6 and scored labels = 1, how sure is the model of the scored labels 1? And how sure can I be that actual outcome will be a 1?
Can I safely assume that a scored probabilities of 0.80 = 80% chance of outcome? Or what type of outcomes should I watch out for?
To start with, your are in a binary classification setting, not in a multi-class one (we normally use this term when number of classes > 2).
If scored probabilities output for my model is 0.6 and scored labels = 1, how sure is the model of the scored labels 1?
In practice, the scored probabilities are routinely interpreted as the confidence of the model; so, in this example, we would say that your model has 60% confidence that the particular sample belongs to class 1 (and, complementary, 40% confidence that it belongs to class 0).
And how sure can I be that actual outcome will be a 1?
If you don't have any alternate means of computing such outcomes yourself (e.g. a different model), I cannot see how this question is different from your previous one.
Can I safely assume that a scored probabilities of 0.80 = 80% chance of outcome?
This is the kind of statement that would drive a professional statistician mad; nevertheless, the clarifications above regarding the confidence should be enough for your purposes (they are enough indeed for ML practitioners).
My answer in Predict classes or class probabilities? should also be helpful.
Related
I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.
The dataset description says that the values go from 0 to 4 but the attribute description says:
0: < 50% coronary disease
1: > 50% coronary disease
I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?
If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.
I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.
While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.
References :
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.
is this dataset meant to be a multiclass or a binary classification problem?
Without changes, the dataset is ready to be used for a multi-class classification problem.
And must i group values 1-4 to a single class (presence of disease)?
Yes, you must, as long as you are interested in using the dataset for a binary classification problem.
I am using two Neural networks for two class text classification. I'm getting 90% accuracy on test data. Also using different performance metrics like precision, recall, f-score and confusion matrix to make sure that model is performing as expected.
In the predictive experiment using trained model, I'm fetching probabilities for each prediction.The output looks as follows (Couldn't provide codes it's implemented in Azure ML Studio )
ex:
class (probability) , class 2 (probability) -> predicted class
class 1 (0.99) , class 2 (0.01) -> class 1
class 1 (0.53) , class 2 (0.47) -> class 1
class 1 (0.2) , class 2(0.8) -> class 2
Example
As per my understanding so far, by looking at the probability we can tell, how confident is the model about its prediction.And 90% accuracy means out 100 records 10 predictions could go wrong.
Now my question is, by looking at probability (confidence) can we tell which bucket the current records falls into 90%(correct prediction) or 10% (wrong prediction)?
What I'm trying to achieve is, to give end your some metric to tell him/her that this prediction is probably wrong, they might want to change it to some other class before using these results.
90% accuracy means out 100 records 10 predictions could go wrong.
It is not exactly like that; accuracy is always (although implicitly) linked to the specific test set we have used to measure it: so, 90% means that out of 100 records our classifier indeed misclassified 10 (i.e. there is not "could").
What we hope for in machine learning is that the performance of our models in new, unseen data, will be comparable to that of our test set (which, regarding the training of our model, is also unseen). Roughly speaking, provided that our new data come from the same statistical distribution with our training & test sets, it is not an unreasonable expectation.
What I'm trying to achieve is, to give end your some metric to tell him/her that this prediction is probably wrong, they might want to change it to some other class before using these results.
Intuitively, you should already know the answer to this: interpreting the returned probabilities as confidence (which, at least in principle, is not an invalid interpretation), their values tell you something about how "certain" your model is about its answers: so, what you could do is provide the end users with these probability values; in your example, the case of "Question" with probability 0.97 is indeed qualitatively not the same with the case "Question" with probability ~ 0.50...
I'm writing a naive bayes classifier for a class project and I just got it working... sort of. While I do get an error-free output, the winning output label had an output probability of 3.89*10^-85.
Wow.
I have a couple of ideas of what I might be doing wrong. Firstly, I am not normalizing the output percentages for the classes, so all of the percentages are effectively zero. While that would give me numbers that look nice, I don't know if that's the correct thing to do.
My second idea was to reduce the number of features. Our input data is a list of pseudo-images in the form of a very long text file. Currently, our features are just the binary value of every pixel of the image, and with a 28x28 image that's a lot of features. If I instead chopped the image into blocks of size, say, 7x7, how much would that actually improve the output percentages?
tl;dr Here's the general things I'm trying to understand about naive bayes:
1) Do you need to normalize the output percentages from testing each class?
2) How much of an effect does having too many features have on the results?
Thanks in advance for any help you can give me.
It could be normal. The output of a naive bayes is not meant to be a real probability. What it is meant to do is order a score among competing classes.
The reason why the probability is so low is that many Naive Bayes implementations are the product of the probabilities of all the observed features of the instance that is being classified. If you are classifying text, each feature may have a low conditional probability for each class (example: lower than 0.01). If you multiply 1000s of feature probabilities, you quickly end up with numbers such as you have reported.
Also, the probabilities returned are not the probabilities of each class given the instance, but an estimate of the probabilities of observing this set of features, given the class. Thus, the more you have features, the less likely it is to observe these exact features. A bayesian theorem is used to change argmax_c P(class_c|features) to argmax_c P(class_c)*P(features|class_c), and then the P(features|class_c) is further simplified by making independence assumption, which allows changing that to a product of the probabilities of observing each individual feature given the class. These assumptions don't change the argmax (the winning class).
If I were you, I would not really care about the probability output, focus instead on the accuracy of your classifier and take action to improve the accuracy, not the calculated probabilities.
When using SKlearn and getting probabilities with the predict_proba(x) function for a binary classification [1, 0] the function returns the probability that the classification falls into each class. example [.8, .34].
Is there a community adopted standard way to reduce this down to a single classification confidence which takes all factors into consideration?
Option 1)
Just take the probability for the classification that was predicted (.8 in this example)
Option 2)
Some mathematical formula or function call which which takes into consideration all of the different probabilities and returns a single number. Such a confidence approach could take into consideration who close the probabilities of the different classes and return a lower confidence if there is not much separation between the different classes.
Theres no standard of of doing it. But what you can do is vary the threshold. What I exactly mean is if you use predict instead it throws out a binary out classifying your dataset, what its doing is taking 0.5 as a threshhold for predicting. Like if the probability of classifying in 1 is >0.5 classify it as 1 and 0 if <=0.5. But this can lead to a bad f1-score in some cases.
So, the approach should be to vary the threshhold and and choose one which yields maximum f1-score or any other metric you want to use as a score function. ROC(Receiver operating characteristic)curves are meant for this purpose only. And infact, the motive behind sklearn for giving out the class probabilities for this only, to let you choose the best threshhold.
A very nice example is predicting whether the patient has cancer or not. So you have to choose your threshhold wisely, if you choose it high you'll might be getting false-negatives a lot or if you choose it low you might get false-positives a lot. So you just choose the threshold according to your needs (as its better to get more false-positives).
Hope it helps!
I am trying to use linear SVMs for multi-class object category recognition. So far what I have understood is that there are mainly two approaches used- one-vs-all(OVA) and one-vs-one(OVO).
But I am having difficulty understanding its implementation. I mean the steps that I think is used are:
First the feature descriptors are prepared from let's say SIFT. So I have a 128XN feature vector.
Next to prepare a SVM classifier model for a particluar object category(say car), I take 50 images of car as the positive training set and total 50 images of rest categories taking randomly from each category (Is this part correct?). I prepare such models for all such categories (say 5 of them).
Next when I have an input image, do I need to input the image into all the 5 models and then check their values (+1/-1) for each of these models? I am having difficulty understanding this part.
In one-vs-all approach, you have to check for all 5 models. Then you can take the decision with the most confidence value. LIBSVM gives probability estimates.
In one-vs-one approach, you can take the majority. For example, you test 1 vs. 2, 1 vs. 3, 1 vs. 4 and 1 vs. 5. You classify it as 1 in 3 cases. You do the same for other 4 classes. Suppose for other four classes the values are [0, 1, 1, 2]. Therefore, class 1 was obtained most number of times, making that class as the final class. In this case, you could also do total of probability estimates. Take the maximum. That would work unless in one pair the classification goes extremely wrong. For example, in 1 vs. 4, it classifies 4 (true class is 1) with a confidence 0.7. Then just because of this one decision, your total of probability estimates may shoot up and give wrong results. This issue can be examined experimentally.
LIBSVM uses one vs. one. You can check the reasoning here. You can read this paper too where they defend one vs. all classification approach and conclude that it is not necessarily worse than one vs. one.
In short, your positive training samples are always the same. In one vs one you train n classifiers with negative samples from each of the negative classes taken separately. In one vs all you lump all negative samples together and train a single classifier.. The problem with the former approach is that you have to consider all n outcomes to decide on the class. The problem with the latter approach is that lumping al negativel object classes create may create a non homogeneous class that is hard to process and analyse.