Weka NaiveBayes Classifier giving different (wrong?) means/standard-deviation on numerical values - machine-learning

I have tried to classify using both a NaiveBayes classifier and a NaiveBayesSimple classifier, using the following data:
#attribute a real
#attribute b {yes, no}
#data
1,yes
3,yes
5,yes
2,yes
1,yes
4,no
7,no
5,no
8,no
9,no
When using the NaiveBayesSimple classifier, I get the mean and variance values I expect:
=== Classifier model (full training set) ===
Naive Bayes (simple)
Class yes: P(C) = 0.5
Attribute a
Mean: 2.4 Standard Deviation: 1.67332005
Class no: P(C) = 0.5
Attribute a
Mean: 6.6 Standard Deviation: 2.07364414
However, when using the NaiveBayes classifier, I get different values:
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute yes no
(0.5) (0.5)
=============================
a
mean 2.5143 6.6286
std. dev. 1.3328 1.8286
weight sum 5 5
precision 1.1429 1.1429
I was wondering what the cause of the shifting mean/SD was? I've read through the paper that the NaiveBayes classifier is based on: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.3257 and can't see any reason for it there.
Thanks

The two algorithms differ from each other.
Naive bayes in Weka is defined as follows:
NAME weka.classifiers.bayes.NaiveBayes
SYNOPSIS Class for a Naive Bayes classifier using estimator classes.
Numeric estimator precision values are chosen based on analysis of the
training data. For this reason, the classifier is not an
UpdateableClassifier (which in typical usage are initialized with zero
training instances) -- if you need the UpdateableClassifier
functionality, use the NaiveBayesUpdateable classifier. The
NaiveBayesUpdateable classifier will use a default precision of 0.1
for numeric attributes when buildClassifier is called with zero
training instances.
For more information on Naive Bayes classifiers, see
George H. John, Pat Langley: Estimating Continuous Distributions in
Bayesian Classifiers. In: Eleventh Conference on Uncertainty in
Artificial Intelligence, San Mateo, 338-345, 1995.
OPTIONS debug -- If set to true, classifier may output additional info
to the console.
displayModelInOldFormat -- Use old format for model output. The old
format is better when there are many class values. The new format is
better when there are fewer classes and many attributes.
useKernelEstimator -- Use a kernel estimator for numeric attributes
rather than a normal distribution.
useSupervisedDiscretization -- Use supervised discretization to
convert numeric attributes to nominal ones.
and NaiveBayesSimple is defined as follows:
NAME weka.classifiers.bayes.NaiveBayesSimple
SYNOPSIS Class for building and using a simple Naive Bayes
classifier.Numeric attributes are modelled by a normal distribution.
For more information, see
Richard Duda, Peter Hart (1973). Pattern Classification and Scene
Analysis. Wiley, New York.
OPTIONS debug -- If set to true, classifier may output additional info
to the console.

Related

Calculating Probability of a Classification Model Prediction

I have a classification task. The training data has 50 different labels. The customer wants to differentiate the low probability predictions, meaning that, I have to classify some test data as Unclassified / Other depending on the probability (certainty?) of the model.
When I test my code, the prediction result is a numpy array (I'm using different models, this is one is pre-trained BertTransformer). The prediction array doesn't contain probabilities such as in Keras predict_proba() method. These are numbers generated by prediction method of pretrained BertTransformer model.
[[-1.7862008 -0.7037363 0.09885322 1.5318055 2.1137428 -0.2216074
0.18905772 -0.32575375 1.0748093 -0.06001111 0.01083148 0.47495762
0.27160102 0.13852511 -0.68440574 0.6773654 -2.2712054 -0.2864312
-0.8428862 -2.1132915 -1.0157436 -1.0340284 -0.35126117 -1.0333195
9.149789 -0.21288703 0.11455813 -0.32903734 0.10503325 -0.3004114
-1.3854568 -0.01692022 -0.4388664 -0.42163098 -0.09182278 -0.28269592
-0.33082992 -1.147654 -0.6703184 0.33038092 -0.50087476 1.1643585
0.96983343 1.3400391 1.0692116 -0.7623776 -0.6083422 -0.91371405
0.10002492]]
I'm using numpy.argmax() to identify the correct label. The prediction works just fine. However, since these are not probabilities, I cannot compare the best result with a threshold value.
My question is, how can I define a threshold (say, 0.6), and then compare the probability of the argmax() element of the BertTransformer prediction array so that I can classify the prediction as "Other" if the probability is less than the threshold value?
Edit 1:
We are using 2 different models. One is Keras, and the other is BertTransformer. We have no problem in Keras since it gives the probabilities so I'm skipping Keras model.
The Bert model is pretrained. Here is how it is generated:
def model(self, data):
number_of_categories = len(data['encoded_categories'].unique())
model = BertForSequenceClassification.from_pretrained(
"dbmdz/bert-base-turkish-128k-uncased",
num_labels=number_of_categories,
output_attentions=False,
output_hidden_states=False,
)
# model.cuda()
return model
The output given above is the result of model.predict() method. We compare both models, Bert is slightly ahead, therefore we know that the prediction works just fine. However, we are not sure what those numbers signify or represent.
Here is the Bert documentation.
BertForSequenceClassification returns logits, i.e., the classification scores before normalization. You can normalize the scores by calling F.softmax(output, dim=-1) where torch.nn.functional was imported as F.
With thousands of labels, the normalization can be costly and you do not need it when you are only interested in argmax. This is probably why the models return the raw scores only.

Problem with XGboost Classification & eli5 package

When training an XGBoost classification model, I am using the eli5 function "explain_prediction()" to look at the feature contributions to invidividual predictions.
However, the eli5 package seems to be treating my model as a regressor rather than a classifier.
Below is a snippet of code, showing my model, my prediction, and then the output from the "explain_prediction" method.
As you can see, the output gives a score that is 3.016 rather than a probability between 0 and 1. In this case I would have expected 0.953.
Any help appreciated.
the eli5 package seems to be treating my model as a regressor rather than a classifier.
The boosting score is converted to the probability score by applying the inverse logit function to it.
The probability scale is non-linear, which would make the numeric interpretation of feature contributions more difficult.
.. the output gives a score is 3.016 .. I would have expected 0.953
1 / (1 + exp(-3.016)) = 0.9532917416863492

sklearn - Predict each class's probability

So far I have resourced another post and sklearn documentation
So in general I want to produce the following example:
X = np.matrix([[1,2],[2,3],[3,4],[4,5]])
y = np.array(['A', 'B', 'B', 'C', 'D'])
Xt = np.matrix([[11,22],[22,33],[33,44],[44,55]])
model = model.fit(X, y)
pred = model.predict(Xt)
However for output, I would like to see 3 columns per observation as output from pred:
A | B | C
.5 | .2 | .3
.25 | .25 | .5
...
and a different probability for each class showing up in my prediction.
I believe that the best approach would be Multilabel classification from the second link I provided above. Additionally, I think it might be a good idea to hop into one of the multi-label or multi-output models listed below:
Support multilabel:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neural_network.MLPClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier
sklearn.linear_model.RidgeClassifierCV
Support multiclass-multioutput:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier
However, I am looking for someone who is has more confidence and experience at doing this the right way. All feedback is appreciated.
-bmc
From what I understand you want to obtain probabilities for each of the potential classes for multi-class classifier.
In Scikit-Learn it can be done by generic function predict_proba. It is implemented for most of the classifiers in scikit-learn. You basically call:
clf.predict_proba(X)
Where clf is the trained classifier.
As output you will get a decimal array of probabilities for each class for each input value.
One word of caution - not all classifiers naturally evaluate class probabilities. For instance, SVM doesn't do that. You still can obtain the class probabilities though, but to do that upon constructing such classifiers you need to instruct it to perform probability estimation. For SVM it would look like:
SVC(Probability=True)
After you fit it you will be able to use predict_proba as before.
I need to warn you that if classifier doesn't naturally evaluate probabilities that means that the probabilities will be evaluated using rather expansive computational methods which may significantly increase training time. So I advice you to use classifiers which naturally evaluate class probabilities (neural networks with softmax output, logistic regression, gradient boosting etc)
Try to use calibrated model:
# define model
model = SVC()
# define and fit calibration model
calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated.fit(trainX, trainy)
# predict probabilities
print(calibrated.predict_proba(testX)[:, 1])

How to get jlibsvm prediction probability in multi-class classification

I am new to SVM. I am using jlibsvm for a multi-class classification problem. Basically, I am doing a sentence classification problem. There are 3 Classes. What I understood is I am doing One-against-all classification. I have a comparatively small train set. A total of 75 sentences, In which 25 sentences belongs to each class.
I am making 3 SVMs (so 3 different models), where, while training, in SVM_A, sentences belong to CLASS A will have a true label, i.e., 1 and other sentences will have a -1 label. Correspondingly done for SVM_B, and SVM_C.
While testing, to get the true label of a sentence, I am giving the sentence to 3 models and I am taking the prediction probability returned by these 3 models. Which one returns the highest will be the class the sentence belong to.
This is how I am doing. But I am getting the same prediction probability for every sentence in the test set for all models.
A predicted:0.012820514
B predicted:0.012820514
C predicted:0.012820514
These values repeat for all sentences in the training set.
The following is how I set parameters for training:
C_SVC svm = new C_SVC();
MutableBinaryClassificationProblemImpl problem;
ImmutableSvmParameterGrid.Builder builder = ImmutableSvmParameterGrid.builder();
// create training parameters ------------
HashSet<Float> cSet;
HashSet<LinearKernel> kernelSet;
cSet = new HashSet<Float>();
cSet.add(1.0f);
kernelSet = new HashSet<LinearKernel>();
kernelSet.add(new LinearKernel());
// configure finetuning parameters
builder.eps = 0.001f; // epsilon
builder.Cset = cSet; // C values used
builder.kernelSet = kernelSet; //Kernel used
builder.probability=true; // To get the prediction probability
ImmutableSvmParameter params = builder.build();
What am I doing wrong?
Is there any other better way to do multi-class classification other than this?
You are getting the same output, because you generate the same model three times.
The reason for this is, that jlibsvm is able to perform multiclass classification out of the box based on the provided data (LIBSVM itself supports this too). If it detects, that more than two class labes are provided in the given data, it automatically performs multiclass classification. So there is no need for a manually 1vsN approach. Just supply the data with class-labels for each category.
However, jlibsvm is still in beta and relies on a rather old version of LIBSVM (2.88). A lot has changed. For a more intiuitive Java binding (in comparison to the default LIBSVM version), you can take a look at zlibsvm, which is available via Maven Central and based on the latest LIBSVM version.

Predicting probabilities

I have time series data consisting of a vector
v=(x_1,…, x_n)
of binary categorical variables and the probabilities for four outcomes
p_1, p_2, p_3, p_4.
Given a new vector of categorical variables I want to predict the probabilities
p_1,…,p_4
The probabilities are very unbalanced with
p_1>.99 and p_2, p_3, p_4 < .01.
For example
v_1= (1,0,0,0,1,0,0,0) , p_1=.99, p_2=.005, p_3=.0035, p_4= .0015
v_2=(0,0,1,0,0,0,0,1), p_1=.99, p_2=.006, p_3=.0035, p_4= .0005
v_3=(0,1,0,0,1,1,1,0), p_1=.99, p_2=.005, p_3=.003, p_4= .002
v_4=(0,0,1,0,1,0,0,1), p_1=.99, p_2=.0075, p_3=.002, p_4= .0005
Given a new vector
v_5= (0,0,1,0,1,1,0,0)
I want to predict
p_1, p_2, p_3, p_4.
I should also note that the new vector could be identical to one of the input vectors, i.e.,
v_5=(0,0,1,0,1,0,0,1)= v_4.
My initial approach is to turn this into 4 regression problems.
The first would predict p_1, the second would predict p_2, the third would predict p_3, and the fourth would predict p_4. The problem with this is that I need
p_1+p_2+p_3+p_4=1
I’m not classifying, but should I also be worried about the unbalanced probabilities. Any ideas would be welcome.
Your suggestion of considering this as a multiple problem + final normalization, has some sense, but it's known to be problematic in many cases (see, e.g., the problem of masking).
What you're describing here is multiclass (soft) classification, and there are many many known techniques for doing so. You didn't specify which language/tool/library you're using, or if you're planning on rolling your own (which only makes sense for didactic purposes). I'd suggest starting with Linear Discriminant Analysis which is very simple to understand and implement, and - despite its strong assumptions - is known to often work well in practice (see the classical book by Hastie & Tibshirani).
Irrespective of the underlying algorithm you use for soft binary classification (e.g., LDA or not), It is not very difficult to transform aggregate input into labeled input.
Consider for example the instance
v_1= (1,0,0,0,1,0,0,0) , p_1=.99, p_2=.005, p_3=.0035, p_4= .0015
If your classifier supports instance weights, feed it 4 instances, labeled 1, 2, ..., with weights given by p_1, p_2, ..., respectively.
If it does not support instance weights, simply simulate what the law of large numbers says would happen: generate some large n instance from this input; for each such new input, choose a label randomly proportionally to its probability.

Resources