WEKA classification results similar but different performances - machine-learning

First I read this: How to interpret weka classification?
but it didn't helped me.
Then, to set up the background, I am trying to learn in kaggle competitions and models are evaluated with ROC area.
Actually I built two models and data about them are represented in this way:
Correctly Classified Instances 10309 98.1249 %
Incorrectly Classified Instances 197 1.8751 %
Kappa statistic 0.7807
K&B Relative Info Score 278520.5065 %
K&B Information Score 827.3574 bits 0.0788 bits/instance
Class complexity | order 0 3117.1189 bits 0.2967 bits/instance
Class complexity | scheme 948.6802 bits 0.0903 bits/instance
Complexity improvement (Sf) 2168.4387 bits 0.2064 bits/instance
Mean absolute error 0.0465
Root mean squared error 0.1283
Relative absolute error 46.7589 % >72<69
Root relative squared error 57.5625 % >72<69
Total Number of Instances 10506
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.998 0.327 0.982 0.998 0.99 0.992 0
0.673 0.002 0.956 0.673 0.79 0.992 1
Weighted Avg. 0.981 0.31 0.981 0.981 0.98 0.992
Apart of K&B Relative Info Score; Relative absolute error and Root relative squared error which are respectively inferior, superior and superior in the best model as assessed by ROC curves,
all data are the same.
I built a third model with similar behavior (TP rate and so on), but again K&B Relative Info Score; Relative absolute error and Root relative squared error varied. But that did not allowed to predict if this third model was superior to both first (variations where the same compared to the best model, so theorically it should have been superior, but it wasn't).
What should I do to predict if a model will perform well given such details about it?
Thanks by advance.

Related

How to interpret the results of logistic regression in Weka

Hello everyone, I'm new in this area, I wondered if anyone could help me understand the results of logistic regression.
I would need to understand if the independent variables can be used to make a good classification.
=== Run information ===
Scheme: weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4
Relation: Train
Instances: 14185
Attributes: 5
ATTR_1
ATTR_2
ATTR_3
ATTR_4
DEPENDENT_VAR
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable 0
====================
ATTR_1 0.0022
ATTR_2 0.0022
ATTR_3 0.0034
ATTR_4 -0.0021
Intercept 0.9156
Odds Ratios...
Class
Variable 0
====================
ATTR_1 1.0022
ATTR_2 1.0022
ATTR_3 1.0034
ATTR_4 0.9979
Time taken to build model: 0.13 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.07 seconds
=== Summary ===
Correctly Classified Instances 51240 72.2453 %
Incorrectly Classified Instances 19685 27.7547 %
Kappa statistic -0.0001
Mean absolute error 0.3992
Root mean squared error 0.4467
Relative absolute error 99.5581 %
Root relative squared error 99.7727 %
Total Number of Instances 70925
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1,000 1,000 0,723 1,000 0,839 -0,005 0,545 0,759 0
0,000 0,000 0,000 0,000 0,000 -0,005 0,545 0,305 1
Weighted Avg. 0,722 0,723 0,522 0,722 0,606 -0,005 0,545 0,633
=== Confusion Matrix ===
a b <-- classified as
51240 5 | a = 0
19680 0 | b = 1
In particular, I am interested in understanding the values of the coefficients and the odds-ratios.
Thanks.
Off the top of my head:
Odds ratios and coefficient values are proportional to another, and can be calculated from each other.
For attribute1 , exp(0.0022) = 1.002
For doing more calculations and fitting/predicting, coefficients are "better". However the coefficients are values that must be plugged into exp(x) functions and are somewhat difficult to "visualize in your head".
For human understanding, odds ratios are sometimes more convenient - easier to interpret/visualize, but you can't do certain calculations directly with them.
Weka does not know what you are more interested in, so it gives you both for convenience.
By the way, weka does regularized logistic regression
(Logistic Regression with ridge parameter of 1.0E-8), so coefficients might differ slightly from those that a different software package might give you.

Accuraccy for prediction

I'm new to machine learning and I'm trying to learn the process and have started by playing around with Weka. When I load the data in Weka and start the classification, the software shows values such as below:
Correctly Classified Instances 416 39.6568 %
Incorrectly Classified Instances 633 60.3432 %
Kappa statistic 0.091
Mean absolute error 0.4371
Root mean squared error 0.4663
Relative absolute error 98.4524 %
Root relative squared error 98.9763 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 100 %
Total Number of Instances 1049
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.310 0.231 0.377 0.310 0.340 0.084 0.554 0.448 16-18
0.271 0.167 0.460 0.271 0.341 0.123 0.501 0.359 19+
0.599 0.511 0.382 0.599 0.467 0.084 0.570 0.395 All Age
Weighted Avg. 0.397 0.306 0.407 0.397 0.384 0.098 0.541 0.399
By taking a look at these values, I can assume that I have bad data since the number of Correctly Classified Instances is 37.65 and there is a high error rate. But the TP Rate and Precision are around an acceptable level.
This makes me confused, I want to know how I can judge the model based on these numbers? Does it mean my data is badly preprocessed?
You have to do a confusion matrix to get the accuracy and precision. Below is the link. Hope it helps.
http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html

Weka Classification

I was trying to data model a Classification Machine Learning algorithm on a data set which has 32 attributes,the last column being Target class.I refined the attributes number in to 6 from 32 ,which I felt would be more useful for my Classification model.
I tried to perform J48 and some incremental classification algorithm.
I expected output structure which consists of confusion matrix,correctlt and incorrectly classified instances,kappa value.
But my result did not give any information on Correctly and Incorrectly classified instances.Also,it did not predict confusion matrix and Kappa value.All I received is like this:
=== Summary ===
Correlation coefficient 0.9482
Mean absolute error 0.2106
Root mean squared error 0.5673
Relative absolute error 13.4077 %
Root relative squared error 31.9157 %
Total Number of Instances 1461
Can anyone tell me why I did not get Confusion matrix,kappa and Correct,Incorrect instances information.
Unfortunately you didnt write your code, or what version of weka do you apply.
BTW, to calculate confusion mtx, kappa etc. you can use methods of Evaluation class, http://weka.sourceforge.net/doc.dev/weka/classifiers/Evaluation.html
for example, after you train your model:
classifier.buildClassifier(train); \\train is an instances
Evaluation eval = new Evaluation(train);
//evaulate your model at 10 fold cross validation manner
eval.crossValidateModel(classifier, train, 10, new Random(1));
System.out.println(classifier);
//print different stats with
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
System.out.println(eval.toClassDetailsString());

Understanding Recall and Precision

I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second.
Assuming there is known to be 10 relevant documents in the search engines index.
Soo... there is 20 searches all together of which 10 are relevant.
Can anyone help me make sense of this?
Thanks
Recall and precision measure the quality of your result. To understand them let's first define the types of results. A document in your returned list can either be
classified correctly
a true positive (TP): a document which is relevant (positive) that was indeed returned (true)
a true negative (TN): a document which is not relevant (negative) that was indeed NOT returned (true)
misclassified
a false positive (FP): a document which is not relevant but was returned positive
a false negative (FN): a document which is relevant but was not returned negative
the precision is then:
|TP| / (|TP| + |FP|)
i.e. the fraction of retrieved documents which are indeed relevant
the recall is then:
|TP| / (|TP| + |FN|)
i.e. the fraction of relevant documents which are in your result set
So, in your example 10 out of 20 results are relevant. This gives you a precision of 0.5. If there are no more than these 10 relevant documents, you have got a recall of 1.
(When measuring the performance of an Information Retrieval system it only makes sense to consider both precision and recall. You can easily get a precision of 100% by returning no result at all (i.e. no spurious returned instance => no FP) or a recall of 100% by returning every instance (i.e. no relevant document was missed => no FN). )
Well, this is an extension of my answer on recall at: https://stackoverflow.com/a/63120204/6907424. First read about precision here and than go to read recall. Here I am only explaining Precision using the same example:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
For now I am calculating precision for Cat. So Cat is our Positive Class and the rest of the classes (Here Dog only) are the Negative Classes. Precision means what the percentage of positive detection was actually positive. So here for Cat there are 3 detection by the model. But are all of them correct? No! Out of them only 2 are correct (in example 0 and 2) and another is wrong (in example 3). So the percentage of correct detection is 2 out of 3 which is (2 / 3) * 100 % = 66.67%.
Now coming to the formulation, here:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FP (False positive): Predicting something as positive but which is not actually positive, i.e, saying something positive "falsely".
Now the number of correct detection of a certain class is the number of TP of that class. But apart from them the model also predicted some other examples as positives but which were not actually positives and so these are the false positives (FP). So irrespective of correct or wrong the total number of positive class detected by the model is TP + FP. So the percentage of correct detection of positive class among all detection of that class will be: TP / (TP + FP) which is the precision of the detection of that class.
Like recall we can also generalize this formula for any number of classes. Just take one class at a time and consider it as the positive class and the rest of the classes as negative classes and continue the same process for all of the classes to calculate precision for each of them.
You can calculate precision and recall in another way (basically the other way of thinking the same formulae). Say for Cat, first count how many examples at the same time have Cat in both Ground-truth and Model's prediction (i.e, count the number of TP). Therefore if you are calculating precision then divide this count by the number of "Cat"s in the Model's Prediction. Otherwise for recall divide by the number of "Cat"s in the Ground-truth. This works as the same as the formulae of precision and recall. If you can't understand why then you should think for a while and review what actually TP, FP, TN and FN means.
If you have difficulty understanding precision and recall, consider reading this
https://medium.com/seek-product-management/8-out-of-10-brown-cats-6e39a22b65dc

The best way to calculate the best threshold with P. Viola, M. Jones Framework

I'm trying to implement P. Viola and M. Jones detection framework in C++ (at the beginning, simply sequence classifier - not cascaded version). I think I have designed all required class and modules (e.g Integral images, Haar features), despite one - the most important: the AdaBoost core algorithm.
I have read the P. Viola and M. Jones original paper and many other publications. Unfortunately I still don't understand how I should find the best threshold for the one weak classifier? I have found only small references to "weighted median" and "gaussian distribution" algorithms and many pieces of mathematics formulas...
I have tried to use OpenCV Train Cascade module sources as a template, but it is so comprehensive that doing a reverse engineering of code is very time-consuming. I also coded my own simple code to understand the idea of Adaptive Boosting.
The question is: could you explain me the best way to calculate the best threshold for the one weak classifier?
Below I'm presenting the AdaBoost pseudo code, rewritten from sample found in Google, but I'm not convinced if it's correctly approach. Calculating of one weak classifier is very slow (few hours) and I have doubts about method of calculating the best threshold especially.
(1) AdaBoost::FindNewWeakClassifier
(2) AdaBoost::CalculateFeatures
(3) AdaBoost::FindBestThreshold
(4) AdaBoost::FindFeatureError
(5) AdaBoost::NormalizeWeights
(6) AdaBoost::FindLowestError
(7) AdaBoost::ClassifyExamples
(8) AdaBoost::UpdateWeights
DESCRIPTION (1)
-Generates all possible arrangement of features in detection window and put to the vector
DO IN LOOP
-Runs main calculating function (2)
END
DESCRIPTION(2)
-Normalizes weights (5)
DO FOR EACH HAAR FEATURE
-Puts sequentially next feature from list on all integral images
-Finds the best threshold for each feature (3)
-Finds the error for each the best feature in current iteration (4)
-Saves errors for each the best feature in current iteration in array
-Saves threshold for each the best feature in current iteration in array
-Saves the threshold sign for each the best feature in current iteration in array
END LOOP
-Finds for classifier index with the lowest error selected by above loop (6)
-Gets the value of error from the best feature
-Calculates the value of the best feature in the all integral images (7)
-Updates weights (8)
-Adds new, weak classifier to vector
DESCRIPTION (3)
-Calculates an error for each feature threshold on positives integral images - seperate for "+" and "-" sign (4)
-Returns threshold and sign of the feature with the lowest error
DESCRIPTION(4)
- Returns feature error for all samples, by calculating inequality f(x) * sign < sign * threshold
DESCRIPTION (5)
-Ensures that samples weights are probability distribution
DESCRIPTION (6)
-Finds the classifier with the lowest error
DESCRIPTION (7)
-Calculates a value of the best features at all integral images
-Counts false positives number and false negatives number
DESCRIPTION (8)
-Corrects weights, depending on classification results
Thank you for any help
In the original viola-Jones paper here, section 3.1 Learning Discussion (para 4, to be precise) you will find out the procedure to find optimal threshold.
I'll sum up the method quickly below.
Optimal threshold for each feature is sample-weight dependent and therefore calculated in very iteration of adaboost. The best weak classifier's threshold is saved as mentioned in the pseudo code.
In every round, for each weak classifier, you have to arrange the N training samples according to the feature value. Putting a threshold will separate this sequence in 2 parts. Both parts will have either positive or negative samples in majority along with a few samples of other type.
T+ : total sum of positive sample weights
T- : total sum of negative sample weights
S+ : sum of positive sample weights below the threshold
S- : sum of negative sample weights below the threshold
Error for this particular threshold is -
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
Why the minimum? here's an example:
If the samples and threshold is like this -
+ + + + + - - | + + - - - - -
In the first round, if all weights are equal(=w), taking the minimum will give you the error of 4*w, instead of 10*w.
You calculate this error for all N possible ways of separating the samples.
The minimum error will give you the range of threshold values. The actual threshold is probably the average of the adjacent feature values (I'm not sure though, do some research on this).
This was the second step in your DO FOR EACH HAAR FEATURE loop.
The cascades given along with OpenCV were created by Rainer Lienhart and I don't know what method he used.
You could closely follow the OpenCV source codes to get any further improvements on this procedure.

Resources