Accuraccy for prediction - machine-learning

I'm new to machine learning and I'm trying to learn the process and have started by playing around with Weka. When I load the data in Weka and start the classification, the software shows values such as below:
Correctly Classified Instances 416 39.6568 %
Incorrectly Classified Instances 633 60.3432 %
Kappa statistic 0.091
Mean absolute error 0.4371
Root mean squared error 0.4663
Relative absolute error 98.4524 %
Root relative squared error 98.9763 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 100 %
Total Number of Instances 1049
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.310 0.231 0.377 0.310 0.340 0.084 0.554 0.448 16-18
0.271 0.167 0.460 0.271 0.341 0.123 0.501 0.359 19+
0.599 0.511 0.382 0.599 0.467 0.084 0.570 0.395 All Age
Weighted Avg. 0.397 0.306 0.407 0.397 0.384 0.098 0.541 0.399
By taking a look at these values, I can assume that I have bad data since the number of Correctly Classified Instances is 37.65 and there is a high error rate. But the TP Rate and Precision are around an acceptable level.
This makes me confused, I want to know how I can judge the model based on these numbers? Does it mean my data is badly preprocessed?

You have to do a confusion matrix to get the accuracy and precision. Below is the link. Hope it helps.
http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html

Related

High AUC and 100% recall, but precision and F1 are low

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)
Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]

How to interpret the results of logistic regression in Weka

Hello everyone, I'm new in this area, I wondered if anyone could help me understand the results of logistic regression.
I would need to understand if the independent variables can be used to make a good classification.
=== Run information ===
Scheme: weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4
Relation: Train
Instances: 14185
Attributes: 5
ATTR_1
ATTR_2
ATTR_3
ATTR_4
DEPENDENT_VAR
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable 0
====================
ATTR_1 0.0022
ATTR_2 0.0022
ATTR_3 0.0034
ATTR_4 -0.0021
Intercept 0.9156
Odds Ratios...
Class
Variable 0
====================
ATTR_1 1.0022
ATTR_2 1.0022
ATTR_3 1.0034
ATTR_4 0.9979
Time taken to build model: 0.13 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.07 seconds
=== Summary ===
Correctly Classified Instances 51240 72.2453 %
Incorrectly Classified Instances 19685 27.7547 %
Kappa statistic -0.0001
Mean absolute error 0.3992
Root mean squared error 0.4467
Relative absolute error 99.5581 %
Root relative squared error 99.7727 %
Total Number of Instances 70925
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1,000 1,000 0,723 1,000 0,839 -0,005 0,545 0,759 0
0,000 0,000 0,000 0,000 0,000 -0,005 0,545 0,305 1
Weighted Avg. 0,722 0,723 0,522 0,722 0,606 -0,005 0,545 0,633
=== Confusion Matrix ===
a b <-- classified as
51240 5 | a = 0
19680 0 | b = 1
In particular, I am interested in understanding the values of the coefficients and the odds-ratios.
Thanks.
Off the top of my head:
Odds ratios and coefficient values are proportional to another, and can be calculated from each other.
For attribute1 , exp(0.0022) = 1.002
For doing more calculations and fitting/predicting, coefficients are "better". However the coefficients are values that must be plugged into exp(x) functions and are somewhat difficult to "visualize in your head".
For human understanding, odds ratios are sometimes more convenient - easier to interpret/visualize, but you can't do certain calculations directly with them.
Weka does not know what you are more interested in, so it gives you both for convenience.
By the way, weka does regularized logistic regression
(Logistic Regression with ridge parameter of 1.0E-8), so coefficients might differ slightly from those that a different software package might give you.

WEKA classification results similar but different performances

First I read this: How to interpret weka classification?
but it didn't helped me.
Then, to set up the background, I am trying to learn in kaggle competitions and models are evaluated with ROC area.
Actually I built two models and data about them are represented in this way:
Correctly Classified Instances 10309 98.1249 %
Incorrectly Classified Instances 197 1.8751 %
Kappa statistic 0.7807
K&B Relative Info Score 278520.5065 %
K&B Information Score 827.3574 bits 0.0788 bits/instance
Class complexity | order 0 3117.1189 bits 0.2967 bits/instance
Class complexity | scheme 948.6802 bits 0.0903 bits/instance
Complexity improvement (Sf) 2168.4387 bits 0.2064 bits/instance
Mean absolute error 0.0465
Root mean squared error 0.1283
Relative absolute error 46.7589 % >72<69
Root relative squared error 57.5625 % >72<69
Total Number of Instances 10506
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.998 0.327 0.982 0.998 0.99 0.992 0
0.673 0.002 0.956 0.673 0.79 0.992 1
Weighted Avg. 0.981 0.31 0.981 0.981 0.98 0.992
Apart of K&B Relative Info Score; Relative absolute error and Root relative squared error which are respectively inferior, superior and superior in the best model as assessed by ROC curves,
all data are the same.
I built a third model with similar behavior (TP rate and so on), but again K&B Relative Info Score; Relative absolute error and Root relative squared error varied. But that did not allowed to predict if this third model was superior to both first (variations where the same compared to the best model, so theorically it should have been superior, but it wasn't).
What should I do to predict if a model will perform well given such details about it?
Thanks by advance.

Experiment with OHSUMED dataset using SVM Rank

I am trying to learn RankSVM using OHSUMED dataset and SVM Rank library as explained in following link:
http://research.microsoft.com/en-s/um/beijing/projects/letor/Baselines/RankSVM-Struct.txt
I used same parameters as link suggests for OHSUMED dataset. i.e
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold1_l1_c0.0002_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold2_l1_c0.002_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold3_l1_c0.01_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold4_l1_c0.02_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold5_l1_c0.01_e0.001.log
But when I train my model & run "svm_rank_classify" command I get following result:
Reading model...done.
Reading test examples...done.
Classifying test examples...done
Runtime (without IO) in cpu-seconds: 0.00
Average loss on test set: 0.3864
Zero/one-error on test set: 100.00% (0 correct, 22 incorrect, 22 total)
NOTE: The loss reported above is the fraction of swapped pairs averaged over
all rankings. The zero/one-error is fraction of perfectly correct
rankings!
Total Num Swappedpairs : 31337
Avg Swappedpairs Percent: 38.64
Please suggest If any steps I am missing here?
Thanks.
The zero/one-error is the percentage of rankings (i.e. qid sets) where the model ranked at least one pair incorrectly. Your accuracy over all pairs is actually:
(100 - Avg Swappedpairs Percent) = 61.36%

How to distinguish photo from picture?

I have the following problem:
I am given a set of images and I need to devide them to photos and pictures(graphics) with means of OpenCV library.
I've already tried
to analyze RGB histogram (in average picture has empty bins of histogram),
to analyze HSV histogram (in average picture has not much colors),
to search for contours (in average the number of contours on picture is less than on photo).
So I have 7% error (tested on 2000 images). I'm confused a little, because I have no a lot of experience in numerous computer vision means.
For example,this photo below. Its histograms (RGB and HSV) are very poor and number of contours is rather small. In addition there is a lot of background color, so I need to find an object to calculate only it histogram (I use findContours() for this). But in any case my algorithm detects this image as picture.
And one more example:
The problem with pictures is noise. I have images of small size (200*150) and in some cases noise is so perceptible, that my algorithm detects this image as photo. I've tried to blur images, but in this case the number of colors increases because of mixing pixels and also it decreases the number of contours (some dim boundaries become indistinguishable).
Example of pictures:
I've also tried color segmentation and MSER, but my best result is still 7%.
Could you advice me what methods can I also try?
I've used your dataset to create really simple models. To do this, I've used Rattle library in R.
Input data
rgbh1 - number of bins in RGB histogram, which value > #param#, in my case #param# = 30 (340 is maximum value)
rgbh2 - number of bins in RGB histogram, which value > 0 (not empty)
hsvh1 - number of bins in HSV histogram, which value > #param#, in my case #param# = 30 (340 is maximum value)
hsvh2 - number of bins in HSV histogram, which value > 0 (not empty)
countours - number of contours on image
PicFlag - flag indicating picture/photo (picture = 1, photo = 0)
Data exploration
To better understand your data, here is a plot of distribution of individual variables by picture/photo group (there is percentage on y axis):
It clearly shows that there are variables with preditive power. Most of them can be used in our model. Next I've created simple scatter plot matrix to see whether some combination of variables can be useful:
You can see the for example combination of number of countours and rgbh1 looks promising.
On the following chart you can notice that there is also quite strong correlation among variables. (Generally, we like to have a lot of variables with low correlation, while you have only a limited number of correlated variables). Pie chart shows how big is the correlation - full circle means 1, empty circle means 0, my opinion is that if correlation exceeds .4 it might not be good idea to have both variables in the model)
Model
Then I created simple models (keeping Rattle's default) using decision tree, random forest, logistic regression and neural network. As input I've used your data with 60/20/20 split (training, validiation, testing dataset). This is my result (please refer to google if you don't understand error matrix):
Error matrix for the Decision Tree model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 167 22
1 6 204
Error matrix for the Decision Tree model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 42 6
1 2 51
Overall error: 0.07017544
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Error matrix for the Random Forest model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 170 19
1 8 202
Error matrix for the Random Forest model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 43 5
1 2 51
Overall error: 0.06766917
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Error matrix for the Linear model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 171 18
1 13 197
Error matrix for the Linear model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 43 5
1 3 49
Overall error: 0.07769424
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Error matrix for the Neural Net model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 169 20
1 15 195
Error matrix for the Neural Net model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 42 5
1 4 49
Overall error: 0.0877193
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Results
As you can see the overall error rate oscilates between 6.5% and 8%. I do not think that this result can be significantly improved by tunning parameters of used methods. There are two ways how to decrease overall error rate:
add more uncorrelated variables (we do usually have 100+ input variables in the modeling dataset and +/- 5-10 are in the final model)
add more data (we can then tune the model without being scared by overfitting)
Used sofware:
R http://www.r-project.org/
Rattle http://rattle.togaware.com/
Code used to create corrgram and scatterplot (other outputs were generated using Rattle GUI):
# install.packages("lattice",dependencies=TRUE)
# install.packages("car")
library(lattice)
library(car)
setwd("C:/")
indata <- read.csv2("pics.csv")
str(indata)
# Corrgram
corrgram(indata, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Picture/Photo correlation matrix")
# Scatterplot Matrices
attach(indata)
scatterplotMatrix(~rgbh1+rgbh2+hsvh1+hsvh2+countours|PicFlag,main="Picture/Photo scatterplot matrix",
diagonal=c("histogram"),legend.plot=TRUE,pch=c(1,1))
Well a generic suggestion will be to increase the number of features ( or get better features) and to build a classifier using this features, trained with an appropriate machine learning algorithm. OpenCV already has couple of good machine learning algorithms, which you can make use of.
I have never worked on this problem, but a quick google search led me to this paper by Cutzu et. al. Distinguishing paintings from photographs
One feature that should be useful is the gradient histogram. Natural images have a particular distribution of gradient strengths.

Resources