Percent difference between accuracy of two machine learning models - machine-learning

I have trained two machine learning models. Both have slightly different accuracies.
Model-A Accuracy = 0.78 or 78%
Model-B Accuracy = 0.80 or 80%
Can I infer from the above results that Model-B is 2% better than Model-A?

The answer depends on how you evaluate the models, and on target distribution.
Metric
If distribution of classes is not balanced, accuracy might not be as useful to describe the generalization error. Use ROC AUC or F1-score.
Evaluation process
Cross-validation will give you more robust estimation of the evaluation metric than hold-out validation. Stratified Cross-validation is even better for the unbalanced dataset.
If you're confident in your validation method, then yes, you can iterpret the results in the way you described: Model-B is 2% better than Model-A.
It's still only an estimate, after all. You can use bootstrapping to estimate confidence intervals, select threshold and infer whether the difference is statistically significant.

Related

Cross-entropy loss influence over F-score

I'm training an FCN (Fully Convolutional Network) and using "Sigmoid Cross Entropy" as a loss function.
my measurements are F-measure and MAE.
The Train/Dev Loss w.r.t #iteration graph is something like the below:
Although Dev loss has a slight increase after #Iter=2200, my measurements on Dev set have been improved up to near #iter = 10000. I want to know is it possible in machine learning at all? If F-measure has been improved, should the loss also be decreased? How do you explain it?
Every answer would be appreciated.
Short answer, yes it's possible.
How I would explain it is by reasoning on the Cross-Entropy loss and how it differs from the metrics. Loss Functions for classification, generally speaking, are used to optimize models relying on probabilities (0.1/0.9), while metrics usually use the predicted labels. (0/1)
Assuming having strong confidence (close to 0 or to 1) in a model probability hypothesis, a wrong prediction will greatly increase the loss and have a small decrease in F-measure.
Likewise, in the opposite scenario, a model with low confidence (e.g. 0.49/0.51) would have a small impact on the loss function (from a numerical perspective) and a greater impact on the metrics.
Plotting the distribution of your predictions would help to confirm this hypothesis.

Random Forest Train / Test meaning

I have the following:
rf = RandomForestClassifier(n_estimators=500, criterion='entropy', random_state=42)
rf.fit(X_train, y_train)
From this, I get:
1.0 accuracy on training set
0.6990116801437556 accuracy on test set
Since we're not setting the max_depth, it seems the trees are overfitting to the training data.
My question is: what does this tell us about the training data? Does the fact that it has reasonable accuracy imply that the test data is very like the training data and that's the only reason we're getting such an accuracy?
Since you don't specify the max_depth of the tree, it grows until you have all pure nodes. So it is natural to overfit and correct/expected to have 100% (or rather high if the min_number of samples for node is not too large) accuracy on the training set.
This fact in not very insightful about the training set.
The fact that you are having a "such good" accuracy on the test set could indeed point out a similarity in the distribution of training/test set (that a one point it is expected if they are drawn from the same phenomenon) and that the tree has some degree of generalizability.
As general rule I would say that it is wrong to infer conclusion from a single result and when the training set is over-fitting. Additionally considering 0.69 accuracy a "good" accuracy is relative to the problem at hand. 30% of difference between training set and test set could be a huge gap in many applications.
In order to have a better understanding of your problem and more robust results it would be better to use a cross validation approach and a random forest.

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

Why do we want to maximize AUC in classification problems?

I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.
I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.
In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC

How to deal with this unbalanced-class skewed data-set?

I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.
For example, following is some part of the training data :
93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0
where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.
The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.
Which classifier should be best for handling this kind of data-set ?
I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.
but got no significant improvement in accuracy.
Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.
Some good starting points are:
try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)
EDIT (now knowing you're using scikit-learn)
The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.
The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.
Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.

Resources