Third point on ROC curve for Zero-R classifier in WEKA - machine-learning

I was comparing several classifiers in Weka and included a Zero-R as a baseline result. Because some of the classifiers had interesting ROC curves I also looked at those. However, I'm confused by the result I get with Zero-R (for which an ROC curve isn't very useful, I know)
My dataset has 122 true and 68 false instances, leading to a zero-r prediction of 64.2% true for every instance. Using a threshold of >=0% you then get TPR=122/122=1 and FPR=68/68=1. Using a threshold of >=100% you get TRP=0/122=0, FPR=0/68=0. This should lead to an AUC of 0.5.
So far, everything makes sense.
However, weka draws a third point at threshold 0.64162 (still smaller than 122/(122+68)=0.64211) for which it decides there are 96 true positives and 56 false positives. It then calculates an AUC of 0.4817.
Does anyone know where this third point comes from?

Related

How to know if the data can be classified in machine learning

In classification problem, we can adjust soft margin to make things easier. However, in some cases, especially in real-world data collections, the classification problem can be extremely hard because of the noise introduced. In 2D space, the following graph can be obtained:
We can easily conclude that these two classes of data can not be classified. However, usually the features of data can be huge, which can not be plot in 2D or 3D space. I also tried TSNE to visualize the classified data. But since TSNE uses KL to train, while SVM uses loss function, I can not get anything from the following picture.
where the green ones are true positive, blue ones are false negative, yellow ones are false positive, black ones are true negative.
So my question is: is there a scientific method that can be used to analyse whether a problem can be classified or not?

Reduce false positive in HOG

I have implemented car detector using HOG and it is working quite okay at the moment. Unfortunately I have a lot of false positive for the classifier.
What I have done so far
I had changed the ratio (positive:negative) of samples from 1:1 to 1:3 and it lower the false positive to some extend. Can some one help to reduce the false positive for the classifier.
My approach to implement HOG
Get the HOG features (blocks only) for the complete image.
Extract the positive features based on the label information and window size.
Extract the negative samples by randomly drawing the rectangle and checking for collision with the object in which I am interested.
Train the linear svm.
Testing the classifier.
Maybe it is not the perfect solution but I hope it helps you.
I was using HOG descriptor + SMV classifier for specific object detection. In order to reduce the false positive results, despite the fact that is important to adjust the number of false and true training samples, at the end I had to adjust empirically the GAMMA and Cost parameters of Radial Basis Function (RBF) kernel SVM. Probably if you increased your GAMMA value , you will have less false positive results, but maybe there'll be some miss detections.
The effect of the inverse-width parameter of the Gaussian kernel (γ) for a fixed value of the soft-margin constant. For small values of γ (upper left) the decision boundary is nearly linear. As γ increases the flexibility of the decision boundary increases. Large values of γ lead to
overfitting (bottom).
I leave you some links as reference:
A User’s Guide to Support Vector Machines
Using SVMs for Scientists and Engineers
Regards.

receiver operating characteristic (ROC) on a test set

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

How do you set an optimal threshold for detection with an SVM?

I have a face detection system with SVM as the classifier. The classifier outputs a confidence level, between 0 and 1 , along with the decision. As in any detection system, there are several false positives too. To eliminate some of them, we can use non-maxima suppression (Please see http://www.di.ens.fr/willow/teaching/recvis10/assignment4/). The confidence threshold for detection is set manually. For example any detection with confidence below 0.6 is a false positive. Is there a way to set this threshold automatically ?
For example using something in detection/ estimation theory?
If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).
If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.

ROC curves cross validation

How to generate a ROC curve for a cross validation?
For a single test I think I should threshold the classification scores of SVM to generate the ROC curve.
But I am unclear about how to generate it for a cross validation?
After a complete round of cross validation all observations have been classified once (although by different models) and have been give an estimated probability of belonging to the class of interest, or a similar statistic. These probabilities can be used to generate a ROC curve in exactly the same way as probabilities obtained on an external test set. Just calculate the classwise error rates as you vary the classification threshold from 0 to 1 and your are all set.
However, typically you would like to perform more than one round of crossvalidation, as the performance varies depending on how the folds are divided. It is not obvious to me how to calculate the mean ROC curve of all rounds. I suggest plotting them all and calculate the mean AUC.
As follow-up to Backlin:
The variation in the results for different runs of k-fold or leave-n-out cross validation show instability of the models. This is valuable information.
Of course you can pool the results and just generate one ROC.
But you can also plot the set of curves
see e.g. the R package ROCR
or calculate e.g. median and IQR at different thresholds and construct a band depicting these variations.
Here's an example: the shaded areas are the inter quartile ranges observed over 125 iterations of 8-fold cross validation. The thin black areas contain half of the observed specificity-sensitivity pairs for one particular threshold, median marked by x (ignore the + marks).

Resources