Precision and recall for outlier detection - machine-learning

I am trying to calculate precision, recall and f1-score for outlier detection (in my case attacks in a network) using a one-class SVM. I encounter a problem in doing that in a rigorous manner. I explain myself. Since precision is calculated like:
precision = true_positive /(true_positive + false_positive)
if I do my tests using a dataset that I already know that has a few number of attacks then the number of false_positive will be really big in comparison with the true_positive, therefore precision will be very low.
However, if I use a dataset that I already know that has lots of attacks, without changing my detection algorithm the number of true_positive will increase and then the precision will be higher.
I know that something must be wrong in the way that I calculate precision. What am I missing?
Thanks in advance!

if I do my tests using a dataset that I already know that has a few number of attacks then the number of false_positive will be really big in comparison with the true_positive, therefore precision will be very low.
That is (probably) expected behavior, because your data set is skewed. However, you should get a recall value that is acceptable.
However, if I use a dataset that I already know that has lots of attacks, without changing my detection algorithm the number of true_positive will increase and then the precision will be higher.
And in this case, I bet recall will be low.
Based on what you describe, there are a few issues and things you can do. I can address more specific issues if you add more information to your question:
Why are you using multiple test sets, all of which are unbalanced? You should use something that is balanced, or even better, use k-fold cross validation with your entire data set. Use it to find the best parameters for your model.
To decide if you have a good enough balance between precision and recall, consider using the F1 score.
Use a confusion matrix to decide if your measures are acceptable.
Plot learning curves to help avoid overfitting.

Related

What's the right way to assess an ML algo on Precision Recall values?

So I know what Precision and Recall stand for.
Precision optimizes for False Positives and Recall False Negatives. In the end what is the cost objective of a business that should be taken into account. Like for a hospital you might want to have a algo with high recall (low false negatives) as the cost to missing out on identifying a malignant tumor is more than having to do more investigation on those false alarms.
But what is still considered decent Precision/Recall metrics? Like I have a Binary Classification algo, which has 0.34 as Precision, but Recall is 0.98. Even if business objective favors optimizes on favoring False Negatives (high recall), is this ok to consider such a algo which although favors High recall but has poor Precision values.
Note: I had a severe class imbalance problems with around 99% of obs 0 and just under 1% being 1 class.
This is highly dependent on the context, but let's assume this classifier detects malignant tumors at very early stages where it's quite hard to detect.
Now for the purpose of this analysis, let's have two scenarios with two different assumptions.
Scenario 1: The system will be used as a quick filtering phase on a huge number of people to quickly dismiss those who are not suspects of tumors
Well in that case, this model with .98 recall will rarely let a person with a tumor slip by undetected, and this is the main purpose of the system since it's merely a quick filtering phase to dismiss a considerable portion of the population, since the following inspections are quite costly and time consuming.
I would say this system would do pretty well in that scenario
Scenario 2: The system will be used to diagnose people with tumor that will be directly enlisted in costly treatment programs
In this fictional scenario, the system is meant to be very confident and precise on those who are classified with tumors since there is no post-filtering phases after this one and the treatments will be both costly and might cause pretty harmful side-effects for those who aren't cancer fighters.
In that case, this model would perform terribly for the purpose it was meant for in this scenario.
So it's totally dependent on the case, in scenario number 1, it's totally fine to go with low precision as long as the recall it very high, of course the higher the precision is better, but as long as it doesn't fall below a certain threshold for the recall.
While for scenario 2, it's expected to have a very high precision, even if the recall is too low, .99 precision with .05 recall is totally fine in that scenario.
UPDATE 1
Regarding the class imbalance that your dataset suffers from, this could be a direct influence on the bad precision for the under-samples class, have you tried using weighted loss, where the under-sampled class have higher weights that should help in balancing the class affect during training.
There are many techniques that can be used to handle imbalanced datasets, you can read more about them here

When are precision and recall inversely related?

I am reading about precision and recall in machine learning.
Question 1: When are precision and recall inversely related? That is, when does the situation occur where you can improve your precision but at the cost of lower recall, and vice versa? The Wikipedia article states:
Often, there is an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the
other. Brain surgery provides an obvious example of the tradeoff.
However, I have seen research experiment results where both precision and recall increase simultaneously (for example, as you use different or more features).
In what scenarios does the inverse relationship hold?
Question 2: I'm familiar with the precision and recall concept in two fields: information retrieval (e.g. "return 100 most relevant pages out of a 1MM page corpus") and binary classification (e.g. "classify each of these 100 patients as having the disease or not"). Are precision and recall inversely related in both or one of these fields?
The inverse relation only holds when you have some parameter in the system that you can vary in order to get more/less results. Then there's a straightforward relationship: you lower the threshold to get more results and among them some are TPs and some FPs. This, actually, doesn't always mean that precision or recall will rise and fall simultaneously - the real relationship can be mapped using the ROC curve. As for Q2, likewise, in both of these tasks precision and recall are not necessarily inversely related.
So, how do you increase recall or precision, not impacting the other simultaneously? Usually, by improving the algorithm or model. I.e. when you just change parameters of a given model, the inverse relationship will usually hold, although you should mind that it will also be usually non-linear. But if you, for example, add more descriptive features to the model, you can increase both metrics at once.
Regarding the first question, I interpret these concepts in terms of how restrictive your results must be.
If you're more restrictive, I mean, if you're more "demanding on the correctness" of the results, you want it to be more precise. For that, you might be willing to reject some correct results as long as everything you get is correct. Thus, you're raising your precision and lowering your recall. Conversely, if you do not mind getting some incorrect results as long as you get all the correct ones, you're raising your recall and lowering your precision.
On what concerns the second question, if I look at it from the point of view of the paragraphs above, I can say that yes, they are inversely related.
To the best of my knowledge, In order to be able to increase both, precision and recall, you'll need either, a better model (more suitable for your problem) or better data (or both, actually).

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

precision or recall speaks loud?

Say I'm evaluating some text classification research project using two approaches 'A' and 'B'. When using approach 'A', I get a x% increase in precision while with 'B', a x% increase in recall. How can I say A or B approach better?
It depends on your goal. If you need the first couple of returned classes to be correct then you should go for precision, if you want to focus on returning all relevant classes then try to increase recall.
If precision and recall both matter to you then an often used measure is the F1 score which combines precision and recall into a single measure.
I fully agree with what #Sicco wrote.
Also, I would recommend watching this video, it's from Machine Learning course at Coursera. From the video: in some cases you can manipulate precision and recall by changing threshold. If you're not sure what's more important for you just stick to F1.

One versus rest classifier

I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)

Resources