How can I combine AUC and averaged 11-point precision/recall? - machine-learning

I am evaluating a recommender and I have ROC curves and Precision-Recall curves. When I change some parameters the ROC and PR curves change a little bit differently. Sometimes the ROC curve looks better than the PR curve, or the other way around. Therefore I want both curves. I can boil down the ROC Curve to AUC, and since I have a 11-point PR curve I can take the mean over the 11 points to get a single number.
Can I combine these measures somehow to one number? And is this something that people do or is that unnecessary?
Is the fact that the ROC looks better than the PR just a subjective thing because I am not good at intrepreting the curves, or is it valid that one can be better than the other? (They are not completely different, but it´s still noticable I think)
EDIT:
Basically I don´t want to show tons of plots, I want a table of numbers. Would you combine these numbers in one table? Or make a table for each measure?

What people do most in common systems is to use the AUC (area under the ROC curve) or the F-Measure as summary metrics. But how you are dealing with recommender systems, until what i know they like to see the precision and recall curves (like these). Because the precision decay and the recall grow as the TOP-K grows are important results to these systems.
But if you still want a better answer about the precision-recal versus ROC curves, read this paper

Related

Shape of ROC curve

I did a prediction analysis on a dataset and drew the ROC curve.
The ROC curve looks like below,
Im not very much sure about the shape of the curve. Doesn't it need to be a wavy curve. But looking at the cure, can we decide, that there is an issue with this. I got arount 71% accuracy, that is ok for me. But I'm worrying about the shape of the curve, which is not wavy. For an example doesn't look like below. (taken from internet.)
It looks like you only plotted three points. The idea of a ROC curve is to show how the FP/TP ratio varies when you tweak the decision threshold in order to establish the performance at every point. Without information about how you plotted this or what parameters you have, it's hard to say anything more.
A typical example would be to tweak aggressivity level -- if you have a spam scanner which will classify as spam at a particular score, how does changing the score threshold change the TP/FP rate? So effectively the X axis will also reveal the threshold setting (but possibly stretched in a manner) and the curve at every point will show how many of the samples in your clean collection will be FPs at that threshold, and how many in your spam collection will be correctly blocked.
("Stretching" means that the threshold setting might not map linearly onto the FP rate. If nothing happens between thresholds 0.950 and 0.975, you don't plot that interval on the x axis at all. The points on the x axis are the threshold values where the TP/FP rate changes; some could be very close to each other in terms of threshold value, and other adjacent points could correspond to a large jump in the threshold value.)
A good ROC curve has a large area underneath it. An ideal ROC goes from 0 to 1.00 and stays there, but then you don't need the plot to help you decide how to deploy your solution anyway. But in reality, they will come in all kinds of shapes, from vaguely asymptotic towards the upper left (very good) to straight diagonal (pretty lousy) and even asymptotic towards the lower right (extremely poor; random verdicts would be better). The interesting points are the "knee" where the TP rate's growth slows down and the FP rate starts growing quicker (that's where you should stop increasing the threshold) and any irregularities, especially any which break monotony.
(In your example from the net, there is a spot around TP 0.6 where increasing the threshold will only increase FPs. Why is that? Is there a skew in the samples, or a problem in the implementation? Could it be fixed?)
It looks like you have plotted points using the predicted class of a classifier (.predict function in python's sklearn package) rather than the predicted class probability (.predict_proba function in python's sklearn package). This means there is only one threshold change, when the class switches from 0 to 1, rather than a range of values that would give you the smooth curve.
Replace your predict class with your prediction probability and this should fix your problem.

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Right order of doing feature selection, PCA and normalization?

I know that feature selection helps me remove features that may have low contribution. I know that PCA helps reduce possibly correlated features into one, reducing the dimensions. I know that normalization transforms features to the same scale.
But is there a recommended order to do these three steps? Logically I would think that I should weed out bad features by feature selection first, followed by normalizing them, and finally use PCA to reduce dimensions and make the features as independent from each other as possible.
Is this logic correct?
Bonus question - are there any more things to do (preprocess or transform)
to the features before feeding them into the estimator?
If I were doing a classifier of some sort I would personally use this order
Normalization
PCA
Feature Selection
Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X. we don't know that is the case yet. So we want to normalize our data.
PCA: Uses the eigenvalue decomposition of data to find an orthogonal basis set that describes the variance in data points. If you have 4 characteristics, PCA can show you that only 2 characteristics really differentiate data points which brings us to the last step
Feature Selection: once you have a coordinate space that better describes your data you can select which features are salient.Typically you'd use the largest eigenvalues(EVs) and their corresponding eigenvectors from PCA for your representation. Since larger EVs mean there is more variance in that data direction, you can get more granularity in isolating features. This is a good method to reduce number of dimensions of your problem.
of course this could change from problem to problem, but that is simply a generic guide.
Generally speaking, Normalization is needed before PCA.
The key to the problem is the order of feature selection, and it's depends on the method of feature selection.
A simple feature selection is to see whether the variance or standard deviation of the feature is small. If these values are relatively small, this feature may not help the classifier. But if you do normalization before you do this, the standard deviation and variance will become smaller (generally less than 1), which will result in very small differences in std or var between the different features.If you use zero-mean normalization, the mean of all the features will equal 0 and std equals 1.At this point, it might be bad to do normalization before feature selection
Feature selection is flexible, and there are many ways to select features. The order of feature selection should be chosen according to the actual situation
Good answers here. One point needs to be highlighted. PCA is a form of dimensionality reduction. It will find a lower dimensional linear subspace that approximates the data well. When the axes of this subspace align with the features that one started with, it will lead to interpretable feature selection as well. Otherwise, feature selection after PCA, will lead to features that are linear combinations of the original set of features and they are difficult to interpret based on the original set of features.

What are the variables involved in constructing an ROC curve?

Say I have a classifier and I achieve FAR of 10% and FRR of 15%. What would I need to do with these to construct an ROC curve?
I'm having trouble seeing what they actually represent and the situation in which they are used. I don't seem to have an important variable the shifts the FAR and FRR in one direction or the other. Can I still use ROC?
Short answer is: no, you cannot.
ROC curve is a parametric curve, you need a scalar value you can shift in order to change your classifiers decisions. Its usefull for:
Checking robustness to this parameter
fine-tuning final probability estimates for a particular application

ROC curve shows strange pattern

I have a dataset to which I add 10-30% of artificial data and run an algorithm to classify what data is original and what artificial. I got the attached ROC curves. I've never seen ROC curves ending like that. Am I doing something wrong? Or such pattern is possible? If so, what would be its explanation?
Thanks
You could see a ROC curve similar to what you have shown if your target data have an unbalanced bimodal distribution with a noise/background distribution located between the two modes. Initially (like in your plot), you would have a steep increase in the ROC curve as it covers the main peak of the true positive (TP) distribution. Next, you would have a relatively flat region where you accumulate false positives (FP's) without much increase in TP's. Then, you would hit the second cluster of TP's.
I'm guessing that your artificial data is closer to the centroid of the main cluster of TP's, which is why adding more artificial data tends to deemphasize the smaller TP cluster and make it look more like a typical ROC curve.
As I mentioned in my comment, it would be informative to plot the ROC curve without any artificial data. Also, it could be informative to show a version zoomed in on the tail end of the plot where the TP rate approaches 1 (i.e., to see if it flattens as it approaches 1).

Resources