I have a dataset to which I add 10-30% of artificial data and run an algorithm to classify what data is original and what artificial. I got the attached ROC curves. I've never seen ROC curves ending like that. Am I doing something wrong? Or such pattern is possible? If so, what would be its explanation?
Thanks
You could see a ROC curve similar to what you have shown if your target data have an unbalanced bimodal distribution with a noise/background distribution located between the two modes. Initially (like in your plot), you would have a steep increase in the ROC curve as it covers the main peak of the true positive (TP) distribution. Next, you would have a relatively flat region where you accumulate false positives (FP's) without much increase in TP's. Then, you would hit the second cluster of TP's.
I'm guessing that your artificial data is closer to the centroid of the main cluster of TP's, which is why adding more artificial data tends to deemphasize the smaller TP cluster and make it look more like a typical ROC curve.
As I mentioned in my comment, it would be informative to plot the ROC curve without any artificial data. Also, it could be informative to show a version zoomed in on the tail end of the plot where the TP rate approaches 1 (i.e., to see if it flattens as it approaches 1).
Related
I did a prediction analysis on a dataset and drew the ROC curve.
The ROC curve looks like below,
Im not very much sure about the shape of the curve. Doesn't it need to be a wavy curve. But looking at the cure, can we decide, that there is an issue with this. I got arount 71% accuracy, that is ok for me. But I'm worrying about the shape of the curve, which is not wavy. For an example doesn't look like below. (taken from internet.)
It looks like you only plotted three points. The idea of a ROC curve is to show how the FP/TP ratio varies when you tweak the decision threshold in order to establish the performance at every point. Without information about how you plotted this or what parameters you have, it's hard to say anything more.
A typical example would be to tweak aggressivity level -- if you have a spam scanner which will classify as spam at a particular score, how does changing the score threshold change the TP/FP rate? So effectively the X axis will also reveal the threshold setting (but possibly stretched in a manner) and the curve at every point will show how many of the samples in your clean collection will be FPs at that threshold, and how many in your spam collection will be correctly blocked.
("Stretching" means that the threshold setting might not map linearly onto the FP rate. If nothing happens between thresholds 0.950 and 0.975, you don't plot that interval on the x axis at all. The points on the x axis are the threshold values where the TP/FP rate changes; some could be very close to each other in terms of threshold value, and other adjacent points could correspond to a large jump in the threshold value.)
A good ROC curve has a large area underneath it. An ideal ROC goes from 0 to 1.00 and stays there, but then you don't need the plot to help you decide how to deploy your solution anyway. But in reality, they will come in all kinds of shapes, from vaguely asymptotic towards the upper left (very good) to straight diagonal (pretty lousy) and even asymptotic towards the lower right (extremely poor; random verdicts would be better). The interesting points are the "knee" where the TP rate's growth slows down and the FP rate starts growing quicker (that's where you should stop increasing the threshold) and any irregularities, especially any which break monotony.
(In your example from the net, there is a spot around TP 0.6 where increasing the threshold will only increase FPs. Why is that? Is there a skew in the samples, or a problem in the implementation? Could it be fixed?)
It looks like you have plotted points using the predicted class of a classifier (.predict function in python's sklearn package) rather than the predicted class probability (.predict_proba function in python's sklearn package). This means there is only one threshold change, when the class switches from 0 to 1, rather than a range of values that would give you the smooth curve.
Replace your predict class with your prediction probability and this should fix your problem.
In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model
I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
model.fit(tr_data,tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.
Recently I was told that the labels of regression data should also be normalized for better result but I am pretty doubtful of that. I have never tried normalizing labels in both regression and classification that's why I don't know if that state is true or not. Can you please give me a clear explanation (mathematically or in experience) about this problem?
Thank you so much.
Any help would be appreciated.
When you say "normalize" labels, it is not clear what you mean (i.e. whether you mean this in a statistical sense or something else). Can you please provide an example?
On Making labels uniform in data analysis
If you are trying to neaten labels for use with the text() function, you could try the abbreviate() function to shorten them, or the format() function to align them better.
The pretty() function works well for rounding labels on plot axes. For instance, the base function hist() for drawing histograms calls on Sturges or other algorithms and then uses pretty() to choose nice bin sizes.
The scale() function will standardize values by subtracting their mean and dividing by the standard deviation, which in some circles is referred to as normalization.
On the reasons for scaling in regression (in response to comment by questor). Suppose you regress Y on covariates X1, X2, ... The reasons for scaling covariates Xk depend on the context. It can enable comparison of the coefficients (effect sizes) of each covariate. It can help ensure numerical accuracy (these days not usually an issue unless covariates on hugely different scales and/or data is big). For a readable intro see Psychosomatic medicine editors' guide. For a mathematically intense discussion see Sylvain Sardy's guide.
In particular, in Bayesian regression, rescaling is advisable to ensure convergence of MCMC estimation; e.g. see this discussion.
You mean features not labels.
It is not necessary to normalize your features for regression or classification, even though in some cases, it is a trick that can help converging faster. You might want to check this post.
To my experience, when using a simple model like a linear regression with only a few variables, keeping the features as they are (without normalization) is preferable since the model is more interpretable.
It may be that what you mean is that you should scale your labels. The reason is so convergence is faster, and you don't get numeric instability.
For example, if your labels are in the range (1000, 1000000) and the weights are initialized close to zero, a mse loss would be so large, you'd likely get NaN errors.
See https://datascience.stackexchange.com/q/22776/38707 for a similar discussion.
for a regression problem with algorithms including decision tree or logistic regression and linear regression I tested in two modes: 1- with label scaling using MinMaxScaler 2- without label scaling the result that i got was : r2 score is the same in 2 mode mse and mae scales
for diabetes dataset using linear regression the result before and after is
without scaling:
Mean Squared Error: 3424.3166
Mean Absolute Error: 46.1742
R2_score : 0.33
after scaling labels:
Mean Squared Error: 0.0332
Mean Absolute Error: 0.1438
R2_score : 0.33
also below link can be useful which says scaling can be helpful in fast convergence enter scale or not scale labels in deep leaning?
I am evaluating a recommender and I have ROC curves and Precision-Recall curves. When I change some parameters the ROC and PR curves change a little bit differently. Sometimes the ROC curve looks better than the PR curve, or the other way around. Therefore I want both curves. I can boil down the ROC Curve to AUC, and since I have a 11-point PR curve I can take the mean over the 11 points to get a single number.
Can I combine these measures somehow to one number? And is this something that people do or is that unnecessary?
Is the fact that the ROC looks better than the PR just a subjective thing because I am not good at intrepreting the curves, or is it valid that one can be better than the other? (They are not completely different, but it´s still noticable I think)
EDIT:
Basically I don´t want to show tons of plots, I want a table of numbers. Would you combine these numbers in one table? Or make a table for each measure?
What people do most in common systems is to use the AUC (area under the ROC curve) or the F-Measure as summary metrics. But how you are dealing with recommender systems, until what i know they like to see the precision and recall curves (like these). Because the precision decay and the recall grow as the TOP-K grows are important results to these systems.
But if you still want a better answer about the precision-recal versus ROC curves, read this paper