Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset - machine-learning

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.

Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
model.fit(tr_data,tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.

Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

Related

Decision boundary is not a property of training data in classification

In ML videos of Andrew Ng on Coursera on Classification (in the third video), he said that the "decision boundary is not a property of the training set". What does this statement mean? And does it also imply that the straight line or any curves that we use in linear regression to fit data are not a property of the training set? He claims that those curves (achieved through linear regression) aren't the properties of the corresponding training data. I am a bit confused about this. Kindly if my doubts could be removed. Thanks in advance.
The decision boundary is a property of your classifier. Different classifiers lead to different decision boundaries.
Decision boundary has nothing to do with linear regression, as it only makes sense for classification problems. The decision boundary is the curve (or surface, in more than two dimensions) that splits the elements of the two different classes in your classification problem. In logistic regression, the decision boundary is a straight line, while in nonlinear classification methods, like neural networks, the decision boundary is a curve.

Pre-processing data: Normalizing data labels in regression?

Recently I was told that the labels of regression data should also be normalized for better result but I am pretty doubtful of that. I have never tried normalizing labels in both regression and classification that's why I don't know if that state is true or not. Can you please give me a clear explanation (mathematically or in experience) about this problem?
Thank you so much.
Any help would be appreciated.
When you say "normalize" labels, it is not clear what you mean (i.e. whether you mean this in a statistical sense or something else). Can you please provide an example?
On Making labels uniform in data analysis
If you are trying to neaten labels for use with the text() function, you could try the abbreviate() function to shorten them, or the format() function to align them better.
The pretty() function works well for rounding labels on plot axes. For instance, the base function hist() for drawing histograms calls on Sturges or other algorithms and then uses pretty() to choose nice bin sizes.
The scale() function will standardize values by subtracting their mean and dividing by the standard deviation, which in some circles is referred to as normalization.
On the reasons for scaling in regression (in response to comment by questor). Suppose you regress Y on covariates X1, X2, ... The reasons for scaling covariates Xk depend on the context. It can enable comparison of the coefficients (effect sizes) of each covariate. It can help ensure numerical accuracy (these days not usually an issue unless covariates on hugely different scales and/or data is big). For a readable intro see Psychosomatic medicine editors' guide. For a mathematically intense discussion see Sylvain Sardy's guide.
In particular, in Bayesian regression, rescaling is advisable to ensure convergence of MCMC estimation; e.g. see this discussion.
You mean features not labels.
It is not necessary to normalize your features for regression or classification, even though in some cases, it is a trick that can help converging faster. You might want to check this post.
To my experience, when using a simple model like a linear regression with only a few variables, keeping the features as they are (without normalization) is preferable since the model is more interpretable.
It may be that what you mean is that you should scale your labels. The reason is so convergence is faster, and you don't get numeric instability.
For example, if your labels are in the range (1000, 1000000) and the weights are initialized close to zero, a mse loss would be so large, you'd likely get NaN errors.
See https://datascience.stackexchange.com/q/22776/38707 for a similar discussion.
for a regression problem with algorithms including decision tree or logistic regression and linear regression I tested in two modes: 1- with label scaling using MinMaxScaler 2- without label scaling the result that i got was : r2 score is the same in 2 mode mse and mae scales
for diabetes dataset using linear regression the result before and after is
without scaling:
Mean Squared Error: 3424.3166
Mean Absolute Error: 46.1742
R2_score : 0.33
after scaling labels:
Mean Squared Error: 0.0332
Mean Absolute Error: 0.1438
R2_score : 0.33
also below link can be useful which says scaling can be helpful in fast convergence enter scale or not scale labels in deep leaning?

How many learning curves should I plot for a multi-class logistic regression classifier?

If we have K classes, do I have to plot K learning curves?
Because it seems impossible to me to calculate the train/validation error against all K theta vectors at once.
To clarify, the learning curve is a plot of the training & cross validation/test set error/cost vs training set size. This plot should allow you to see if increasing the training set size improves performance. More generally, the learning curve allows you to identify whether your algorithm suffers from a bias (under fitting) or variance (over fitting) problem.
It depends. Learning curves do not concern themselves with the number of classes. Like you said, it is a plot of training set and test set error, where that error is a numerical value. This is all learning curves are.
That error can be anything you want: accuracy, precision, recall, F1 score etc. (even MAE, MSE and others for regression).
However, the error you choose to use is the one that does or does not apply to your specific problem, which in turn indirectly affects how you should use learning curves.
Accuracy is well defined for any number of classes, so if you use this, a single plot should suffice.
Precision and recall, however, are defined only for binary problems. You can somewhat generalize them (see here for example) by considering the binary problem with classes x and not x for each class x. In that case, you will probably want to plot learning curves for each class. This will also help you identify problems relating to certain classes better.
If you want to read more about performance metrics, I like this paper a lot.

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Logistic Regression and Naive Bayes for this dataset

Can both Naive Bayes and Logistic regression classify both of these dataset perfectly ? My understanding is that Naive Bayes can , and Logistic regression with complex terms can classify these datasets. Please help if I am wrong.
Image of datasets is here:
Lets run both algorithms on two similar datasets to the ones you posted and see what happens...
EDIT The previous answer I posted was incorrect. I forgot to account for the variance in Gaussian Naive Bayes. (The previous solution was for naive bayes using Gaussians with fixed, identity covariance, which gives a linear decision boundary).
It turns out that LR fails at the circular dataset while NB could succeed.
Both methods succeed at the rectangular dataset.
The LR decision boundary is linear while the NB boundary is quadratic (the boundary between two axis-aligned Gaussians with different covariances).
Applying NB the circular dataset gives two means in roughly the same position, but with different variances, leading to a roughly circular decision boundary - as the radius increases, the probability of the higher variance Gaussian increases compared to that of the lower variance Gaussian. In this case, many of the inner points on the inner circle are incorrectly classified.
The two plots below show a gaussian NB solution with fixed variance.
In the plots below, the contours represent probability contours of the NB solution.
This gaussian NB solution also learns the variances of individual parameters, leading to an axis-aligned covariance in the solution.
Naive Bayes/Logistic Regression can get the second (right) of these two pictures, in principle, because there's a linear decision boundary that perfectly separates.
If you used a continuous version of Naive Bayes with class-conditional Normal distributions on the features, you could separate because the variance of the red class is greater than that of the blue, so your decision boundary would be circular. You'd end up with distributions for the two classes which had the same mean (the centre point of the two rings) but where the variance of the features conditioned on the red class would be greater than that of the features conditioned on the blue class, leading to a circular decision boundary somewhere in the margin. This is a non-linear classifier, though.
You could get the same effect with histogram binning of the feature spaces, so long as the histograms' widths were narrow enough. In this case both logistic regression and Naive Bayes will work, based on histogram-like feature vectors.
How would you use Naive Bayes on these data sets?
In the usual form, Naive Bayes needs binary / categorial data.

Resources