Logistic Regression and Naive Bayes for this dataset - machine-learning

Can both Naive Bayes and Logistic regression classify both of these dataset perfectly ? My understanding is that Naive Bayes can , and Logistic regression with complex terms can classify these datasets. Please help if I am wrong.
Image of datasets is here:

Lets run both algorithms on two similar datasets to the ones you posted and see what happens...
EDIT The previous answer I posted was incorrect. I forgot to account for the variance in Gaussian Naive Bayes. (The previous solution was for naive bayes using Gaussians with fixed, identity covariance, which gives a linear decision boundary).
It turns out that LR fails at the circular dataset while NB could succeed.
Both methods succeed at the rectangular dataset.
The LR decision boundary is linear while the NB boundary is quadratic (the boundary between two axis-aligned Gaussians with different covariances).
Applying NB the circular dataset gives two means in roughly the same position, but with different variances, leading to a roughly circular decision boundary - as the radius increases, the probability of the higher variance Gaussian increases compared to that of the lower variance Gaussian. In this case, many of the inner points on the inner circle are incorrectly classified.
The two plots below show a gaussian NB solution with fixed variance.
In the plots below, the contours represent probability contours of the NB solution.
This gaussian NB solution also learns the variances of individual parameters, leading to an axis-aligned covariance in the solution.

Naive Bayes/Logistic Regression can get the second (right) of these two pictures, in principle, because there's a linear decision boundary that perfectly separates.
If you used a continuous version of Naive Bayes with class-conditional Normal distributions on the features, you could separate because the variance of the red class is greater than that of the blue, so your decision boundary would be circular. You'd end up with distributions for the two classes which had the same mean (the centre point of the two rings) but where the variance of the features conditioned on the red class would be greater than that of the features conditioned on the blue class, leading to a circular decision boundary somewhere in the margin. This is a non-linear classifier, though.
You could get the same effect with histogram binning of the feature spaces, so long as the histograms' widths were narrow enough. In this case both logistic regression and Naive Bayes will work, based on histogram-like feature vectors.

How would you use Naive Bayes on these data sets?
In the usual form, Naive Bayes needs binary / categorial data.

Related

Does it makes sense to scale features by only one label before using logistic regression?

I have a simple binary classification problem, my current classifier is Logistic Regression and I'm using RobustScaler from sklearn to scale my features before fitting the lr.
Assuming my features are looking like 2 Gaussians:
While the orange histogram is for the positive label and the blue histogram is for the negative.
My question is, does it makes sense to pass only the negative label features into the scaler?
My intuition is from the sense that in our case, the blue ones are the "normal" cases, and the orange ones are "abnormal". So shouldn't it be better to scale by the "normals" and push the "abnormals" further away from the mean (which is 0 after scaling).
Consider how you would use your model for inference. On new data, you will not know the class, so you can only apply the scaler to all of the cases. That will reduce the model's performance.

Which classifier can classify according to non axis parallel decision boundaries?

I have some 2d data which looks like it could be well classified by an area described by two intersecting straight lines. These lines won't necessarily by at right angles to each other. Here is a simple example where the two lines would be more or less at right angles:
Is there a suitable classifier for this? Logistic regression will give me one straight line but I am not sure what will give me two as a decision boundary. A decision tree will give me two that are axis parallel which isn't really want I want.
You can give Support Vector Machine (SVM) a try. There are multiple kernels that can be used with SVM, like
Linear
Polynomial
RBF (Radial Basis Function)
Sigmoid
You can even specify custom kernels as mentioned in this list.
Here is an image of decision boundaries over Iris dataset, taken from this example
References
Difference between various SVM kernels
Selecting Kernels for SVM
Custom kernel SVM

Decision boundary is not a property of training data in classification

In ML videos of Andrew Ng on Coursera on Classification (in the third video), he said that the "decision boundary is not a property of the training set". What does this statement mean? And does it also imply that the straight line or any curves that we use in linear regression to fit data are not a property of the training set? He claims that those curves (achieved through linear regression) aren't the properties of the corresponding training data. I am a bit confused about this. Kindly if my doubts could be removed. Thanks in advance.
The decision boundary is a property of your classifier. Different classifiers lead to different decision boundaries.
Decision boundary has nothing to do with linear regression, as it only makes sense for classification problems. The decision boundary is the curve (or surface, in more than two dimensions) that splits the elements of the two different classes in your classification problem. In logistic regression, the decision boundary is a straight line, while in nonlinear classification methods, like neural networks, the decision boundary is a curve.

Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
model.fit(tr_data,tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

Formulae of Cosine Distance?

1) I am using the following for measuring the cosine distance between two vectors (let's say A and B).
Lets assume we have two vectors for e.g vector A and vector B,
cosine distance between A & B = (dot(A, B) / (Magnitude (A) * Magnitude (B)))
is this formula right ? if not than kindly suggest me the right formulae ?
2) Is K-NN always better in accuracy than Rocchio or there are some situations when Rocchio performs better than K-NN ? K-NN looks like an enhancement of Rocchio and theoretical concepts suggests that K-NN will perform much better than Rocchio but i am finding vice versa in practical implementation in which Rocchio is performing much better than K-NN ?
(1) Cosine distance is one of the similarity measures. Others may include the Euclidean distance or weighted Euclidean distance. You implementation is correct.
(2) The main difference between KNN and Rocchio is there is no training in the former, but prototype vectors are generated during training process in the latter. During test process, all the training instances are used in KNN, but only the prototype vectors are used in Rocchio (usually one vector per class). So the Rocchio is more efficient in both training and test. However it lacks sufficient theoretical validity to demonstrate Rocchio's stability and robustness. And it is shown that Rocchio does not work well if the categories are not linear separable.

Resources