Linear Discriminant Analysis Algorithm - image-processing

I have started to learn image processing and I am trying to learn LDA (Linear Discriminant Analysis) algorithm. I want to ask a question to understand the philosophy of LDA. If it is is useful to use LDA in the distribution in the first example, is it also useful to use LDA in the distribution in the second example. I mean if I rotate 90 degrees to the image, is LDA still useful?
first
second

LDA is a generative algorithm that allows to classify predictors in two or more classes divided by linear boundaries. So, if you rotate the image the LDA algorithm does not lose its value, as long as you fit two different LDA models.

Related

Difference between linear regression and gradient descent

based assignment and I chose machine learning as my topic. I'm still in highschool so I don't know much about calculus.
My end goal is to try using a machine learning algorithm to predict stock values. But I want to understand what I'm doing without copying and analyzing existing codes that perform my required function.
This also isn't programming-related but mostly concerns over the theory part of it? I read through articles on linear regression and watched the lecture that Stanford has on its youtube. But I don't get it. These are my main confusions:
Are linear regression and gradient descent different algorithms or a set of algorithms used together to predict or classify stuff?
Are y = mx + c and f(x) = ϴ0 + ϴx same? What can I calculate with this?
This equation is shown in the linear regression part so what exactly does this do?
I will try to answer all three questions you asked.
First, let me classify ML into some categories.
Regression - Predicting continuous valued output (example, stock prediction)
Classification - Predicting discrete valued output (example, spam classification)
Now regression can be also classified as linear regression or polynomial regression.
Linear Regression is the simplest one. This is how it works.
Suppose I have this data.
These are the house prices plotted against size of the house. Now I want a straight line that can best fit this data. Maybe I will try this line.
And I will try more and more lines to see which actually fit best to the data. Now, to obtain different lines I will vary parameters like a and b in y=a+bx. This answers your second question, this equation represents a straight line which you are trying to fit to the data.
But, how will I decide if one line is better fit than the other. I will calculate some value which represents the error my line makes in correctly predicting the y values of all the x values in my data. This is actually called cost function. I can choose a cost function like this :
(Ignore if it doesn't make sense).
But basically I want my cost function (error representing value) to be minimum and Gradient Descent is one such algorithm that can minimize my cost function. Gradient Descent can actually minimize any general function and hence it is not exclusive to Linear Regression but still it is popular for linear regression. This answers your first question.
Next step is to know how Gradient descent work. This is the algo:
This is what you have asked in your third question. This is the line of code which actually adjusts your fitting line(called hypothesis) while minimizing the cost function.

Which classifier can classify according to non axis parallel decision boundaries?

I have some 2d data which looks like it could be well classified by an area described by two intersecting straight lines. These lines won't necessarily by at right angles to each other. Here is a simple example where the two lines would be more or less at right angles:
Is there a suitable classifier for this? Logistic regression will give me one straight line but I am not sure what will give me two as a decision boundary. A decision tree will give me two that are axis parallel which isn't really want I want.
You can give Support Vector Machine (SVM) a try. There are multiple kernels that can be used with SVM, like
Linear
Polynomial
RBF (Radial Basis Function)
Sigmoid
You can even specify custom kernels as mentioned in this list.
Here is an image of decision boundaries over Iris dataset, taken from this example
References
Difference between various SVM kernels
Selecting Kernels for SVM
Custom kernel SVM

Decision boundary is not a property of training data in classification

In ML videos of Andrew Ng on Coursera on Classification (in the third video), he said that the "decision boundary is not a property of the training set". What does this statement mean? And does it also imply that the straight line or any curves that we use in linear regression to fit data are not a property of the training set? He claims that those curves (achieved through linear regression) aren't the properties of the corresponding training data. I am a bit confused about this. Kindly if my doubts could be removed. Thanks in advance.
The decision boundary is a property of your classifier. Different classifiers lead to different decision boundaries.
Decision boundary has nothing to do with linear regression, as it only makes sense for classification problems. The decision boundary is the curve (or surface, in more than two dimensions) that splits the elements of the two different classes in your classification problem. In logistic regression, the decision boundary is a straight line, while in nonlinear classification methods, like neural networks, the decision boundary is a curve.

How is a linear autoencoder equal to PCA?

I would like the mathematical proof of it. does anyone know a paper for it. or can workout the math?
https://pvirie.wordpress.com/2016/03/29/linear-autoencoders-do-pca/
PCA is restricted to a linear map, while auto encoders can have nonlinear enoder/decoders.
A single layer auto encoder with linear transfer function is nearly equivalent to PCA, where nearly means that the WW found by AE and PCA won't be the same--but the subspace spanned by the respective WW's will.
Here is a detailed mathematical explanation:
https://arxiv.org/abs/1804.10253
This paper also shows that using a linear autoencoder, it is possible not only to compute the subspace spanned by the PCA vectors, but it is actually possible to compute the principal components themselves.

Logistic Regression and Naive Bayes for this dataset

Can both Naive Bayes and Logistic regression classify both of these dataset perfectly ? My understanding is that Naive Bayes can , and Logistic regression with complex terms can classify these datasets. Please help if I am wrong.
Image of datasets is here:
Lets run both algorithms on two similar datasets to the ones you posted and see what happens...
EDIT The previous answer I posted was incorrect. I forgot to account for the variance in Gaussian Naive Bayes. (The previous solution was for naive bayes using Gaussians with fixed, identity covariance, which gives a linear decision boundary).
It turns out that LR fails at the circular dataset while NB could succeed.
Both methods succeed at the rectangular dataset.
The LR decision boundary is linear while the NB boundary is quadratic (the boundary between two axis-aligned Gaussians with different covariances).
Applying NB the circular dataset gives two means in roughly the same position, but with different variances, leading to a roughly circular decision boundary - as the radius increases, the probability of the higher variance Gaussian increases compared to that of the lower variance Gaussian. In this case, many of the inner points on the inner circle are incorrectly classified.
The two plots below show a gaussian NB solution with fixed variance.
In the plots below, the contours represent probability contours of the NB solution.
This gaussian NB solution also learns the variances of individual parameters, leading to an axis-aligned covariance in the solution.
Naive Bayes/Logistic Regression can get the second (right) of these two pictures, in principle, because there's a linear decision boundary that perfectly separates.
If you used a continuous version of Naive Bayes with class-conditional Normal distributions on the features, you could separate because the variance of the red class is greater than that of the blue, so your decision boundary would be circular. You'd end up with distributions for the two classes which had the same mean (the centre point of the two rings) but where the variance of the features conditioned on the red class would be greater than that of the features conditioned on the blue class, leading to a circular decision boundary somewhere in the margin. This is a non-linear classifier, though.
You could get the same effect with histogram binning of the feature spaces, so long as the histograms' widths were narrow enough. In this case both logistic regression and Naive Bayes will work, based on histogram-like feature vectors.
How would you use Naive Bayes on these data sets?
In the usual form, Naive Bayes needs binary / categorial data.

Resources