I have only 2 classes.
Fisher Discrimination Analysis project the data into a low dimensional discriminative subspace. According to the papers, I can find atmost C-1 non-zero eigen values. That means if my initial data has a dimension d the projected data will have a dimension of atmost c-1.
If the number of classes are only 2, I will get a feature vector with only one dimension.
My problem is I would like to project my data into a discriminative subspace and then I would like to get a feature vector of size m (m < d, m~=1).
Are there any way to do these kind of discriminative projection?
Related
I would like to compare the accuracies of running logistic regression on a dataset following PCA and LDA. The dataset I am using is the wisconsin cancer dataset, which contains two classes: malignant or benign tumors and 30 features. I have already conducted PCA on this data and have been able to get good accuracy scores with 10 PCAs. I know that LDA is similar to PCA. My understanding is that you calculate the mean vectors of each feature for each class, compute scatter matricies and then get the eigenvalues for the dataset. Is LDA similar to PCA in the sense that I can choose 10 LDA eigenvalues to better separate my data? I have tried LDA with scikit learn, however it has only given me one LDA back. Is this becasue I only have 2 classes, or do I need to do an addiontional step? I would like to have 10 LDAs in order to compare it with my 10 PCAs. Is this even possible?
Actually both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels). You can picture PCA as a technique that finds the directions of maximal variance.And LDA as a technique that also cares about class separability (note that here, LD 2 would be a very bad linear discriminant).Remember that LDA makes assumptions about normally distributed classes and equal class covariances (at least the multiclass version; the generalized version by Rao).
Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.
I need a machine learning algorithm that will satisfy the following requirements:
The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
The test data are some feature vectors which might or might not belong to the positive class.
The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)
An example:
Let's say that the feature vectors are 2D feature vectors.
Positive training data:
(0, 1), (0, 2), (0, 3)
Test data:
(0, 10) should be an anomaly, but not a distinct one
(1, 0) should be an anomaly, but with higher "rank" than (0, 10)
(1, 10) should be an anomaly, with an even higher anomaly "rank"
The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones:
K-nearest neighbors - a simple distance-based method which assumes that normal data samples are close to other normal data samples, while novel samples are located far from the normal points. Python implementation of KNN can be found in ScikitLearn.
Mixture models (e.g. Gaussian Mixture Model) - probabilistic models modeling the generative probability density function of the data, for instance using a mixture of Gaussian distributions. Given a set of normal data samples, the goal is to find parameters of a probability distribution so that it describes the samples best. Then, use the probability of a new sample to decide if it belongs to the distribution or is an outlier. ScikitLearn implements Gaussian Mixture Models and uses the Expectation Maximization algorithm to learn them.
One-class Support Vector Machine (SVM) - an extension of the standard SVM classifier which tries to find a boundary that separates the normal samples from the unknown novel samples (in the classic approach, the boundary is found by maximizing the margin between the normal samples and the origin of the space, projected to the so called "feature space"). ScikitLearn has an implementation of one-class SVM which allows you to use it easily, and a nice example. I attach the plot of that example to illustrate the boundary one-class SVM finds "around" the normal data samples:
I am reading the mathematical formulation of SVM and on many sources I found this idea that "max-margin hyperplane is completely determined by those \vec{x}_i which lie nearest to it. These \vec{x}_i are called support vectors."
Could an expert explain mathematically this consequence? please.
This is true only if you use representation theorem
Your separation plane W can be represented as Sum(i= 1 to m) Alpha_i*phi(x_i)
Where m - number of examples in your train sample and x_i is example i. And Phi is function that maps x_i to some feature space.
So, your SVM algorithm will find vector Alpha=(Alpha_1...Alpha_m), - alpha_i for x_i. Every x_i(example in train sample) that his alpha_i is NOT zero is support vector of W.
Hence the name - SVM.
What happens if your data is separable, is that you need only support vectors that are close to separation margin W, and all rest of the training set can be discarded(its alphas is 0). Amount of support vectors that algorithm will use is depends on complexity of data and kerenl you using.
I have a cyclic method running which collects a data set of 15.000 feature vectors with 30 dimensions (every 200ms). My current setup simply feeds all raw feature vectors to a SVM with RBF (Radial basis function). The classification result is rather unconvincing as being costly in terms of time. I know that the dataset isn't that big, so classification in real-time could be possible with the right subsampling feature vector or so. The goal is to speed up the entire classification process (training/prediction) to reach a few milliseconds. To obtain an unsupervised classification approach, I currently run k-means to label the feature vectors. I pick a few cluster results and assign them class 1 and all others class 0.
The idea now the following:
collect all 15.000 (N) feature vectors with 30 (D) dimensions
PCA on all N feature vectors
use the eigenvalues to determine a feature vector with (d) dimensions (d < D)
Fed the new set of (n < N)
feature vectors
or: the eigenvectors ?
to train the svm
Maybe instead of SVM a KNN approach would result in similar result?
Does this approach makes sense?
Any ideas to improve the process or change it in order to speed it up?
How do I determine the best number of d?
The classification accuracy shouldn't suffer too much from the time reduction.
EDIT: Data stream mining
I was just reading about Data Stream Mining. I think this topic fits my setup quite well since I have to extract knowledge structures from continuous, rapid data records. Maybe I should replace the SVM with a Gradient Boosted Tree?
Thanks!