Machine Learning - one class classification/novelty detection/anomaly assessment? - machine-learning

I need a machine learning algorithm that will satisfy the following requirements:
The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
The test data are some feature vectors which might or might not belong to the positive class.
The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)
An example:
Let's say that the feature vectors are 2D feature vectors.
Positive training data:
(0, 1), (0, 2), (0, 3)
Test data:
(0, 10) should be an anomaly, but not a distinct one
(1, 0) should be an anomaly, but with higher "rank" than (0, 10)
(1, 10) should be an anomaly, with an even higher anomaly "rank"

The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones:
K-nearest neighbors - a simple distance-based method which assumes that normal data samples are close to other normal data samples, while novel samples are located far from the normal points. Python implementation of KNN can be found in ScikitLearn.
Mixture models (e.g. Gaussian Mixture Model) - probabilistic models modeling the generative probability density function of the data, for instance using a mixture of Gaussian distributions. Given a set of normal data samples, the goal is to find parameters of a probability distribution so that it describes the samples best. Then, use the probability of a new sample to decide if it belongs to the distribution or is an outlier. ScikitLearn implements Gaussian Mixture Models and uses the Expectation Maximization algorithm to learn them.
One-class Support Vector Machine (SVM) - an extension of the standard SVM classifier which tries to find a boundary that separates the normal samples from the unknown novel samples (in the classic approach, the boundary is found by maximizing the margin between the normal samples and the origin of the space, projected to the so called "feature space"). ScikitLearn has an implementation of one-class SVM which allows you to use it easily, and a nice example. I attach the plot of that example to illustrate the boundary one-class SVM finds "around" the normal data samples:

Related

Can a machine learning model provide information about mean and standard deviation of data on which it was trained?

Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.

What does it mean by the phrase "a machine learning algorithm learn a probability distribution"? What exactly is happening here

Generative and discriminative models seem to learn conditional P(x|y) and joint P(x,y) probability distributions. But at the fundamental level I fail to convince myself what it means by the probability distribution is learnt.
It means that your model is either functioning as an estimator for the distribution from which your training samples were drawn, or is utilizing that estimator to perform some other prediction.
To give a trivial example, consider a set of observations {x[1], ..., x[N]}. Let's say you want to train a Gaussian estimator on it. From these samples, the maximum-likelihood parameters for this Gaussian estimator would be the mean and variance of the data
Mean = 1/N * (x[1] + ... + x[N])
Variance = 1/(N-1) * ((x[1] - Mean)^2 + ... + (x[N] - Mean)^2)
Now you have a model capable of generating new samples from (an estimate of) the distribution your training sample was drawn from.
Going a little more sophisticated, you could consider something like a Gaussian mixture model. This similarly infers the best-fitting parameters of a model given your data. Except this time, that model is comprised of multiple Gaussians. As a result, if you are given some test data, you may probabilistically assign classes to each of those samples, based on the relative contribution of each Gaussian component to the probability density at the points of observation. This of course makes the fundamental assumption of machine learning: your training and test data are both drawn from the same distribution (something you ought to check).

Difference between Probabilistic kNN and Naive Bayes

I'm trying to modify an standard kNN algorithm to obtain the probability of belonging to a class instead of just the usual classification. I haven't found much information about Probabilistic kNN, but as far as I understand, it works similar to kNN, with the difference that it calculates the percentage of examples of every class inside the given radius.
So I wonder, what's the difference then between Naive Bayes and Probabilistic kNN? I just can spot that Naive Bayes takes into consideration the prior possibility, while PkNN does not. Am I getting it wrong?
Thanks in advance!
To be honest there is nearly no similarity.
Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis. For contiuous case - It will fit a radial Normal distribution to your whole class (each of them) and then make a decision through argmax_y N(m_y, Sigma_y)
KNN on the other hand is not a probabilistic model. Modification that you are refering to is simply a "smooth" version of the original idea, where you return ratio of each class in the nearest neighbours set (and this is not really any "probabilistic kNN", it is just regular kNN which rough estimate of probability). This assumes nothing about data distribution (besides being localy smooth). In particular - it is a nonparametric model which, given enough training samples, will fit perfectly to any dataset. Naive Bayes will fit perfectly only to K gaussians (where K is number of classes).
(I don't know how to format math formulas. For more details and clear representations, please see this.)
I would like to propose an opposite view that KNN is a kind of simplified Naive Bayes (NB) by viewing KNN as a mean of density estimation.
To perform density estimation, we attempt to estimate p(x) = k/NV, where k is the number of samples lying in a region R, N is the total sample number, and V is the volume of the region R. Usually, there are two ways to estimate it: (1) fixing V, calculate k, which is known as kernel density estimation or Parzen window; (2) fixing k, calculate V, which is the KNN-based density estimation. The latter one is much less famous than the former one due to its many drawbacks.
Yet, we can use KNN-based density estimation to connect KNN and NB. Given total N samples, Ni samples for class ci, we can write the NB in the form of KNN-based density estimation by considering a region contain x:
P(ci|x) = P(x|ci)P(ci)/P(x) = (ki/NiV)(Ni/N)/(k/NV) = ki/k,
where ki is the sample number of class ci lying in the region. The final form ki/k is actually the KNN classifier.

Suggested unsupervised feature selection / extraction method for 2 class classification?

I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.

What is the OpenCV svm type parameter

The opencv SVM implementation takes a parameter labeled as "SVM type" which must be used in the CVSVMParams structure used in training the SVM. All the explanation I can find is:
// SVM type
enum { C_SVC=100, NU_SVC=101, ONE_CLASS=102, EPS_SVR=103, NU_SVR=104 };
Anyone know what these different values represent?
They are different formulations of SVM. At the heart of SVM is an mathematical optimization problem. This problem can be stated in different ways.
C-SVM uses C as the tradeoff parameter between the size of margin and the number of training points which are misclassified. C is just a number, the useful range depends on the dataset and it can range from very small (like 10-5) to very large (like 10^5), depending on your data.
nu-SVM uses nu instead of C. nu is roughly a percentage of training points which will end up as support vectors. The more support vectors, the wider your margin is, the more training points which will be misclassified. nu ranges from 0.1 to 0.8 - at 0.1 roughly 10% of training points will be support vectors, at 0.8, more like 80%. I say roughly because its just correlated that way - its not exact.
epsilon-SVR and nu-SVR use SVM for regression. Instead of doing binary classification by finding a maximum margin hyperplane, instead the concept is used to find a hypertube which best fits the data in order to use it to predict future models. They differ in the way they are parameterized (like nu-SVM and C-SVM differ).
One-Class SVM is novelty detection. Rather than binary classification, or predicting a value, instead you give the SVM a training set and it attempts to train a model to wrap around that set so that a future instance can be classified as part of the class or outside the class (novel or outlier).
In general:
Classification SVM Type 1 (also known as C-SVM classification)
Classification SVM Type 2 (also known as nu-SVM classification)
Regression SVM Type 1 (also known as epsilon-SVM regression)
Regression SVM Type 2 (also known as nu-SVM regression)
Details can be found on page SVM

Resources