What is the OpenCV svm type parameter - opencv

The opencv SVM implementation takes a parameter labeled as "SVM type" which must be used in the CVSVMParams structure used in training the SVM. All the explanation I can find is:
// SVM type
enum { C_SVC=100, NU_SVC=101, ONE_CLASS=102, EPS_SVR=103, NU_SVR=104 };
Anyone know what these different values represent?

They are different formulations of SVM. At the heart of SVM is an mathematical optimization problem. This problem can be stated in different ways.
C-SVM uses C as the tradeoff parameter between the size of margin and the number of training points which are misclassified. C is just a number, the useful range depends on the dataset and it can range from very small (like 10-5) to very large (like 10^5), depending on your data.
nu-SVM uses nu instead of C. nu is roughly a percentage of training points which will end up as support vectors. The more support vectors, the wider your margin is, the more training points which will be misclassified. nu ranges from 0.1 to 0.8 - at 0.1 roughly 10% of training points will be support vectors, at 0.8, more like 80%. I say roughly because its just correlated that way - its not exact.
epsilon-SVR and nu-SVR use SVM for regression. Instead of doing binary classification by finding a maximum margin hyperplane, instead the concept is used to find a hypertube which best fits the data in order to use it to predict future models. They differ in the way they are parameterized (like nu-SVM and C-SVM differ).
One-Class SVM is novelty detection. Rather than binary classification, or predicting a value, instead you give the SVM a training set and it attempts to train a model to wrap around that set so that a future instance can be classified as part of the class or outside the class (novel or outlier).

In general:
Classification SVM Type 1 (also known as C-SVM classification)
Classification SVM Type 2 (also known as nu-SVM classification)
Regression SVM Type 1 (also known as epsilon-SVM regression)
Regression SVM Type 2 (also known as nu-SVM regression)
Details can be found on page SVM

Related

Comparing MSE loss and cross-entropy loss in terms of convergence

For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss?
When I plot them it seems to me that MSE loss has a lower error margin. Why would that be?
Or for example when I have the target as [1,1,1,1....1] I get the following:
As complement to the accepted answer, I will answer the following questions
What is the interpretation of MSE loss and cross entropy loss from probability perspective?
Why cross entropy is used for classification and MSE is used for linear regression?
TL;DR Use MSE loss if (random) target variable is from Gaussian distribution and categorical cross entropy loss if (random) target variable is from Multinomial distribution.
MSE (Mean squared error)
One of the assumptions of the linear regression is multi-variant normality. From this it follows that the target variable is normally distributed(more on the assumptions of linear regression can be found here and here).
Gaussian distribution(Normal distribution) with mean and variance is given by
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
This is called standard normal distribution.
For normal distribution model with weight parameter and precision(inverse variance) parameter , the probability of observing a single target t given input x is expressed by the following equation
, where is mean of the distribution and is calculated by model as
Now the probability of target vector given input can be expressed by
Taking natural logarithm of left and right terms yields
Where is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to . Now maximum likelihood function for parameter is given by (constant terms with respect to can be omitted),
For training the model omitting the constant doesn't affect the convergence.
This is called squared error and taking the mean yields mean squared error.
,
Cross entropy
Before going into more general cross entropy function, I will explain specific type of cross entropy - binary cross entropy.
Binary Cross entropy
The assumption of binary cross entropy is probability distribution of target variable is drawn from Bernoulli distribution. According to Wikipedia
Bernoulli distribution is the discrete probability distribution of a random variable which
takes the value 1 with probability p and the value 0
with probability q=1-p
Probability of Bernoulli distribution random variable is given by
, where and p is probability of success.
This can be simply written as
Taking negative natural logarithm of both sides yields
, this is called binary cross entropy.
Categorical cross entropy
Generalization of the cross entropy follows the general case
when the random variable is multi-variant(is from Multinomial distribution
) with the following probability distribution
Taking negative natural logarithm of both sides yields categorical cross entropy loss.
,
You sound a little confused...
Comparing the values of MSE & cross-entropy loss and saying that one is lower than the other is like comparing apples to oranges
MSE is for regression problems, while cross-entropy loss is for classification ones; these contexts are mutually exclusive, hence comparing the numerical values of their corresponding loss measures makes no sense
When your prediction vector is like [0,0.1,0.2,....1] (i.e. with non-integer components), as you say, the problem is a regression (and not a classification) one; in classification settings, we usually use one-hot encoded target vectors, where only one component is 1 and the rest are 0
A target vector of [1,1,1,1....1] could be the case either in a regression setting, or in a multi-label multi-class classification, i.e. where the output may belong to more than one class simultaneously
On top of these, your plot choice, with the percentage (?) of predictions in the horizontal axis, is puzzling - I have never seen such plots in ML diagnostics, and I am not quite sure what exactly they represent or why they can be useful...
If you like a detailed discussion of the cross-entropy loss & accuracy in classification settings, you may have a look at this answer of mine.
I tend to disagree with the previously given answers. The point is that the cross-entropy and MSE loss are the same.
The modern NN learn their parameters using maximum likelihood estimation (MLE) of the parameter space. The maximum likelihood estimator is given by argmax of the product of probability distribution over the parameter space. If we apply a log transformation and scale the MLE by the number of free parameters, we will get an expectation of the empirical distribution defined by the training data.
Furthermore, we can assume different priors, e.g. Gaussian or Bernoulli, which yield either the MSE loss or negative log-likelihood of the sigmoid function.
For further reading:
Ian Goodfellow "Deep Learning"
A simple answer to your first question:
For a very simple classification problem ... would cross-entropy loss converge better/faster or would MSE loss?
is that MSE loss, when combined with sigmoid activation, will result in non-convex cost function with multiple local minima. This is explained by Prof Andrew Ng in his lecture:
Lecture 6.4 — Logistic Regression | Cost Function — [ Machine Learning | Andrew Ng]
I imagine the same applies to multiclass classification with softmax activation.

Machine Learning - one class classification/novelty detection/anomaly assessment?

I need a machine learning algorithm that will satisfy the following requirements:
The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
The test data are some feature vectors which might or might not belong to the positive class.
The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)
An example:
Let's say that the feature vectors are 2D feature vectors.
Positive training data:
(0, 1), (0, 2), (0, 3)
Test data:
(0, 10) should be an anomaly, but not a distinct one
(1, 0) should be an anomaly, but with higher "rank" than (0, 10)
(1, 10) should be an anomaly, with an even higher anomaly "rank"
The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones:
K-nearest neighbors - a simple distance-based method which assumes that normal data samples are close to other normal data samples, while novel samples are located far from the normal points. Python implementation of KNN can be found in ScikitLearn.
Mixture models (e.g. Gaussian Mixture Model) - probabilistic models modeling the generative probability density function of the data, for instance using a mixture of Gaussian distributions. Given a set of normal data samples, the goal is to find parameters of a probability distribution so that it describes the samples best. Then, use the probability of a new sample to decide if it belongs to the distribution or is an outlier. ScikitLearn implements Gaussian Mixture Models and uses the Expectation Maximization algorithm to learn them.
One-class Support Vector Machine (SVM) - an extension of the standard SVM classifier which tries to find a boundary that separates the normal samples from the unknown novel samples (in the classic approach, the boundary is found by maximizing the margin between the normal samples and the origin of the space, projected to the so called "feature space"). ScikitLearn has an implementation of one-class SVM which allows you to use it easily, and a nice example. I attach the plot of that example to illustrate the boundary one-class SVM finds "around" the normal data samples:

Suggested unsupervised feature selection / extraction method for 2 class classification?

I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.

how to use weight when training a weak learner for adaboost

The following is adaboost algorithm:
It mentions "using weights wi on the training data" at part 3.1.
I am not very clear about how to use the weights. Should I resample the training data?
I am not very clear about how to use the weights. Should I resample the training data?
It depends on what classifier you are using.
If your classifier can take instance weight (weighted training examples) into account, then you don't need to resample the data. An example classifier could be naive bayes classifier that accumulates weighted counts or a weighted k-nearest-neighbor classifier.
Otherwise, you want to resample the data using the instance weight, i.e., those instance with more weights could be sampled multiple times; while those instance with little weight might not even appear in the training data. Most of the other classifiers fall in this category.
In Practice
Actually in practice, boosting performs better if you only rely on a pool of very naive classifiers, e.g., decision stump, linear discriminant. In this case, the algorithm you listed has a easy-to-implement form (see here for details):
Where alpha is chosen by (epsilon is defined similarly as yours).
An Example
Define a two-class problem in the plane (for example, a circle of points
inside a square) and build a strong classier out of a pool of randomly
generated linear discriminants of the type sign(ax1 + bx2 + c).
The two class labels are represented with red crosses and blue dots. We here are using a bunch of linear discriminants (yellow lines) to construct the pool of naive/weak classifiers. We generate 1000 data points for each class in the graph (inside the circle or not) and 20% of data is reserved for testing.
This is the classification result (in the test dataset) I got, in which I used 50 linear discriminants. The training error is 1.45% and the testing error is 2.3%
The weights are the values applied to each example (sample) in step 2. These weights are then updated at step 3.3 (wi).
So initially all weights are equal (step 2) and they are increased for wrongly classified data and decreased for correctly classified data. So in step 3.1 you have to take take these value in account to determine a new classifier, giving more importance to higher weight values. If you did not change the weight you would produce exactly the same classifier each time you execute step 3.1.
These weights are only used for training purpose, they're not part of the final model.

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

Resources