I have a categorical variable say Building Type with possible values (Residential, commercial, Industry, Special Buildings). I applied SVM on the datasets and tried to predict a value of population class (High, MED, LOW) based on building type. how are the accuracy measures and F-scores calculated for my case? Is accuracy measure a good metric for a categorical variable with values more than one?
For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss?
When I plot them it seems to me that MSE loss has a lower error margin. Why would that be?
Or for example when I have the target as [1,1,1,1....1] I get the following:
As complement to the accepted answer, I will answer the following questions
What is the interpretation of MSE loss and cross entropy loss from probability perspective?
Why cross entropy is used for classification and MSE is used for linear regression?
TL;DR Use MSE loss if (random) target variable is from Gaussian distribution and categorical cross entropy loss if (random) target variable is from Multinomial distribution.
MSE (Mean squared error)
One of the assumptions of the linear regression is multi-variant normality. From this it follows that the target variable is normally distributed(more on the assumptions of linear regression can be found here and here).
Gaussian distribution(Normal distribution) with mean and variance is given by
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
This is called standard normal distribution.
For normal distribution model with weight parameter and precision(inverse variance) parameter , the probability of observing a single target t given input x is expressed by the following equation
, where is mean of the distribution and is calculated by model as
Now the probability of target vector given input can be expressed by
Taking natural logarithm of left and right terms yields
Where is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to . Now maximum likelihood function for parameter is given by (constant terms with respect to can be omitted),
For training the model omitting the constant doesn't affect the convergence.
This is called squared error and taking the mean yields mean squared error.
,
Cross entropy
Before going into more general cross entropy function, I will explain specific type of cross entropy - binary cross entropy.
Binary Cross entropy
The assumption of binary cross entropy is probability distribution of target variable is drawn from Bernoulli distribution. According to Wikipedia
Bernoulli distribution is the discrete probability distribution of a random variable which
takes the value 1 with probability p and the value 0
with probability q=1-p
Probability of Bernoulli distribution random variable is given by
, where and p is probability of success.
This can be simply written as
Taking negative natural logarithm of both sides yields
, this is called binary cross entropy.
Categorical cross entropy
Generalization of the cross entropy follows the general case
when the random variable is multi-variant(is from Multinomial distribution
) with the following probability distribution
Taking negative natural logarithm of both sides yields categorical cross entropy loss.
,
You sound a little confused...
Comparing the values of MSE & cross-entropy loss and saying that one is lower than the other is like comparing apples to oranges
MSE is for regression problems, while cross-entropy loss is for classification ones; these contexts are mutually exclusive, hence comparing the numerical values of their corresponding loss measures makes no sense
When your prediction vector is like [0,0.1,0.2,....1] (i.e. with non-integer components), as you say, the problem is a regression (and not a classification) one; in classification settings, we usually use one-hot encoded target vectors, where only one component is 1 and the rest are 0
A target vector of [1,1,1,1....1] could be the case either in a regression setting, or in a multi-label multi-class classification, i.e. where the output may belong to more than one class simultaneously
On top of these, your plot choice, with the percentage (?) of predictions in the horizontal axis, is puzzling - I have never seen such plots in ML diagnostics, and I am not quite sure what exactly they represent or why they can be useful...
If you like a detailed discussion of the cross-entropy loss & accuracy in classification settings, you may have a look at this answer of mine.
I tend to disagree with the previously given answers. The point is that the cross-entropy and MSE loss are the same.
The modern NN learn their parameters using maximum likelihood estimation (MLE) of the parameter space. The maximum likelihood estimator is given by argmax of the product of probability distribution over the parameter space. If we apply a log transformation and scale the MLE by the number of free parameters, we will get an expectation of the empirical distribution defined by the training data.
Furthermore, we can assume different priors, e.g. Gaussian or Bernoulli, which yield either the MSE loss or negative log-likelihood of the sigmoid function.
For further reading:
Ian Goodfellow "Deep Learning"
A simple answer to your first question:
For a very simple classification problem ... would cross-entropy loss converge better/faster or would MSE loss?
is that MSE loss, when combined with sigmoid activation, will result in non-convex cost function with multiple local minima. This is explained by Prof Andrew Ng in his lecture:
Lecture 6.4 — Logistic Regression | Cost Function — [ Machine Learning | Andrew Ng]
I imagine the same applies to multiclass classification with softmax activation.
I am newbie to using svm for classification. I want to tune svm parameters by .TrainAutofunction in EmguCV. But I don't know what are the range(min-max value) of below parameters that I should give to this function to search:
1- C (for poly and RBF kernels)
2- Gamma (for poly and RBF kernels)
3- Coeficient (for poly kernel)
4- Degree (for poly kernel)
What are the range of these parameters?
Do these ranges depends on the number of samples?
As I receive the out of memory allocation error, can I set step parameter to a large value and when I found optimal values approximately, set these range to a smaller one and search on the second range with smaller step?
In general, you should consider the RBF kernel first, unless you have a good reason not to. Here are some guides on how to choose the parameters C and gamma:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
I have a classification task, and I use svm_perf application.
The question is having trained the model I wonder whether it's possible to get the weight of the features.
There is an -a parametes which outputs the alphas, honestly I don't recall alphas in SVM I think the weights are always w.
If you are implementing linear SVM, there is a Python script based on the model file output by svm_learn and svm_perf_learn. To be more specific, the weight is just w=SUM_i (y_i*alpha_i*sv_i) where sv_i is the support vector, y_i is the category from trained sample.
If you are using non linear SVM, I don't think the weights coefficients are directly related to the input space. Yet you can get the decision function:
f(x) = sgn( SUM_i (alpha_i*y_i*K(sv_i,x)) + b );
where K is your kernel function.
The opencv SVM implementation takes a parameter labeled as "SVM type" which must be used in the CVSVMParams structure used in training the SVM. All the explanation I can find is:
// SVM type
enum { C_SVC=100, NU_SVC=101, ONE_CLASS=102, EPS_SVR=103, NU_SVR=104 };
Anyone know what these different values represent?
They are different formulations of SVM. At the heart of SVM is an mathematical optimization problem. This problem can be stated in different ways.
C-SVM uses C as the tradeoff parameter between the size of margin and the number of training points which are misclassified. C is just a number, the useful range depends on the dataset and it can range from very small (like 10-5) to very large (like 10^5), depending on your data.
nu-SVM uses nu instead of C. nu is roughly a percentage of training points which will end up as support vectors. The more support vectors, the wider your margin is, the more training points which will be misclassified. nu ranges from 0.1 to 0.8 - at 0.1 roughly 10% of training points will be support vectors, at 0.8, more like 80%. I say roughly because its just correlated that way - its not exact.
epsilon-SVR and nu-SVR use SVM for regression. Instead of doing binary classification by finding a maximum margin hyperplane, instead the concept is used to find a hypertube which best fits the data in order to use it to predict future models. They differ in the way they are parameterized (like nu-SVM and C-SVM differ).
One-Class SVM is novelty detection. Rather than binary classification, or predicting a value, instead you give the SVM a training set and it attempts to train a model to wrap around that set so that a future instance can be classified as part of the class or outside the class (novel or outlier).
In general:
Classification SVM Type 1 (also known as C-SVM classification)
Classification SVM Type 2 (also known as nu-SVM classification)
Regression SVM Type 1 (also known as epsilon-SVM regression)
Regression SVM Type 2 (also known as nu-SVM regression)
Details can be found on page SVM