Selection parameters in libsvm - machine-learning

I started my work with libsvm one week ago.
I cannot find any information about parameters to libsvm.
I would like to better understand how I should select these parameters.
Can somebody tell me in using simple language what each parameter means?
-d degree
-g gamma
-r coef0
-c cost
-n nu
-p epsilon
-m cachesize
-e epsilon
-h shrinking
-b probability_estimates
-wi weight
In this case (vector), what parameter values will be the best?

The parameters are described on the main libsvm page. There are links to several papers that show the mathematical usage of the Greek-letter variables, including A Practical Guide to Support Vector Classification.
In general, you leave the parameter values at their defaults. Then you tweak them, one at a time, to see how the changes affect your desired characteristics.
To help with a few of the main parameters:
degree ... This is the degree (highest exponent) of a polynomial kernel function. This is a transformation function applied to your data points in an effort to get a more accurate linear division between the classes. A high degree will lead to over-fitting; a low degree loses accuracy.
gamma & r ... Leading coefficient and constant (bias) parameters of the kernel.
-e epsilon ... Convergence tolerance; smaller value will take more iterations to converge.
Overall, the paper gives you a good sequence of suggestions for developing a SVM model. I suggest that you work your way through these steps and post again when you have a specific programming problem with what you're doing.
As far as choosing parameters for your data set, we cannot give you a good starting set without analysing the data for "shape" and span, as well as knowing what results you need. In short, we'd have to know more from you, and then step through the work you need to do as model developer.
Do these suggestions get you moving in the right direction?

Related

interpret statistical model metrics

Do you know how to intepret RAE and RSE values? I know a COD closer to 1 is a good sign. Does this indicate that boosted decision tree regression is best?
RAE and RSE closer to 0 is a good sign...you want error to be as low as possible. See this article for more information on evaluating your model. From that page:
The term "error" here represents the difference between the predicted value and the true value. The absolute value or the square of this difference are usually computed to capture the total magnitude of error across all instances, as the difference between the predicted and true value could be negative in some cases. The error metrics measure the predictive performance of a regression model in terms of the mean deviation of its predictions from the true values. Lower error values mean the model is more accurate in making predictions. An overall error metric of 0 means that the model fits the data perfectly.
Yes, with your current results, the boosted decision tree performs best. I don't know the details of your work well enough to determine if that is good enough. It honestly may be. But if you determine it's not, you can also tweak the input parameters in your "Boosted Decision Tree Regression" module to try to get even better results. The "ParameterSweep" module can help with that by trying many different input parameters for you and you specify the parameter that you want to optimize for (such as your RAE, RSE, or COD referenced in your question). See this article for a brief description. Hope this helps.
P.S. I'm glad that you're looking into the black carbon levels in Westeros...I'm sure Cersei doesn't even care.

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

Estimating parameters in multivariate classification

Newbie here typesetting my question, so excuse me if this don't work.
I am trying to give a bayesian classifier for a multivariate classification problem where input is assumed to have multivariate normal distribution. I choose to use a discriminant function defined as log(likelihood * prior).
However, from the distribution,
$${f(x \mid\mu,\Sigma) = (2\pi)^{-Nd/2}\det(\Sigma)^{-N/2}exp[(-1/2)(x-\mu)'\Sigma^{-1}(x-\mu)]}$$
i encounter a term -log(det($S_i$)), where $S_i$ is my sample covariance matrix for a specific class i. Since my input actually represents a square image data, my $S_i$ discovers quite some correlation and resulting in det(S_i) being zero. Then my discriminant function all turn Inf, which is disastrous for me.
I know there must be a lot of things go wrong here, anyone willling to help me out?
UPDATE: Anyone can help how to get the formula working?
I do not analyze the concept, as it is not very clear to me what you are trying to accomplish here, and do not know the dataset, but regarding the problem with the covariance matrix:
The most obvious solution for data, where you need a covariance matrix and its determinant, and from numerical reasons it is not feasible is to use some kind of dimensionality reduction technique in order to capture the most informative dimensions and simply discard the rest. One such method is Principal Component Analysis (PCA), which applied to your data and truncated after for example 5-20 dimensions would yield the reduced covariance matrix with non-zero determinant.
PS. It may be a good idea to post this question on Cross Validated
Probably you do not have enough data to infer parameters in a space of dimension d. Typically, the way you would get around this is to take an MAP estimate as opposed to an ML.
For the multivariate normal, this is a normal-inverse-wishart distribution. The MAP estimate adds the matrix parameter of inverse Wishart distribution to the ML covariance matrix estimate and, if chosen correctly, will get rid of the singularity problem.
If you are actually trying to create a classifier for normally distributed data, and not just doing an experiment, then a better way to do this would be with a discriminative method. The decision boundary for a multivariate normal is quadratic, so just use a quadratic kernel in conjunction with an SVM.

machine learning: how to generate regression model that outputs a multivariate instead of a univarite?

Given D=(x,y), y=F(x), it seems most machine learning methods only outputs y as a univariate, either a label or a real value. But I am facing a situation that x vector may only have 5~9 dimensions while I need y to be a multinomial distribution vector which can have up to 800 dimensions. This makes the problem really tricky.
I looked into a lot of things in multitask machine learning methods, where I can train all these y_i at the same time. And of course, another stupid way is that I can also train all these dimensions separately without considering the linkage between tasks. But the problem is, after reviewing many papers, seem that most MTL experiments only deal with 10~30 tasks, which means 800 tasks can be crazy and bad to train. Maybe clustering could be a solution, but I am really curious that can anyone give some suggestions about other ways to deal with this problem, not from a MTL perspective.
When the input is so "small" and the output so big, I would expect there to be a different representation of those output values. You could analyze if they are a linear or nonlinear combination of some sort, so to estimate the "function parameters" instead of the values itself. Example: We once have estimated a time series which could be "reduced" to a weighted sum of normal distributions, so we just had to estimate the weights and parameters.
In the end you will reach only a 6-to-12-dimensional subspace in some sense (not linear, probably) when you have only 6 input parameters. They can of course be a bit complicated, but to avoid the chaos in a 800-dim space I would really look into parametrizing the result.
And as I commented the machine learning that I know produce vectors. http://en.wikipedia.org/wiki/Bayes_estimator

Is there a way to use SVM to build a model that contains a pre-defined percentage of the training data?

Say, I wish to use LIBSVM to build a model that contains 70% of the training data. Is that possible?
There are no techniques that allow you to specify exactly how many support vectors the model will contain in advance. A possible exception are fixed-size formulations of least-squares SVM which allow you to specify the size of the kernel in advance (and for LS-SVM every training instance is a SV).
Note that a fraction of 70% support vectors is very high for typical SVM in most cases, so I can't see an immediate reason why you would want this.
You can, however, specify a minimum fraction of support vectors using formulations like the nu-SVM by Schölkopf et al:
B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector
algorithms. Neural Computation, 12:1207-1245, 2000.
In this formulation you will have at least the fraction nu support vectors (with 0 < nu <= 1). nu-SVMs are implemented in LIBSVM, for example (use -s 1 for nu-SVC or -s 4 for nu-SVR). For more information, you may refer to this pdf (page 4) or google nu-SVM to find various papers about it.
If you look in the tools folder of libsvm you will find a python script called subset.py. You can use it to randomly select a subset of your data for training.

Resources