Are high values for c or gamma problematic when using an RBF kernel SVM? - machine-learning

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!

In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.

RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

Related

Is it a good idea to exclude noisy data from the dataset to train the model?

Will it be a good idea to exclude the noisy data( which may reduce model accuracy or cause unexpected output for testing dataset) from a dataset to generate the training and validation dataset ?
Assumption: Noisy data is pre-known to us
Any suggestion is deeply appreciated!
It depends on your application. If the noisy data is valid, then definitely include it to find the best model.
However, if the noisy data is invalid, then it should be cleaned out before fitting your model.
Noise is a broad term, you better consider them as inliers or outliers instead.
Most of the outliers detection algorithms specify a threshold and sort the outliers candidates according to some given score. In this case, you can choose to eradicate the most extreme values. Say for example 3xSTD far from the mean (of course that is in case you have a Gaussian-like distributed data set).
So my suggestion is to build your judgement based on two things:
Your business concept and logic about validity vs invalidity. For example: A house size, area or price cannot be a negative number.
Your mathematical / algorithmic logic. For example: Detect extreme values based on some threshold to decide (along with / without point no. 1) whether it is a valid observation or not.
Noisy data doesn't cause a huge problem themselves. The extreme noisy data (i.e. extreme values / outliers) are those you should really concern about!
Such points would adjust the hypothesis of your model while fitting the data. Hence, results might be drastically shifted / incorrect.
Finally, you can look at Pyod open-source Pythonic toolbox which contains a lot of different algorithms implemented off-the-shelf. (You can choose more than one algorithm and create a voting pool to decide the extremeness of the observations).
You can use Multivariate Gaussian Distribution for outlier Detection in python. It is the best method.

How to model multiple inputs to single output in classification?

Purpose:
I am trying to build a model to classify multiple inputs to a single output class, which is something like this:
{x_i1, x_i2, x_i3, ..., x_i16} (features) to y_i (class)
I am using a SVM to make the classification, but the 0/1-loss was bad (half of the data a misclassified), which leads me to the conclusion that the data might be non-linear. This is why I played around with polynomial basis function. I transformed each coefficient such that I get any combinations of polynomials up to degree 4, in the hope that my features are linear in the transformed space. my new transformed input looks like this:
{x_i1, ..., x_i16, x_i1^2, ..., x_i16^2, ... x_i1^4, ..., x_i16^4, x_i1^3, ..., x_i16^3, x_i1*x_i2, ...}
The loss was minimized but still not quite where I want to go. Since with the number of polynomial degree the chance of overfitting rises, i added regularization in order to counter balance that. I also added a forward greedy algorithm in order to pick up the coefficients which leads to minimal cross-validation error, but with no great improvement.
Question:
Is there a systematic way to figure out which transform leads to linear feature behaviour in the transformed space? Seems little odd to me that I have to try out every polynomial until it "fits". Are there perhaps better basis functions except polynomials? I understand that in low dimensional feature space, one can simply plot the data out and estimate the transform visually, but how can I do it in high dimensional space?
Maybe a little off topic but I also informed myself about PCA in order to throw away the components which doesnt provide much informations in the first place. Is this worth a try?
Thank you for your help.
Did you try other kernel functions such as RBF other than linear and polynomial? Since different dataset may have different characteristics, some kernel functions may work better than others do, especially in non-linear cases.
I don't know which tools you are using, but the following one also provides a guide for beginners on how to build SVM models:
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
It is always a good idea to have a feature selection step first, especially for high-dimensional data. Those noisy or irrelevant features should be taken away, leading to a better performance and higher efficiency.

would we ever compute the cost J(θ) on the *test* set?

I'm pretty sure that the answer is no, but wanted to confirm...
When training a neural network or other learning algorithm, we will compute the cost function J(θ) as an expression of how well our algorithm fits the training data (higher values mean it fits the data less well). When training our algorithm, we generally expect to see J(theta) go down with each iteration of gradient descent.
But I'm just curious, would there ever be a value to computing J(θ) against our test data?
I think the answer is no, because since we only evaluate our test data once, we would only get one value of J(θ), and I think that it is meaningless except when compared with other values.
Your question touches on a very common ambiguity regarding the terminology: one between the validation and the test sets (the Wikipedia entry and this Cross Vaidated post may be helpful in resolving this).
So, assuming that you indeed refer to the test set proper and not the validation one, then:
You are right in that this set is only used once, just at the end of the whole modeling process
You are, in general, not right in assuming that we don't compute the cost J(θ) in this set.
Elaborating on (2): in fact, the only usefulness of the test set is exactly for evaluating our final model, in a set that has not been used at all in the various stages of the fitting process (notice that the validation set has been used indirectly, i.e. for model selection); and in order to evaluate it, we obviously have to compute the cost.
I think that a possible source of confusion is that you may have in mind only classification settings (although you don't specify this in your question); true, in this case, we are usually interested in the model performance regarding a business metric (e.g. accuracy), and not regarding the optimization cost J(θ) itself. But in regression settings it may very well be the case that the optimization cost and the business metric are one and the same thing (e.g. RMSE, MSE, MAE etc). And, as I hope is clear, in such settings computing the cost in the test set is by no means meaningless, despite the fact that we don't compare it with other values (it provides an "absolute" performance metric for our final model).
You may find this and this answers of mine useful regarding the distinction between loss & accuracy; quoting from these answers:
Loss and accuracy are different things; roughly speaking, the accuracy is what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...

How to verify what's noise what's real data?

I am wondering how can I claim that I correctly catch the "noise" in my data ?
To be more specific, take Principle Component Analysis as example, we know that in PCA, after doing SVD, we can zeros out the small singular values and reconstruct the original matrix using low-rank approximation.
Then can I claim what's been ignored is indeed noise in the data ?
Is there any evaluation metric for this ?
The only method I can come up with is simply subtract the original data from the reconstructed data.
Then, try to fit a Gaussian over it, seeing if the fitness is good.
Is that conventional method in field like DSP ??
BTW, I think in typical machine learning tasks, the measurement would be the follow up classification performance, but since I am doing purely generative model, there are no labels attached.
The way I see it, the definition of noise would depend on the domain of the problem. Therefore the strategy for reducing it would be different on each domain.
For instance, having a noisy signal in problems like seismic formation classification or a noisy image on a face classification problem would be drastically different to the noise produced by improperly tagged data in a medical diagnostic problem or the noise because similar words with different meaning in a language classification problem for documents.
When the noise is because of a given (or a set of) data point, then the solution is as simple as ignore those data points (although identify those data points most of the time is the challenging part)
From your example I guess you are more concerning about the case when the noise is embedded into the features (like in the seismic example). Sometimes people tend to pre-process the data with a noise reduction filter like the median filter (http://en.wikipedia.org/wiki/Median_filter). In contrast, some other people tend to reduce the dimension of the data to reduce noise, and PCA is used in this scenario.
Both strategies are valid, and normally people try both and cross-validate them to see which one gave better results.
What you did is a good metric to check gaussian noise. However, for non-gaussian noise your metric can give you false negatives (bad fitness but still good noise reduction)
Personally, if you want to prove the efficacy of the noise reduction, I'd use a task-based evaluation. I assume you're doing this for some purpose, to solve some problem? If so, solve the task with the original noisy matrix and the new clean one. If the latter works better, what was discarded was noise, for the purposes of the task you're interested in. I think some objective measure of noise is pretty hard to define.
I have found this. it is very resoureful, needs good time to understand.
https://sci2s.ugr.es/noisydata#Introduction%20to%20Noise%20in%20Data%20Mining

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources