Creating synthetic dataset with known SVM parameters - machine-learning

I want to create a synthetic dataset consisting of 2 classes and 3 features for testing a hyperparameter optimization technique for a SVM classifier with a RBF kernel. The hyperparameters are gamma and C (the cost).
I have created my current 3D synthetic dataset as follows:
I have created 10 based points for each class by sampling from a multivariate normal distribution with mean (1,0,0) and (0,1,0), respectively, and unit variance.
I have added more points to each class by picking a base point at random and then sampling a new point from a normal distribution with mean equal to the chosen base point and variance I/5.
It would be a very cool thing if I could determine the best C and gamma from the dataset (before running SVM), so that I can see if my optimization technique provides me the best parameters in the end.
Is there a possibility to calculate the best gamma and C parameter from the synthetic dataset described above?
Or else is there a way to create a synthetic dataset where the best gamma and C parameters are known?

Very interesting question, but the answer is no. It is completely data specific, even knowing exactly the distributions, unless you have an infinite sample, it is mathematicaly impossible to prove best C/gamma as SVM in the end is purely point-based method (as opposed to density estimation based). Typical comparison is done in a different scenario - you take real data, and fit hyperparams using other techniques, like gaussian processes (bayesian optimization) etc, which generate baseline (and probably will get to optimal C and gamma too, or at least realy close to them). In the end looking for best C and gamma is not complex problem, thus simply run good techniqe (like bayesopt) for a longer time, and you will get your optimas to compare against. Furthermore, remember that the task of hyperparams optimization is not to find a particular C and gamma, it is to find hyperparams yielding best results, and in fact, even for SVM, there might be many sets of "optimal" C and gammas, all yielding the same results (in terms of your finite dataset) despite being very far away from each other.

Related

Linear Regression: Is there a difference in the model between using ML instead MSE?

We know we need 4 things for building a machine learning algorithm:
A Dataset
A Model
A cost function
An optimization procedure
Taking the example of linear regression (y = m*x +q) we have two most common way of finding the best parameters: using ML or MSE as cost functions.
We hypotize data are Gaussian-distributed, using ML.
Is this assumption part of the model, also?
It it's not, why? Is it part of the cost function?
I can't see the "edge" of the model, in this case.
Is this assumption part of the model, also?
Yes it is. The ideas of different loss functions derived from the nature of the problem, consequently the nature of the model.
MSE by definition calculates for the mean of the squares of the errors (error means the difference between real y and predicted y) which in its turn will be high if the data is not Gaussian-Like distributed. Just imagine a few extreme values among the data, what will happen to the line slope and consequently the residual error?
It is worth mentioning the assumptions of Linear Regression:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
If it's not, why? Is it part of the cost function?
As far I have seen, the assumption is not directly related to the cost function itself, rather related -as above-mentioned- to the model itself.
For example, Support Vector Machine idea is separation of classes. That’s finding out a line/ hyper-plane (in multidimensional space that separate outs classes), thus its cost function is Hinge Loss to "maximum-margin" of classification.
On the other hand, Logistic Regression uses Log-Loss (related to cross-entropy) because the model is binary and works on the probability of the output (0 or 1). And the list goes on...
The assumption that the data is Gaussian-distributed is part of the model in the sense that, for Gaussian distributed data the minimal Mean Squared Error also yields the maximum liklelihood solution for the data, given the model parameters. (Common proof, you can look it up if you are interested).
So you could say that the Gaussian distribution assumption justifies the choice of least squares as the loss function.

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

data imbalance in SVM using libSVM

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

Ideal classifiers in python to fit sparse high dimensional features (with hierarchical classification)

This is my task:
I have a set of hierarchical classes (ex. "object/architecture/building/residential building/house/farmhouse")--and I've written two ways of classifying:
treating each class independently (using one model/classifier overall)
using a tree where each node represents a decision (the root represents "object/", and each level decreases generality), and a specific model/classifier for each node (here, I consider the c (usually 3) highest probabilities that come out of each node, and propagate the probabilities down (summing the log probs) to the leaves), and choose the highest.
I also had to introduce a way to incentivize going further down the tree (as it could stop at say object/architecture/building (if there is the corresponding training data)), and used an arbitrary trial-and-error process to decide specifically how (I don't feel comfortable with this).:
if numcategories == 4:
tempscore +=1
elif numcategories ==5:
tempscore +=1.3
elif numcategories ==6:
tempscore +=1.5
elif numcategories >6:
tempscore +=2
It is also important to note that I have around 290k training samples and ~150k (currently/mostly) boolean features (represented with 1.0 or 0.0)--although it's highly sparse, so I use scipy's sparse matrices. Also, there are ~6500 independent classes (though many less for each node in method 2)
With method 1, with scikit's sgdclassifier(loss=hinge), I get around 75-76% accuracy, and with linearsvc, I get around 76-77% (although it's 8-9 times slower).
However, for the second method (which I think can/should ultimately perform better) neither of these classifiers produce true probabilities, and while I've attempted to scale the confidence scores produced by their .decision_functions(), it didn't work well (accuracies of 10-25%). Thus, I switched to logisticregression(), which gets me ~62-63% accuracy. Also, NB based classifiers seem to perform substantially less well.
Ultimately, I have twoish questions:
Is there a better classifier (than scikit's logisticregression()) around implemented in python (could be scikit or mlpy/nltk/orange/etc) that can (i) handle sparse matrices, (ii) produce (something close to) probabilities, and (iii) work with multiclass classification?
Is there some way to handle method two better?
2.a. Specifically, is there some way to better handle incentivizing the classifier to produce results further down the tree?
A few ideas you could try:
Apply some embedding technique on your features, so as to avoid a large sparse matrix. However, it doesn't fit every case and also requires a lot of work
Use XGBoost with a custom loss function. In this loss function you can basically apply the logic you've described regarding the depth of the predicted class and provide an incentive for the model to get deeper classes predicted more often. In addition, a tree-based model will benefit you by taking into consideration the correlations between your features

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources