For certain data, we may need to manually create features which are combinations of earlier features to get a better algorithm. The below distribution (and any other where the distribution is an ellipse with it's axis NOT aligned to the feature axes) are easy examples of gaussian distribution not working.
The image is from Stanford's Machine Learning course on coursera.
It was told in that course that a multivariate gaussian distribution works the best for it OR we can develop a new feature x_3 = x_2 - x_1 and x_4 = x_2 + x_1, and use these two new features as the axes for the ellipse, since with these axes the ellipse will no longer be tilted and normal gaussian distribution can work.
I want to automate this new feature selection and was thinking that using PCA (with a high variance %) would help in creating a new features that are of 'linear type', like x_2 - x_1. Kind of how the long diagonal black line from this image (from the same course as above) is exactly what we need, and this black line was generated using PCA.
Would the above work? Is there something better than we can do to automate generating new features (of linear type and otherwise)? Neural networks and Kernels were something that I also thought of but I have a hunch that they would be very expensive computational wise.
The family of the problems you're looking for is called dimensionality reduction. Yes, PCA is the default approach to such problems, fully analogous to how linear regression is the default approach to regression.
It won't, however, find a feature of the form x_2 / x_1, because that is a nonlinear transformation. The features it will find will be of the form a_1 x_1 + a_2 x_2 (in the mean-centered coordinates). There are many methods that allow you to look for such features automatically. Kernel PCA that you mentioned is one such method, although, like all kernel methods, it does not explicitly search for features and only selects the most useful ones from a predefined set. Among the methods that can search infinite classes of nonlinear features my personal favorite is genetic programming. It mostly focuses on regression, but it should be straightforward to apply it to dimensionality reduction. Note that it is not something that will work out of the box and will require some trial and error. If you do decide to take that route, here are some useful resources: 3 and 4.
Methods that search large classes of nonlinearities are inherently computationally costly. If performance is critical, study the system that generates the distribution, identify a class of nonlinearities that might be useful and use something like kernel PCA for feature selection. Or have an approach like the one described in 3 extract useful features from the data in one slow, but automatic study, then freeze the feature search and let the linear selector that's sitting on top of it make decisions for each particular measurement.
Related
I want to use a feature selection method where "combinations" of features or "between features" interactions are considered for a simple linear regression.
SelectKBest only looks at one feature to the target, one at a time, and ranks them by Pearson's R values. While this is quick, but I'm afraid it's ignoring some important interactions between features.
Recursive Feature Elimination first uses ALL my features, fits a Linear Regression model, and then kicks out the feature with the smallest absolute value coefficient. I'm not sure whether if this accounts for "between features" interaction...I don't think so since it's simply kicking out the smallest coefficient one at a time until it reaches your designated number of features.
What I'm looking for is, for those seasoned feature selection scientists out there, a method to find the best subset or combination of features. I read through all Feature Selection documentation and can't find a method that describes what I have in mind.
Any tips will be greatly appreciated!!!!!!
I want to use a feature selection method where "combinations" of features or "between features" interactions are considered for a simple linear regression.
For this case, you might consider using Lasso (or, actually, the elastic net refinement). Lasso attempts to minimize linear least squares, but with absolute-value penalties on the coefficients. Some results from convex-optimization theory (mainly on duality), show that this constraint takes into account "between feature" interactions, and removes the more inferior of correlated features. Since Lasso is know to have some shortcomings (it is constrained in the number of features it can pick, for example), a newer variant is elastic net, which penalizes both absolute-value terms and square terms of the coefficients.
In sklearn, sklearn.linear_model.ElasticNet implements this. Note that this algorithm requires you to tune the penalties, which you'd typically do using cross validation. Fortunately, sklearn also contains sklearn.linear_model.ElasticNetCV, which allows very efficient and convenient searching for the values of these penalty terms.
I believe you have to generate your combinations first and only then apply the feature selection step. You may use http://scikit-learn.org/stable/modules/preprocessing.html#generating-polynomial-features for the feature combinations.
I am new to machine learning field and right now trying to get a grasp of how the most common learning algorithms work and understand when to apply each one of them. At the moment I am learning on how Support Vector Machines work and have a question on custom kernel functions.
There is plenty of information on the web on more standard (linear, RBF, polynomial) kernels for SVMs. I, however, would like to understand when it is reasonable to go for a custom kernel function. My questions are:
1) What are other possible kernels for SVMs?
2) In which situation one would apply custom kernels?
3) Can custom kernel substantially improve prediction quality of SVM?
1) What are other possible kernels for SVMs?
There are infinitely many of these, see for example list of ones implemented in pykernels (which is far from being exhaustive)
https://github.com/gmum/pykernels
Linear
Polynomial
RBF
Cosine similarity
Exponential
Laplacian
Rational quadratic
Inverse multiquadratic
Cauchy
T-Student
ANOVA
Additive Chi^2
Chi^2
MinMax
Min/Histogram intersection
Generalized histogram intersection
Spline
Sorensen
Tanimoto
Wavelet
Fourier
Log (CPD)
Power (CPD)
2) In which situation one would apply custom kernels?
Basically in two cases:
"simple" ones give very bad results
data is specific in some sense and so - in order to apply traditional kernels one has to degenerate it. For example if your data is in a graph format, you cannot apply RBF kernel, as graph is not a constant-size vector, thus you need a graph kernel to work with this object without some kind of information-loosing projection. also sometimes you have an insight into data, you know about some underlying structure, which might help classifier. One such example is a periodicity, you know that there is a kind of recuring effect in your data - then it might be worth looking for a specific kernel etc.
3) Can custom kernel substantially improve prediction quality of SVM?
Yes, in particular there always exists a (hypothethical) Bayesian optimal kernel, defined as:
K(x, y) = 1 iff arg max_l P(l|x) == arg max_l P(l|y)
in other words, if one has a true probability P(l|x) of label l being assigned to a point x, then we can create a kernel, which pretty much maps your data points onto one-hot encodings of their most probable labels, thus leading to Bayes optimal classification (as it will obtain Bayes risk).
In practise it is of course impossible to get such kernel, as it means that you already solved your problem. However, it shows that there is a notion of "optimal kernel", and obviously none of the classical ones is not of this type (unless your data comes from veeeery simple distributions). Furthermore, each kernel is a kind of prior over decision functions - closer you get to the actual one with your induced family of functions - the more probable is to get a reasonable classifier with SVM.
I want to create a synthetic dataset consisting of 2 classes and 3 features for testing a hyperparameter optimization technique for a SVM classifier with a RBF kernel. The hyperparameters are gamma and C (the cost).
I have created my current 3D synthetic dataset as follows:
I have created 10 based points for each class by sampling from a multivariate normal distribution with mean (1,0,0) and (0,1,0), respectively, and unit variance.
I have added more points to each class by picking a base point at random and then sampling a new point from a normal distribution with mean equal to the chosen base point and variance I/5.
It would be a very cool thing if I could determine the best C and gamma from the dataset (before running SVM), so that I can see if my optimization technique provides me the best parameters in the end.
Is there a possibility to calculate the best gamma and C parameter from the synthetic dataset described above?
Or else is there a way to create a synthetic dataset where the best gamma and C parameters are known?
Very interesting question, but the answer is no. It is completely data specific, even knowing exactly the distributions, unless you have an infinite sample, it is mathematicaly impossible to prove best C/gamma as SVM in the end is purely point-based method (as opposed to density estimation based). Typical comparison is done in a different scenario - you take real data, and fit hyperparams using other techniques, like gaussian processes (bayesian optimization) etc, which generate baseline (and probably will get to optimal C and gamma too, or at least realy close to them). In the end looking for best C and gamma is not complex problem, thus simply run good techniqe (like bayesopt) for a longer time, and you will get your optimas to compare against. Furthermore, remember that the task of hyperparams optimization is not to find a particular C and gamma, it is to find hyperparams yielding best results, and in fact, even for SVM, there might be many sets of "optimal" C and gammas, all yielding the same results (in terms of your finite dataset) despite being very far away from each other.
If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?
You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.
Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004
This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )
How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.