How do I use principal component analysis in supervised machine learning classification problems? - machine-learning

I have been working through the concepts of principal component analysis in R.
I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.
The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.
More specifically, here's my real question:
I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?
I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)
Please step in and set me straight if you have the time and wherewithal. Thanks in advance.

Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.
The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:
Because of orthonormality, we can invert W simply by transposing it and write:
Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.
Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.
The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:
We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:
The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.

After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.
This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:
In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.

Related

How to Intelligently Sample Parameter Space while Training a Statistical Classifier

I'm interested in a statistical classification problem. Given a feature vector X, I would like to classify X as either "yes" or "no". However, the training data will be fed in real-time based on human input. For instance, if the user sees feature vector X, the user will assign "yes" or "no" based on their expertise.
Rather than doing grid search on parameter space, I would like to more intelligently explore the parameter space based on the previously submitted data. For example, if there is a dense cluster of "no's" in part of the parameter space, it probably doesn't make sense to keep sampling there - it's probably just going to be more "no's".
How can I go about doing this? The C4.5 algorithm seems to be up this alley, but I'm unsure if this is the way to go.
An additional subtlety is that some of the features might be specifying random data. Suppose that the first two attributes in the feature vector specify the mean and variance of a gaussian distribution. The data the user classifies could be significantly different, even if all parameters are held equal.
For example, let's say the algorithm displays a sine wave with gaussian noise added, where the gaussian distribution is specified by the mean and variance in the feature vector. The user is asked "does this graph represent a sine wave?" Two very similar values in mean or variance could still have significantly different graphs.
Is there an algorithm designed to handle such cases?
The setting that you're talking about fits in the broad area of Active Learning. This topic addresses the iterative process of model building, and choosing which training examples to query next in order to optimize model performance. Here, the training cost of each data point is roughly the same, and there are no additional variable rewards in the learning phase.
However, in each iteration, if you have a variable reward which is a function of the data point chosen, you would want to look at Multi-Armed Bandits and Reinforcement Learning.
The other issue that you're talking about is one of finding the right features to represent your data points, and should be handled separately.

Machine Learning - SVM

If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?
You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.
Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004
This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )

Fuzzy clustering using unsupervised dimensionality reduction

An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.

PCA in machine learning

When applying the PCA technique on a training set, we find a coefficient matrix A, which is the principal component. So when we in training stage we find this principals and project it on the data. my question is does we apply the same principals or we find a new principals for the data in testing stage? I think in an answer like this : if we use it for dimensionality reduction, we have to find new principals. but if we use it for feature extraction (like feature extraction for EEG data ) we have to use the old(which is for the data in training stage) how much my thinking is true? BS: I'm not ask and answer the question in the same time, but to tell what I think , to show the points of misunderstanding, and take the opinion from experts
PCA is one of feature vector transformations. The goal is to reduce dimensionality. It sort of merges correlated features. If you have features like weight and size and the most of the objects when something is heavy it's also big. It replaces these features with one weight_and_size. It reduces noise and also is makes e.q. neural network smaller.
It enables the network to solve a problem in shorter time (be reducing network's size). It also should improve generalization.
So if you trained your network with feature vectors compressed with PCA you have to test it with transformed data as well. Simply because it only has as many inputs as compressed feature vector. You also have to use exactly the same transformation. If the network learned that first input is weight_and_size you cannot put the the value of e.q. warm_and_colorful and expect good results.
Both PCA and PCR are built on the training data and the transformation is applied to Test for performance (error) evaluation. With these 2 techniques, you get better results, when not using just a single training dataset, but doing a K-fold Cross Validation where you do a separate PCA for every fold and apply the transformations to the Test sets. Hope it helps!

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources