PCA in machine learning - machine-learning

When applying the PCA technique on a training set, we find a coefficient matrix A, which is the principal component. So when we in training stage we find this principals and project it on the data. my question is does we apply the same principals or we find a new principals for the data in testing stage? I think in an answer like this : if we use it for dimensionality reduction, we have to find new principals. but if we use it for feature extraction (like feature extraction for EEG data ) we have to use the old(which is for the data in training stage) how much my thinking is true? BS: I'm not ask and answer the question in the same time, but to tell what I think , to show the points of misunderstanding, and take the opinion from experts

PCA is one of feature vector transformations. The goal is to reduce dimensionality. It sort of merges correlated features. If you have features like weight and size and the most of the objects when something is heavy it's also big. It replaces these features with one weight_and_size. It reduces noise and also is makes e.q. neural network smaller.
It enables the network to solve a problem in shorter time (be reducing network's size). It also should improve generalization.
So if you trained your network with feature vectors compressed with PCA you have to test it with transformed data as well. Simply because it only has as many inputs as compressed feature vector. You also have to use exactly the same transformation. If the network learned that first input is weight_and_size you cannot put the the value of e.q. warm_and_colorful and expect good results.

Both PCA and PCR are built on the training data and the transformation is applied to Test for performance (error) evaluation. With these 2 techniques, you get better results, when not using just a single training dataset, but doing a K-fold Cross Validation where you do a separate PCA for every fold and apply the transformations to the Test sets. Hope it helps!

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

How do I use principal component analysis in supervised machine learning classification problems?

I have been working through the concepts of principal component analysis in R.
I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.
The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.
More specifically, here's my real question:
I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?
I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)
Please step in and set me straight if you have the time and wherewithal. Thanks in advance.
Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.
The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:
Because of orthonormality, we can invert W simply by transposing it and write:
Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.
Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.
The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:
We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:
The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.
After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.
This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:
In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources