How many principal components to take? - machine-learning

I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first few eigen values. Now, how do we decide on the number of eigen values that we should take from the eigen value matrix?

To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).

There is no correct answer, it is somewhere between 1 and n.
Think of a principal component as a street in a town you have never visited before. How many streets should you take to get to know the town?
Well, you should obviously visit the main street (the first component), and maybe some of the other big streets too. Do you need to visit every street to know the town well enough? Probably not.
To know the town perfectly, you should visit all of the streets. But what if you could visit, say 10 out of the 50 streets, and have a 95% understanding of the town? Is that good enough?
Basically, you should select enough components to explain enough of the variance that you are comfortable with.

As others said, it doesn't hurt to plot the explained variance.
If you use PCA as a preprocessing step for a supervised learning task, you should cross validate the whole data processing pipeline and treat the number of PCA dimension as an hyperparameter to select using a grid search on the final supervised score (e.g. F1 score for classification or RMSE for regression).
If cross-validated grid search on the whole dataset is too costly try on a 2 sub samples, e.g. one with 1% of the data and the second with 10% and see if you come up with the same optimal value for the PCA dimensions.

There are a number of heuristics use for that.
E.g. taking the first k eigenvectors that capture at least 85% of the total variance.
However, for high dimensionality, these heuristics usually are not very good.

Depending on your situation, it may be interesting to define the maximal allowed relative error by projecting your data on ndim dimensions.
Matlab example
I will illustrate this with a small matlab example. Just skip the code if you are not interested in it.
I will first generate a random matrix of n samples (rows) and p features containing exactly 100 non zero principal components.
n = 200;
p = 119;
data = zeros(n, p);
for i = 1:100
data = data + rand(n, 1)*rand(1, p);
end
The image will look similar to:
For this sample image, one can calculate the relative error made by projecting your input data to ndim dimensions as follows:
[coeff,score] = pca(data,'Economy',true);
relativeError = zeros(p, 1);
for ndim=1:p
reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)';
residuals = data - reconstructed;
relativeError(ndim) = max(max(residuals./data));
end
Plotting the relative error in function of the number of dimensions (principal components) results in the following graph:
Based on this graph, you can decide how many principal components you need to take into account. In this theoretical image taking 100 components result in an exact image representation. So, taking more than 100 elements is useless. If you want for example maximum 5% error, you should take about 40 principal components.
Disclaimer: The obtained values are only valid for my artificial data. So, do not use the proposed values blindly in your situation, but perform the same analysis and make a trade off between the error you make and the number of components you need.
Code reference
Iterative algorithm is based on the source code of pcares
A StackOverflow post about pcares

I highly recommend the following paper by Gavish and Donoho: The Optimal Hard Threshold for Singular Values is 4/sqrt(3).
I posted a longer summary of this on CrossValidated (stats.stackexchange.com). Briefly, they obtain an optimal procedure in the limit of very large matrices. The procedure is very simple, does not require any hand-tuned parameters, and seems to work very well in practice.
They have a nice code supplement here: https://purl.stanford.edu/vg705qn9070

Related

Gaussian Progress Regression usecase

while reading the paper :" Tactile-based active object discrimination and target object search in an unknown workspace", there is something that I just can not understand:
The paper is about finding object's position and other properties using only tactile information. In the section 4.1.2, the author says that he uses GPR to guide the exploratory process and in section 4.1.4 he describes how he trained his GPR:
Using the example from the section 4.1.2, the input is (x,z) and the ouput y.
Whenever there is a contact, the coresponding y-value is stored.
This procedure is repeated several times.
This trained GPR is used to estimate the next exploring point, which is the point where the variance is maximum at.
In the following link, you also can see the demonstration: https://www.youtube.com/watch?v=ZiLq3i-BJcA&t=177s . In the first part of video (0:24-0:29), the first initalization takes place where the robot samples 4 times. Then in the next 25 seconds, the robot explores explores from the corresponding direction. I do not understand how this tiny initialization of GPR can guide the exploratory process. Could someone please explain how the input points (x,z) from the first exploring part could be estimated?
Any regression algorithm simply maps the input (x,z) to an output y in some way unique to the specific algorithm. For a new input (x0,z0) the algorithm will likely predict something very close to the true output y0 if many data points similar to this was included in the training. If only training data was available in a vastly different region, the predictions will likely be very bad.
GPR includes a measure of confidence of the predictions, namely the variance. The variance will naturally be very high in regions where no training data has been seen before and low very close to already seen data points. If the 'experiment' takes much longer than evaluating the Gaussian Process, you can use the Gaussian Process fit to make sure you sample regions where you are very uncertain of your answer.
If the goal is to fully explore the entire input space, you could draw a lot of random values of (x,z) and evaluate the variance at these values. Then you could perform the costly experiment at the input point where you are most uncertain in y. Then you can retrain the GPR with all the explored data so far and repeat the process.
For optimization problems (Not the OP's question)
If you wish to find the lowest value of y across the input space, you are not interested in doing the experiment in regions that you know give high values of y, but you are just uncertain of how high these values will be. So instead of choosing the (x,z) points with the highest variance, you might choose the predicted value of y plus one standard deviation. Minimizing values this way is named Bayesian Optimization and this specific scheme is named Upper Confidence Bound (UCB). Expected Improvement (EI) - the probability of improving the previously best score - is also commonly used.

Is my method to detect overfitting in matrix factorization correct?

I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.

What “information” in document vectors makes sentiment prediction work?

Sentiment prediction based on document vectors works pretty well, as examples show:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://linanqiu.github.io/2015/10/07/word2vec-sentiment/
I wonder what pattern is in the vectors making that possible. I thought it should be similarity of vectors making that somehow possible. Gensim similarity measures rely on cosine similarity. Therefore, I tried the following:
Randomly initialised a fix “compare” vector, get cosine similarity of the “compare” vector with all other vectors in training and test set, use the similarities and the labels of the train set to estimate a logistic regression model, evaluate the model with the test set.
Looks like this, where train/test_arrays contain document vectors and train/test_labels labels either 0 or 1. (Notice, document vectors are obtained from genism doc2vec and are well trained, predicting the test set 80% right if directly used as input for the logistic regression):
fix_vec = numpy.random.rand(100,1)
def cos_distance_to_fix(x):
return scipy.spatial.distance.cosine(fix_vec, x)
train_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=train_arrays), newshape=(-1,1))
test_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=test_arrays), newshape=(-1,1))
classifier = LogisticRegression()
classifier.fit(train_arrays_cos, train_labels)
classifier.score(test_arrays_cos, test_labels)
It turns out, that this approach does not work, predicting the test set only to 50%....
So, my question is, what “information” is in the vectors, making the prediction based on vectors work, if it is not the similarity of vectors? Or is my approach simply not possible to capture similarity of vectors correct?
This is less a question about Doc2Vec than about machine-learning principles with high-dimensional data.
Your approach is collapsing 100-dimensions to a single dimension – the distance to your random point. Then, you're hoping that single dimension can still be predictive.
And roughly all LogisticRegression can do with that single-valued input is try to pick a threshold-number that, when your distance is on one side of that threshold, predicts a class – and on the other side, predicts not-that-class.
Recasting that single-threshold-distance back to the original 100-dimensional space, it's essentially trying to find a hypersphere, around your random point, that does a good job collecting all of a single class either inside or outside its volume.
What are the odds your randomly-placed center-point, plus one adjustable radius, can do that well, in a complex high-dimensional space? My hunch is: not a lot. And your results, no better than random guessing, seems to suggest the same.
The LogisticRegression with access to the full 100-dimensions finds a discriminating-frontier for assigning the class that's described by 100 coefficients and one intercept-value – and all of those 101 values (free parameters) can be adjusted to improve its classification performance.
In comparison, your alternative LogisticRegression with access to only the one 'distance-from-a-random-point' dimension can pick just one coefficient (for the distance) and an intercept/bias. It's got 1/100th as much information to work with, and only 2 free parameters to adjust.
As an analogy, consider a much simpler space: the surface of the Earth. Pick a 'random' point, like say the South Pole. If I then tell you that you are in an unknown place 8900 miles from the South Pole, can you answer whether you are more likely in the USA or China? Hardly – both of those 'classes' of location have lots of instances 8900 miles from the South Pole.
Only in the extremes will the distance tell you for sure which class (country) you're in – because there are parts of the USA's Alaska and Hawaii further north and south than parts of China. But even there, you can't manage well with just a single threshold: you'd need a rule which says, "less than X or greater than Y, in USA; otherwise unknown".
The 100-dimensional space of Doc2Vec vectors (or other rich data sources) will often only be sensibly divided by far more complicated rules. And, our intuitions about distances and volumes based on 2- or 3-dimensional spaces will often lead us astray, in high dimensions.
Still, the Earth analogy does suggest a way forward: there are some reference points on the globe that will work way better, when you know the distance to them, at deciding if you're in the USA or China. In particular, a point at the center of the US, or at the center of China, would work really well.
Similarly, you may get somewhat better classification accuracy if rather than a random fix_vec, you pick either (a) any point for which a class is already known; or (b) some average of all known points of one class. In either case, your fix_vec is then likely to be "in a neighborhood" of similar examples, rather than some random spot (that has no more essential relationship to your classes than the South Pole has to northern-Hemisphere temperate-zone countries).
(Also: alternatively picking N multiple random points, and then feeding the N distances to your regression, will preserve more of the information/shape of the original Doc2Vec data, and thus give the classifier a better chance of finding a useful separating-threshold. Two would likely do better than your one distance, and 100 might approach or surpass the 100 original dimensions.)
Finally, some comment about the Doc2Vec aspect:
Doc2Vec optimizes vectors that are somewhat-good, within their constrained model, at predicting the words of a text. Positive-sentiment words tend to occur together, as do negative-sentiment words, and so the trained doc-vectors tend to arrange themselves in similar positions when they need to predict similar-meaning-words. So there are likely to be 'neighborhoods' of the doc-vector space that correlate well with predominantly positive-sentiment or negative-sentiment words, and thus positive or negative sentiments.
These won't necessarily be two giant neighborhoods, 'positive' and 'negative', separated by a simple boundary –or even a small number of neighborhoods matching our ideas of 3-D solid volumes. And many subtleties of communication – such as sarcasm, referencing a not-held opinion to critique it, spending more time on negative aspects but ultimately concluding positive, etc – mean incursions of alternate-sentiment words into texts. A fully-language-comprehending human agent could understand these to conclude the 'true' sentiment, while these word-occurrence based methods will still be confused.
But with an adequate model, and the right number of free parameters, a classifier might capture some generalizable insight about the high-dimensional space. In that case, you can achieve reasonably-good predictions, using the Doc2Vec dimensions – as you've seen with the ~80%+ results on the full 100-dimensional vectors.

How to choose C and gamma AFTER grid search using libSVM (RBF kernel) for best possible generalisation?

I am aware of the abundance of questions asking about choosing the 'best' C and gamma values for SVM (RBF kernel). The standard answer is a grid search, however, my questions starts after the results of the grid search. Let me explain:
I have a data set of 10 subjects on which I perform leave-one-subject-out-xfold-validation meaning I perform a grid search on each left-out subject. In order to not optimise on this training data I do not want to choose the best C and gamma parameter by building the mean accuracy over all 10 models and search for the maximum. Considering one model within the xfold, I could perform another xfold only on the training data wihtin this model (not involving the left out validation subject). But you can imagine the computational effort and I do not have enough time atm for this.
Since the grid search for each of the 10 models resulted in a wide range of good C and gamma parameters (difference between accuracy of only 2-4%, see Figure 1) I thought about a different way.
I defined a region within the grid, which only contains the accuracies that have a difference of 2% to the maximum accuracy of this grid. All other accuracy values with a difference higher than 2% are set to zero (see Figure 2). I do this for every model and build the intersect between the regions of every model. This results in a much smaller region of C and gamma values that would produce accuracies within 2% of the max. accuracy for each model. However, the range is still rather big. So I thought about choosing the C-gamma pair with the lowest C as this would mean that I am the furthest away from overfitting and closer to a good generalisation. Can I argue like that?
How would I generally choose a C and gamma within this region of C-gamma pairs, which all proofed to be reliable adjustments for my classifier in all 10 models?
Should I focus on minimising the C parameter? Or should I focus on minimising the C AND the gamma paramater?
I found a related answer here (Are high values for c or gamma problematic when using an RBF kernel SVM?) that says a combination of high C AND high gamma would mean overfitting. I understood that the value of gamma changes the width of the gaussian curve around data points, but I still cant get my head around what it practically means within a data set.
The post brought me to another idea. Could I use the number of SVs related to the number of data points as a criterium to choose between all the C-gamma pairs? A low (number of SVs/number of data points) would mean a better generalisation? I am willing to loose accuracy as it shouldnt effect the outcome I am interested in, if I get in return a better generalisation (at least from a theoretical point of view).
Since linear kernel is a special case of rbf kernel. There is a method using linear SVM to tune C first. And bilinear tuning C-G pair later to save time.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.880&rep=rep1&type=pdf

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources