i recently read PCA (Principle Component Analysis) and understood that how to reduce dimension. we select an eigenvector corresponding to maximum eigenvalue when we need only one dimension but if need more than one dimension then should i take eigenvectors corrosponding to maximum eigenvalues?
Principal component analysis (PCA) is a statistical technique that carries out an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
The number of components after PCA transformation is equal to the number of variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, it accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set.
Generally, people take as many components that accounts for 99% variance, which will be much lesser than the total number of variables.
References:
https://stats.stackexchange.com/a/140579/86202
http://scikit-learn.org/stable/modules/decomposition.html#pca
https://en.wikipedia.org/wiki/Principal_component_analysis
Basically yes(from what can be infered from your description), it would be nice to have more information in your case, your implementation tool ,etc. But basically yes , the process would be:
Compute covariance matrix
Compute eigenvectors of the covariance matrix,depending on your tool it can be computed using pre-defined functions "eig" or also "singular value descomposition" (svd in matlab). If you use svd, it commonly will return 3 values, the first value its a matrix wich will contain the eigenvectors , of this matrix if you want "k" dimensions , you take "k" columns and they are your principal components.
Heres my implementation in octave of PCA, i use a pca.m file to define my pca calculation and ex7_pca.m to use it for dimensinality reduction for that particular case:
https://github.com/llealgt/standord_machine_learning_exercices/blob/master/machine-learning-ex7/ex7/pca.m
https://github.com/llealgt/standord_machine_learning_exercices/blob/master/machine-learning-ex7/ex7/ex7_pca.m
Related
I am not sure what the y-axis of my PDP implies? Is that the probability for my target feature to be 1 (binary classification) or something else?
If you do the partial dependence plot of column a and you want to interpret the y value at x = 0.0, the y-axis value represent the average probability of class 1 computed by
changing value of column a in all rows in your dataset to 0.0
predicting all changed row with your fitted model
averaging the probability given by the model
I may not good at explaining but you can read more about PDP at https://christophm.github.io/interpretable-ml-book/pdp.html. Hope this help :)
Generally speaking, we can produce a classifier from a function, f, producing a real-value output plus a threshold. We call the output an 'activation'. If the activation meets a threshold condition is met, the we say the class is detected:
is_class := ( f(x0, x1, ...) > threshold )
and
activation = f(x0, x1, ...)
PDP plots simply show activation values as they change in response to changes in an input value (we ignore the threshold). That is might plot:
f(x0, x, x2, x3, ...)
as a single input x varies. Typically, we hold the others constant, although we can also plot in 2d and 3d.
Sometimes we're interested in:
how a single change the activation
how multiple inputs independently change the activation
how multiple activations change based on different inputs, and so on.
Strictly speaking, we need not even be talking about a classifier when looking a PDP plots. Any function that productions a real-value output (an activation) in response to one of more real-valued feature inputs that we can vary allows us to produce PDP plots.
Classifier activations need not be, and often should not be, interpreted as probabilities, as others have written. In very many cases, this is simply just incorrect. Nevertheless, the analysis of the activation levels is of interest to us, independently of whether the activations represent probabilities: in PDP plots, we can see, for example, which feature values produce strong change - more horizontal plots may imply a worthless feature.
Similarly, in RoC plots, we explicitly examine information about the true-positive and false-position detection rates that result for varying the threshold of activation values.
In both cases, there's no necessity that the classifier produce probabilities as its activation.
Interpretation of PDP plots is fraught with dangers. At a minimum, you need to be clear about what is being held constant as a input feature is varied. Were the other features set to zero (a good choice for linear models)? Did we the set them to their most common values in the test set? Or the most common values for a known class in a sample? Without this information, the vertical axis may be less helpful.
Knowing that an activation is a probability also doesn't seem to helpful in PDP plots -- you can't expect the area under it to sum to one. Perhaps the most useful thing you might find is error cases, where output probabilities are not in the range 0..1.
I am doing multiple regression problem. I have the below data set as below.
rank--discipline--yrs.since.phd--yrs.service--sex--salary
[ 1 1 19 18 1 139750],......
I am taking salary as dependent variable, and other variable as independent variable. After doing data pre processing, I ran the gradient descent, regression model. I estimated bias(intercept), coefficient for all independent features.
I want to do scattered plot for the actual values and regression line
for the hypothesis I predicted. Since we have more than one features here,
I have the below questions.
While plotting actual values (scatted plot), how do I decide the x-axis values. Meaning, I have list of values. for example, first row [1,1,19,18,1]=>139750 How do I transform or map [1,1,19,18,1] to x-axis.? I need to somehow make [1,1,19,18,1] to one value, so I can mark a point of (x,y) in the plot.
While plotting regression line, what would be the feature values, so I can calculate the hypothesis value.?
Meaning now, I have the intercept, and weight of all features, but I dont have the feature values. How do I decide upon the feature values now.?
I want to calculate the points and use matplot to do the jobs. I am aware that there are lot of tools available outside including matplotlib to do the job. But I want to get the basic understanding.
Thanks.
I am still not sure I completely understand your question, so if something is not what you expected comment below and we will work it out.
Now,
Query 1: In all your datasets you are going to have multiple inputs and there is no way to view the target variable salary in your case with respect to all, in a single graph, what is usually done is either you apply the concept of dimensionality reduction on your data using t-sne (link) or you use principal component analysis (PCA) to reduce the dimensionality of your data, and make your output a function of two or three variables and then plot it on the screen, the other technique that I prefer is rather plotting target vs each variable separately as subplot, The reason for this is we don't even have a way to comprehend how we will see the data that is in more than three dimensions.
Query 2: If you are not determined to use matplotlib, I will suggest seaborn.regplot(), but let's also do it in matplotlib. Suppose the variable you want to use first is 'discipline' vs 'salary'.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['discipline']]
Y = df['salary']
lm.fit(X,Y)
After running this lm.coef_ will give you the coefficient, and lm.intercept_ will give you the intercept, in a linear equation that forms this variable, then you can plot the data between two variables and a line using matplotlib easily.
what you can do is ->
from pandas import plotting as pdplt
pdplt.scatter_matrix(dataframe, pass the remaining required parameters)
by this you will get a matrix of plots(in your case it's 6X6) which will exactly show how each column in your dataframe relates to the other columns and you can clearly visualise which feature dominates the result and also how the features are correlated to each other.
If you ask me this is the first thing I used to do with such types of problems and then remove all correlated features and select the features which best approximate the output.
But as you have to plot a 2d plot and in the above approach you might get more than a single feature which dominate the output then what you can do is a miracle named PCA.
If you ask me PCA is one of the most beautiful thing in machine learning. What it will do that is somehow merges all your feautres in some magical ratio which will generate principle components for your data. Principal components are those components which govern/major contribution to your model. You apply pca by simply importing from sklearn and then select the first principle component(as you need a 2d plot) or might select 2 priciple components and plot a 3d graph. But always remember this that these pricipal components are not the real features of your model but they are some magical combination and how PCA did so is very very interesting(by using concepts like eigen values and vectors) and you can build by your own also.
Apart from all these you can apply Singular Value decomposition(SVD) to your model which is the essence of whole linear algebra which is a type of matrix decomposition existing for all matrix. What this do is decompose your matrix into three matrix out of which the diagonal matrix which consists of singular values(a scaling factor) in descending order and what you have to do is that select the top singular values (in your case only the first one having highest magnitude) and construct back a feature matrix from 5 columns to 1 columns and then plot that. You can do svd by using the numpy.linalg
Once you applied any one of these methods then what you can do is learn your hypothesis with only the single most important selected feature and finally plot the graph. But take a tip, just for plotting a 2d graph you should avoid other important features beacuse maybe you have 3 principal components all having almost the same contribution and may the top three singular values are very close to each other. So take my words and take all important features into account and if you need the visualisation of these important features then use scatter matrix
Summary ->
All I want to mention is that you can do the same process with all these things and also can invent your own statistical or mathematical model for compressing your feature space.
But for me I prefer to go with PCA and in such type of problems I even first plot the scatter matrix to get an visual intuition to the data. And also PCA and SVD helps to remove redundancy and hence overfitting.
For rest details refer to docs.
Happy machine learning...
I would like to conduct some feature extraction(or clustering) for dataset containing sub-features.
For example, dataset is like below. The goal is to classify the type of robot using the data.
Samples : 100 robot samples [Robot 1, Robot 2, ..., Robot 100]
Classes : 2 types [Type A, Type B]
Variables : 6 parts, and 3 sub-features for each parts (total 18 variables)
[Part1_weight, Part1_size, Part1_strength, ..., Part6_size, Part6_strength, Part6_weight]
I want to conduct feature extraction with [weight, size, strength], and use extracted feature as a representative value for the part.
In short, my aim is to reduce the feature to 6 - [Part1_total, Part2_total, ..., Part6_total] - and then, classify the type of robot with those 6 features. So, make combined feature with 'weight', 'size', and 'strength' is the problem to solve.
First I thought of applying PCA (Principal Component Analysis), because it is one of the most popular feature extraction algorithm. But it considers all 18 features separately, so 'Part1_weight' can be considered as more important than 'Part2_weight'. But what I have to know is the importance of 'weights', 'sizes', and 'strengths' among samples, so PCA seems to be not applicable.
Is there any supposed way to solve this problem?
If you want to have exactly one feature per part I see no other way than performing the feature reduction part-wise. However, there might be better choices than simple PCA. For example, if the parts are mostly solid, their weight is likely to correlate with the third power of the size, so you could take the cubic root of the weight or the cube of the size before performing the PCA. Alternatively, you can take a logarithm of both values, which again results in a linear dependency.
Of course, there are many more fancy transformations you could use. In statistics, the Box-Cox Transformation is used to achieve a normal-looking distribution of the data.
You should also consider normalising the transformed data before performing the PCA, i.e. subtracting the mean and dividing by the standard deviations of each variable. It will remove the influence of units of measurement. I.e. it won't matter whether you measure weight in kg, atomic units, or Sun masses.
If the Part's number makes them different from one another (e.g Part1 is different from Part2, doesn't matter if their size, weight, strength parameters are identical), you can do PCA once for each Part. Using only the current Part's size, weight and strength as parameters in the current PCA.
Alternatively, if the Parts array order doesn't matter, you can do only one PCA using all (size, weight, strength) parameter triples, not differing them by their part number.
An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.
I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first few eigen values. Now, how do we decide on the number of eigen values that we should take from the eigen value matrix?
To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).
There is no correct answer, it is somewhere between 1 and n.
Think of a principal component as a street in a town you have never visited before. How many streets should you take to get to know the town?
Well, you should obviously visit the main street (the first component), and maybe some of the other big streets too. Do you need to visit every street to know the town well enough? Probably not.
To know the town perfectly, you should visit all of the streets. But what if you could visit, say 10 out of the 50 streets, and have a 95% understanding of the town? Is that good enough?
Basically, you should select enough components to explain enough of the variance that you are comfortable with.
As others said, it doesn't hurt to plot the explained variance.
If you use PCA as a preprocessing step for a supervised learning task, you should cross validate the whole data processing pipeline and treat the number of PCA dimension as an hyperparameter to select using a grid search on the final supervised score (e.g. F1 score for classification or RMSE for regression).
If cross-validated grid search on the whole dataset is too costly try on a 2 sub samples, e.g. one with 1% of the data and the second with 10% and see if you come up with the same optimal value for the PCA dimensions.
There are a number of heuristics use for that.
E.g. taking the first k eigenvectors that capture at least 85% of the total variance.
However, for high dimensionality, these heuristics usually are not very good.
Depending on your situation, it may be interesting to define the maximal allowed relative error by projecting your data on ndim dimensions.
Matlab example
I will illustrate this with a small matlab example. Just skip the code if you are not interested in it.
I will first generate a random matrix of n samples (rows) and p features containing exactly 100 non zero principal components.
n = 200;
p = 119;
data = zeros(n, p);
for i = 1:100
data = data + rand(n, 1)*rand(1, p);
end
The image will look similar to:
For this sample image, one can calculate the relative error made by projecting your input data to ndim dimensions as follows:
[coeff,score] = pca(data,'Economy',true);
relativeError = zeros(p, 1);
for ndim=1:p
reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)';
residuals = data - reconstructed;
relativeError(ndim) = max(max(residuals./data));
end
Plotting the relative error in function of the number of dimensions (principal components) results in the following graph:
Based on this graph, you can decide how many principal components you need to take into account. In this theoretical image taking 100 components result in an exact image representation. So, taking more than 100 elements is useless. If you want for example maximum 5% error, you should take about 40 principal components.
Disclaimer: The obtained values are only valid for my artificial data. So, do not use the proposed values blindly in your situation, but perform the same analysis and make a trade off between the error you make and the number of components you need.
Code reference
Iterative algorithm is based on the source code of pcares
A StackOverflow post about pcares
I highly recommend the following paper by Gavish and Donoho: The Optimal Hard Threshold for Singular Values is 4/sqrt(3).
I posted a longer summary of this on CrossValidated (stats.stackexchange.com). Briefly, they obtain an optimal procedure in the limit of very large matrices. The procedure is very simple, does not require any hand-tuned parameters, and seems to work very well in practice.
They have a nice code supplement here: https://purl.stanford.edu/vg705qn9070