Why to add mean in the reconstruction in PCA? - machine-learning

Suppose that X is our dataset (still not centered) and X_cent is our centered dataset (X_cent = X - mean(X)).
If we are doing PCA projection in this way Z_cent = F*X_cent, where F is matrix of principal components, that is pretty obvious that we need to add mean(X) after reconstruction Z_cent.
But what if we are doing PCA projection in this way Z = F*X? In this case we don't need to add mean(X) after reconstruction, but it gives us another result.
I think that something wrong with this procedure (construction-reconstruction), when it applied to the non-centered data (X in our case). Can anyone explain how it works? Why can't we do construction/reconstruction phase without this subracting/adding mean?
Thank you in advance.

If you retain all Principal Components, then reconstruction of the centered and non-centered vectors as described in your question would be identical. The problem (as indicated in your comments) is that you are only retaining K principal components. When you drop PCs, you lose information so the reconstruction will contain errors. Since you don't have to reconstruct the mean in one of the reconstructions, you don't introduce errors w.r.t. the mean there so the reconstruction errors of the two versions will be different.
Reconstruction with fewer than all PCs isn't quite as simple as multiplying by the transpose of the eigenvectors (F') because you need to pad your transformed data with zeros but to keep things simple, I'll ignore that here. Your two reconstructions look like this:
R1 = F'*F*X
R2 = F'*F*X_cent + X_mean
= F'*F*(X - X_mean) + X_mean
= F'*F*X - F'*F*X_mean + X_mean
Since the reconstruction is lossy, in general, F'*F*Y != Y for matrix Y. If you retrained all PCs, you would have R1 - R2 = 0. But since you are only retaining a subset of the PCs, your two reconstructions will differ by
R2 - R1 = X_mean - F'*F*X_mean
Your follow-up question in the comments regarding why it's better to reconstruct X_cent instead of X is a bit more nuanced and really depends on why you are doing PCA in the first place. The most fundamental reason is that the PCs are with respect to the mean in the first place so by not centering the data prior to transforming/rotating, you aren't really decorrelating the features. Another reason is that the numeric values of the transformed data are often orders of magnitude smaller when centering the data first.

Related

Composing multiple Essential matrices

I have computed the Essential matrices between Frame [a,b], [b,c] and [c,d]. I have now E_ab, E_bc and E_cd. Is it possible to compute E_ad directly without matching? I am thinking of sth like 3D transformation composing where:
T04 = T01 * T12 * T23 * T34
The solution that I found is to decompose each one it into R|T and construct T_4*4 from R|T and then use the previouse notation of composing T matrices and then convert the final R|T to E. However, I was wondering if there is a direct way without decomposing for two reasons:
Decomposing and re composing (at least in OpenCV) changes the essential matrix a bit
Performance sake
So, How can I combine two (or more) essential matrices directly?
Before answering the original question, I think it might be important to discuss this point from your comment:
Ultimate goal is Bundle Adjustment on the Essential Matrices directly instead of on RT. I am using Single Calibrated Camera
It might be that I'm missing critical details about your project, but there seems to be a few issues with that method:
Minimal representation. Usually, it's better (for both performance and accuracy reasons) to use minimal parametrisations in your estimations. The Essential matrix has six degrees of freedom (which can be seen from its decomposition as [T]_xR). If you estimate your Essential directly, you're optimizing eight parameters instead of six.
Less constrained problem. The Essential matrix imposes coplanarity constraints and doesn't penalize some wrong configurations that would be penalized by reprojection errors that are used in usual bundle adjustment. If you consider two cameras C_1 and C_2 such that a point X_2 expressed in C_2 can be expressed in C_1 via RX_2+T, and that we denote the corresponding point in C_1 as X_1, then the constraint X_1.transpose()*E*X_2=0 imposes that the three vectors T, X_1 and RX_2+T remain coplanar. So, the point X_2 is free to move as long as it reprojects to the epiploar line, even if its ray diverges from X_1:
As you can see in the figure, the red and green points would respect the Essential matrix constraint, however they would have different reprojection errors.
Time complexity. Computing Essentials is really slow compared to alternatives, and so is the recovery of rotation and translations from Essentials, compared to the inverse problem.
Now to come back to the initial problem from the question, if you consider the three cameras C_1, C_2, C_3 and respectively denote their poses as Identity, R_1, R_2, and note the Essentials between C_1 and C_2 as E_1 and the one between C_2 and C_3 as E_2, then you can easily see that the Essential from C_3 to C_1, that I'll note E_3, will be given by
E_3=[R_1T_2+T_1]_x R_1R_2
which with a little bit of reorganization leads to
E_3=R_1E_2 + E_1R_2
This relation which establishes the link between the three Essential matrices seems to indicate that you still will need to extract R_1 and R_2 from E_1 and E_2.
Edit: I derived the relationship above like this: assume y is a 3x1 vector, then
E_3y = [R_1T_2+T_1]_x R_1R_2y
= (R_1T_2+T_1) x (R_1R_2y)
= R_1T_2 x R_1R_2y + T_1 x R_1R_2y #for the next line, note that crossproduct is invariant under rotation
= R_1(T_2 x R_2 y) + T_1 x R_1R_2 y
= R_1[T_2]_x R_2 y + [T_1]x_x R_1 R_2 y
= R_1 E_2 y + E_1 R_2 y
= (R_1E_2 + E_1R_2) y
Since this holds for any arbitrary y, this means that E_3=R_1E_2+E_1R_2.

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!
Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.

Are there any machine learning regression algorithms that can train on ordinal data?

I have a function f(x): R^n --> R (sorry, is there a way to do LaTeX here?), and I want to build a machine learning algorithm that estimates f(x) for any input point x, based on a bunch of sample xs in a training data set. If I know the value of f(x) for every x in the training data, this should be simple - just do a regression, or take the weighted average of nearby points, or whatever.
However, this isn't what my training data looks like. Rather, I have a bunch of pairs of points (x, y), and I know the value of f(x) - f(y) for each pair, but I don't know the absolute values of f(x) for any particular x. It seems like there ought to be a way to use this data to find an approximation to f(x), but I haven't found anything after some Googling; there are papers like this but they seem to assume that the training data comes in the form of a set of discrete labels for each entity, rather than having labels over pairs of entities.
This is just making something up, but could I try kernel density estimation over f'(x), and then do integration to get f(x)? Or is that crazy, or is there a known better technique?
You could assume that f is linear, which would simplify things - if f is linear we know that:
f(x-y) = f(x) - f(y)
For example, Suppose you assume f(x) = <w, x>, making w the parameter you want to learn. How would the squared loss per sample (x,y) and known difference d look like?
loss((x,y), d) = (f(x)-f(y) - d)^2
= (<w,x> - <w,y> - d)^2
= (<w, x-y> - d)^2
= (<w, z> - d)^2 // where z:=x-y
Which is simply the squared loss for z=x-y
Practically, you would need to construct z=x-y for each pair and then learn f using linear regression over inputs z and outputs d.
This model might be too weak for your needs, but its probably the first thing you should try. Otherwise, as soon as you step away from the linearity assumption, you'd likely arrive at a difficult non-convex optimization problem.
I don't see a way to get absolute results. Any constant in your function (f(x) = g(x) + c) will disappear, in the same way constants disappear in an integral.

Intuition about the kernel trick in machine learning

I have successfully implemented a kernel perceptron classifier, that uses an RBF kernel. I understand that the kernel trick maps features to a higher dimension so that a linear hyperplane can be constructed to separate the points. For example, if you have features (x1,x2) and map it to a 3-dimensional feature space you might get: K(x1,x2) = (x1^2, sqrt(x1)*x2, x2^2).
If you plug that into the perceptron decision function w'x+b = 0, you end up with: w1'x1^2 + w2'sqrt(x1)*x2 + w3'x2^2which gives you a circular decision boundary.
While the kernel trick itself is very intuitive, I am not able to understand the linear algebra aspect of this. Can someone help me understand how we are able to map all of these additional features without explicitly specifying them, using just the inner product?
Thanks!
Simple.
Give me the numeric result of (x+y)^10 for some values of x and y.
What would you rather do, "cheat" and sum x+y and then take that value to the 10'th power, or expand out the exact results writing out
x^10+10 x^9 y+45 x^8 y^2+120 x^7 y^3+210 x^6 y^4+252 x^5 y^5+210 x^4 y^6+120 x^3 y^7+45 x^2 y^8+10 x y^9+y^10
And then compute each term and then add them together? Clearly we can evaluate the dot product between degree 10 polynomials without explicitly forming them.
Valid kernels are dot products where we can "cheat" and compute the numeric result between two points without having to form their explicit feature values. There are many such possible kernels, though only a few have been getting used a lot on papers / practice.
I'm not sure if I'm answering your question, but as I remember the "trick" is that you don't explicitly calculate inner products. The perceptron calculates a straight line that separates the clusters. To get curved lines or even circles, instead of changing the perceptron you can change the space that contains the clusters. This is done by using a transformation usually called phi that transform coordinates to from one space to another. The perceptron algorithm is then applied in the new space where it produces a straight line, but when that line then is transformed back to the original space it can be curved.
The trick is that the perceptron only needs to know the inner product of the points of the clusters it is trying to separate. This means that we only need to be able to calculate the inner product of the transformed points. This is what the kernel does K(x,y) = <phi(x), phi(y)> where < . , . > is the inner product in the new space. This means that there is no need to do all the transformations to the new space and back, we don't even need to explicitly know what the transformation phi() is. All that is needed is that K defines an inner product in some space and hope that this inner product and space is useful for separating our clusters.
I think that there was some theorem that says that if the space represented by the kernel has higher dimensionality than the original space it is likely that it will separate the clusters better.
There is really not much to it
The weight in the higher space is
w = sum_i{a_i^t * Phi(x_i)}
and the input vector in the higher space
Phi(x)
so that the linear classification in the higher space is
w^t * input + c > 0
so if you put these together
sum_i{a_i * Phi(x_i)} * Phi(x) + c = sum_i{a_i * Phi(x_i)^t * Phi(x)} + c > 0
the last dot product's computational complexity is linear to the number of dimensions (often intractable, or not wanted)
We solve this by going over to the kernel "magic answer to the dot product"
K(x_i, x) = Phi(x_i)^t * Phi(x)
which gives
sum_i{a_i * K(x_i, x)} + c > 0

Compare Plots in matlab

I have two plots in matlab where in I have plotted x and y coordinates. If I have these two plots, is it possible to compare if the plots match? Can I obtain numbers to tell how well they match?
Note that the graphs could possibly be right/left/up/down shifted in plot (turning axis off is not problem), scaled/rotated (I would also like to know if it is skewed, but for now, it is not a must).
It will not need to test color elements, color inversion and any other complicated graphic properties than basic ones mentioned above.
If matlab is not enough, I would welcome other tools.
Note that I cannot simply take the absolute difference of x- and y- values. I could obtain x-absolute difference average and y-absolute difference and then average but I need a combined error. I need to compare the graph.
Graphs to be compared.
EDIT
Direct Correlation does not work for me.
For a different set of data: I got .94 correlation. This is very high for given data. noticing that one data is fluctuating less and faster than other.
You can access the plotted data with this code
x = 10:100;
y = log10(x);
plot(x,y);
h = gcf;
axesObjs = get(h, 'Children'); %axes handles
dataObjs = get(axesObjs, 'Children'); %handles to low-level graphics objects in axes
objTypes = get(dataObjs, 'Type'); %type of low-level graphics object
xdata = get(dataObjs, 'XData'); %data from low-level grahics objects
ydata = get(dataObjs, 'YData');
Then you can do a correlation between xdata and ydata for example, or any kind of comparison. The coefficient R will indicate a percent match.
[R,P] = corrcoef(xdata, ydata);
You would also be interested in comparing the axes limits in the graphical current axes. For example
R = ( diff(get(h_ax1,'XLim')) / diff(get(h_ax2,'XLim')) ) + ...
( diff(get(h_ax1,'YLim')) / diff(get(h_ax2,'YLim')) )
where h_ax1 is the handle of the first axe and h_ax2 for the second one. Here, you will have a comparison between values of (XLim + YLim). The possible comparisons with different gca properties are really vast though.
EDIT
To compare two sets of points, you may use other metrics than analytical relationship. I think of distances or convergences such as the Hausdorff distance. A script is available here in matlab central. I used such distance to compare letter shapes. In the wikipedia page, the 'Applications' section is of importance (edge detector for thick shapes, but it may not be pertinent to your particular problem).

Resources