Why multiplying two matrices need opposite order compared to its visual apply order in transform matrix

Why multiplying two matrices need opposite order compared to its visual apply order in transform matrix - ios

I am reading a book that talks about the iOS UIView transform property and noticed this fact in the picture when using CGAffineTransformConcat function, from the documents CGAffineTransformConcat simply multiplying its two parameters but here if you want to get Figure 1-9, your transform matrix parameter need in opposite order
this cause me confused and after a bit searching found wiki said this
In other words, the matrix of the combined transformation A followed by B is simply the product of the individual matrices. Note that the multiplication is done in the opposite order from the English sentence: the matrix of "A followed by B" is BA, not AB.
I am not really understand this, because from my current knowledge the Matrix is not commutative but can associate so if the original coordinate is multiplied from the right everything is explainable like this
let X = original coordinate
first multiply A --> A*X
then B --> B*(A*X)
equal === (B*A)*X
here the final combined matrix BA's order is opposite!
but from the apple documents the original coordinate is multiplied from left
can someone explain this, thanks

Think of it like this: where will you place your vector x (assuming coordinates are represented using columnar vectors) which has to transformed? You place it in the last i.e. BAx . So first B(Ax) and then (B(Ax)). So basically you are applying A first and then B.
In ios, instead of representing coordinates as columnar vectors they are using row vectors. Therefore the change in the order of multiplications.
Note:-If you take transpose on both sides, you can observe that it gives the other representation.
Suppose x is a column vector.
Text book method = BAx
Lets take its transpose
transpose(B*A*x) = transpose(x) * transpose(B*A)
= transpose(x) * transpose(B*A)
= transpose(x) * (transpose(A) * transpose(B))
= transpose(x) * transpose(A) * transpose(B)
rewriting transpose(A) as A' and transpose(B) as B'
= transpose(x) * A' * B'
= ios notation
If we use M in one representation then we must use transpose(M) in the other representation.
See page 5 in this https://www.cs.utexas.edu/~fussell/courses/cs384g/lectures/lecture07-Affine.pdf
Try to relate tx and ty in both representations.
Columnar representation (From wiki)
Row representation (From the question)

Related

How to align two Point Clouds given the set of points and point-to-point correspondence?

Suppose I have two pointclouds [x1, x2, x3...] and [y1, y2, y3...]. These two pointclouds should be as close as possible. There are a lot of algorithms and deep learning techniques for the pointcloud registration problems. But I have the extra information that: points x1 and y1 should be aligned, x2 and y2 should be aligned, and so on.
So the order of the points in both point clouds is the same. How can I use this to properly get the transformation matrix to align these two-point clouds?
Note: These two points clouds are not exactly the same. Actually, I had ground truth point cloud [x1,x2,x3...] and I tried to reconstruct another point cloud as [y1,y2,y3...]. Now I want to match them and visualize them if the reconstruction is good or not.

The problem you are facing is an overdetermined system of equations, which is solvable with a closed-form expression. No need for iterative methods like ICP, since you have the correspondence between points.
If you're looking for a rigid transform (that allows scaling, rotation and translation but no shearing), you want Umeyama's algorithm [3], which is a closed form as well, there is a Python implementation here: https://gist.github.com/nh2/bc4e2981b0e213fefd4aaa33edfb3893
If you are looking for an affine transform between your point clouds, i.e a linear transform A (that allows shearing, see [2]) as well as a translation t (which is not linear):
Then, each of your points must satisfy the equation:
y = Ax + t.
Here we assume the following shapes: y:(d,n), A:(d,d), x:(d,n), t:(d,1) if each cloud has n points in R^d.
You can also write it in homogeneous notation, by adding an extra coordinate, see [1]. This results in a linear system y=Mx, and a lot (assuming n>d) of pairs (x,y) that satisfy this equation (i.e. an overdetermined system).
You can therefore solve this using a closed-form least square method:
# Inputs:
# - P, a (n,dim) [or (dim,n)] matrix, a point cloud of n points in dim dimension.
# - Q, a (n,dim) [or (dim,n)] matrix, a point cloud of n points in dim dimension.
# P and Q must be of the same shape.
# This function returns :
# - Pt, the P point cloud, transformed to fit to Q
# - (T,t) the affine transform
def affine_registration(P, Q):
transposed = False
if P.shape[0] < P.shape[1]:
transposed = True
P = P.T
Q = Q.T
(n, dim) = P.shape
# Compute least squares
p, res, rnk, s = scipy.linalg.lstsq(np.hstack((P, np.ones([n, 1]))), Q)
# Get translation
t = p[-1].T
# Get transform matrix
T = p[:-1].T
# Compute transformed pointcloud
Pt = P#T.T + t
if transposed: Pt = Pt.T
return Pt, (T, t)
Opencv has a function called getAffineTransform(), however it only takes 3 pairs of points as input. https://theailearner.com/tag/cv2-getaffinetransform/. This won't be robust for your case (if e.g. you give it the first 3 pairs of points).
References:
[1] https://web.cse.ohio-state.edu/~shen.94/681/Site/Slides_files/transformation_review.pdf#page=24
[2] https://docs.opencv.org/3.4/d4/d61/tutorial_warp_affine.html
[3] https://stackoverflow.com/a/32244818/4195725

As another user already mentioned, the ICP algorithm (implementation in PCL can be found here) can be used to register two point clouds to each other. However this only works locally, so the clouds have to be aligned first.
I don't think there is a global registration in PCL at the moment, but I've used OpenGR which has a PCL wrapper.
If you know for sure that x1 is near y1, x2 is near y2 etc. you can do a manual alignment which will be a lot faster than global alignment:
Translate 2nd cloud by vector y1-x1
Rotate vector y2-y1 into vector x2-x1
Then refine it using ICP.
This does not account for measurement errors, so using the matrix estimation above will be useful if your data is not 100% correct.

VTK's vtkLandmarkTransform also does the same thing, with support for RigidBody/Similarity/Affine transformation:
// need at least four pairs of points in sourcePoint and targetPoints,
// can pick more, but probably not too many
vtkLandmarkTransform landmarkTransform = new vtkLandmarkTransform();
landmarkTransform.SetSourceLandmarks(sourcePoints); // source is to be transformed to match the target
landmarkTransform.SetTargetLandmarks(targetPoints); // target stays still
landmarkTransform.SetMode(VTK_Modes.AFFINE);
landmarkTransform.Modified(); // do the calculation
landmarkTransform.GetMatrix(mtx);
// now you can apply the mtx to all points

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!

Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.

Fourier Transform Scaling the magnitude

I have a set of signals S1, S2, ....,SN, for which I am numerically computing the Fourier transforms F1, F2, ,,,,FN. Where Si's and Fi's are C++ vectors (my computations are in C++).
My Computation objective is as follows:
compute product: F = F1*F2*...*FN
inverse Fourier transform F to get a S.
What I numerically observe is when I am trying to compute the product either I am running into situation where numbers become too small. Or numbers become too big.
I thought of scaling F1 by say a1 so that the overflow or underflow is avoided.
With the scaling my step 1 above will become
F' = (F1/a1)*(F2/a2)*...*(FN/aN) = F'1*F'2*...*F'N
And when I inverse transform my final S' will differ from S by a scale factor. The structural form of the S will not change. By this I mean only the normalization of S is different.
My question is :
Is my rationale correct.
If my rationale is correct then given a C++ vector "Fi" how can I choose a good scale "ai" to scale up or down the Fi's.
Many thanks in advance.

You can change the range of the Fi vector in a new domain, let's say [a, b]. So from your vector Fi, the minimum number will become a and the maximum number will become b. You can scale the vector using this equation:
Fnew[i] = (b-a)/(max-min) * (F[i] - min) + a,
where: min = the minimum value from F
max= the maximum value from F
The only problem now is choosing a and b. I would choose [1,2], because the values are small (so you won't get overflow easily), but not smaller than 0.

Machine Learning: Why xW+b instead of Wx+b?

I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much

This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.

I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D

Intuition about the kernel trick in machine learning

I have successfully implemented a kernel perceptron classifier, that uses an RBF kernel. I understand that the kernel trick maps features to a higher dimension so that a linear hyperplane can be constructed to separate the points. For example, if you have features (x1,x2) and map it to a 3-dimensional feature space you might get: K(x1,x2) = (x1^2, sqrt(x1)*x2, x2^2).
If you plug that into the perceptron decision function w'x+b = 0, you end up with: w1'x1^2 + w2'sqrt(x1)*x2 + w3'x2^2which gives you a circular decision boundary.
While the kernel trick itself is very intuitive, I am not able to understand the linear algebra aspect of this. Can someone help me understand how we are able to map all of these additional features without explicitly specifying them, using just the inner product?
Thanks!

Simple.
Give me the numeric result of (x+y)^10 for some values of x and y.
What would you rather do, "cheat" and sum x+y and then take that value to the 10'th power, or expand out the exact results writing out
x^10+10 x^9 y+45 x^8 y^2+120 x^7 y^3+210 x^6 y^4+252 x^5 y^5+210 x^4 y^6+120 x^3 y^7+45 x^2 y^8+10 x y^9+y^10
And then compute each term and then add them together? Clearly we can evaluate the dot product between degree 10 polynomials without explicitly forming them.
Valid kernels are dot products where we can "cheat" and compute the numeric result between two points without having to form their explicit feature values. There are many such possible kernels, though only a few have been getting used a lot on papers / practice.

I'm not sure if I'm answering your question, but as I remember the "trick" is that you don't explicitly calculate inner products. The perceptron calculates a straight line that separates the clusters. To get curved lines or even circles, instead of changing the perceptron you can change the space that contains the clusters. This is done by using a transformation usually called phi that transform coordinates to from one space to another. The perceptron algorithm is then applied in the new space where it produces a straight line, but when that line then is transformed back to the original space it can be curved.
The trick is that the perceptron only needs to know the inner product of the points of the clusters it is trying to separate. This means that we only need to be able to calculate the inner product of the transformed points. This is what the kernel does K(x,y) = <phi(x), phi(y)> where < . , . > is the inner product in the new space. This means that there is no need to do all the transformations to the new space and back, we don't even need to explicitly know what the transformation phi() is. All that is needed is that K defines an inner product in some space and hope that this inner product and space is useful for separating our clusters.
I think that there was some theorem that says that if the space represented by the kernel has higher dimensionality than the original space it is likely that it will separate the clusters better.

There is really not much to it
The weight in the higher space is
w = sum_i{a_i^t * Phi(x_i)}
and the input vector in the higher space
Phi(x)
so that the linear classification in the higher space is
w^t * input + c > 0
so if you put these together
sum_i{a_i * Phi(x_i)} * Phi(x) + c = sum_i{a_i * Phi(x_i)^t * Phi(x)} + c > 0
the last dot product's computational complexity is linear to the number of dimensions (often intractable, or not wanted)
We solve this by going over to the kernel "magic answer to the dot product"
K(x_i, x) = Phi(x_i)^t * Phi(x)
which gives
sum_i{a_i * K(x_i, x)} + c > 0

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Why multiplying two matrices need opposite order compared to its visual apply order in transform matrix - ios

Related

How to align two Point Clouds given the set of points and point-to-point correspondence?

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

Fourier Transform Scaling the magnitude

Machine Learning: Why xW+b instead of Wx+b?

Intuition about the kernel trick in machine learning

Categories

Resources