How to align two Point Clouds given the set of points and point-to-point correspondence?

How to align two Point Clouds given the set of points and point-to-point correspondence? - alignment

Suppose I have two pointclouds [x1, x2, x3...] and [y1, y2, y3...]. These two pointclouds should be as close as possible. There are a lot of algorithms and deep learning techniques for the pointcloud registration problems. But I have the extra information that: points x1 and y1 should be aligned, x2 and y2 should be aligned, and so on.
So the order of the points in both point clouds is the same. How can I use this to properly get the transformation matrix to align these two-point clouds?
Note: These two points clouds are not exactly the same. Actually, I had ground truth point cloud [x1,x2,x3...] and I tried to reconstruct another point cloud as [y1,y2,y3...]. Now I want to match them and visualize them if the reconstruction is good or not.

The problem you are facing is an overdetermined system of equations, which is solvable with a closed-form expression. No need for iterative methods like ICP, since you have the correspondence between points.
If you're looking for a rigid transform (that allows scaling, rotation and translation but no shearing), you want Umeyama's algorithm [3], which is a closed form as well, there is a Python implementation here: https://gist.github.com/nh2/bc4e2981b0e213fefd4aaa33edfb3893
If you are looking for an affine transform between your point clouds, i.e a linear transform A (that allows shearing, see [2]) as well as a translation t (which is not linear):
Then, each of your points must satisfy the equation:
y = Ax + t.
Here we assume the following shapes: y:(d,n), A:(d,d), x:(d,n), t:(d,1) if each cloud has n points in R^d.
You can also write it in homogeneous notation, by adding an extra coordinate, see [1]. This results in a linear system y=Mx, and a lot (assuming n>d) of pairs (x,y) that satisfy this equation (i.e. an overdetermined system).
You can therefore solve this using a closed-form least square method:
# Inputs:
# - P, a (n,dim) [or (dim,n)] matrix, a point cloud of n points in dim dimension.
# - Q, a (n,dim) [or (dim,n)] matrix, a point cloud of n points in dim dimension.
# P and Q must be of the same shape.
# This function returns :
# - Pt, the P point cloud, transformed to fit to Q
# - (T,t) the affine transform
def affine_registration(P, Q):
transposed = False
if P.shape[0] < P.shape[1]:
transposed = True
P = P.T
Q = Q.T
(n, dim) = P.shape
# Compute least squares
p, res, rnk, s = scipy.linalg.lstsq(np.hstack((P, np.ones([n, 1]))), Q)
# Get translation
t = p[-1].T
# Get transform matrix
T = p[:-1].T
# Compute transformed pointcloud
Pt = P#T.T + t
if transposed: Pt = Pt.T
return Pt, (T, t)
Opencv has a function called getAffineTransform(), however it only takes 3 pairs of points as input. https://theailearner.com/tag/cv2-getaffinetransform/. This won't be robust for your case (if e.g. you give it the first 3 pairs of points).
References:
[1] https://web.cse.ohio-state.edu/~shen.94/681/Site/Slides_files/transformation_review.pdf#page=24
[2] https://docs.opencv.org/3.4/d4/d61/tutorial_warp_affine.html
[3] https://stackoverflow.com/a/32244818/4195725

As another user already mentioned, the ICP algorithm (implementation in PCL can be found here) can be used to register two point clouds to each other. However this only works locally, so the clouds have to be aligned first.
I don't think there is a global registration in PCL at the moment, but I've used OpenGR which has a PCL wrapper.
If you know for sure that x1 is near y1, x2 is near y2 etc. you can do a manual alignment which will be a lot faster than global alignment:
Translate 2nd cloud by vector y1-x1
Rotate vector y2-y1 into vector x2-x1
Then refine it using ICP.
This does not account for measurement errors, so using the matrix estimation above will be useful if your data is not 100% correct.

VTK's vtkLandmarkTransform also does the same thing, with support for RigidBody/Similarity/Affine transformation:
// need at least four pairs of points in sourcePoint and targetPoints,
// can pick more, but probably not too many
vtkLandmarkTransform landmarkTransform = new vtkLandmarkTransform();
landmarkTransform.SetSourceLandmarks(sourcePoints); // source is to be transformed to match the target
landmarkTransform.SetTargetLandmarks(targetPoints); // target stays still
landmarkTransform.SetMode(VTK_Modes.AFFINE);
landmarkTransform.Modified(); // do the calculation
landmarkTransform.GetMatrix(mtx);
// now you can apply the mtx to all points

Related

Composing multiple Essential matrices

I have computed the Essential matrices between Frame [a,b], [b,c] and [c,d]. I have now E_ab, E_bc and E_cd. Is it possible to compute E_ad directly without matching? I am thinking of sth like 3D transformation composing where:
T04 = T01 * T12 * T23 * T34
The solution that I found is to decompose each one it into R|T and construct T_4*4 from R|T and then use the previouse notation of composing T matrices and then convert the final R|T to E. However, I was wondering if there is a direct way without decomposing for two reasons:
Decomposing and re composing (at least in OpenCV) changes the essential matrix a bit
Performance sake
So, How can I combine two (or more) essential matrices directly?

Before answering the original question, I think it might be important to discuss this point from your comment:
Ultimate goal is Bundle Adjustment on the Essential Matrices directly instead of on RT. I am using Single Calibrated Camera
It might be that I'm missing critical details about your project, but there seems to be a few issues with that method:
Minimal representation. Usually, it's better (for both performance and accuracy reasons) to use minimal parametrisations in your estimations. The Essential matrix has six degrees of freedom (which can be seen from its decomposition as [T]_xR). If you estimate your Essential directly, you're optimizing eight parameters instead of six.
Less constrained problem. The Essential matrix imposes coplanarity constraints and doesn't penalize some wrong configurations that would be penalized by reprojection errors that are used in usual bundle adjustment. If you consider two cameras C_1 and C_2 such that a point X_2 expressed in C_2 can be expressed in C_1 via RX_2+T, and that we denote the corresponding point in C_1 as X_1, then the constraint X_1.transpose()*E*X_2=0 imposes that the three vectors T, X_1 and RX_2+T remain coplanar. So, the point X_2 is free to move as long as it reprojects to the epiploar line, even if its ray diverges from X_1:
As you can see in the figure, the red and green points would respect the Essential matrix constraint, however they would have different reprojection errors.
Time complexity. Computing Essentials is really slow compared to alternatives, and so is the recovery of rotation and translations from Essentials, compared to the inverse problem.
Now to come back to the initial problem from the question, if you consider the three cameras C_1, C_2, C_3 and respectively denote their poses as Identity, R_1, R_2, and note the Essentials between C_1 and C_2 as E_1 and the one between C_2 and C_3 as E_2, then you can easily see that the Essential from C_3 to C_1, that I'll note E_3, will be given by
E_3=[R_1T_2+T_1]_x R_1R_2
which with a little bit of reorganization leads to
E_3=R_1E_2 + E_1R_2
This relation which establishes the link between the three Essential matrices seems to indicate that you still will need to extract R_1 and R_2 from E_1 and E_2.
Edit: I derived the relationship above like this: assume y is a 3x1 vector, then
E_3y = [R_1T_2+T_1]_x R_1R_2y
= (R_1T_2+T_1) x (R_1R_2y)
= R_1T_2 x R_1R_2y + T_1 x R_1R_2y #for the next line, note that crossproduct is invariant under rotation
= R_1(T_2 x R_2 y) + T_1 x R_1R_2 y
= R_1[T_2]_x R_2 y + [T_1]x_x R_1 R_2 y
= R_1 E_2 y + E_1 R_2 y
= (R_1E_2 + E_1R_2) y
Since this holds for any arbitrary y, this means that E_3=R_1E_2+E_1R_2.

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!

Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.

Batch Normalization in Convolutional Neural Network

I am newbie in convolutional neural networks and just have idea about feature maps and how convolution is done on images to extract features. I would be glad to know some details on applying batch normalisation in CNN.
I read this paper https://arxiv.org/pdf/1502.03167v3.pdf and could understand the BN algorithm applied on a data but in the end they mentioned that a slight modification is required when applied to CNN:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini- batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effec- tive mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation. Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
I am total confused when they say
"so that different elements of the same feature map, at different locations, are normalized in the same way"
I know what feature maps mean and different elements are the weights in every feature map. But I could not understand what location or spatial location means.
I could not understand the below sentence at all
"In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations"
I would be glad if someone cold elaborate and explain me in much simpler terms

Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels. An index (x, y) where 0 <= x < H and 0 <= y < W is a spatial location.
Usual batchnorm
Now, here's how the batchnorm is applied in a usual way (in pseudo-code):
# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)
Basically, it computes H*W*C means and H*W*C standard deviations across B elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B values.
Batchnorm in conv layer
This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.
Here's how the code looks like in this case (again pseudo-code):
# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)
In total, there are only C means and standard deviations and each one of them is computed over B*H*W values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").

Some clarification on Maxim's answer.
I was puzzled by seeing in Keras that the axis you specify is the channels axis, as it doesn't make sense to normalize over the channels - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is usually not what you want - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.
This is why you want C means and stds, because you want a mean and std per channel/feature.
After checking and testing it myself I realized the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.).
I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.
Hope this helps anyone who was confused as I was.

I'm only 70% sure of what I say, so if it does not make sense, please edit or mention it before downvoting.
About location or spatial location: they mean the position of pixels in an image or feature map. A feature map is comparable to a sparse modified version of image where concepts are represented.
About so that different elements of the same feature map, at different locations, are normalized in the same way:
some normalisation algorithms are local, so they are dependent of their close surrounding (location) and not the things far apart in the image. They probably mean that every pixel, regardless of their location, is treated just like the element of a set, independently of it's direct special surrounding.
About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations: They get a flat list of every values of every training example in the minibatch, and this list combines things whatever their location is on the feature map.

Firstly we need to make it clear that the depth of a kernel is determined by previous feature map's channel num, and the number of kernel in this layer determins the channel num of next feature map (the next layer).
then we should make it clear that each kernel(three dimentional usually) will generate just one channel of feature map in the next layer.
thirdly we should try to accept the idea of each points in the generated feature map (regardless of their position) are generated by the same kernel, by sliding on previous layer. So they could be seen as a distribution generated by this kernel, and they could be seen as samples of a stochastic variable. Then they should be averaged to obtain the mean and then the variance. (it not rigid, only helps to understand)
This is what they say "so that different elements of the same feature map, at different locations, are normalized in the same way"

Why multiplying two matrices need opposite order compared to its visual apply order in transform matrix

I am reading a book that talks about the iOS UIView transform property and noticed this fact in the picture when using CGAffineTransformConcat function, from the documents CGAffineTransformConcat simply multiplying its two parameters but here if you want to get Figure 1-9, your transform matrix parameter need in opposite order
this cause me confused and after a bit searching found wiki said this
In other words, the matrix of the combined transformation A followed by B is simply the product of the individual matrices. Note that the multiplication is done in the opposite order from the English sentence: the matrix of "A followed by B" is BA, not AB.
I am not really understand this, because from my current knowledge the Matrix is not commutative but can associate so if the original coordinate is multiplied from the right everything is explainable like this
let X = original coordinate
first multiply A --> A*X
then B --> B*(A*X)
equal === (B*A)*X
here the final combined matrix BA's order is opposite!
but from the apple documents the original coordinate is multiplied from left
can someone explain this, thanks

Think of it like this: where will you place your vector x (assuming coordinates are represented using columnar vectors) which has to transformed? You place it in the last i.e. BAx . So first B(Ax) and then (B(Ax)). So basically you are applying A first and then B.
In ios, instead of representing coordinates as columnar vectors they are using row vectors. Therefore the change in the order of multiplications.
Note:-If you take transpose on both sides, you can observe that it gives the other representation.
Suppose x is a column vector.
Text book method = BAx
Lets take its transpose
transpose(B*A*x) = transpose(x) * transpose(B*A)
= transpose(x) * transpose(B*A)
= transpose(x) * (transpose(A) * transpose(B))
= transpose(x) * transpose(A) * transpose(B)
rewriting transpose(A) as A' and transpose(B) as B'
= transpose(x) * A' * B'
= ios notation
If we use M in one representation then we must use transpose(M) in the other representation.
See page 5 in this https://www.cs.utexas.edu/~fussell/courses/cs384g/lectures/lecture07-Affine.pdf
Try to relate tx and ty in both representations.
Columnar representation (From wiki)
Row representation (From the question)

Intuition about the kernel trick in machine learning

I have successfully implemented a kernel perceptron classifier, that uses an RBF kernel. I understand that the kernel trick maps features to a higher dimension so that a linear hyperplane can be constructed to separate the points. For example, if you have features (x1,x2) and map it to a 3-dimensional feature space you might get: K(x1,x2) = (x1^2, sqrt(x1)*x2, x2^2).
If you plug that into the perceptron decision function w'x+b = 0, you end up with: w1'x1^2 + w2'sqrt(x1)*x2 + w3'x2^2which gives you a circular decision boundary.
While the kernel trick itself is very intuitive, I am not able to understand the linear algebra aspect of this. Can someone help me understand how we are able to map all of these additional features without explicitly specifying them, using just the inner product?
Thanks!

Simple.
Give me the numeric result of (x+y)^10 for some values of x and y.
What would you rather do, "cheat" and sum x+y and then take that value to the 10'th power, or expand out the exact results writing out
x^10+10 x^9 y+45 x^8 y^2+120 x^7 y^3+210 x^6 y^4+252 x^5 y^5+210 x^4 y^6+120 x^3 y^7+45 x^2 y^8+10 x y^9+y^10
And then compute each term and then add them together? Clearly we can evaluate the dot product between degree 10 polynomials without explicitly forming them.
Valid kernels are dot products where we can "cheat" and compute the numeric result between two points without having to form their explicit feature values. There are many such possible kernels, though only a few have been getting used a lot on papers / practice.

I'm not sure if I'm answering your question, but as I remember the "trick" is that you don't explicitly calculate inner products. The perceptron calculates a straight line that separates the clusters. To get curved lines or even circles, instead of changing the perceptron you can change the space that contains the clusters. This is done by using a transformation usually called phi that transform coordinates to from one space to another. The perceptron algorithm is then applied in the new space where it produces a straight line, but when that line then is transformed back to the original space it can be curved.
The trick is that the perceptron only needs to know the inner product of the points of the clusters it is trying to separate. This means that we only need to be able to calculate the inner product of the transformed points. This is what the kernel does K(x,y) = <phi(x), phi(y)> where < . , . > is the inner product in the new space. This means that there is no need to do all the transformations to the new space and back, we don't even need to explicitly know what the transformation phi() is. All that is needed is that K defines an inner product in some space and hope that this inner product and space is useful for separating our clusters.
I think that there was some theorem that says that if the space represented by the kernel has higher dimensionality than the original space it is likely that it will separate the clusters better.

There is really not much to it
The weight in the higher space is
w = sum_i{a_i^t * Phi(x_i)}
and the input vector in the higher space
Phi(x)
so that the linear classification in the higher space is
w^t * input + c > 0
so if you put these together
sum_i{a_i * Phi(x_i)} * Phi(x) + c = sum_i{a_i * Phi(x_i)^t * Phi(x)} + c > 0
the last dot product's computational complexity is linear to the number of dimensions (often intractable, or not wanted)
We solve this by going over to the kernel "magic answer to the dot product"
K(x_i, x) = Phi(x_i)^t * Phi(x)
which gives
sum_i{a_i * K(x_i, x)} + c > 0

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart