This question belongs to the topic - 'Structure from Motion'.
Suppose, there are 3 images. There are point correspondences between image 1-2 and image 2-3, but there's no common point between image 1 and 3. I got the RT (rotation and translation matrix) of image 2, RT12, with respect to image 1(considering image 1 RT as [I|0], that means, rotation is identity, translation is zero). Lets split RT12 into R12 and T12.
Similarly, I got RT23 considering image 2 RT as [I|0]. So, now I have R23 and T23, who are related to image 2, but not image 1. Now I want to find R13 and T13.
For a synthetic dataset, the equation R13=R23*R12 is giving correct R(verified, as I actually have the R13 precalculated). Similary T13 should be T2+T1. But the translation computed this way is bad. Since I have actual results with me, I could verify that Rotation was estimated nicely, but not translation. Any ideas?
This is a simple matrix block-multiplication problem, but you have to remember that you are actually considering 4x4 matrices (associated to rigid transformations in the 3D homogeneous world) and not 3x4 matrices.
Your 3x4 RT matrices actually correspond to the first three rows of 4x4 matrices A, whose last rows are [0, 0, 0, 1]:
RT23 -> A23=[R23, T23; 0, 0, 0, 1]
RT12 -> A12=[R12, T12; 0, 0, 0, 1]
Then, if you do the matrix block-multiplication on the 4x4 matrices (A13 = A23 * A12), you'll quickly find out that:
R13 = R23 * R12
T13 = R23 * T12 + T23
Related
When trying to use torch.nn.functional.affine_grid, it requires a theta affine matrix of size (N x 3 x 4) according to the documentation. I thought a general affine matrix is (N x 4 x 4). What is the supposed affine matrix format in pytorch?
An example of 3D rotation affine input would be ideal. Appreciate your help.
The dimensions you mention are applicable for the case of 3D inputs, that is you wish to apply 3D geometric transforms on the input tensor x of shape bxcxdxhxw.
A transformation to points in 3D (represented as 4-vector in homogeneous coordinates as (x, y, z, 1)) should be, in the general case, a 4x4 matrix as you noted.
However, since we restrict ourselves to homogeneous coordinates, i.e., the fourth coordinate must be 1, the 4th row of the matrix must be (0, 0, 0, 1) (see this).
Therefore, there's no need to explicitly code this last row.
To conclude, a 3D transformation composed of a 3x3 rotation R and 3d translation t is simply the 3x4 matrix:
theta = [R t]
I am building a project that is a basic neural network that takes in a 2x2 image with the goal to classify the image as either a forward slash (1-class) or back slash (0-class) shape. The data for the input is a flat numpy array. 1 represents a black pixel 0 represents a white pixel.
0-class: [1, 0, 0, 1]
1-class: [0, 1, 1, 0]
If I start my filter as a random 4x1 matrix, how can I use gradient descent to come to either perfect matrix [1,-1,-1,1] or [-1,1,1,-1] to classify the datapoints.
Side note: Even when multiplied with the "perfect" answer matrix then summed, the label output would be -2 and 2. Would my data labels need to be -2 and 2? What if I want my classes labeled as 0 and 1?
I'm wondering if I can determine in OpenCV the transformation kinds (rotation, translation, shear, ...) just from a given transformation matrix?
Following up on this article: http://docs.opencv.org/2.4/doc/tutorials/imgproc/imgtrans/warp_affine/warp_affine.html
I will get an 2x3 Transformation matrix. E.g.:
[ 0.85, 0.20, 0;
-0.06, 0.37, 253.44]
I know that the third "column" stands for the translation. So in this case x=0, y=253.44
But is there a way of determine if in the first two columns only a rotation or a rotation and a scaling is applied? So what I mean is get the matrix multiplication somehow reversed?
This isn't possible in general. For example, consider the affine transformation:
cos(theta) sin(theta) 0
-sin(theta) cos(theta) 0
If theta = pi, then this evaluates to
-1 0 0
0 -1 0
Is this affine transformation scaling by -1, or rotating by 180 degrees?
Affine transformations can indeed be written as a composition of multiple transformations, but as matrix multiplication is not commutative, the order of these compositions matter. And that even goes for translation. The resulting transformation matrix is different if you rotate and then translate than if you translate and rotate, because rotation happens around the (0, 0) location in your image.
In short, there is no unique decomposition. If you know certain constraints, like scaling is uniform, or shearing is uniform, or the scaling is only positive, etc, then you might be able to get a unique decomposition. See here for the mathematics.
I'm trying to calculate a new camera position based on the motion of corresponding images.
the images conform to the pinhole camera model.
As a matter of fact, I don't get useful results, so I try to describe my procedure and hope that somebody can help me.
I match the features of the corresponding images with SIFT, match them with OpenCV's FlannBasedMatcher and calculate the fundamental matrix with OpenCV's findFundamentalMat (method RANSAC).
Then I calculate the essential matrix by the camera intrinsic matrix (K):
Mat E = K.t() * F * K;
I decompose the essential matrix to rotation and translation with singular value decomposition:
SVD decomp = SVD(E);
Matx33d W(0,-1,0,
1,0,0,
0,0,1);
Matx33d Wt(0,1,0,
-1,0,0,
0,0,1);
R1 = decomp.u * Mat(W) * decomp.vt;
R2 = decomp.u * Mat(Wt) * decomp.vt;
t1 = decomp.u.col(2); //u3
t2 = -decomp.u.col(2); //u3
Then I try to find the correct solution by triangulation. (this part is from http://www.morethantechnical.com/2012/01/04/simple-triangulation-with-opencv-from-harley-zisserman-w-code/ so I think that should work correct).
The new position is then calculated with:
new_pos = old_pos + -R.t()*t;
where new_pos & old_pos are vectors (3x1), R the rotation matrix (3x3) and t the translation vector (3x1).
Unfortunately I got no useful results, so maybe anyone has an idea what could be wrong.
Here are some results (just in case someone can confirm that any of them is definitely wrong):
F = [8.093827077399547e-07, 1.102681999632987e-06, -0.0007939604310854831;
1.29246107737264e-06, 1.492629957878578e-06, -0.001211264339006535;
-0.001052930954975217, -0.001278667878010564, 1]
K = [150, 0, 300;
0, 150, 400;
0, 0, 1]
E = [0.01821111092414898, 0.02481034499174221, -0.01651092283654529;
0.02908037424088439, 0.03358417405226801, -0.03397110489649674;
-0.04396975675562629, -0.05262169424538553, 0.04904210357279387]
t = [0.2970648246214448; 0.7352053067682792; 0.6092828956013705]
R = [0.2048034356172475, 0.4709818957303019, -0.858039396912323;
-0.8690270040802598, -0.3158728880490416, -0.3808101689488421;
-0.4503860776474556, 0.8236506374002566, 0.3446041331317597]
First of all you should check if
x' * F * x = 0
for your point correspondences x' and x. This should be of course only the case for the inliers of the fundamental matrix estimation with RANSAC.
Thereafter, you have to transform your point correspondences to normalized image coordinates (NCC) like this
xn = inv(K) * x
xn' = inv(K') * x'
where K' is the intrinsic camera matrix of the second image and x' are the points of the second image. I think in your case it is K = K'.
With these NCCs you can decompose your essential matrix like you described. You triangulate the normalized camera coordinates and check the depth of your triangulated points. But be careful, in literature they say that one point is sufficient to get the correct rotation and translation. From my experience you should check a few points since one point can be an outlier even after RANSAC.
Before you decompose the essential matrix make sure that E=U*diag(1,1,0)*Vt. This condition is required to get correct results for the four possible choices of the projection matrix.
When you've got the correct rotation and translation you can triangulate all your point correspondences (the inliers of the fundamental matrix estimation with RANSAC). Then, you should compute the reprojection error. Firstly, you compute the reprojected position like this
xp = K * P * X
xp' = K' * P' * X
where X is the computed (homogeneous) 3D position. P and P' are the 3x4 projection matrices. The projection matrix P is normally given by the identity. P' = [R, t] is given by the rotation matrix in the first 3 columns and rows and the translation in the fourth column, so that P is a 3x4 matrix. This only works if you transform your 3D position to homogeneous coordinates, i.e. 4x1 vectors instead of 3x1. Then, xp and xp' are also homogeneous coordinates representing your (reprojected) 2D positions of your corresponding points.
I think the
new_pos = old_pos + -R.t()*t;
is incorrect since firstly, you only translate the old_pos and you do not rotate it and secondly, you translate it with a wrong vector. The correct way is given above.
So, after you computed the reprojected points you can calculate the reprojection error. Since you are working with homogeneous coordinates you have to normalize them (xp = xp / xp(2), divide by last coordinate). This is given by
error = (x(0)-xp(0))^2 + (x(1)-xp(1))^2
If the error is large such as 10^2 your intrinsic camera calibration or your rotation/translation are incorrect (perhaps both). Depending on your coordinate system you can try to inverse your projection matrices. On that account you need to transform them to homogeneous coordinates before since you cannot invert a 3x4 matrix (without the pseudo inverse). Thus, add the fourth row [0 0 0 1], compute the inverse and remove the fourth row.
There is one more thing with reprojection error. In general, the reprojection error is the squared distance between your original point correspondence (in each image) and the reprojected position. You can take the square root to get the Euclidean distance between both points.
To update your camera position, you have to update the translation first, then update the rotation matrix.
t_ref += lambda * (R_ref * t);
R_ref = R * R_ref;
where t_ref and R_ref are your camera state, R and t are new calculated camera rotation and translation, and lambda is the scale factor.
I am trying to use FFTW for image convolution.
At first just to test if the system is working properly, I performed the fft, then the inverse fft, and could get the exact same image returned.
Then a small step forward, I used the identity kernel(i.e., kernel[0][0] = 1 whereas all the other components equal 0). I took the component-wise product between the image and kernel(both in the frequency domain), then did the inverse fft. Theoretically I should be able to get the identical image back. But the result I got is very not even close to the original image. I am suspecting this has something to do with where I center my kernel before I fft it into frequency domain(since I put the "1" at kernel[0][0], it basically means that I centered the positive part at the top left). Could anyone enlighten me about what goes wrong here?
For each dimension, the indexes of samples should be from -n/2 ... 0 ... n/2 -1, so if the dimension is odd, center around the middle. If the dimension is even, center so that before the new 0 you have one sample more than after the new 0.
E.g. -4, -3, -2, -1, 0, 1, 2, 3 for a width/height of 8 or -3, -2, -1, 0, 1, 2, 3 for a width/height of 7.
The FFT is relative to the middle, in its scale there are negative points.
In the memory the points are 0...n-1, but the FFT treats them as -ceil(n/2)...floor(n/2), where 0 is -ceil(n/2) and n-1 is floor(n/2)
The identity matrix is a matrix of zeros with 1 in the 0,0 location (the center - according to above numbering). (In the spatial domain.)
In the frequency domain the identity matrix should be a constant (all real values 1 or 1/(N*M) and all imaginary values 0).
If you do not receive this result, then the identify matrix might need padding differently (to the left and down instead of around all sides) - this may depend on the FFT implementation.
Center each dimension separately (this is an index centering, no change in actual memory).
You will probably need to pad the image (after centering) to a whole power of 2 in each dimension (2^n * 2^m where n doesn't have to equal m).
Pad relative to FFT's 0,0 location (to center, not corner) by copying existing pixels into a new larger image, using center-based-indexes in both source and destination images (e.g. (0,0) to (0,0), (0,1) to (0,1), (1,-2) to (1,-2))
Assuming your FFT uses regular floating point cells and not complex cells, the complex image has to be of size 2*ceil(2/n) * 2*ceil(2/m) even if you don't need a whole power of 2 (since it has half the samples, but the samples are complex).
If your image has more than one color channel, you will first have to reshape it, so that the channel are the most significant in the sub-pixel ordering, instead of the least significant. You can reshape and pad in one go to save time and space.
Don't forget the FFTSHIFT after the IFFT. (To swap the quadrants.)
The result of the IFFT is 0...n-1. You have to take pixels floor(n/2)+1..n-1 and move them before 0...floor(n/2).
This is done by copying pixels to a new image, copying floor(n/2)+1 to memory-location 0, floor(n/2)+2 to memory-location 1, ..., n-1 to memory-location floor(n/2), then 0 to memory-location ceil(n/2), 1 to memory-location ceil(n/2)+1, ..., floor(n/2) to memory-location n-1.
When you multiply in the frequency domain, remember that the samples are complex (one cell real then one cell imaginary) so you have to use a complex multiplication.
The result might need dividing by N^2*M^2 where N is the size of n after padding (and likewise for M and m). - You can tell this by (a. looking at the frequency domain's values of the identity matrix, b. comparing result to input.)
I think that your understanding of the Identity kernel may be off. An Identity kernel should have the 1 at the center of the 2D kernal not at the 0, 0 position.
example for a 3 x 3, you have yours setup as follows:
1, 0, 0
0, 0, 0
0, 0, 0
It should be
0, 0, 0
0, 1, 0
0, 0, 0
Check this out also
What is the "do-nothing" convolution kernel
also look here, at the bottom of page 3.
http://www.fmwconcepts.com/imagemagick/digital_image_filtering.pdf
I took the component-wise product between the image and kernel in frequency domain, then did the inverse fft. Theoretically I should be able to get the identical image back.
I don't think that doing a forward transform with a non-fft kernel, and then an inverse fft transform should lead to any expectation of getting the original image back, but perhaps I'm just misunderstanding what you were trying to say there...