How to get scale, rotation & translation after feature tracking? - image-processing

I have implemented a Kanade–Lucas–Tomasi feature tracker. I have used it on two images, that show the same scene, but the camera has moved a bit between taking the pictures.
As a result I get the coordinates of the features. For example:
1. Picture:
| feature | (x,y)=val |
|---------|-----------------|
| 1 | (436,349)=33971 |
| 2 | (440,365)=29648 |
| 3 | ( 36,290)=29562 |
2nd Picture:
| feature | (x,y)=val |
|---------|--------------|
| 1 | (443.3,356.0)=0 |
| 2 | (447.6,373.0)=0 |
| 3 | ( -1.0, -1.0)=-4 |
So I know the position of the features 1 & 2 in both images and that feature 3 couldn't be found in the second image. The coordinates of the features 1 & 2 aren't the same, because the camera has zoomed in a bit and also moved.
Which algorithm is suitable to get the scale, rotation and translation between the two images? Is there a robust algorithm, that also considers outliers?

If you dont know what movement happened between the images, then you need to calculate the Homography between them. The homography however, needs 4 points to be calculated.
If you have 4 points in both images, that are relatively on a plane (same flat surface, e.g. a window), then you can follow the steps from here in math.stackexchange to compute the homography matrix that will transform between images.
Note that while rotation and translation may happen between 2 images, they also could have been taken from different angles. If this happens, then the homography is your only option. Instead, if the images are for sure just rotation and translation (e.g. 2 satelite images) then you may find some other method, but homography will also help.

Depending upon whether the camera is calibrated or uncalibrated, tracked features to compute essential or fundamental matrix respectively.
Factorize the matrix into R, T. Use Multiview geometry book for any help with the formulae. https://www.robots.ox.ac.uk/~vgg/hzbook/hzbook1/HZepipolar.pdf
Caution: These steps only work well if the features come from different depth planes and cover wide field of view. In case all features lie on a single plane, you should estimate homography and try to factorize that instead.

Related

Mean Filter at first position (0,0)

Actually, I am in the middle work of adaptive thresholding using mean. I used 3x3 matrix so I calculate means value on that matrix and replace it into M(1,1) or middle position of the matrix. I got confused about how to do perform the process at first position f(0,0).
This is a little illustration, let's assume that I am using 3x3 Matrix (M) and image (f) first position f(0,0) = M(1,1) = 4. So, M(0,0) M(0,1) M(0,2) M(1,0) M(2,0) has no value.
-1 | -1 | -1 |
-1 | 4 | 3 |
-1 | 2 | 1 |
Which one is the correct process,
a) ( 4 + 3 + 2 + 1 ) / 4
b) ( 4 + 3 + 2 + 1) / 9
I asked this because I follow some tutorial adaptive mean thresholding, it shows a different result. So, I need to make sure that the process is correct. Thanks.
There is no "correct" way to solve this issue. There are many different solutions used in practice, they all have some downsides:
Averaging over only the known values (i.e. your suggested (4+3+2+1)/4). By averaging over fewer pixels, one obtains a result that is more sensitive to noise (i.e. the "amount of noise" left in the image after filtering is larger near the borders. Also, a bias is introduced, since the averaging happens over values to one side only.
Assuming 0 outside the image domain (i.e. your suggested (4+3+2+1)/9). Since we don't know what is outside the image, assuming 0 is as good as anything else, no? Well, no it is not. This leads to a filter result that has darker values around the edges.
Assuming a periodic image. Here one takes values from the opposite side of the image for the unknown values. This effectively happens when computing the convolution through the Fourier domain. But usually images are not periodic, with strong differences in intensities (or colors) at opposite sides of the image, leading to "bleeding" of the colors on the opposite of the image.
Extrapolation. Extending image data by extrapolation is a risky business. This basically comes down to predicting what would have been in those pixels had we imaged them. The safest bet is 0-order extrapolation (i.e. replicating the boundary pixel), though higher-order polygon fits are possible too. The downside is that the pixels at the image edge become more important than other pixels, they will be weighted more heavily in the averaging.
Mirroring. Here the image is reflected at the boundary (imagine placing a mirror at the edge of the image). The value at index -1 is taken to be the value at index 1; at index -2 that at index 2, etc. This has similar downsides as the extrapolation method.

PCL Point Feature Histograms - binning

The binning process, which is part of the point feature histogram estimation, results in b^3 bins if only the three angular features (alpha, phi, theta) are used, where b is the number of bins.
Why is it b^3 and not b * 3?
Let's say we consider alpha.
The feature value range is subdivided into b intervals. You iterate over all neighbors of the query point and count the amount of alpha values which lie in one interval. So you have b bins for alpha. When you repeat this for the other two features, you get 3 * b bins.
Where am I wrong?
For simplicity, I'll first explain it in 2D, i.e. with two angular features. In that case, you would have b^2 bins, not b*2.
The feature space is divided into a regular grid. Features are binned according to their position in the 2D (or 3D) space, not independently along each dimension. See the following example with two feature dimensions and b=4, where the feature is binned into the cell marked with #:
^ phi
|
+-+-+-+-+
| | | | |
+-+-+-+-+
| | | | |
+-+-+-+-+
| | | |#|
+-+-+-+-+
| | | | |
+-+-+-+-+-> alpha
The feature is binned into the cell where alpha is in a given interval AND phi in another interval. The key difference to your understanding is that the dimensions are not treated independently. Each cell specifies an interval on all the dimensions, not a single one.
(This would work the same way in 3D, only that you would have another dimension for theta and a 3D grid instead of a 2D one.)
This way of binning results in b^2 bins for the 2D case, since each interval in the alpha dimension is combined with ALL intervals in the phi dimension, resulting in a squaring of the number, not a doubling. Add another dimension, and you get the cubing instead of the tripling, as in your question.

Non-region of interest with Mat Image in OpenCV

I want to get features from non-region of interest area. I know how to define ROI in Mat format, however, I also need the rest of the area for negative image features.Thanks in advance.
You could use the mask to define any region you want to get features. However, it requires the called function to support mask.
For example:
void ORB::operator()(InputArray image, InputArray mask, vector<KeyPoint>& keypoints, OutputArray descriptors, bool useProvidedKeypoints=false ) const
mask – The operation mask.
If the functions do not support mask. There are two tricks to get the features in non-ROI:
Get the features of whole image, then filter the result manually.
Split the non-ROI into ROI's (as following), then pass the ROI's into the function.
For example:
|-----------------|
| 1 |
|----|-------|----|
| 2 | | 3 |
|----|-------|----|
| 4 |
|-----------------|

Calculate surface area

For a given terrain, how can you calculate its surface area?
As of now, I plan to build the terrain using Three.js with something like:
var geo = new THREE.PlaneGeometry(300, 300, 10, 10);
for (var i = 0; i < geo.vertices.length; i++)
geo.vertices[i].y = someHeight; // Makes the flat plain into a terrain
Next, if its possible to iterate through each underlying triangle of the geometry (i.e. triangles of TRIANGLE_STRIP given to the WebGL array) the area of each triangle could be summed up to get the total surface area.
Does this approach sound right? If so, how do you determine vertices of individual triangles?
Any other ideas to build the terrain in WebGL/Three.js are welcome.
I think your approach sounds right and shouldn't be hard to implement.
I'm not familiar with three.js, but I think it's quite easy to determine positions of the vertices. You know that the vertices are evenly distribute between x=0...300, z=0...300 and you know the y coordinate. So the [i,j]-th vertex has position [i*300/10, y, j*300/10].
You have 10x10 segments in total and each segment consists of 2 triangles. This is where you have to be careful.
The triangles could form two different shapes:
------ ------
| \ | | /|
| \ | or | / |
| \| | / |
------ ------
which could result in different shape and (I'm not entirely sure about this) into different surface areas.
When you find out how exactly three.js creates the surface, it should be relatively easy to iteratively sum the triangle surfaces.
It would be nice to be able to do the sum without actual iteration over all triangles, but, right now, I don't have any idea how to do it...

Calculating camera motion out of corresponding 3d point sets

I am having a little problem. I wrote a program that extracts a set of three-dimensional points in each frame using a camera and depth information. The points are in the camera coordinate system, which means the origin is at the camera center, x is horizontal distance, y vertical distance and z the distance from the camera (along the optical axis). Everything is in meters. I.e. point (2,-1,5) would be two meters right, one meter below and five meters along the optical axis of the camera.
I calculate these points in each time frame and also know the correspondences, like I know which point in t-1 belongs to which 3d point in t.
My goal now is to calculate the motion of the camera in each time frame in my world coordinate system (with z pointing up representing the height). I would like to calculate relative motion but also the absolute one starting from some start position to visualize the trajectory of the camera.
This is an example data set of one frame with the current (left) and the previous 3D location (right) of the points in camera coordinates:
-0.174004 0.242901 3.672510 | -0.089167 0.246231 3.646694
-0.265066 -0.079420 3.668801 | -0.182261 -0.075341 3.634996
0.092708 0.459499 3.673029 | 0.179553 0.459284 3.636645
0.593070 0.056592 3.542869 | 0.675082 0.051625 3.509424
0.676054 0.517077 3.585216 | 0.763378 0.511976 3.555986
0.555625 -0.350790 3.496224 | 0.633524 -0.354710 3.465260
1.189281 0.953641 3.556284 | 1.274754 0.938846 3.504309
0.489797 -0.933973 3.435228 | 0.561585 -0.935864 3.404614
Since I would like to work with OpenCV if possible I found the estimateAffine3D() function in OpenCV 2.3, which takes two 3D point input vectors and calculates the affine transformation between them using RANSAC.
As output I get a 3x4 transformation matrix.
I already tried to make the calculation more accurate by setting the RANSAC parameters but a lot of times the trnasformation matrix shows a translatory movement that is quite big. As you can see in the sample data the movement is usually quite small.
So I wanted to ask if anybody has another idea on what I could try? Does OpenCV offer other solutions for this?
Also if I have the relative motion of the camera in each timeframe, how would I convert it to world coordinates? Also how would I then get the absolute position starting from a point (0,0,0) so I have the camera position (and direction) for each time frame?
Would be great if anybody could give me some advice!
Thank you!
UPDATE 1:
After #Michael Kupchick nice answer I tried to check how well the estimateAffine3D() function in OpenCV works. So I created two little test sets of 6 point-pairs that just have a translation, not a rotation and had a look at the resulting transformation matrix:
Test set 1:
1.5 2.1 6.7 | 0.5 1.1 5.7
6.7 4.5 12.4 | 5.7 3.5 11.4
3.5 3.2 1.2 | 2.5 2.2 0.2
-10.2 5.5 5.5 | -11.2 4.5 4.5
-7.2 -2.2 6.5 | -8.2 -3.2 5.5
-2.2 -7.3 19.2 | -3.2 -8.3 18.2
Transformation Matrix:
1 -1.0573e-16 -6.4096e-17 1
-1.3633e-16 1 2.59504e-16 1
3.20342e-09 1.14395e-09 1 1
Test set 2:
1.5 2.1 0 | 0.5 1.1 0
6.7 4.5 0 | 5.7 3.5 0
3.5 3.2 0 | 2.5 2.2 0
-10.2 5.5 0 | -11.2 4.5 0
-7.2 -2.2 0 | -8.2 -3.2 0
-2.2 -7.3 0 | -3.2 -8.3 0
Transformation Matrix:
1 4.4442e-17 0 1
-2.69695e-17 1 0 1
0 0 0 0
--> This gives me two transformation matrices that look right at first sight...
Assuming this is right, how would I recalculate the trajectory of this when I have this transformation matrix in each timestep?
Anybody any tips or ideas why it's that bad?
This problem is much more 3d related than image processing.
What you are trying to do is to register the knowing 3d and since for all the frames there is same 3d points->camera relation the transformations calculated from registration will be the camera motion transformations.
In order to solve this you can use PCL. It is opencv's sister project for 3d related tasks.
http://www.pointclouds.org/documentation/tutorials/template_alignment.php#template-alignment
This is a good tutorial on point cloud alignments.
Basically it goes like this:
For each pair of sequential frames 3d point correspondences are known, so you can use the SVD method implemented in
http://docs.pointclouds.org/trunk/classpcl_1_1registration_1_1_transformation_estimation_s_v_d.html
You should have at least 3 corresponding points.
You can follow the tutorial or implement your own ransac algorithm.
This will give you only some rough estimation of the transformation (can be quite good if the noise is not too big) in order to get the accurate transfomation you should apply ICP algorithm using the guess transformation calculated at the previous step.
ICP is described here:
http://www.pointclouds.org/documentation/tutorials/iterative_closest_point.php#iterative-closest-point
These two steps should give you an accurate estimation of the transformation between frames.
So you should do pairwise registration incrementally - registering first pair of frames get the transformation from first frame to the second 1->2. Register the second with third (2->3) and then append the 1->2 transformation to the 2->3 and so on. This way you will get the transformations in the global coordinate system where the first frame is the origin.

Resources