Suppose cameras are calibrated therefore Metric projection matrices M_i(3x4) are there for view i from multiple views. As well, K_i(3x3) the camera matrix of each view is available. Can we calculate the location of plane at infinity in projective space?
Sure, the plane at infinity is always the plane where w = 0. If you are applying affine transformations, it remains fixed. It only shifts if you use a homography.
Yes, it is theoretically possible. The plane at infinity always remains fixed in terms of the actual projective 3D world. However, it is imaged differently by a moving camera upon each view and in these cases we say that the plane at infinity is not at its canonical position. Instead of thinking that the camera moved, it is more convenient to think that the entire 3D projective space has moved! Thus, we invent a 3D homography to "blame" for this change. Mathematically, the 3D homography tags along the left side of the projection matrix:
x = (P*H)*X
So, to answer the question: Although tricky, yes you may recover it, given that you have sufficiently enough reconstructed views of the scene. This is a process called auto calibration and it essentially involves one equation (that comes in many flavors) but unfortunately, gives non-linear equations. I suggest you look at the following:
http://nguyendangbinh.org/Proceedings/CVPR/1999/DATA/03_34.PDF
I believe it contains the most up-to-date method to iteratively compute the plane at infinity.
Related
Given a scene, multiple camera views of that scene and their corresponding projection matrices, we wish to triangulate 2D matching points into 3D.
What i've been doing so far is solve the system PX = alphax where P is the projection matrix, X is the 3D point in camera coordinates, alpha is a scalar and x is the vector corresponding to the point in 2D. X and x are in homogeneous coords.
See https://tspace.library.utoronto.ca/handle/1807/10437 page 102 for more detail.
Solving this with an SVD yields proper results when the 2D points are accurately selected or when i only use two views. Introducing more views adds a lot of error.
Any advice on what techniques are best to improve/refine this solution and make it support more views?
If I understand correctly, we can view this as finding a point in 3D space that minimizes the sum of orthogonal distances between the point and lines (one line per camera view) ? I guess with a gradient descent in 3d space, it's possible to find a local minimum of this.
Did I understand the problem correctly?
let's say I am placing a small object on a flat floor inside a room.
First step: Take a picture of the room floor from a known, static position in the world coordinate system.
Second step: Detect the bottom edge of the object in the image and map the pixel coordinate to the object position in the world coordinate system.
Third step: By using a measuring tape measure the real distance to the object.
I could move the small object, repeat this three steps for every pixel coordinate and create a lookup table (key: pixel coordinate; value: distance). This procedure is accurate enough for my use case. I know that it is problematic if there are multiple objects (an object could cover an other object).
My question: Is there an easier way to create this lookup table? Accidentally changing the camera angle by a few degrees destroys the hard work. ;)
Maybe it is possible to execute the three steps for a few specific pixel coordinates or positions in the world coordinate system and perform some "calibration" to calculate the distances with the computed parameters?
If the floor is flat, its equation is that of a plane, let
a.x + b.y + c.z = 1
in the camera coordinates (the origin is the optical center of the camera, XY forms the focal plane and Z the viewing direction).
Then a ray from the camera center to a point on the image at pixel coordinates (u, v) is given by
(u, v, f).t
where f is the focal length.
The ray hits the plane when
(a.u + b.v + c.f) t = 1,
i.e. at the point
(u, v, f) / (a.u + b.v + c.f)
Finally, the distance from the camera to the point is
p = √(u² + v² + f²) / (a.u + b.v + c.f)
This is the function that you need to tabulate. Assuming that f is known, you can determine the unknown coefficients a, b, c by taking three non-aligned points, measuring the image coordinates (u, v) and the distances, and solving a 3x3 system of linear equations.
From the last equation, you can then estimate the distance for any point of the image.
The focal distance can be measured (in pixels) by looking at a target of known size, at a known distance. By proportionality, the ratio of the distance over the size is f over the length in the image.
Most vision libraries (including opencv) have built in functions that will take a couple points from a camera reference frame and the related points from a Cartesian plane and generate your warp matrix (affine transformation) for you. (some are fancy enough to include non-linearity mappings with enough input points, but that brings you back to your time to calibrate issue)
A final note: most vision libraries use some type of grid to calibrate off of ie a checkerboard patter. If you wrote your calibration to work off of such a sheet, then you would only need to measure distances to 1 target object as the transformations would be calculated by the sheet and the target would just provide the world offsets.
I believe what you are after is called a Projective Transformation. The link below should guide you through exactly what you need.
Demonstration of calculating a projective transformation with proper math typesetting on the Math SE.
Although you can solve this by hand and write that into your code... I strongly recommend using a matrix math library or even writing your own matrix math functions prior to resorting to hand calculating the equations as you will have to solve them symbolically to turn it into code and that will be very expansive and prone to miscalculation.
Here are just a few tips that may help you with clarification (applying it to your problem):
-Your A matrix (source) is built from the 4 xy points in your camera image (pixel locations).
-Your B matrix (destination) is built from your measurements in in the real world.
-For fast recalibration, I suggest marking points on the ground to be able to quickly place the cube at the 4 locations (and subsequently get the altered pixel locations in the camera) without having to remeasure.
-You will only have to do steps 1-5 (once) during calibration, after that whenever you want to know the position of something just get the coordinates in your image and run them through step 6 and step 7.
-You will want your calibration points to be as far away from eachother as possible (within reason, as at extreme distances in a vanishing point situation, you start rapidly losing pixel density and therefore source image accuracy). Make sure that no 3 points are colinear (simply put, make your 4 points approximately square at almost the full span of your camera fov in the real world)
ps I apologize for not writing this out here, but they have fancy math editing and it looks way cleaner!
Final steps to applying this method to this situation:
In order to perform this calibration, you will have to set a global home position (likely easiest to do this arbitrarily on the floor and measure your camera position relative to that point). From this position, you will need to measure your object's distance from this position in both x and y coordinates on the floor. Although a more tightly packed calibration set will give you more error, the easiest solution for this may simply be to have a dimension-ed sheet(I am thinking piece of printer paper or a large board or something). The reason that this will be easier is that it will have built in axes (ie the two sides will be orthogonal and you will just use the four corners of the object and used canned distances in your calibration). EX: for a piece of paper your points would be (0,0), (0,8.5), (11,8.5), (11,0)
So using those points and the pixels you get will create your transform matrix, but that still just gives you a global x,y position on axes that may be hard to measure on (they may be skew depending on how you measured/ calibrated). So you will need to calculate your camera offset:
object in real world coords (from steps above): x1, y1
camera coords (Xc, Yc)
dist = sqrt( pow(x1-Xc,2) + pow(y1-Yc,2) )
If it is too cumbersome to try to measure the position of the camera from global origin by hand, you can instead measure the distance to 2 different points and feed those values into the above equation to calculate your camera offset, which you will then store and use anytime you want to get final distance.
As already mentioned in the previous answers you'll need a projective transformation or simply a homography. However, I'll consider it from a more practical view and will try to summarize it short and simple.
So, given the proper homography you can warp your picture of a plane such that it looks like you took it from above (like here). Even simpler you can transform a pixel coordinate of your image to world coordinates of the plane (the same is done during the warping for each pixel).
A homography is basically a 3x3 matrix and you transform a coordinate by multiplying it with the matrix. You may now think, wait 3x3 matrix and 2D coordinates: You'll need to use homogeneous coordinates.
However, most frameworks and libraries will do this handling for you. What you need to do is finding (at least) four points (x/y-coordinates) on your world plane/floor (preferably the corners of a rectangle, aligned with your desired world coordinate system), take a picture of them, measure the pixel coordinates and pass both to the "find-homography-function" of your desired computer vision or math library.
In OpenCV that would be findHomography, here an example (the method perspectiveTransform then performs the actual transformation).
In Matlab you can use something from here. Make sure you are using a projective transformation as transform type. The result is a projective tform, which can be used in combination with this method, in order to transform your points from one coordinate system to another.
In order to transform into the other direction you just have to invert your homography and use the result instead.
I'm working on an object tracking application using openCV. I want to convert my pixel coordinates to world coordinates to get more meaningful information. I have read a lot about computing the perspective transform matrix, and I know about cv2.solvePnP. But I feel like my case should be special, because I'm tracking a runner on a track and field runway with the runway orthogonal to the camera's z-axis. I will set up the camera to ensure this.
If I just pick two points on the runway edge, I can calculate a linear conversion from pixels to world coords at that specific height (ground level) and distance from the camera (i.e. along that line). Then I reason that the runner will run on a line parallel to the runway at a different height and slightly different distance from the camera, but the lines should still be parallel in the image, because they will both be orthogonal to the camera z-axis. With all those constraints, I feel like I shouldn't need the normal number of points to track the runner on that particular axis. My gut says that 2-3 should be enough. Can anyone help me nail down the method here? Am I completely off track? With both height and distance from camera essentially fixed, shouldn't I be able to work with a much smaller set of correspondences?
Thanks, Bill
So, I think I've answered this one myself. It's true that only two correspondence points are needed given the following assumptions.
Assume:
World coordinates are set up with X-axis and Y-axis parallel to the ground plane. X-axis is parallel to the runway.
Camera is translated and possibly rotated about X-axis (angled downward), but no rotation around Y-axis(camera plane parallel to runway and x-axis) or Z-axis (camera is level with respect to ground).
Camera intrinsic parameters are known from camera calibration.
Method:
Pick two points in the ground plane with known coordinates in world and image. For example, two points on the runway edge as mentioned in original post. The line connecting the poitns in world coordinates should not be parallel with either X or Z axis.
Since Y=0 for these points, ignore the second column of the rotation/translation matrix, reducing the projection to a planar homography transform (3x3 matrix). Now we have 9 degrees of freedom.
The rotation assumptions will enforce a certain form on the rotation/translation matrix. Namely, the first column and first row will be the identity (1,0,0). This further reduces the number of degrees of freedom in the matrix to 5.
Constrain the values of the second column of the matrix such that cos^2(theta)+sin^2(theta) = 1. This reduces the number of unknowns to only 4. Two correspondence points will give us the 4 equations we need to calculate the homography matrix for the ground plane.
Factor out the camera intrinsic parameter matrix from the homography matrix, leaving the rotation/translation matrix for the ground plane.
Due to the rotation assumptions made earlier, the ignored column of the rotation/translation matrix can be easily constructed from the third column of the same matrix, which is the second column in the ground plane homography matrix.
Multiply back out with the camera intrinsic parameters to arrive at the final universal projection matrix (from only 2 correspondence points!)
My test implentation has worked quite well. Of course, it's sensitive to the accuracy of the two correspondence points provided, but that's kind of a given.
I have two images that are taken from different positions. The 2nd camera is located to the right, up and backward with respect to 1st camera.
So I think there is a perspective transformation between the two views and not just an affine transform since cameras are at relatively different depths. Am I right?
I have a few corresponding points between the two images. I think of using these corresponding points to determine the transformation of each pixel from the 1st to the 2nd image.
I am confused by the functions findFundamentalMat and findHomography. Both return a 3x3 matrix. What is the difference between the two?
Is there any condition required/prerequisite to use them (when to use them)?
Which one to use to transform points from 1st image to 2nd image? In the 3x3 matrices, which the functions return, do they include the rotation and translation between the two image frames?
From Wikipedia, I read that the fundamental matrix is a relation between corresponding image points. In an SO answer here, it is said the essential matrix E is required to get corresponding points. But I do not have the internal camera matrix to calculate E. I just have the two images.
How should I proceed to determine the corresponding point?
Without any extra assumption on the world scene geometry, you cannot affirm that there is a projective transformation between the two views. This is only true if the scene is planar. A good reference on that topic is the book Multiple View Geometry in Computer Vision by Hartley and Zisserman.
If the world scene is not planar, you should definitely not use the findHomography function. You can use the findFundamentalMat function, which will provide you an estimation of the fundamental matrix F. This matrix describes the epipolar geometry between the two views. You may use F to rectify your images in order to apply stereo algorithms to determine a dense correspondence map.
I assume you are using the expression "perspective transformation" to mean "projective transformation". To the best of my knowledge, a perspective transformation is a world to image mapping, not an image to image mapping.
The Fundamental matrix has the relation
x'Fu = 0
with x in one image and u in the other iff x and u are projections of the same 3d point.
Also
l = Fu
defines a line (lx' = 0) where the correponding point of u must be on, so it can be used to confine the searchspace for the correspondences.
A Homography maps a point on one projection of a plane to another projection of the plane.
x = Hu
There are only two cases where the transformation between two views is a projective transformation (ie a homography): either the scene is planar or the two views were generated by a camera rotating around its center.
I am using the glMatrix to code Webgl and want to get the eye position, focal point and up direction from the existing projection and view matrix (kinda like the reverse of lookat function). Is there any way to do this?
I didn't implement one, no. I'm not even sure that you could decompose it into the original vectors, for that matter. The lookAt point could be anywhere along a ray from the origin, and how would you determine what the appropriate up vector was? I'm thinking this is a one-way algorithm (just too lazy to prove it!)
Beyond that, however, I question wether you would want to do this even if there was a method for it. I'll be willing to bet that it's almost always more beneficial to track the values you're using and manipulate them rather than to try and pull them back and forth from matrix to vectors and back.
Yes and No: Yes you can invert the model view transformation and no you will not get exactly all three vectors the same.
The model view transformation of lookAt is very similar to the connectTo operation as used in CSG models. It is mounting your scene in front of your camera. This is done by translation and three axis rotations. The eye point is translated to (0,0,0) and all further rotation is done around it. You can easily derive the eye point by transforming (0,0,0) with the inverse matrix.
But the center point is just used for adjusting the axis of view along the -Z axis. In openGL the eye is facing to -Z. The distance between center and eye is lost. So you can easy get a center point along your axis of view if you define the distance yourself. Let's say we want a distance of d. Then we just need to transform (0,0,-d) with the inverse matrix and we get a valid center point, but not exactly the same. The center point is defining only two rotation angles, the camera pan and tilt.
Even more worse is the reconstruction of the up vector. It is only used for the roll angle of the camera and thus only for one scalar value. Thus for the inverse transformation you can not only choose any positive value along the Y axis, you could choose any point in the YZ plane with a positive Y value. To get a up vector perfectly normal to the viewing axis and of size 1 we just transform (0,1,0) with the inverse matrix. Remember to transform as vector this time (not as point).
Now we have eye, center and up reconstructed in a way to get exactly the same result of lookAt next time. But since this matrix contains only 6 values of information (translation,pan,tilt,roll) we had to choose 3 values that were lost (distance center to eye, size and angle of up vector in YZ plane of camera).
The model view matrix can of course do other transformation (any affine) but the lookAt function is using this matrix only for translation and rotation. It is adjusting the scene in front of the camera without distorting it.