I am trying to use the solvePnP in the OpenCV calibration functionality to calibrate my camera.
So, the object in my case is a car and the origin is at the ground under the left tail light. The x-axes runs along from left to right and the y-axes is along the length of the car and the z-axes is along the height of the car. Now, the image coordinate system in OpenCV is at the upper left corner of the image and the pixel location is reported as (y, x) where y is along the rows and x is along the column. This is traditionally flipped from most image processing tools, I think.
My question is for the solvePnP function to work correctly, do I need to do something to image coordinates in terms of changing its origin or flipping the (row, cols) indexing or will this be somehow taken into account when estimating the calibration matrix. For me, it seems to be the latter but I could not convince myself that it is indeed the case.
Related
I'd like to get the pose (translation: x, y, z and rotation: Rx, Ry, Rz in World coordinate system) of the overhead camera. I got many object points and image points by moving the ChArUco calibration board with a robotic arm (like this https://www.youtube.com/watch?v=8q99dUPYCPs). Because of that, I already have exact positions of all the object points.
In order to feed many points to solvePnP, I set the first detected pattern (ChArUco board) as the first object and used it as the object coordinate system's origin. Then, I added the detected object points (from the second pattern to the last) to the first detected object points' coordinate system (the origin of the object frame is the origin of the first object).
After I got the transformation between the camera and the object's coordinate frame, I calculated the camera's pose based on that transformation.
The result looked pretty good at first, but when I measured the camera's absolute pose by using a ruler or a tape measure, I noticed that the extrinsic calibration result was around 15-20 millimeter off for z direction (the height of the camera), though almost correct for the others (x, y, Rx, Ry, Rz). The result was same even I changed the range of the object points by moving a robotic arm differently, it always ended up to have a few centimeters off for the height.
Has anyone experienced the similar problem before? I'd like to know anything I can try. What is the common mistake when the depth direction (z) is inaccurate?
I don't know how you measure the z but I believe that what you're measuring with the ruler is not z but the euclidean distance which is computed like so:
d=std::sqrt(x*x+y*y+z*z);
Let's take an example, if x=2; y=2; z=2;
then d will be d~3,5 so 3.5-2=1.5 is the difference you get between z and the ruler when you said around 15-20 millimeter off for z direction.
let's say I am placing a small object on a flat floor inside a room.
First step: Take a picture of the room floor from a known, static position in the world coordinate system.
Second step: Detect the bottom edge of the object in the image and map the pixel coordinate to the object position in the world coordinate system.
Third step: By using a measuring tape measure the real distance to the object.
I could move the small object, repeat this three steps for every pixel coordinate and create a lookup table (key: pixel coordinate; value: distance). This procedure is accurate enough for my use case. I know that it is problematic if there are multiple objects (an object could cover an other object).
My question: Is there an easier way to create this lookup table? Accidentally changing the camera angle by a few degrees destroys the hard work. ;)
Maybe it is possible to execute the three steps for a few specific pixel coordinates or positions in the world coordinate system and perform some "calibration" to calculate the distances with the computed parameters?
If the floor is flat, its equation is that of a plane, let
a.x + b.y + c.z = 1
in the camera coordinates (the origin is the optical center of the camera, XY forms the focal plane and Z the viewing direction).
Then a ray from the camera center to a point on the image at pixel coordinates (u, v) is given by
(u, v, f).t
where f is the focal length.
The ray hits the plane when
(a.u + b.v + c.f) t = 1,
i.e. at the point
(u, v, f) / (a.u + b.v + c.f)
Finally, the distance from the camera to the point is
p = √(u² + v² + f²) / (a.u + b.v + c.f)
This is the function that you need to tabulate. Assuming that f is known, you can determine the unknown coefficients a, b, c by taking three non-aligned points, measuring the image coordinates (u, v) and the distances, and solving a 3x3 system of linear equations.
From the last equation, you can then estimate the distance for any point of the image.
The focal distance can be measured (in pixels) by looking at a target of known size, at a known distance. By proportionality, the ratio of the distance over the size is f over the length in the image.
Most vision libraries (including opencv) have built in functions that will take a couple points from a camera reference frame and the related points from a Cartesian plane and generate your warp matrix (affine transformation) for you. (some are fancy enough to include non-linearity mappings with enough input points, but that brings you back to your time to calibrate issue)
A final note: most vision libraries use some type of grid to calibrate off of ie a checkerboard patter. If you wrote your calibration to work off of such a sheet, then you would only need to measure distances to 1 target object as the transformations would be calculated by the sheet and the target would just provide the world offsets.
I believe what you are after is called a Projective Transformation. The link below should guide you through exactly what you need.
Demonstration of calculating a projective transformation with proper math typesetting on the Math SE.
Although you can solve this by hand and write that into your code... I strongly recommend using a matrix math library or even writing your own matrix math functions prior to resorting to hand calculating the equations as you will have to solve them symbolically to turn it into code and that will be very expansive and prone to miscalculation.
Here are just a few tips that may help you with clarification (applying it to your problem):
-Your A matrix (source) is built from the 4 xy points in your camera image (pixel locations).
-Your B matrix (destination) is built from your measurements in in the real world.
-For fast recalibration, I suggest marking points on the ground to be able to quickly place the cube at the 4 locations (and subsequently get the altered pixel locations in the camera) without having to remeasure.
-You will only have to do steps 1-5 (once) during calibration, after that whenever you want to know the position of something just get the coordinates in your image and run them through step 6 and step 7.
-You will want your calibration points to be as far away from eachother as possible (within reason, as at extreme distances in a vanishing point situation, you start rapidly losing pixel density and therefore source image accuracy). Make sure that no 3 points are colinear (simply put, make your 4 points approximately square at almost the full span of your camera fov in the real world)
ps I apologize for not writing this out here, but they have fancy math editing and it looks way cleaner!
Final steps to applying this method to this situation:
In order to perform this calibration, you will have to set a global home position (likely easiest to do this arbitrarily on the floor and measure your camera position relative to that point). From this position, you will need to measure your object's distance from this position in both x and y coordinates on the floor. Although a more tightly packed calibration set will give you more error, the easiest solution for this may simply be to have a dimension-ed sheet(I am thinking piece of printer paper or a large board or something). The reason that this will be easier is that it will have built in axes (ie the two sides will be orthogonal and you will just use the four corners of the object and used canned distances in your calibration). EX: for a piece of paper your points would be (0,0), (0,8.5), (11,8.5), (11,0)
So using those points and the pixels you get will create your transform matrix, but that still just gives you a global x,y position on axes that may be hard to measure on (they may be skew depending on how you measured/ calibrated). So you will need to calculate your camera offset:
object in real world coords (from steps above): x1, y1
camera coords (Xc, Yc)
dist = sqrt( pow(x1-Xc,2) + pow(y1-Yc,2) )
If it is too cumbersome to try to measure the position of the camera from global origin by hand, you can instead measure the distance to 2 different points and feed those values into the above equation to calculate your camera offset, which you will then store and use anytime you want to get final distance.
As already mentioned in the previous answers you'll need a projective transformation or simply a homography. However, I'll consider it from a more practical view and will try to summarize it short and simple.
So, given the proper homography you can warp your picture of a plane such that it looks like you took it from above (like here). Even simpler you can transform a pixel coordinate of your image to world coordinates of the plane (the same is done during the warping for each pixel).
A homography is basically a 3x3 matrix and you transform a coordinate by multiplying it with the matrix. You may now think, wait 3x3 matrix and 2D coordinates: You'll need to use homogeneous coordinates.
However, most frameworks and libraries will do this handling for you. What you need to do is finding (at least) four points (x/y-coordinates) on your world plane/floor (preferably the corners of a rectangle, aligned with your desired world coordinate system), take a picture of them, measure the pixel coordinates and pass both to the "find-homography-function" of your desired computer vision or math library.
In OpenCV that would be findHomography, here an example (the method perspectiveTransform then performs the actual transformation).
In Matlab you can use something from here. Make sure you are using a projective transformation as transform type. The result is a projective tform, which can be used in combination with this method, in order to transform your points from one coordinate system to another.
In order to transform into the other direction you just have to invert your homography and use the result instead.
I'm working on an object tracking application using openCV. I want to convert my pixel coordinates to world coordinates to get more meaningful information. I have read a lot about computing the perspective transform matrix, and I know about cv2.solvePnP. But I feel like my case should be special, because I'm tracking a runner on a track and field runway with the runway orthogonal to the camera's z-axis. I will set up the camera to ensure this.
If I just pick two points on the runway edge, I can calculate a linear conversion from pixels to world coords at that specific height (ground level) and distance from the camera (i.e. along that line). Then I reason that the runner will run on a line parallel to the runway at a different height and slightly different distance from the camera, but the lines should still be parallel in the image, because they will both be orthogonal to the camera z-axis. With all those constraints, I feel like I shouldn't need the normal number of points to track the runner on that particular axis. My gut says that 2-3 should be enough. Can anyone help me nail down the method here? Am I completely off track? With both height and distance from camera essentially fixed, shouldn't I be able to work with a much smaller set of correspondences?
Thanks, Bill
So, I think I've answered this one myself. It's true that only two correspondence points are needed given the following assumptions.
Assume:
World coordinates are set up with X-axis and Y-axis parallel to the ground plane. X-axis is parallel to the runway.
Camera is translated and possibly rotated about X-axis (angled downward), but no rotation around Y-axis(camera plane parallel to runway and x-axis) or Z-axis (camera is level with respect to ground).
Camera intrinsic parameters are known from camera calibration.
Method:
Pick two points in the ground plane with known coordinates in world and image. For example, two points on the runway edge as mentioned in original post. The line connecting the poitns in world coordinates should not be parallel with either X or Z axis.
Since Y=0 for these points, ignore the second column of the rotation/translation matrix, reducing the projection to a planar homography transform (3x3 matrix). Now we have 9 degrees of freedom.
The rotation assumptions will enforce a certain form on the rotation/translation matrix. Namely, the first column and first row will be the identity (1,0,0). This further reduces the number of degrees of freedom in the matrix to 5.
Constrain the values of the second column of the matrix such that cos^2(theta)+sin^2(theta) = 1. This reduces the number of unknowns to only 4. Two correspondence points will give us the 4 equations we need to calculate the homography matrix for the ground plane.
Factor out the camera intrinsic parameter matrix from the homography matrix, leaving the rotation/translation matrix for the ground plane.
Due to the rotation assumptions made earlier, the ignored column of the rotation/translation matrix can be easily constructed from the third column of the same matrix, which is the second column in the ground plane homography matrix.
Multiply back out with the camera intrinsic parameters to arrive at the final universal projection matrix (from only 2 correspondence points!)
My test implentation has worked quite well. Of course, it's sensitive to the accuracy of the two correspondence points provided, but that's kind of a given.
I am searching lots of resources on internet for many days but i couldnt solve the problem.
I have a project in which i am supposed to detect the position of a circular object on a plane. Since on a plane, all i need is x and y position (not z) For this purpose i have chosen to go with image processing. The camera(single view, not stereo) position and orientation is fixed with respect to a reference coordinate system on the plane and are known
I have detected the image pixel coordinates of the centers of circles by using opencv. All i need is now to convert the coord. to real world.
http://www.packtpub.com/article/opencv-estimating-projective-relations-images
in this site and other sites as well, an homographic transformation is named as:
p = C[R|T]P; where P is real world coordinates and p is the pixel coord(in homographic coord). C is the camera matrix representing the intrinsic parameters, R is rotation matrix and T is the translational matrix. I have followed a tutorial on calibrating the camera on opencv(applied the cameraCalibration source file), i have 9 fine chessbordimages, and as an output i have the intrinsic camera matrix, and translational and rotational params of each of the image.
I have the 3x3 intrinsic camera matrix(focal lengths , and center pixels), and an 3x4 extrinsic matrix [R|T], in which R is the left 3x3 and T is the rigth 3x1. According to p = C[R|T]P formula, i assume that by multiplying these parameter matrices to the P(world) we get p(pixel). But what i need is to project the p(pixel) coord to P(world coordinates) on the ground plane.
I am studying electrical and electronics engineering. I did not take image processing or advanced linear algebra classes. As I remember from linear algebra course we can manipulate a transformation as P=[R|T]-1*C-1*p. However this is in euclidian coord system. I dont know such a thing is possible in hompographic. moreover 3x4 [R|T] Vector is not invertible. Moreover i dont know it is the correct way to go.
Intrinsic and extrinsic parameters are know, All i need is the real world project coordinate on the ground plane. Since point is on a plane, coordinates will be 2 dimensions(depth is not important, as an argument opposed single view geometry).Camera is fixed(position,orientation).How should i find real world coordinate of the point on an image captured by a camera(single view)?
EDIT
I have been reading "learning opencv" from Gary Bradski & Adrian Kaehler. On page 386 under Calibration->Homography section it is written: q = sMWQ where M is camera intrinsic matrix, W is 3x4 [R|T], S is an "up to" scale factor i assume related with homography concept, i dont know clearly.q is pixel cooord and Q is real coord. It is said in order to get real world coordinate(on the chessboard plane) of the coord of an object detected on image plane; Z=0 then also third column in W=0(axis rotation i assume), trimming these unnecessary parts; W is an 3x3 matrix. H=MW is an 3x3 homography matrix.Now we can invert homography matrix and left multiply with q to get Q=[X Y 1], where Z coord was trimmed.
I applied the mentioned algorithm. and I got some results that can not be in between the image corners(the image plane was parallel to the camera plane just in front of ~30 cm the camera, and i got results like 3000)(chessboard square sizes were entered in milimeters, so i assume outputted real world coordinates are again in milimeters). Anyway i am still trying stuff. By the way the results are previosuly very very large, but i divide all values in Q by third component of the Q to get (X,Y,1)
FINAL EDIT
I could not accomplish camera calibration methods. Anyway, I should have started with perspective projection and transform. This way i made very well estimations with a perspective transform between image plane and physical plane(having generated the transform by 4 pairs of corresponding coplanar points on the both planes). Then simply applied the transform on the image pixel points.
You said "i have the intrinsic camera matrix, and translational and rotational params of each of the image.” but these are translation and rotation from your camera to your chessboard. These have nothing to do with your circle. However if you really have translation and rotation matrices then getting 3D point is really easy.
Apply the inverse intrinsic matrix to your screen points in homogeneous notation: C-1*[u, v, 1], where u=col-w/2 and v=h/2-row, where col, row are image column and row and w, h are image width and height. As a result you will obtain 3d point with so-called camera normalized coordinates p = [x, y, z]T. All you need to do now is to subtract the translation and apply a transposed rotation: P=RT(p-T). The order of operations is inverse to the original that was rotate and then translate; note that transposed rotation does the inverse operation to original rotation but is much faster to calculate than R-1.
I have used openCV to calculate the homography relating to views of the same plane by using features and matching them. Is there any way to recover the plane itsself or the plane normal from this homography? (I am looking for an equation where H is the input and the normal n is the output.)
If you have the calibration of the cameras, you can extract the normal of the plane, but not the distance to the plane (i.e. the transformation that you obtain is up to scale), as Wikipedia explains. I don't know any implementation to do it, but here you are a couple of papers that deal with that problem (I warn you it is not straightforward): Faugeras & Lustman 1988, Vargas & Malis 2005.
You can recover the real translation of the transformation (i.e. the distance to the plane) if you have at least a real distance between two points on the plane. If that is the case, the easiest way to go with OpenCV is to first calculate the homography, then obtain four points on the plane with their 2D coordinates and the real 3D ones (you should be able to obtain them if you have a real measurement on the plane), and using PnP finally. PnP will give you a real transformation.
Rectifying an image is defined as making epipolar lines horizontal and lying in the same row in both images. From your description I get that you simply want to warp the plane such that it is parallel to the camera sensor or the image plane. This has nothing to do with rectification - I’d rather call it an obtaining a bird’s-eye view or a top view.
I see the source of confusion though. Rectification of images usually involves multiplication of each image with a homography matrix. In your case though each point in sensor plane b:
Xb = Hab * Xa = (Hb * Ha^-1) * Xa, where Ha is homography from the plane in the world to the sensor a; Ha and intrinsic camera matrix will give you a plane orientation but I don’t see an easy way to decompose Hab into Ha and Hb.
A classic (and hard) way is to find a Fundamental matrix, restore the Essential matrix from it, decompose the Essential matrix into camera rotation and translation (up to scale), rectify both images, perform a dense stereo, then fit a plane equation into 3d points you reconstruct.
If you interested in the ground plane and you operate an embedded device though, you don’t even need two frames - a top view can be easily recovered from a single photo, camera elevation from the ground (H) and a gyroscope (or orientation vector) readings. A simple diagram below explains the process in 2D case: first picture shows how to restore Z (depth) coordinate to every point on the ground plane; the second picture shows a plot of the top view with vertical axis being z and horizontal axis x = (img.col-w/2)*Z/focal; Here is img.col is image column, w - image width, and focal is camera focal length. Note that a camera frustum looks like a trapezoid in a birds eye view.