Understanding the output of solvepnp? - ios

I am have been using solvepnp() for the calculation of the rotation and translation matrix. But the euler angles calculated from the obtained rotation matrix gave very erratic values. Trying to find the problem, I had a set of 2D projection points for my marker and kept the other parameters of solvepnp() constant.
Eg values:
2D points
[219.67473, 242.78395; 363.4151, 238.61298; 503.04855, 234.56117; 501.70917, 628.16742; 500.58069, 959.78564; 383.1756, 972.02679; 262.8746, 984.56982; 243.17044, 646.22925]
The euler angle theta(x) calculated from the output rotation matrix of solvepnp() was -26.4877
Next, I incremented only the x value of the first point(i.e 219.67473) by 0.1 to check the variation of the theta(x) euler angle (keeping the remaining points and the other parameters constant) and ran the solvepnp() again .For that very small change,I had values which were decreasing from -19 degree, -18 degree (for x coord = 223.074) then suddenly jump to 27 degree for a while (for x coord = 223.174 to 226.974) then come down to 1.3 degree (for x coord = 227.074).
I cannot understand this behaviour at all.Could somebody please explain?
My euler angle calculation from the rotation matrix uses this procedure.

Try Rodrigues() for conversion between rotation matrix and rotation vector to make sure everything is clean and right. Non RANSAC version can be very sensitive to outliers that create a huge error in the parameters and thus bias a solution. Using RANSAC version of solvePnP may make it more stable to outliers. For example, adding too much to one of the points coordinates will eventually make it an outlier and it won’t influence a solution after that.
If everything fails, write a series unit tests: create an artificial set of points in 3D (possibly non planar), apply a simple translation first, in second variant apply rotation only, and in a third test apply both. Project using your camera matrix and then plug in your 2D, 3D points and projection matrix into your code to find the pose. If the result deviates from the inverse of the translations and rotations your applied to the points look for the bug in feeding parameters to PnP.

It seems the coordinate systems are different.OpenCV uses right-hand coordinate-system Y-pointing downwards. At nghiaho.com it says the calculations are based on this and if you look at the axis they don't seem to match. I guess you are using Rodrigues for matrix computation? Try comparing rotation vectors as well.

Related

Why do we need to convert Rotational vector to Rotational matrix to calculate angle

I am not getting any proper source to understand why we need to change rotational vector to rotational matrix [in the context of calculating angle between two ARUco markers].
We are using
rmat = cv2.Rodrigues(rvec)
rmat1 =cv2.Rodrigues(rvec1)
relative_rmat = rmat1#rmat.T
My questions are
why are we converting the rotational vector to rotational matrix
And can I please get the source of relative_rmat's formula. I am tryna understand the geometrical concept
I have tried to understand from Wikipedia. But I am getting more confused. It would be helpful if anyone can provide the source of the concept for both the questions
Translation of a rigid body (or tvec as used in OpenCV conventions) lives in 3D Euclidean space. We call this the 'configuration space'.
Assume there is a rigid body at a point we call pos1. A 3x1 vector pos1 = [x1, y1, z1] completely defines its position uniquely. The term 'unique' means that there is no other way to define pos1 without [x1, y1, z1], and also if you go to [x1, y1, z1], it will always be pos1.
Any attitude (a rotation of a rigid body around three-dimensional space) can be represented in many different ways. However, attitudes live in a place called 'Special Orthogonal Space or SO(3)'. This is the configuration space for rotations, and elements in this space are what you were referring to as 'rotation matrices'.
All other ways of defining a rotation like Euler angles, rotation vectors (rvec in OpenCV), or quaternions are 'local parameterizations' of a given rotation matrix. So they have several issue, including not being unique at some rotations. Gimbal Lock wiki page has some nice visualizations.
To avoid such issues, the simplest way is to use rotation matrices. Even though it seems complex, rotation matrices can be so much easier to work with once you get used to it. Given the properties of SO(3), the inverse of a rotation matrix is the transpose of that matrix (which is why you get the rmat.T when you are trying to get the relative rotation).
Let's assume that you have two markers named 'marker' and 'marker1' that corresponds to rvec and rvec1 respectively. Your rmat is the rotation of 'marker' with respect to the camera frame, or how a vector in 'marker' can be represented in the camera frame (I know this can be confusing, but this is how this is defined, so stay with me).
Similarly, rmat1 is how a vector in 'marker1' is represented in the camera frame. Also, keep in mind that these matrices are directional, meaning we need to know inverse(rmat1) if we need to find the how a vector in camera frame is represented in the 'marker1' frame.
Your relative_rmat is how you represent a vector in 'marker' in 'marker1'. You cannot randomly hop on to different markers, and always need to go through a common place. First you have to transform a vector in 'marker' to camera frame, and then transform that to the 'marker1' frame. We can write that as
relative_rmat = rmat1 # inverse(rmat)
But as I said above, a special property of a rotation matrix dictates that its inverse is as same as its transpose. So we write it as:
relative_rmat = rmat1 # rmat.T
The order matter here and you should always start with the first rotation and pre-multiply subsequent rotations. If you want to go the other way, you just need to take the inverse of the relative_rmat. So with simple matrix math, we can see that it as same as if we follow the rotations as I described above manually.
relative_rmat_inverse
= relative_rmat.T
= (rmat1 # rmat.T).T
= (rmat.T).T # rmat1.T
= rmat # rmat1.T
It is hard to detail everything here, and it can take a while to understand the math behind the rotation matrices. I would recommend this as a good reference, but it can get very technical depending on your background. If you are new to this, start with basics of robotics, coordinate transformations, and then move to SO(3).

How can we find theta from rotation matrix?

According to OpenCV's documentation, solvePnp will return the rotation vector of the object pose from 3D-2D point correspondes. To obtain the rotation matrix, we can use Rodrigues method to convert the rotation vector to rotation matrix. According to OpenCV documentation, we can find theta using the following:
theta = norm(r)
But I thought norm(r) will find the magnitude of the vector r? If that's the case how can we find the angle from the magnitude of the vector r? Correct me if I am wrong. Thank you.
Given a rotation vector r, its length (in Python, numpy.linalg.norm(r)) is the angle of rotation around the axis whose direction is the vector's one. The sense of the rotation obeys the "right hand rule": if your right hand makes a thumbs-up sign, with the thumb pointing as the vector, the other fingers wrap the vectors as the rotation (equivalently, it's the sense of rotation that makes an ordinary screw advance when its tip points as the vector).
The same rotation can be expressed as a 3x3 matrix, or as a triple of (Euler) angles or rotation about up to 3 orthogonal axes. There are ordinarily many different triples of Euler angles that represent the same rotation. Consult a textbook, or Wikipedia, for details.

Measure distance to object with a single camera in a static scene

let's say I am placing a small object on a flat floor inside a room.
First step: Take a picture of the room floor from a known, static position in the world coordinate system.
Second step: Detect the bottom edge of the object in the image and map the pixel coordinate to the object position in the world coordinate system.
Third step: By using a measuring tape measure the real distance to the object.
I could move the small object, repeat this three steps for every pixel coordinate and create a lookup table (key: pixel coordinate; value: distance). This procedure is accurate enough for my use case. I know that it is problematic if there are multiple objects (an object could cover an other object).
My question: Is there an easier way to create this lookup table? Accidentally changing the camera angle by a few degrees destroys the hard work. ;)
Maybe it is possible to execute the three steps for a few specific pixel coordinates or positions in the world coordinate system and perform some "calibration" to calculate the distances with the computed parameters?
If the floor is flat, its equation is that of a plane, let
a.x + b.y + c.z = 1
in the camera coordinates (the origin is the optical center of the camera, XY forms the focal plane and Z the viewing direction).
Then a ray from the camera center to a point on the image at pixel coordinates (u, v) is given by
(u, v, f).t
where f is the focal length.
The ray hits the plane when
(a.u + b.v + c.f) t = 1,
i.e. at the point
(u, v, f) / (a.u + b.v + c.f)
Finally, the distance from the camera to the point is
p = √(u² + v² + f²) / (a.u + b.v + c.f)
This is the function that you need to tabulate. Assuming that f is known, you can determine the unknown coefficients a, b, c by taking three non-aligned points, measuring the image coordinates (u, v) and the distances, and solving a 3x3 system of linear equations.
From the last equation, you can then estimate the distance for any point of the image.
The focal distance can be measured (in pixels) by looking at a target of known size, at a known distance. By proportionality, the ratio of the distance over the size is f over the length in the image.
Most vision libraries (including opencv) have built in functions that will take a couple points from a camera reference frame and the related points from a Cartesian plane and generate your warp matrix (affine transformation) for you. (some are fancy enough to include non-linearity mappings with enough input points, but that brings you back to your time to calibrate issue)
A final note: most vision libraries use some type of grid to calibrate off of ie a checkerboard patter. If you wrote your calibration to work off of such a sheet, then you would only need to measure distances to 1 target object as the transformations would be calculated by the sheet and the target would just provide the world offsets.
I believe what you are after is called a Projective Transformation. The link below should guide you through exactly what you need.
Demonstration of calculating a projective transformation with proper math typesetting on the Math SE.
Although you can solve this by hand and write that into your code... I strongly recommend using a matrix math library or even writing your own matrix math functions prior to resorting to hand calculating the equations as you will have to solve them symbolically to turn it into code and that will be very expansive and prone to miscalculation.
Here are just a few tips that may help you with clarification (applying it to your problem):
-Your A matrix (source) is built from the 4 xy points in your camera image (pixel locations).
-Your B matrix (destination) is built from your measurements in in the real world.
-For fast recalibration, I suggest marking points on the ground to be able to quickly place the cube at the 4 locations (and subsequently get the altered pixel locations in the camera) without having to remeasure.
-You will only have to do steps 1-5 (once) during calibration, after that whenever you want to know the position of something just get the coordinates in your image and run them through step 6 and step 7.
-You will want your calibration points to be as far away from eachother as possible (within reason, as at extreme distances in a vanishing point situation, you start rapidly losing pixel density and therefore source image accuracy). Make sure that no 3 points are colinear (simply put, make your 4 points approximately square at almost the full span of your camera fov in the real world)
ps I apologize for not writing this out here, but they have fancy math editing and it looks way cleaner!
Final steps to applying this method to this situation:
In order to perform this calibration, you will have to set a global home position (likely easiest to do this arbitrarily on the floor and measure your camera position relative to that point). From this position, you will need to measure your object's distance from this position in both x and y coordinates on the floor. Although a more tightly packed calibration set will give you more error, the easiest solution for this may simply be to have a dimension-ed sheet(I am thinking piece of printer paper or a large board or something). The reason that this will be easier is that it will have built in axes (ie the two sides will be orthogonal and you will just use the four corners of the object and used canned distances in your calibration). EX: for a piece of paper your points would be (0,0), (0,8.5), (11,8.5), (11,0)
So using those points and the pixels you get will create your transform matrix, but that still just gives you a global x,y position on axes that may be hard to measure on (they may be skew depending on how you measured/ calibrated). So you will need to calculate your camera offset:
object in real world coords (from steps above): x1, y1
camera coords (Xc, Yc)
dist = sqrt( pow(x1-Xc,2) + pow(y1-Yc,2) )
If it is too cumbersome to try to measure the position of the camera from global origin by hand, you can instead measure the distance to 2 different points and feed those values into the above equation to calculate your camera offset, which you will then store and use anytime you want to get final distance.
As already mentioned in the previous answers you'll need a projective transformation or simply a homography. However, I'll consider it from a more practical view and will try to summarize it short and simple.
So, given the proper homography you can warp your picture of a plane such that it looks like you took it from above (like here). Even simpler you can transform a pixel coordinate of your image to world coordinates of the plane (the same is done during the warping for each pixel).
A homography is basically a 3x3 matrix and you transform a coordinate by multiplying it with the matrix. You may now think, wait 3x3 matrix and 2D coordinates: You'll need to use homogeneous coordinates.
However, most frameworks and libraries will do this handling for you. What you need to do is finding (at least) four points (x/y-coordinates) on your world plane/floor (preferably the corners of a rectangle, aligned with your desired world coordinate system), take a picture of them, measure the pixel coordinates and pass both to the "find-homography-function" of your desired computer vision or math library.
In OpenCV that would be findHomography, here an example (the method perspectiveTransform then performs the actual transformation).
In Matlab you can use something from here. Make sure you are using a projective transformation as transform type. The result is a projective tform, which can be used in combination with this method, in order to transform your points from one coordinate system to another.
In order to transform into the other direction you just have to invert your homography and use the result instead.

Camera projection for lines orthogonal to camera z-axis

I'm working on an object tracking application using openCV. I want to convert my pixel coordinates to world coordinates to get more meaningful information. I have read a lot about computing the perspective transform matrix, and I know about cv2.solvePnP. But I feel like my case should be special, because I'm tracking a runner on a track and field runway with the runway orthogonal to the camera's z-axis. I will set up the camera to ensure this.
If I just pick two points on the runway edge, I can calculate a linear conversion from pixels to world coords at that specific height (ground level) and distance from the camera (i.e. along that line). Then I reason that the runner will run on a line parallel to the runway at a different height and slightly different distance from the camera, but the lines should still be parallel in the image, because they will both be orthogonal to the camera z-axis. With all those constraints, I feel like I shouldn't need the normal number of points to track the runner on that particular axis. My gut says that 2-3 should be enough. Can anyone help me nail down the method here? Am I completely off track? With both height and distance from camera essentially fixed, shouldn't I be able to work with a much smaller set of correspondences?
Thanks, Bill
So, I think I've answered this one myself. It's true that only two correspondence points are needed given the following assumptions.
Assume:
World coordinates are set up with X-axis and Y-axis parallel to the ground plane. X-axis is parallel to the runway.
Camera is translated and possibly rotated about X-axis (angled downward), but no rotation around Y-axis(camera plane parallel to runway and x-axis) or Z-axis (camera is level with respect to ground).
Camera intrinsic parameters are known from camera calibration.
Method:
Pick two points in the ground plane with known coordinates in world and image. For example, two points on the runway edge as mentioned in original post. The line connecting the poitns in world coordinates should not be parallel with either X or Z axis.
Since Y=0 for these points, ignore the second column of the rotation/translation matrix, reducing the projection to a planar homography transform (3x3 matrix). Now we have 9 degrees of freedom.
The rotation assumptions will enforce a certain form on the rotation/translation matrix. Namely, the first column and first row will be the identity (1,0,0). This further reduces the number of degrees of freedom in the matrix to 5.
Constrain the values of the second column of the matrix such that cos^2(theta)+sin^2(theta) = 1. This reduces the number of unknowns to only 4. Two correspondence points will give us the 4 equations we need to calculate the homography matrix for the ground plane.
Factor out the camera intrinsic parameter matrix from the homography matrix, leaving the rotation/translation matrix for the ground plane.
Due to the rotation assumptions made earlier, the ignored column of the rotation/translation matrix can be easily constructed from the third column of the same matrix, which is the second column in the ground plane homography matrix.
Multiply back out with the camera intrinsic parameters to arrive at the final universal projection matrix (from only 2 correspondence points!)
My test implentation has worked quite well. Of course, it's sensitive to the accuracy of the two correspondence points provided, but that's kind of a given.

Calculate a Homography with only Translation, Rotation and Scale in Opencv

I do have two sets of points and I want to find the best transformation between them.
In OpenCV, you have the following function:
Mat H = Calib3d.findHomography(src_points, dest_points);
that returns you a 3x3 Homography matrix, using RANSAC. My problem is now, that I only need translation and rotation (& maybe scale), I don't need affine and perspective.
The thing is, my points are only in 2D.
(1) Is there a function to compute something like a homography but with less degrees of freedom?
(2) If there is none, is it possible to extract a 3x3 matrix that does only translation and rotation from the 3x3 homography matrix?
Thanks in advance for any help!
Isa
OpenCV estimateRigidTransform function is exactly what you need: it returns Translation, Rotation and Scale (use false value for fullAffine flag). And it DOES use RANSAC (see source code to be sure of it).
Homography is for 2D points, the third dimension is just for casting points in 3 dim homogeneous coordinates and performing perspective effects. You can always cast points back:
homogeneous [x, y, w]
cartesian [x/w, y/w]
However since you calculate 6DOF instead of 4DOF (similarity) you result is pretty different from what you expect with 4DOF. More flexible transformation will fit more points in RANSAC at the expense of distortions in transformations you care about. Bottom line - don’t try to decompose H, instead fit similarity or isometry (also called rigid or euclidean). The reason why they are absent in the library - they are expressed in closed form even with correct least squared metric in point coordinates and thus don't require non-linear optimization. In other words, they are very simple.
If you only have rotation and translation, I wrote a quick functions to find them (no RANSAC though). It is probably similar to a rigidTransform but more understandable (hopefully)
https://stackoverflow.com/a/18091472/457687
With scale there is still a closed form solution, but slightly different formulas for translation and scaling. See Learning similarity parameters, p. 25

Resources