I am using standard OpenCV functions to calibrate camera for intrinsic parameters. In order to obtain good results, I know we have to use images of the chessboard from different angles (considering different planes in the 3D). This is stated in all the documentations
and papers but I really don't understand, why is it so important for us to consider different planes and if there is an optimal number of planes that we have to consider for the best calibration results?
I will be glad if you can provide me reference to some paper or documentation which explains this. (I think Zhang's paper talks about it but, its mathematically intensive and was hart to digest.)
Thanks
Mathematically, a unique solution for the intrinsic parameters (up to scale) is defined only if you have 3 or more distinct images of the planar target. See page 6 of Zhang's paper: "If n images of the model plane are observed, by stacking n such equations as (8) we have Vb = 0 ; (9) where V is a 2n×6 matrix. If n ≥ 3, we will have in general a unique solution b defined up to a scale factor..."
There isn't an "optimal" number of planes, where data are concerned, the more you have the merrier you are. But as the solution starts to converge, the marginal gain in calibration accuracy due to adding an extra image becomes negligible. Of course, this assumes that the images show planes well separated in both pose and location.
See also this other answer of mine for practical tips.
If you're looking for a little intuition, here's an example of why one plane isn't enough. Imagine your calibration chessboard is tilting away from you at a 45° angle:
You can see that when you move up the chessboard by 1 meter in the +y direction, you also move away from the camera by 1 meter in the +z direction. This means there's no way to separate the effect of moving in the y direction vs the z direction. The y and z movement directions are effectively tied to each other, for all our training points. So, if we just look at points on this one plane, there's no way to tease apart the effects of y movement vs z movement.
For example, from this 1 plane, we can't tell the difference between these scenarios:
The camera has perspective distortion such that things appear smaller in the image as they move in the world's +y direction.
The camera focal length is such that things appear smaller in the image as they move in the world's +z direction.
Any mixture of the effects in #1 and #2.
Mathematically, this ambiguity means that there are many equally possible solutions when OpenCV tries to fit a camera matrix to match the data. (Note that the 45° angle was not important. Any plane you choose will have the same problem: training examples' (x,y,z) dimensions are coupled together, so you can't separate their effects.)
One last note: if you make enough assumptions about the camera matrix (e.g. no perspective distortion, x and y scale identically, etc) then you can end up with a situation with fewer unknowns (in an extreme case, maybe you're just calculating the focal length) and in that case you could calibrate with just 1 plane.
Related
I have a problem at hand where I need to detect/predict the coordinates of the hinge point or axis of rotation point using image processing. The image is as shown below:
I've used a method where I started with tracking the circular movement (in an arc) of a few feature points in an RoI around the default hinge coordinates (entered manually) in a configuration file. This circular motion of these tracked points happens around the vertical axis which passes through the hinge point. Now, I tracked these points from their initial position until the connecting bar made a particular angle (15°/20°) with the y-axis, I drew secants between these different positions (start and end positions) of the same point and drew its perpendicular bisector, which will ideally pass through the centre of the (concentric) circles, which is the ideal hinge point.
Eg:
y_intercepts calculated for each point
H0 (322, 42)
H1 (322, 64) (within tolerance, closest to GT)
H2 (322, 48)
H_avg (322,52)
H_groundtruth (x,y): (322, 61)
We need an accuracy or tolerance of +/- 3 pixels.
Now, the issues we faced in this ideal scenario to practical working of it is:
Different tracked points give different potential hinge points (different dots on the vertical yellow line), (few of which are very close the ground truth(yellow circle)), but their weighted/average (big green circle) goes off the mark. Quite frankly, this is a problem of too many in which we do get the closest potentially to ground truth, but we’re not sure, which of these points is the closest as we’re not to use the default hitch coordinates (entered manually) from config file.
One solution could be to use frameworks already implemented for image registration such as elastix. If you configure it for a rigid registration, you can get the transformation matrix and therefore the center of the rotation.
The problem here is that only one part of your image is moving. Before doing the registration, I would simply mask the region of interest by calculating a mask from the subtraction of the two images, to keep only the part where something actually moved.
Such approach could get a subpixel accuracy. You could also repeat it for multiple angles and average the result. Alternatively to the averaging, you could use the RANSAC algorithm to know which hinge points are off (outliers) and exclude them.
Here is an example how to do a simple rigid transformation with elastix.
I hope this helps!
I intended this as only a comment, but it ended up significantly over the character limit:
The problem from an accuracy perspective (sorry, couldn't resist) seems to be that you're trying to use a planar euclidean geometry technique to solve a projective geometry problem.
Those feature tracks are only circular arcs in 3D world space. They're actually (noisy) elliptical arcs in 2D image pixel space due to the projection.
Your hinge rotation axis isn't a single pixel either, unless your camera's optical axis is directly aligned with the hinge axis. If that's not the case (as the perspective in the photo you added suggests), then your hinge axis is actually a line in pixel space, not a point, and different heights for the different tracks in model space will be 'centered' around different pixels on that line. So asking for +/- 3 pixel hinge 'point' accuracy is unclear, and so is measuring angles in pixel space in general in a way that doesn't account for perspective.
I only mention these details because you seem focused on measuring accurately. Often, those kinds of 2D approximations are fine for many applications, but high accuracy and precision from a single camera (if that's really what you need) requires better 3D scene understanding. (Or you could train a deep network with a bunch of labeled ground truth images and let it figure out the mappings.)
Now maybe you don't need such high accuracy for your application after all. In that case, simple affine geometry techniques like that mentioned in the other answer might work well enough.
The traditional solution for high resolution images examples :
extract features (dense) for all images
match features to find tracks through images
triangulate features to 3d points.
I can give two problem here for my case (many 640*480 images with small movements between each others) , first: matching is very slow , especially if the number of images is big, so a better solution can be optical flow tracking.., but it's getting sparse with big moves, ( a mix could solve the problem !!)
second: triangulate tracks , though it is over-determined problem, I find it hard to code a solution, .. (here am asking for simplifying what I read in references )
I searched quite a bit for libraries in that direction, with no useful result.
again, I have ground truth camera matrices and need only 3d positions as first estimate (without BA),
A coded software solution can be of great help as I don't need to reinvent the wheel, though a detailed instructions maybe helpful
this basically shows the underlying geometry for estimating the depth.
As you said, we have camera pose Q, and we are picking a point X from world, X_L is it's projection on left image, now, with Q_L, Q_R and X_L, we are able to make up this green colored epipolar plane, the rest job is easy, we search through points on line (Q_L, X), this line exactly describe the depth of X_L, with different assumptions: X1, X2,..., we can get different projections on the right image
Now we compare the pixel intensity difference from X_L and the reprojected point on right image, just pick the smallest one and that corresponding depth is exactly what we want.
Pretty easy hey? Truth is it's way harder, image is never strictly convex:
This makes our matching extremely hard, since the non-convex function will result any distance function have multiple critical points (candidate matches), how do you decide which one is the correct one?
However, people proposed path based match to handle this problem, methods like: SAD, SSD, NCC, they are introduced to create the distance function as convex as possible, still, they are unable to handle large scale repeated texture problem and low texture problem.
To solve this, people start to search over a long range in the epipolar line, and suddenly found that we can describe this whole distribution of matching metrics into a distance along the depth.
The horizontal axis is depth, and the vertical axis is matching metric score, and this illustration lead us found the depth filter, and we usually describe this distribution with gaussian, aka, gaussian depth filter, and use this filter to discribe the uncertainty of depth, combined with the patch matching method, we can roughly get a proposal.
Now what, let's use some optimization tools, like GN or gradient descent to finally refine the depth estimaiton.
To sum up, the total process of the depth estimation is like the following steps:
assume all depth in all pixel following a initial gaussian distribution
start search through epipolar line and reproject points into target frame
triangulate depth and calculate the uncertainty of the depth from depth filter
run 2 and 3 again to get a new depth distribution and merge with previous one, if they converged then break, ortherwise start again from 2.
A computer-tomography device has a roentgen matrix of 20x500 dots with the resolution of 2mm in each direction. This matrix is rotating around a belt, shich transports items to be analysed. A special reconstruction algorithm produces 3D model of the items from many-many matixes captured from all 360 perspectives ( one image per 1° angle).
The problem is, the reconstruction algorithm is very sensitive to the belt speed/position. Measuring the belt position requires quite complicated and expensive positining sensors and very fine mechanics.
I wonder if it is possible to calculate the belt velocity using the roentgen-image itself. It has a width of 40mm and should be sufficient for capturing the movement. The problem is, the movement is always in 2 directions - rotation and X (belt). For those working in CT-area, are you aware of some applications/publishings about such direct measurement of the belt/table velocity?
P.S.: It is not vor medical application.
Hmm, interesting idea.
Are you doing a full 180 degree for the reconstruction? I'd go with the 0 and 180 degree cone beam images. They should be approximately the same, minus some non-linear stuff, artifacts, Poisson noise and difference in 'shadows' and scattering due to perspective.
You could translate the 180 image along the negative x-axis, to the direction opposite of the movement. You could then subtract images at some suitable intervals along this axis. When the absolute of the sum hits a minimum the translation should be approx at the distance the object has moved between 0 and 180, as the mirror images cancel each other out partially.
This could obviously be ruined by artifacts and wonkily shaped heavy objects. Still, worth a try. I'm assuming your voltage is pretty high if you are doing industrial stuff.
EDIT: "A special reconstruction algorithm produces 3D model of the items from many-many matixes captured from all 360 perspectives ( one image per 1° angle)."
So yes, you are using 180+ degrees. You could then perhaps use multiple opposite images for a more robust routine. How do you get the full circle? Are you shooting through the belt?
this is my first question in this forum.
I'm working about a project for my thesis. I have to calibrate my camera to import intrinsic parameters in photoscan fo reconstructon 3D of the object which measures maximum 0,7 x 0,7 mm.
I calibrate the camera with openCv, photographing a symmetric pattern glass (0,5x0,5 mm) with circle grid. I do 24 photos, 8 for each kind of inclination ( horizontal vertical and oblique)
1)I would know how can I evaluate the calibration? I read that Reprojection Errors isn't an absolute evaluation, can I compare cx and cy with the real center of the image? Can I evaluate the values of distorsion parameters?(How?)
2) How can improve my method? Do you think that i need of this little ( and perfect) pattern or can I calibrate with chessboard?
Any other suggestion is welcome
The evaluation of results is one of the hardest task in photogrammetry. Therefore questions are: How accurate do you need to be? Are we talking about about accuracies of 1ppm or 1:1,000? How reliable is your hardware for your goal?
1) The reprojection errors do not really yield anything reliable. It just tells you how the chosen function fits into the measurements (is also often referred as internal accuracy). So if your measurements are garbage the result protocol will happily tell you how well it could fit into your garbage. A reliable evaluation is only possible if you have enough external references to get a good approximation for the external accuracy. This can be achieved with precise known distances between targets which have been not included in the calibration step to scale the systems. For a solid calibration with a plane calibration body you'll need six of them. Two as a cross on the main diagonal and four on each side.
2) How big are the circles in the image? You might need to correct your image measurements for circle eccentricity before starting your calibration. Is your measurement volume two dimensional? Only in that case a two dimensional calibration field is a good choice. Circle targets are (at the moment) with a huge distance the most reliable,robust and precise targets. Chessboard targets are mostly used in robotics or computer vision but not really when you expect some level of precision. Also the cx, cy approach is a bad choice if you want to achieve some level of precision since it's arbitrary and has no physical basis. Look for a physical equation like the Brown approach to describe your lens. The parameters are mostly referred as: c (focal length), x0,y0 (principal point) ,r0,A1,A2,A3 (radial symmetric distortion),B1,B2 (radial asymmetric distortion) ,C1,C2 (affine distortion).
Is there any particular reason why we need multiple poses (e.g. varying z or rotation) to obtain the focal length and principal point for the camera matrix? In other words, is it sufficient to calibrate a pinhole camera with a single pose? i.e. by keeping the location of the calibration object (let's say a standard checkerboard) constant?
I assume you are asking in the context of OpenCV-like camera calibration using images of a planar target. The reference for the algorithm used by OpenCV is Z. Zhang's now classic paper . The discussion in the top half of page 6 shows that n >= 3 images are necessary for calibrating all 5 parameters of a pinhole camera matrix. Imposing constraints on the parameters reduces the number of needed images to a theoretical minimum of one.
In practice you need more for various reasons, among them:
The need to have enough measurements to overcome "noise" and "random" corner detection errors, while using a practical target with well-separated corners.
The difference between measuring data and observing (constraining) model parameters.
Practical limitations of physical lenses, e.g. depth of field.
As an example for the second point, the ideal target pose for calibrating the nonlinear lens distortion (barrel, pincushion, tangential, etc.) is frontal-facing, covering the whole field of view, because it produces a large number of well-separated and aligned corners over the image, all with approximately the same degree of blur. However, this is exactly the worst pose you can use in order to estimate the field of view / focal length, as for that purpose you need to observe significant perspective foreshortening.
Likewise, it is possible to show that the location of the principal point is well constrained by a set of images showing the vanishing points of multiple pencils of parallel lines. This is important because that location is inherently confused by the component parallel to the image plane of the relative motion between camera and target. Thus the vanishing points help "guide" the optimizer's solution toward the correct one, in the common case where the target does translate w.r.t the camera.