I am trying to understand some computer vision topics. One main difference I observed between these two is that In optical flow, the 2nd image is often at time (t+1) whereas in disparity estimation, its often the same time-step unless one is having a static view and using single non-stereo camera.
Is there any other difference and their respective implications ?
As you pointed out, Optical flow represents the displacement of pixels between an image at time t and at time t+1 whereas disparity estimation is the displacement of a pixel between one camera and another.
Strictly speaking these two tasks could be considered identical.
However, in practice, disparity is computed using a "right" and a "left" camera which are horizontally aligned. Therefore, the disparity is only horizontal (and in a single direction due to the laws of optics) and can be represented by a heatmap.
On the contrary, Optical flow is a 2D vector field in which vectors can take any value.
In machine learning, this distinction mainly changes the dimension of the output (1D for disparity and 2D for Optical flow) to predict as well as its scale (positive for disparity vs all real numbers for Optical flow).
I hope my answer was clear :)
Related
Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).
this is my first question in this forum.
I'm working about a project for my thesis. I have to calibrate my camera to import intrinsic parameters in photoscan fo reconstructon 3D of the object which measures maximum 0,7 x 0,7 mm.
I calibrate the camera with openCv, photographing a symmetric pattern glass (0,5x0,5 mm) with circle grid. I do 24 photos, 8 for each kind of inclination ( horizontal vertical and oblique)
1)I would know how can I evaluate the calibration? I read that Reprojection Errors isn't an absolute evaluation, can I compare cx and cy with the real center of the image? Can I evaluate the values of distorsion parameters?(How?)
2) How can improve my method? Do you think that i need of this little ( and perfect) pattern or can I calibrate with chessboard?
Any other suggestion is welcome
The evaluation of results is one of the hardest task in photogrammetry. Therefore questions are: How accurate do you need to be? Are we talking about about accuracies of 1ppm or 1:1,000? How reliable is your hardware for your goal?
1) The reprojection errors do not really yield anything reliable. It just tells you how the chosen function fits into the measurements (is also often referred as internal accuracy). So if your measurements are garbage the result protocol will happily tell you how well it could fit into your garbage. A reliable evaluation is only possible if you have enough external references to get a good approximation for the external accuracy. This can be achieved with precise known distances between targets which have been not included in the calibration step to scale the systems. For a solid calibration with a plane calibration body you'll need six of them. Two as a cross on the main diagonal and four on each side.
2) How big are the circles in the image? You might need to correct your image measurements for circle eccentricity before starting your calibration. Is your measurement volume two dimensional? Only in that case a two dimensional calibration field is a good choice. Circle targets are (at the moment) with a huge distance the most reliable,robust and precise targets. Chessboard targets are mostly used in robotics or computer vision but not really when you expect some level of precision. Also the cx, cy approach is a bad choice if you want to achieve some level of precision since it's arbitrary and has no physical basis. Look for a physical equation like the Brown approach to describe your lens. The parameters are mostly referred as: c (focal length), x0,y0 (principal point) ,r0,A1,A2,A3 (radial symmetric distortion),B1,B2 (radial asymmetric distortion) ,C1,C2 (affine distortion).
Is there any particular reason why we need multiple poses (e.g. varying z or rotation) to obtain the focal length and principal point for the camera matrix? In other words, is it sufficient to calibrate a pinhole camera with a single pose? i.e. by keeping the location of the calibration object (let's say a standard checkerboard) constant?
I assume you are asking in the context of OpenCV-like camera calibration using images of a planar target. The reference for the algorithm used by OpenCV is Z. Zhang's now classic paper . The discussion in the top half of page 6 shows that n >= 3 images are necessary for calibrating all 5 parameters of a pinhole camera matrix. Imposing constraints on the parameters reduces the number of needed images to a theoretical minimum of one.
In practice you need more for various reasons, among them:
The need to have enough measurements to overcome "noise" and "random" corner detection errors, while using a practical target with well-separated corners.
The difference between measuring data and observing (constraining) model parameters.
Practical limitations of physical lenses, e.g. depth of field.
As an example for the second point, the ideal target pose for calibrating the nonlinear lens distortion (barrel, pincushion, tangential, etc.) is frontal-facing, covering the whole field of view, because it produces a large number of well-separated and aligned corners over the image, all with approximately the same degree of blur. However, this is exactly the worst pose you can use in order to estimate the field of view / focal length, as for that purpose you need to observe significant perspective foreshortening.
Likewise, it is possible to show that the location of the principal point is well constrained by a set of images showing the vanishing points of multiple pencils of parallel lines. This is important because that location is inherently confused by the component parallel to the image plane of the relative motion between camera and target. Thus the vanishing points help "guide" the optimizer's solution toward the correct one, in the common case where the target does translate w.r.t the camera.
I am doing a research in stereo vision and I am interested in accuracy of depth estimation in this question. It depends of several factors like:
Proper stereo calibration (rotation, translation and distortion extraction),
image resolution,
camera and lens quality (the less distortion, proper color capturing),
matching features between two images.
Let's say we have a no low-cost cameras and lenses (no cheap webcams etc).
My question is, what is the accuracy of depth estimation we can achieve in this field?
Anyone knows a real stereo vision system that works with some accuracy?
Can we achieve 1 mm depth estimation accuracy?
My question also aims in systems implemented in opencv. What accuracy did you manage to achieve?
Q. Anyone knows a real stereo vision system that works with some accuracy? Can we achieve 1 mm depth estimation accuracy?
Yes, you definitely can achieve 1mm (and much better) depth estimation accuracy with a stereo rig (heck, you can do stereo recon with a pair of microscopes). Stereo-based industrial inspection systems with accuracies in the 0.1 mm range are in routine use, and have been since the early 1990's at least. To be clear, by "stereo-based" I mean a 3D reconstruction system using 2 or more geometrically separated sensors, where the 3D location of a point is inferred by triangulating matched images of the 3D point in the sensors. Such a system may use structured light projectors to help with the image matching, however, unlike a proper "structured light-based 3D reconstruction system", it does not rely on a calibrated geometry for the light projector itself.
However, most (likely, all) such stereo systems designed for high accuracy use either some form of structured lighting, or some prior information about the geometry of the reconstructed shapes (or a combination of both), in order to tightly constrain the matching of points to be triangulated. The reason is that, generally speaking, one can triangulate more accurately than they can match, so matching accuracy is the limiting factor for reconstruction accuracy.
One intuitive way to see why this is the case is to look at the simple form of the stereo reconstruction equation: z = f b / d. Here "f" (focal length) and "b" (baseline) summarize the properties of the rig, and they are estimated by calibration, whereas "d" (disparity) expresses the match of the two images of the same 3D point.
Now, crucially, the calibration parameters are "global" ones, and they are estimated based on many measurements taken over the field of view and depth range of interest. Therefore, assuming the calibration procedure is unbiased and that the system is approximately time-invariant, the errors in each of the measurements are averaged out in the parameter estimates. So it is possible, by taking lots of measurements, and by tightly controlling the rig optics, geometry and environment (including vibrations, temperature and humidity changes, etc), to estimate the calibration parameters very accurately, that is, with unbiased estimated values affected by uncertainty of the order of the sensor's resolution, or better, so that the effect of their residual inaccuracies can be neglected within a known volume of space where the rig operates.
However, disparities are point-wise estimates: one states that point p in left image matches (maybe) point q in right image, and any error in the disparity d = (q - p) appears in z scaled by f b. It's a one-shot thing. Worse, the estimation of disparity is, in all nontrivial cases, affected by the (a-priori unknown) geometry and surface properties of the object being analyzed, and by their interaction with the lighting. These conspire - through whatever matching algorithm one uses - to reduce the practical accuracy of reconstruction one can achieve. Structured lighting helps here because it reduces such matching uncertainty: the basic idea is to project sharp, well-focused edges on the object that can be found and matched (often, with subpixel accuracy) in the images. There is a plethora of structured light methods, so I won't go into any details here. But I note that this is an area where using color and carefully choosing the optics of the projector can help a lot.
So, what you can achieve in practice depends, as usual, on how much money you are willing to spend (better optics, lower-noise sensor, rigid materials and design for the rig's mechanics, controlled lighting), and on how well you understand and can constrain your particular reconstruction problem.
I would add that using color is a bad idea even with expensive cameras - just use the gradient of gray intensity. Some producers of high-end stereo cameras (for example Point Grey) used to rely on color and then switched to grey. Also consider a bias and a variance as two components of a stereo matching error. This is important since using a correlation stereo, for example, with a large correlation window would average depth (i.e. model the world as a bunch of fronto-parallel patches) and reduce the bias while increasing the variance and vice versa. So there is always a trade-off.
More than the factors you mentioned above, the accuracy of your stereo will depend on the specifics of the algorithm. It is up to an algorithm to validate depth (important step after stereo estimation) and gracefully patch the holes in textureless areas. For example, consider back-and-forth validation (matching R to L should produce the same candidates as matching L to R), blob noise removal (non Gaussian noise typical for stereo matching removed with connected component algorithm), texture validation (invalidate depth in areas with weak texture), uniqueness validation (having a uni-modal matching score without second and third strong candidates. This is typically a short cut to back-and-forth validation), etc. The accuracy will also depend on sensor noise and sensor's dynamic range.
Finally you have to ask your question about accuracy as a function of depth since d=f*B/z, where B is a baseline between cameras, f is focal length in pixels and z is the distance along optical axis. Thus there is a strong dependence of accuracy on the baseline and distance.
Kinect will provide 1mm accuracy (bias) with quite large variance up to 1m or so. Then it sharply goes down. Kinect would have a dead zone up to 50cm since there is no sufficient overlap of two cameras at a close distance. And yes - Kinect is a stereo camera where one of the cameras is simulated by an IR projector.
I am sure with probabilistic stereo such as Belief Propagation on Markov Random Fields one can achieve a higher accuracy. But those methods assume some strong priors about smoothness of object surfaces or particular surface orientation. See this for example, page 14.
If you wan't to know a bit more about accuracy of the approaches take a look at this site, although is no longer very active the results are pretty much state of the art. Take into account that a couple of the papers presented there went to create companies. What do you mean with real stereo vision system? If you mean commercial there aren't many, most of the commercial reconstruction systems work with structured light or directly scanners. This is because (you missed one important factor in your list), the texture is a key factor for accuracy (or even before that correctness); a white wall cannot be reconstructed by a stereo system unless texture or structured light is added. Nevertheless, in my own experience, systems that involve variational matching can be very accurate (subpixel accuracy in image space) which is generally not achieved by probabilistic approaches. One last remark, the distance between cameras is also important for accuracy: very close cameras will find a lot of correct matches and quickly but the accuracy will be low, more distant cameras will find less matches, will probably take longer but the results could be more accurate; there is an optimal conic region defined in many books.
After all this blabla, I can tell you that using opencv one of the best things you can do is do an initial cameras calibration, use Brox's optical flow to find find matches and reconstruct.
I am working with a set of calibrated images that form a ring around a foreground object (1). I used Fusiello's method (1) to rectify adjacent pairs of images, and then I performed disparity estimation.
When I take the matched points from a stereo pair and triangulate them, it forms an accurate point cloud. Unfortunately, when I triangulate the points from another stereo image pair, this point cloud never aligns correctly with the original cloud.
Should calibrated, rectified images' point clouds merge together automatically?
Thanks in advance for any help you can offer.
This might be due to the accuracy of calibration - both intrinsic (i.e. the same camera model - and how it handles distortion) and extrinsic (i.e. the camera pose in real space). Together, of course, these dictate the ultimate accuracy of your re-projection.
Do you have a measure of error for camera calibration - in terms of MSE re-projection?
Cumulative error is often noticeable in my experience if simply iterating over subsequent images. Some form of global optimisation often needs to be performed to first correct positions for all the camera poses.
The accuracy of your disparity estimation is also a factor. Not only in terms of the algorithm you using, but also in relation to the stereo baseline and how it relates to the size/nature of the object in question (how concave/convex), and how many sampling of the images you are taking (and the quality of those images - exposure/depth-of-field/etc).
Fundamentally, just how "off" are your point clouds? Are they close to being aligned (you could do a bit of ICP before triangulation...). Are they closer in the "centre" of the re-projection? Are they worse for projections taken from opposing images on opposite sides of the object?
Remember as well that (due to the discrete sampling) you shouldn't expect points to ever be re-projected exactly "on-top" on one another. Some form of binning operation during the triangulation pipeline usually occurs for handling this (hence most of the research work in visual hull -> voxels -> marching cubes -> triangulated surface around this...)
Have you checked out MeshLab BTW?