I am doing a project on stereo vision, basically the system should estimate the distance in real time to avoid collision. The thing is i am not able to decide the proper baseline value . in formula what is the value of disparity.
Look at the formula that relates a baseline to disparity
D=focal*Baseline/z
where focal length is in pixels, Baseline is in mm and z, a distance along the optical axis, is also in mm. Pick your Baseline so that you have a few pixels of disparity at the longest working distance. Also keep in mind that though a long Baseline will accomplish this, at a closer distance you would have a lager dead zone where the cameras' field of views do not overlap enough to have a meaningful disparity calculation.
Also, when selecting the resolution for your images don't go too high since a stereo processing is very intensive and a higher resolution may have stronger noise. Typically people don't use color in stereo matching for the same reason. For your task, the algorithm that uses gray VGA images and works at least at 20 fps with a Baseline = 40-60 cm may be a reasonable choice given vehicle speed <40mph.
Related
I'm working with an stereo camera setup that has auto-focus (which I cannot turn off) and a really low baseline of less than 1cm.
Auto-focus process can actually change any intrinsic parameter of both cameras (as focal length and principal point, for example) and without a fix relation (left camera may add focus while right one decrease it). Luckily cameras always report the current state of intrinsics with great precision.
On every frame an object of interest is being detected and disparities between camera images are calculated. As baseline is quite low and resolution is not the greatest, performing stereo triangulation leads to quite poor results and for this matter several succeeding computer vision algorithms relay only on image keypoints and disparities.
Now, disparities being calculated between stereo frames cannot be directly related. If principal point changes disparities will be in very different magnitudes after the auto-focus process.
Is there any way to relate keypoint corners and/or disparities between frames after auto-focus process? For example, calculate where would the object lie in the image with the previous intrinsics?
Maybe using a bearing vector towards object and then seek for intersection with image plane defined by previous intrinsics?
Quite challenging your project, perhaps these patents could help you in some way:
Stereo yaw correction using autofocus feedback
Autofocus for stereo images
Depth information for auto focus using two pictures and two-dimensional Gaussian scale space theory
I have a software working with an industrial grade camera with a manual lens (focus and aperture to be set manually).
I can control the cameras exposure time and gain.
I did some histogram analysis to check the exposure of the image.
No I am looking for a method to transfer the mean value of the grayscale intensity into a exposure value.
Goal is to calculate a exposure time for a fixed aperture setting and a currunt lighting condition since the exposure value is Ev = Av + Tv (Av aperture value or f stops, Tv time value, exposure time) I hope that there is some conversion from grayscale intensity into Exposure value.
I would like to give you my solution.
What I have found out is that normal cameras or not cabable of measuring brightness. In general an image sensor cannot distingish between bright colors as white and brithness circumstances.
Anyway, I implemented an historgram which is measuring the greyscale intensity. After that the mean value is extracted and scaled to a value range of 256. The goal is to have mean value of 128.
So I use the measured histogram mean value as input for a PI controller which controlls the the exposure time.
So this way I create a link between histogram mean value and exposure time
I think you may want to consider the histogram as providing the dynamic range required. Ansel Adams, and others, sometimes referred to this as Zones. So your 18% is Zone V (5). If your image has dynamic range that is clipping the high value (255) or clipping on the minimum (O), then you may need 2 or more images... One with less Ev at a high f# or less exposure to be unclipped ("blown out" think photo speak), and the another shot with more exposure to ensure that shadow detail is not lost ("blocked" in photo speak).
If you have an image that is within the 1-255 range, then you could rescale the image to have a mean value of 18%. Or have some calibration for light reading to exposure or Ev.
Generally you want the histogram both for computing then mean (EV or crestfactor) as well as a way to find the min/max in order to determine is you need more exposure/gain or less.
Of course if your picture is flat with respect to brightness, the. 128 is perfect. If there are a few bright sources and in general a "normal scene" then some mean value closer to 18% is statistically better (~46)
I am calibrating an industrial AVT camera. Is it ok, when I focus on the plane where I will do my measurements with the f/4 aperture then close the aperture to f/16, calibrate the internal parameters of the camera and then open the aperture to f/4 ? Will the calibration change with the changing aperture? I know that none of the parameters (focal length, principal point, lens) should physically change, but is there no effect?
I am not changing the focus (focal length). I need to change the aperture due to bigger depth of field during my calibration and faster camera during my measurements.
I think the short answer is: No it doesn't.
The calibration should be the same (within experimental limits) at different apertures. The aperture only affects the depth of field and the amount of light entering the camera. The focal length, principal point, len distortions, etc. don't change - although your ability to measure them accuratley may be affected by the quality of the image you get.
Maybe a larger aperture could in theory capture a better approximation of lens distortion, although reading this article makes me doubt my own words, but if you calibrate at a wide aperture and then capture at a smaller aperture, this should not be a problem. Only if your lens is seriously distorted would it be an issue (IMHO). The article linked says this:
The size of the stop has no effect on the distortion, as the chief ray
does not alter its route when the aperture is made smaller or larger.
It would presumably be a simple procedure to perform camera calibration at different aperture settings and see if the results are similar. Certainly I know of no way to infer the aperture setting from a camera calibration matrix, which implies that this information itself is not captured.
It depends on your lens and exactly what values you are calibrating.
If you are just setting the back focus distance, then there is no problem.
Or if you are only measuring in the center of the field of view.
Without knowing the focal length, widest aperture, and the general
quality of the lens, it's not possible to give a specific answer.
For example, if the widest aperture is f/1.4, then the lens should
perform well at f/4. But if f/4 is the widest aperture, chances
are you will see a lot of aberrations.
Generally speaking, if you start with the widest aperture of a camera
lens and stop down, the resolving power in the center of the field of
view increases for about 2 stops, and the resolving power in the periphery
of the field improves for about 4 stops.
Beyond that, if you continue to stop down, there is no improvement in
image quality (only increased depth-of-field and a dimmer image--as a
previous answer correctly states). Eventually, as the physical diameter
of the f/stop becomes small, resolving power throughout the field will
decrease due to diffraction.
For example, on a "full frame" (35 x 24 mm format) digital camera with
a good lens, f/22 is noticeably less sharp than say, f/8.
Unfortunately, geometrical (Gaussian) analysis cannot predict
the behavior of real lenses, for two reasons: aberrations and
distortion.
Aberrations are imperfections in design, materials and/or
manufacturing that can be corrected by additional elements, better
glass, closer manufacturing tolerances, etc-- but only to a point
and only at a price. Ideal lenses exist only in theory; real lenses
always perform best (i.e. greatest resolving power) for paraxial rays
(traveling near the optical axis).
Not all aberrations are equally affected by stopping down.
Most improvement: higher-order spherical
Much improvement: spherical, oblique spherical and coma
Some improvement: astigmatism, field curvature, axial chromatic
Not affected: lateral chromatic
Geometrical (Petzval) distortion (technically not an aberration)
also is not affected by stopping down.
Diffraction on the other hand, is a fundamental law of optics--you
just have to live with it. Diffraction varies inversely with the
physical diameter of the aperture: the smaller the diameter, the bigger
the angular size of the Airy disk. As we all know, f-number is focal
length divided by diameter--so f/16 is a much smaller hole on f=50 mm
lens than on a f=150 mm lens.
Traditional methods of measuring diffraction and confusion by
diameters at the projected image (film or sensor) -- rather than by
resolving power at the object--tend to understate the performance of
longer lenses and depth-of-field of larger formats. But MTF charts
tell the real story about the former: the best performing lens in
any manufacture's catalog is a long lens or telephoto.
understate the performance of longer focal length lenses.
Diffraction is why pinholes -- which have no aberrations (and no
distortions if properly designed) -- are not sharp.
Smaller aperture diameters always have more diffraction (i.e., a
larger Airy disk), but diffraction is only significant when the
Airy disk is larger than the lens's circle-of-confusion. The better
correct the lens, the close it is to being "limited by diffraction"
--the technical term for an ideal optical system.
More information:
https://www.diyphotography.net/what-actually-happens-when-you-stop-down-a-lens/
I am doing a research in stereo vision and I am interested in accuracy of depth estimation in this question. It depends of several factors like:
Proper stereo calibration (rotation, translation and distortion extraction),
image resolution,
camera and lens quality (the less distortion, proper color capturing),
matching features between two images.
Let's say we have a no low-cost cameras and lenses (no cheap webcams etc).
My question is, what is the accuracy of depth estimation we can achieve in this field?
Anyone knows a real stereo vision system that works with some accuracy?
Can we achieve 1 mm depth estimation accuracy?
My question also aims in systems implemented in opencv. What accuracy did you manage to achieve?
Q. Anyone knows a real stereo vision system that works with some accuracy? Can we achieve 1 mm depth estimation accuracy?
Yes, you definitely can achieve 1mm (and much better) depth estimation accuracy with a stereo rig (heck, you can do stereo recon with a pair of microscopes). Stereo-based industrial inspection systems with accuracies in the 0.1 mm range are in routine use, and have been since the early 1990's at least. To be clear, by "stereo-based" I mean a 3D reconstruction system using 2 or more geometrically separated sensors, where the 3D location of a point is inferred by triangulating matched images of the 3D point in the sensors. Such a system may use structured light projectors to help with the image matching, however, unlike a proper "structured light-based 3D reconstruction system", it does not rely on a calibrated geometry for the light projector itself.
However, most (likely, all) such stereo systems designed for high accuracy use either some form of structured lighting, or some prior information about the geometry of the reconstructed shapes (or a combination of both), in order to tightly constrain the matching of points to be triangulated. The reason is that, generally speaking, one can triangulate more accurately than they can match, so matching accuracy is the limiting factor for reconstruction accuracy.
One intuitive way to see why this is the case is to look at the simple form of the stereo reconstruction equation: z = f b / d. Here "f" (focal length) and "b" (baseline) summarize the properties of the rig, and they are estimated by calibration, whereas "d" (disparity) expresses the match of the two images of the same 3D point.
Now, crucially, the calibration parameters are "global" ones, and they are estimated based on many measurements taken over the field of view and depth range of interest. Therefore, assuming the calibration procedure is unbiased and that the system is approximately time-invariant, the errors in each of the measurements are averaged out in the parameter estimates. So it is possible, by taking lots of measurements, and by tightly controlling the rig optics, geometry and environment (including vibrations, temperature and humidity changes, etc), to estimate the calibration parameters very accurately, that is, with unbiased estimated values affected by uncertainty of the order of the sensor's resolution, or better, so that the effect of their residual inaccuracies can be neglected within a known volume of space where the rig operates.
However, disparities are point-wise estimates: one states that point p in left image matches (maybe) point q in right image, and any error in the disparity d = (q - p) appears in z scaled by f b. It's a one-shot thing. Worse, the estimation of disparity is, in all nontrivial cases, affected by the (a-priori unknown) geometry and surface properties of the object being analyzed, and by their interaction with the lighting. These conspire - through whatever matching algorithm one uses - to reduce the practical accuracy of reconstruction one can achieve. Structured lighting helps here because it reduces such matching uncertainty: the basic idea is to project sharp, well-focused edges on the object that can be found and matched (often, with subpixel accuracy) in the images. There is a plethora of structured light methods, so I won't go into any details here. But I note that this is an area where using color and carefully choosing the optics of the projector can help a lot.
So, what you can achieve in practice depends, as usual, on how much money you are willing to spend (better optics, lower-noise sensor, rigid materials and design for the rig's mechanics, controlled lighting), and on how well you understand and can constrain your particular reconstruction problem.
I would add that using color is a bad idea even with expensive cameras - just use the gradient of gray intensity. Some producers of high-end stereo cameras (for example Point Grey) used to rely on color and then switched to grey. Also consider a bias and a variance as two components of a stereo matching error. This is important since using a correlation stereo, for example, with a large correlation window would average depth (i.e. model the world as a bunch of fronto-parallel patches) and reduce the bias while increasing the variance and vice versa. So there is always a trade-off.
More than the factors you mentioned above, the accuracy of your stereo will depend on the specifics of the algorithm. It is up to an algorithm to validate depth (important step after stereo estimation) and gracefully patch the holes in textureless areas. For example, consider back-and-forth validation (matching R to L should produce the same candidates as matching L to R), blob noise removal (non Gaussian noise typical for stereo matching removed with connected component algorithm), texture validation (invalidate depth in areas with weak texture), uniqueness validation (having a uni-modal matching score without second and third strong candidates. This is typically a short cut to back-and-forth validation), etc. The accuracy will also depend on sensor noise and sensor's dynamic range.
Finally you have to ask your question about accuracy as a function of depth since d=f*B/z, where B is a baseline between cameras, f is focal length in pixels and z is the distance along optical axis. Thus there is a strong dependence of accuracy on the baseline and distance.
Kinect will provide 1mm accuracy (bias) with quite large variance up to 1m or so. Then it sharply goes down. Kinect would have a dead zone up to 50cm since there is no sufficient overlap of two cameras at a close distance. And yes - Kinect is a stereo camera where one of the cameras is simulated by an IR projector.
I am sure with probabilistic stereo such as Belief Propagation on Markov Random Fields one can achieve a higher accuracy. But those methods assume some strong priors about smoothness of object surfaces or particular surface orientation. See this for example, page 14.
If you wan't to know a bit more about accuracy of the approaches take a look at this site, although is no longer very active the results are pretty much state of the art. Take into account that a couple of the papers presented there went to create companies. What do you mean with real stereo vision system? If you mean commercial there aren't many, most of the commercial reconstruction systems work with structured light or directly scanners. This is because (you missed one important factor in your list), the texture is a key factor for accuracy (or even before that correctness); a white wall cannot be reconstructed by a stereo system unless texture or structured light is added. Nevertheless, in my own experience, systems that involve variational matching can be very accurate (subpixel accuracy in image space) which is generally not achieved by probabilistic approaches. One last remark, the distance between cameras is also important for accuracy: very close cameras will find a lot of correct matches and quickly but the accuracy will be low, more distant cameras will find less matches, will probably take longer but the results could be more accurate; there is an optimal conic region defined in many books.
After all this blabla, I can tell you that using opencv one of the best things you can do is do an initial cameras calibration, use Brox's optical flow to find find matches and reconstruct.
I am using disparity map generation from 2 stereoscopic images .And then I use the normal triangulation formula depth= focallength*baseline/disparity to get the depth .How can I check that the recovered depth is indeed correct? Is there some test for this ? I guess there are some tweak-able parameters like multiplying this depth by some factor etc but again that is more of trial and error.I am looking for something more concrete.How do people in the vision community generally verify the results?
I suggest you verify the depth measured in your images by measuring it in the real world. If there was a way to verify the measurement you did in the images .. in the images then you probably would have used that way to measure depth in the first place.
Measure the distance from your camera to some object in the real world, and measure the size of the object perpendicular to the axis of one of the camera's. Then also measure the distance and size in your images. You use the size measure in the real world combined with the size of the object in pixels in the image to scale the distance you calculate. The result should match the distance you mesaured.