Can you, please, suggest me ways of determining the distance between camera and a pixel in an image (in real world units, that is cm/m/..).
The information I have is: camera horizontal (120 degrees) and vertical (90 degrees) field of view, camera angle (-5 degrees) and the height at which the camera is placed (30 cm).
I'm not sure if this is everything I need. Please tell me what information should I have about the camera and how can I calculate the distance between camera and one pixel?
May be it isn't right to tell 'distance between camera and pixel ', but I guess it is clear what I mean. Please write in the comments if something isn't clear.
Thank you in advance!
What I think you mean is, "how can I calculate the depth at every pixel with a single camera?" Without adding some special hardware this is not feasible, as Rotem mentioned in the comments. There are exceptions, and though I expect you may be limited in time or budget, I'll list a few.
If you want to find depths so that your toy car can avoid collisions, then you needn't assume that depth measurement is required. Google "optical flow collision avoidance" and see if that meets your needs.
If instead you want to measure depth as part of some Simultaneous Mapping and Localization (SLAM) scheme, then that's a different problem to solve. Though difficult to implement, and perhaps not remotely feasible for a toy car project, there are a few ways to measure distance using a single camera:
Project patterns of light, preferably with one or more laser lines or laser spots, and determine depth based on how the dots diverge or converge. The Kinect version 1 operates on this principle of "structured light," though the implementation is much too complicated to reproduce completely. For a collision warning simple you can apply the same principles, only more simply. For example, if the projected light pattern on the right side of the image changes quickly, turn left! Learning how to estimate distance using structured light is a significant project to undertake, but there are plenty of references.
Split the optical path so that one camera sensor can see two different views of the world. I'm not aware of optical splitters for tiny cameras, but they may exist. But even if you find a splitter, the difficult problem of implementing stereovision remains. Stereovision has inherent problems (see below).
Use a different sensor, such as the somewhat iffy but small Intel R200, which will generate depth data. (http://click.intel.com/intel-realsense-developer-kit-r200.html)
Use a time-of-flight camera. These are the types of sensors built into the Kinect version 2 and several gesture-recognition sensors. Several companies have produced or are actively developing tiny time-of-flight sensors. They will generate depth data AND provide full-color images.
Run the car only in controlled environments.
The environment in which your toy car operates is important. If you can limit your toy car's environment to a tightly controlled one, you can limit the need to write complicated algorithms. As is true with many imaging problems, a narrowly defined problem may be straightforward to solve, whereas the general problem may be nearly impossible to solve. If you want your car to run "anywhere" (which likely isn't true), assume the problem is NOT solvable.
Even if you have an off-the-shelf depth sensor that represents the best technology available, you would still run into limitations:
Each type of depth sensing has weaknesses. No depth sensors on the market do well with dark, shiny surfaces. (Some spot sensors do okay with dark, shiny surfaces, but area sensors don't.) Stereo sensors have problems with large, featureless regions, and also require a lot of processing power. And so on.
Once you have a depth image, you still need to run calculations, and short of having a lot of onboard processing power this will be difficult to pull off on a toy car.
If you have to make many compromises to use depth sensing, then you might consider just using a simpler ultrasound sensor to avoid collisions.
Good luck!
Related
I am totally new to camera calibration techniques... I am using OpenCV chessboard technique... I am using a webcam from Quantum...
Here are my observations and steps..
I have kept each chess square side = 3.5 cm. It is a 7 x 5 chessboard with 6 x 4 internal corners. I am taking total of 10 images in different views/poses at a distance of 1 to 1.5 m from the webcam.
I am following the C code in Learning OpenCV by Bradski for the calibration.
my code for calibration is
cvCalibrateCamera2(object_points,image_points,point_counts,cvSize(640,480),intrinsic_matrix,distortion_coeffs,NULL,NULL,CV_CALIB_FIX_ASPECT_RATIO);
Before calling this function I am making the first and 2nd element along the diagonal of the intrinsic matrix as one to keep the ratio of focal lengths constant and using CV_CALIB_FIX_ASPECT_RATIO
With the change in distance of the chess board the fx and fy are changing with fx:fy almost equal to 1. there are cx and cy values in order of 200 to 400. the fx and fy are in the order of 300 - 700 when I change the distance.
Presently I have put all the distortion coefficients to zero because I did not get good result including distortion coefficients. My original image looked handsome than the undistorted one!!
Am I doing the calibration correctly?. Should I use any other option than CV_CALIB_FIX_ASPECT_RATIO?. If yes, which one?
Hmm, are you looking for "handsome" or "accurate"?
Camera calibration is one of the very few subjects in computer vision where accuracy can be directly quantified in physical terms, and verified by a physical experiment. And the usual lesson is that (a) your numbers are just as good as the effort (and money) you put into them, and (b) real accuracy (as opposed to imagined) is expensive, so you should figure out in advance what your application really requires in the way of precision.
If you look up the geometrical specs of even very cheap lens/sensor combinations (in the megapixel range and above), it becomes readily apparent that sub-sub-mm calibration accuracy is theoretically achievable within a table-top volume of space. Just work out (from the spec sheet of your camera's sensor) the solid angle spanned by one pixel - you'll be dazzled by the spatial resolution you have within reach of your wallet. However, actually achieving REPEATABLY something near that theoretical accuracy takes work.
Here are some recommendations (from personal experience) for getting a good calibration experience with home-grown equipment.
If your method uses a flat target ("checkerboard" or similar), manufacture a good one. Choose a very flat backing (for the size you mention window glass 5 mm thick or more is excellent, though obviously fragile). Verify its flatness against another edge (or, better, a laser beam). Print the pattern on thick-stock paper that won't stretch too easily. Lay it after printing on the backing before gluing and verify that the square sides are indeed very nearly orthogonal. Cheap ink-jet or laser printers are not designed for rigorous geometrical accuracy, do not trust them blindly. Best practice is to use a professional print shop (even a Kinko's will do a much better job than most home printers). Then attach the pattern very carefully to the backing, using spray-on glue and slowly wiping with soft cloth to avoid bubbles and stretching. Wait for a day or longer for the glue to cure and the glue-paper stress to reach its long-term steady state. Finally measure the corner positions with a good caliper and a magnifier. You may get away with one single number for the "average" square size, but it must be an average of actual measurements, not of hopes-n-prayers. Best practice is to actually use a table of measured positions.
Watch your temperature and humidity changes: paper adsorbs water from the air, the backing dilates and contracts. It is amazing how many articles you can find that report sub-millimeter calibration accuracies without quoting the environment conditions (or the target response to them). Needless to say, they are mostly crap. The lower temperature dilation coefficient of glass compared to common sheet metal is another reason for preferring the former as a backing.
Needless to say, you must disable the auto-focus feature of your camera, if it has one: focusing physically moves one or more pieces of glass inside your lens, thus changing (slightly) the field of view and (usually by a lot) the lens distortion and the principal point.
Place the camera on a stable mount that won't vibrate easily. Focus (and f-stop the lens, if it has an iris) as is needed for the application (not the calibration - the calibration procedure and target must be designed for the app's needs, not the other way around). Do not even think of touching camera or lens afterwards. If at all possible, avoid "complex" lenses - e.g. zoom lenses or very wide angle ones. For example, anamorphic lenses require models much more complex than stock OpenCV makes available.
Take lots of measurements and pictures. You want hundreds of measurements (corners) per image, and tens of images. Where data is concerned, the more the merrier. A 10x10 checkerboard is the absolute minimum I would consider. I normally worked at 20x20.
Span the calibration volume when taking pictures. Ideally you want your measurements to be uniformly distributed in the volume of space you will be working with. Most importantly, make sure to angle the target significantly with respect to the focal axis in some of the pictures - to calibrate the focal length you need to "see" some real perspective foreshortening. For best results use a repeatable mechanical jig to move the target. A good one is a one-axis turntable, which will give you an excellent prior model for the motion of the target.
Minimize vibrations and associated motion blur when taking photos.
Use good lighting. Really. It's amazing how often I see people realize late in the game that you need a generous supply of photons to calibrate a camera :-) Use diffuse ambient lighting, and bounce it off white cards on both sides of the field of view.
Watch what your corner extraction code is doing. Draw the detected corner positions on top of the images (in Matlab or Octave, for example), and judge their quality. Removing outliers early using tight thresholds is better than trusting the robustifier in your bundle adjustment code.
Constrain your model if you can. For example, don't try to estimate the principal point if you don't have a good reason to believe that your lens is significantly off-center w.r.t the image, just fix it at the image center on your first attempt. The principal point location is usually poorly observed, because it is inherently confused with the center of the nonlinear distortion and by the component parallel to the image plane of the target-to-camera's translation. Getting it right requires a carefully designed procedure that yields three or more independent vanishing points of the scene and a very good bracketing of the nonlinear distortion. Similarly, unless you have reason to suspect that the lens focal axis is really tilted w.r.t. the sensor plane, fix at zero the (1,2) component of the camera matrix. Generally speaking, use the simplest model that satisfies your measurements and your application needs (that's Ockam's razor for you).
When you have a calibration solution from your optimizer with low enough RMS error (a few tenths of a pixel, typically, see also Josh's answer below), plot the XY pattern of the residual errors (predicted_xy - measured_xy for each corner in all images) and see if it's a round-ish cloud centered at (0, 0). "Clumps" of outliers or non-roundness of the cloud of residuals are screaming alarm bells that something is very wrong - likely outliers due to bad corner detection or matching, or an inappropriate lens distortion model.
Take extra images to verify the accuracy of the solution - use them to verify that the lens distortion is actually removed, and that the planar homography predicted by the calibrated model actually matches the one recovered from the measured corners.
This is a rather late answer, but for people coming to this from Google:
The correct way to check calibration accuracy is to use the reprojection error provided by OpenCV. I'm not sure why this wasn't mentioned anywhere in the answer or comments, you don't need to calculate this by hand - it's the return value of calibrateCamera. In Python it's the first return value (followed by the camera matrix, etc).
The reprojection error is the RMS error between where the points would be projected using the intrinsic coefficients and where they are in the real image. Typically you should expect an RMS error of less than 0.5px - I can routinely get around 0.1px with machine vision cameras. The reprojection error is used in many computer vision papers, there isn't a significantly easier or more accurate way to determine how good your calibration is.
Unless you have a stereo system, you can only work out where something is in 3D space up to a ray, rather than a point. However, as one can work out the pose of each planar calibration image, it's possible to work out where each chessboard corner should fall on the image sensor. The calibration process (more or less) attempts to work out where these rays fall and minimises the error over all the different calibration images. In Zhang's original paper, and subsequent evaluations, around 10-15 images seems to be sufficient; at this point the error doesn't decrease significantly with the addition of more images.
Other software packages like Matlab will give you error estimates for each individual intrinsic, e.g. focal length, centre of projection. I've been unable to make OpenCV spit out that information, but maybe it's in there somewhere. Camera calibration is now native in Matlab 2014a, but you can still get hold of the camera calibration toolbox which is extremely popular with computer vision users.
http://www.vision.caltech.edu/bouguetj/calib_doc/
Visual inspection is necessary, but not sufficient when dealing with your results. The simplest thing to look for is that straight lines in the world become straight in your undistorted images. Beyond that, it's impossible to really be sure if your cameras are calibrated well just by looking at the output images.
The routine provided by Francesco is good, follow that. I use a shelf board as my plane, with the pattern printed on poster paper. Make sure the images are well exposed - avoid specular reflection! I use a standard 8x6 pattern, I've tried denser patterns but I haven't seen such an improvement in accuracy that it makes a difference.
I think this answer should be sufficient for most people wanting to calibrate a camera - realistically unless you're trying to calibrate something exotic like a Fisheye or you're doing it for educational reasons, OpenCV/Matlab is all you need. Zhang's method is considered good enough that virtually everyone in computer vision research uses it, and most of them either use Bouguet's toolbox or OpenCV.
I've built an imaging system with a webcam and feature matching such that as I move the camera around; I can track the camera's motion. I am doing something similar to here, except with the webcam frames as the input.
It works really well for "good" images, but when taking images in really low light lots of noise appears (camera high gain), and that messes with the feature detection and matching. Basically, it doesn't detect any good features, and when it does, it cannot match them correctly between frames.
Does anyone know a good solution for this? What other methods are used for finding and matching features?
Here are two example images with very low features:
I think phase correlation is going to be your best bet here. It is designed to tell you the phase shift (i.e., translation) between two images. It is much more resilient (but not immune) to noise than feature detection because it operates in frequency space; whereas, feature detectors operate spatially. Another benefit is, it is very fast when compared with feature detection methods. I have an implementation available in the OpenCV trunk that is sub-pixel accurate located here.
However, your images are pretty much "featureless" with the exception of the crease in the middle, so even phase correlation may have some trouble with it. Think of it like trying to detect translation in a snow storm. If all you can see is white, you can't tell that you have translated at all, thus the term whiteout. In your case, the algorithm might suffer from "greenout" :)
Can you adjust the camera settings to work better in low-light conditions. Have you fully opened the iris? Can you live with lower framerates? Setting a longer exposure time will allow the camera to gather more light, thus giving you more features at the cost of adding motion blur. Or, if low-light is your default environment you probably want something designed for this like an IR camera, but those can be expensive. Other than that, a big lens and long exposures are your friend :)
Histogram equalization may be of interest in improving the image contrast. But, sometimes it can just enhance the noise. OpenCV has a global histogram equalization function called equalizeHist. For a more localized implementation, you'll want to look at Contrast Limited Adaptive Histogram Equalization or CLAHE for short. Here is a good article on it. This page has some nice examples, and some code.
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/
what approach would you recommend for finding obstacles in a 2D image?
Here are some key points I came up with till now:
I doubt I can use object recognition based on "database of obstacles" search, since I don't know what might the obstruction look like.
I assume color recognition might be problematic if the path does not differ a lot from the object itself.
Possibly, adding one more camera and computing a 3D image (like a Kinect does) would work, but that would not run as smooth as I require.
To illustrate the problem; robot can ride either left or right side of the pavement. In the following picture, left side is the correct choice:
If you know what the path looks like, this is largely a classification problem. Acquire a bunch of images of path at different distances, illumination, etc. and manually label the ground in each image. Use this labeled data to train a classifier that classifies each pixel as either "road" or "not road." Depending upon the texture of the road, this could be as simple as classifying each pixels' RGB (or HSV) values or using OpenCv's built-in histogram back-projection (i.e. cv::CalcBackProjectPatch()).
I suggest beginning with manual thresholds, moving to histogram-based matching, and only using a full-fledged machine learning classifier (such as a Naive Bayes Classifier or a SVM) if the simpler techniques fail. Once the entire image is classified, all pixels that are identified as "not road" are obstacles. By classifying the road instead of the obstacles, we completely avoided building a "database of objects".
Somewhat out of the scope of the question, the easiest solution is to add additional sensors ("throw more hardware at the problem!") and directly measure the three-dimensional position of obstacles. In order of preference:
Microsoft Kinect: Cheap, easy, and effective. Due to ambient IR light, it only works indoors.
Scanning Laser Rangefinder: Extremely accurate, easy to setup, and works outside. Also very expensive (~$1200-10,000 depending upon maximum range and sample rate).
Stereo Camera: Not as good as a Kinect, but it works outside. If you cannot afford a pre-made stereo camera (~$1800), you can make a decent custom stereo camera using USB webcams.
Note that professional stereo vision cameras can be very fast by using custom hardware (Stereo On-Chip, STOC). Software-based stereo is also reasonably fast (10-20 Hz) on a modern computer.
I have a number of images where I know the focal length, pixel count, dimensions and position (from GPS). They are all in a high oblique manner, taken on the ground with commercially available cameras.
alt text http://desmond.yfrog.com/Himg411/scaled.php?tn=0&server=411&filename=mjbm.jpg&xsize=640&ysize=640
What would be the best method for calculating the euclidean distances between certain pixels within an image? If it is indeed possible.
Assuming you're not looking for full landscape modelling but a simple approximation then this shouldn't be too hard. Basically a first approximation of your image reduces to a camera with know focal length looking along a plane. So we can create a model of the system in 3D very easily - it's not too far from the classic observer looking over a checkerboard demo.
Normally our graphics problem would be to project the 3D model into 2D so we could render the image. Although most programs nowadays use an API (such as OpenGL) to do this the equations are not particularly complex or difficult to understand. I wrote my first code using the examples from 3D Graphics In Pascal which is a nice clear treatise, but there will be lots of other similar source (although probably less nowadays as a hardware API is invariably used).
What's useful about this is that the projection equations are commutative, in that if you have a point on the image and the model you can run the data back though the projection to retrieve the original 3D coordinates - which is what you wish to do.
So a couple of approaches suggest: either write the code to do the above yourself directly, or probably more simply use OpenGL (I'd recommend the GLUT toolkit for this). If your math is good and manipulating matrices causes you no issue then I'd recommend the former as the solution will be tighter and it's interesting stuff - otherwise take the OpenGL approach. You'd probably want to turn the camera/plane approximation into camera/sphere fairly early too.
If this isn't sufficient for your needs then in theory going to actual landscape modelling would be feasible. The SRTM data is freely available (albeit not in the friendliest of forms) so combined with your GPS position it should be possible to create a mesh model in with which you apply the same algorithms as above.