I am looking for a way to approach a computer vision problem I'm having.
I have working tracking system:
4-8 cameras
Gives (x,y,z) of a infrared led
Each led Transmits a unique 8 bit signal
The tracking system is expensive and the interface is too hard for our users to work with. I want to replace it with a possible my own/ OpenCV implementation.
My current approach which seems to require a lot of development of what seems to be common problems:
Calibrate the cameras to make a 3D space - The cameras need to know where they are in space and in relation to each other.
Given two or more camera sees a unique led it uses gray-scale image with the pixel to calculate the 3D position (x, y, z) of that led.
Right now I am attempting to write my own custom algorithm for both task and its proving to be a lot of work. Is it possible to approach this with OpenCV to help with the heavy lifting.
Take a look at Free track : http://www.free-track.net/english/ you can download sources there.
Related
Can you, please, suggest me ways of determining the distance between camera and a pixel in an image (in real world units, that is cm/m/..).
The information I have is: camera horizontal (120 degrees) and vertical (90 degrees) field of view, camera angle (-5 degrees) and the height at which the camera is placed (30 cm).
I'm not sure if this is everything I need. Please tell me what information should I have about the camera and how can I calculate the distance between camera and one pixel?
May be it isn't right to tell 'distance between camera and pixel ', but I guess it is clear what I mean. Please write in the comments if something isn't clear.
Thank you in advance!
What I think you mean is, "how can I calculate the depth at every pixel with a single camera?" Without adding some special hardware this is not feasible, as Rotem mentioned in the comments. There are exceptions, and though I expect you may be limited in time or budget, I'll list a few.
If you want to find depths so that your toy car can avoid collisions, then you needn't assume that depth measurement is required. Google "optical flow collision avoidance" and see if that meets your needs.
If instead you want to measure depth as part of some Simultaneous Mapping and Localization (SLAM) scheme, then that's a different problem to solve. Though difficult to implement, and perhaps not remotely feasible for a toy car project, there are a few ways to measure distance using a single camera:
Project patterns of light, preferably with one or more laser lines or laser spots, and determine depth based on how the dots diverge or converge. The Kinect version 1 operates on this principle of "structured light," though the implementation is much too complicated to reproduce completely. For a collision warning simple you can apply the same principles, only more simply. For example, if the projected light pattern on the right side of the image changes quickly, turn left! Learning how to estimate distance using structured light is a significant project to undertake, but there are plenty of references.
Split the optical path so that one camera sensor can see two different views of the world. I'm not aware of optical splitters for tiny cameras, but they may exist. But even if you find a splitter, the difficult problem of implementing stereovision remains. Stereovision has inherent problems (see below).
Use a different sensor, such as the somewhat iffy but small Intel R200, which will generate depth data. (http://click.intel.com/intel-realsense-developer-kit-r200.html)
Use a time-of-flight camera. These are the types of sensors built into the Kinect version 2 and several gesture-recognition sensors. Several companies have produced or are actively developing tiny time-of-flight sensors. They will generate depth data AND provide full-color images.
Run the car only in controlled environments.
The environment in which your toy car operates is important. If you can limit your toy car's environment to a tightly controlled one, you can limit the need to write complicated algorithms. As is true with many imaging problems, a narrowly defined problem may be straightforward to solve, whereas the general problem may be nearly impossible to solve. If you want your car to run "anywhere" (which likely isn't true), assume the problem is NOT solvable.
Even if you have an off-the-shelf depth sensor that represents the best technology available, you would still run into limitations:
Each type of depth sensing has weaknesses. No depth sensors on the market do well with dark, shiny surfaces. (Some spot sensors do okay with dark, shiny surfaces, but area sensors don't.) Stereo sensors have problems with large, featureless regions, and also require a lot of processing power. And so on.
Once you have a depth image, you still need to run calculations, and short of having a lot of onboard processing power this will be difficult to pull off on a toy car.
If you have to make many compromises to use depth sensing, then you might consider just using a simpler ultrasound sensor to avoid collisions.
Good luck!
Let say I have a video from a drive recorder. I want to construct the recorded scene's points cloud using structure from motion technique. First I need to track some points.
Which algorithm can yield a better result? By using the sparse optical flow (Kanade-Lucas-Tomasi tracker) or the dense optical flow (Farneback)? I have experimented a bit but cannot really decide. Each one of them has their own strengths and weaknesses.
The ultimate target is to get the points cloud of the recorded cars in the scene. By using the sparse optical flow, I can track the interesting points of the cars. But it would be quite unpredictable. One solution is to make some kind of grid in the image, and force the tracker to track one interesting point in each of the grid. But I think this would be quite hard.
By using the dense flow, I can get the movement of every pixel, but the problem is, it cannot really detect the motion of cars that have only little motion. Also, I have doubt that the flow of every pixel yielded by the algorithm would be that accurate. Plus, with this, I believe I can only get the pixels movement between two frames only (unlike by using the sparse optical flow in which I can get multiple coordinates of the same interesting point along time t)
Your title indicate SFM which includes pose estimation ,
tracking is only the first step (matching) , if you want point cloud from video (very hard task) first thing I would think of, is bundle adjustment which also works for MVE,
Nevertheless , for video we can do more, as frames are too close to each other, we can use faster algorithm like ( optical flow ) , /than matching SIFT/ and extract F matrix from it , then :
E = 1/K * F * K
Back to your original question , what is better:
1) Dense Optical flow , or
2) Sparse one .
apparently you are working offline , so no importance of speed ,but I would recommend the sparse one ,
Update
for 3d reconstruction , the dense may seem more attractive, but as you said it's rarely robust, so you can use sparse but add as many points as you want to make it semi-dense ,
I cannot name but a few methods that could do this, like mono-slam or orb-slam
Final Update
use semi-dense as I write earlier, but SFM always assume static objects (no movement) or it will never works.
in practical using all the pixels in the image is something never used for 3d reconstruction (not direct methods), and always SIFT were praised way for features detecting and matching, .. recently all the pixels were used in different kind of calibration ,for ex in methods like: Direct Sparse odometry and LSD known as Direct methods
I want to get 3d model of some real word object.
I have two web cams and using openCV and SBM for stereo correspondence I get point cloud of the scene, and filtering through z I can get point cloud only of object.
I know that ICP is good for this purprose, but it needs point clouds to be initally good aligned, so it is combined with SAC to achieve better results.
But my SAC fitness score it too big smth like 70 or 40, also ICP doesn't give good results.
My questions are:
Is it ok for ICP if I just rotate the object infront of cameras for obtaining point clouds? What angle of rotation must be to achieve good results? Or maybe there are better way of taking pictures of the object for getting 3d model? Is it ok if my point clouds will have some holes? What is maximal acceptable fitness score of SAC for good ICP, and what is maximal fitness score of good ICP?
Example of my point cloud files:
https://drive.google.com/file/d/0B1VdSoFbwNShcmo4ZUhPWjZHWG8/view?usp=sharing
My advice and experience is that you already have rgb images or grey. ICP is an good application for optimising the point cloud, but has some troubles aligning them.
First start with rgb odometry (through feature points aligning the point cloud (rotated from each other)) then use and learn how ICP works with the already mentioned point cloud library. Let rgb features giving you a prediction and then use ICP to optimize that when possible.
When this application works think about good fitness score calculation. If that all works use the trunk version of ICP and optimize the parameter. After this all been done You have code that is not only fast, but also with the a low error of going wrong.
The following post is explain what went wrong.
Using ICP, we refine this transformation using only geometric information. However, here ICP decreases the precision. What happens is that ICP tries to match as many corresponding points as it can. Here the background behind the screen has more points that the screen itself on the two scans. ICP will then align the clouds to maximize the correspondences on the background. The screen is then misaligned
https://github.com/introlab/rtabmap/wiki/ICP
First off, I'd like to state that I'm very new to this field and apologize if the question is a little too repetitive. I've looked around but in vain. I'm working on reading Hartley and Zisserman's book but it's taking me a while.
My problem is That I've got 3 Video Sources of an area and I need to find the camera position at each frame of the video. I do not have any information about the cameras that took the videos (i.e no Intrinsics).
Looking for a solution I came across SfM and tried existing software that exists namely Bundler & Vsfm and they both seem to have worked quite well. However I've got a couple of questions about it.
1) Is SfM really required in my case? Since SfM does a sparse reconstruction and the common points between images are also an output, is it fully necessary? or are there more suitable methods that can do it without since positions are all I really need? Or are there less complex methods I may use instead?
2) From what I've read, I need to calibrate the camera and find it's Intrinsics and Extrinsics. How can I do this without knowing either? I've looked at the 5-pt problem and others but most of them require you to know the intrinsic properties of the camera which I don't have and I cannot use a pattern such as a chessboard to calibrate them since they come from a source outside my control.
Thanks for your time!
Based on my experience, the short answer is:
1) You cannot reliably estimate the 3D pose of the cameras independently from the 3D of the scene. Moreover, since your cameras are moving independently, I think SfM is the right way to approach your problem.
2) You need to estimate the cameras' intrinsics in order to estimate useful (i.e. Euclidian) poses and scene reconstruction. If you cannot use the standard calibration procedure, with chessboard and co, you can have a look at the autocalibration techniques (see also chapter 19 in Hartley's & Zisserman's book). This calibration procedure is done independently for each camera and only require several image samples at different positions, which seems appropriate in your case.
You can actually accomplish your task in a massive bundle adjacent procedure up to a scaling parameter. But is is a very complicated thing even if you aren't novice. You dont need 3d reconstruction, just an essential matrix that can be obtained from 2d projections and decomposed i to rotation and translation but this does require Iintrinsic Paramus. To get them you have to have at least three frames.
Finally, Drop Zimmerman book it will drive you crazy. Read Simon Princes "Computer Vision"instead.
I am totally new to camera calibration techniques... I am using OpenCV chessboard technique... I am using a webcam from Quantum...
Here are my observations and steps..
I have kept each chess square side = 3.5 cm. It is a 7 x 5 chessboard with 6 x 4 internal corners. I am taking total of 10 images in different views/poses at a distance of 1 to 1.5 m from the webcam.
I am following the C code in Learning OpenCV by Bradski for the calibration.
my code for calibration is
cvCalibrateCamera2(object_points,image_points,point_counts,cvSize(640,480),intrinsic_matrix,distortion_coeffs,NULL,NULL,CV_CALIB_FIX_ASPECT_RATIO);
Before calling this function I am making the first and 2nd element along the diagonal of the intrinsic matrix as one to keep the ratio of focal lengths constant and using CV_CALIB_FIX_ASPECT_RATIO
With the change in distance of the chess board the fx and fy are changing with fx:fy almost equal to 1. there are cx and cy values in order of 200 to 400. the fx and fy are in the order of 300 - 700 when I change the distance.
Presently I have put all the distortion coefficients to zero because I did not get good result including distortion coefficients. My original image looked handsome than the undistorted one!!
Am I doing the calibration correctly?. Should I use any other option than CV_CALIB_FIX_ASPECT_RATIO?. If yes, which one?
Hmm, are you looking for "handsome" or "accurate"?
Camera calibration is one of the very few subjects in computer vision where accuracy can be directly quantified in physical terms, and verified by a physical experiment. And the usual lesson is that (a) your numbers are just as good as the effort (and money) you put into them, and (b) real accuracy (as opposed to imagined) is expensive, so you should figure out in advance what your application really requires in the way of precision.
If you look up the geometrical specs of even very cheap lens/sensor combinations (in the megapixel range and above), it becomes readily apparent that sub-sub-mm calibration accuracy is theoretically achievable within a table-top volume of space. Just work out (from the spec sheet of your camera's sensor) the solid angle spanned by one pixel - you'll be dazzled by the spatial resolution you have within reach of your wallet. However, actually achieving REPEATABLY something near that theoretical accuracy takes work.
Here are some recommendations (from personal experience) for getting a good calibration experience with home-grown equipment.
If your method uses a flat target ("checkerboard" or similar), manufacture a good one. Choose a very flat backing (for the size you mention window glass 5 mm thick or more is excellent, though obviously fragile). Verify its flatness against another edge (or, better, a laser beam). Print the pattern on thick-stock paper that won't stretch too easily. Lay it after printing on the backing before gluing and verify that the square sides are indeed very nearly orthogonal. Cheap ink-jet or laser printers are not designed for rigorous geometrical accuracy, do not trust them blindly. Best practice is to use a professional print shop (even a Kinko's will do a much better job than most home printers). Then attach the pattern very carefully to the backing, using spray-on glue and slowly wiping with soft cloth to avoid bubbles and stretching. Wait for a day or longer for the glue to cure and the glue-paper stress to reach its long-term steady state. Finally measure the corner positions with a good caliper and a magnifier. You may get away with one single number for the "average" square size, but it must be an average of actual measurements, not of hopes-n-prayers. Best practice is to actually use a table of measured positions.
Watch your temperature and humidity changes: paper adsorbs water from the air, the backing dilates and contracts. It is amazing how many articles you can find that report sub-millimeter calibration accuracies without quoting the environment conditions (or the target response to them). Needless to say, they are mostly crap. The lower temperature dilation coefficient of glass compared to common sheet metal is another reason for preferring the former as a backing.
Needless to say, you must disable the auto-focus feature of your camera, if it has one: focusing physically moves one or more pieces of glass inside your lens, thus changing (slightly) the field of view and (usually by a lot) the lens distortion and the principal point.
Place the camera on a stable mount that won't vibrate easily. Focus (and f-stop the lens, if it has an iris) as is needed for the application (not the calibration - the calibration procedure and target must be designed for the app's needs, not the other way around). Do not even think of touching camera or lens afterwards. If at all possible, avoid "complex" lenses - e.g. zoom lenses or very wide angle ones. For example, anamorphic lenses require models much more complex than stock OpenCV makes available.
Take lots of measurements and pictures. You want hundreds of measurements (corners) per image, and tens of images. Where data is concerned, the more the merrier. A 10x10 checkerboard is the absolute minimum I would consider. I normally worked at 20x20.
Span the calibration volume when taking pictures. Ideally you want your measurements to be uniformly distributed in the volume of space you will be working with. Most importantly, make sure to angle the target significantly with respect to the focal axis in some of the pictures - to calibrate the focal length you need to "see" some real perspective foreshortening. For best results use a repeatable mechanical jig to move the target. A good one is a one-axis turntable, which will give you an excellent prior model for the motion of the target.
Minimize vibrations and associated motion blur when taking photos.
Use good lighting. Really. It's amazing how often I see people realize late in the game that you need a generous supply of photons to calibrate a camera :-) Use diffuse ambient lighting, and bounce it off white cards on both sides of the field of view.
Watch what your corner extraction code is doing. Draw the detected corner positions on top of the images (in Matlab or Octave, for example), and judge their quality. Removing outliers early using tight thresholds is better than trusting the robustifier in your bundle adjustment code.
Constrain your model if you can. For example, don't try to estimate the principal point if you don't have a good reason to believe that your lens is significantly off-center w.r.t the image, just fix it at the image center on your first attempt. The principal point location is usually poorly observed, because it is inherently confused with the center of the nonlinear distortion and by the component parallel to the image plane of the target-to-camera's translation. Getting it right requires a carefully designed procedure that yields three or more independent vanishing points of the scene and a very good bracketing of the nonlinear distortion. Similarly, unless you have reason to suspect that the lens focal axis is really tilted w.r.t. the sensor plane, fix at zero the (1,2) component of the camera matrix. Generally speaking, use the simplest model that satisfies your measurements and your application needs (that's Ockam's razor for you).
When you have a calibration solution from your optimizer with low enough RMS error (a few tenths of a pixel, typically, see also Josh's answer below), plot the XY pattern of the residual errors (predicted_xy - measured_xy for each corner in all images) and see if it's a round-ish cloud centered at (0, 0). "Clumps" of outliers or non-roundness of the cloud of residuals are screaming alarm bells that something is very wrong - likely outliers due to bad corner detection or matching, or an inappropriate lens distortion model.
Take extra images to verify the accuracy of the solution - use them to verify that the lens distortion is actually removed, and that the planar homography predicted by the calibrated model actually matches the one recovered from the measured corners.
This is a rather late answer, but for people coming to this from Google:
The correct way to check calibration accuracy is to use the reprojection error provided by OpenCV. I'm not sure why this wasn't mentioned anywhere in the answer or comments, you don't need to calculate this by hand - it's the return value of calibrateCamera. In Python it's the first return value (followed by the camera matrix, etc).
The reprojection error is the RMS error between where the points would be projected using the intrinsic coefficients and where they are in the real image. Typically you should expect an RMS error of less than 0.5px - I can routinely get around 0.1px with machine vision cameras. The reprojection error is used in many computer vision papers, there isn't a significantly easier or more accurate way to determine how good your calibration is.
Unless you have a stereo system, you can only work out where something is in 3D space up to a ray, rather than a point. However, as one can work out the pose of each planar calibration image, it's possible to work out where each chessboard corner should fall on the image sensor. The calibration process (more or less) attempts to work out where these rays fall and minimises the error over all the different calibration images. In Zhang's original paper, and subsequent evaluations, around 10-15 images seems to be sufficient; at this point the error doesn't decrease significantly with the addition of more images.
Other software packages like Matlab will give you error estimates for each individual intrinsic, e.g. focal length, centre of projection. I've been unable to make OpenCV spit out that information, but maybe it's in there somewhere. Camera calibration is now native in Matlab 2014a, but you can still get hold of the camera calibration toolbox which is extremely popular with computer vision users.
http://www.vision.caltech.edu/bouguetj/calib_doc/
Visual inspection is necessary, but not sufficient when dealing with your results. The simplest thing to look for is that straight lines in the world become straight in your undistorted images. Beyond that, it's impossible to really be sure if your cameras are calibrated well just by looking at the output images.
The routine provided by Francesco is good, follow that. I use a shelf board as my plane, with the pattern printed on poster paper. Make sure the images are well exposed - avoid specular reflection! I use a standard 8x6 pattern, I've tried denser patterns but I haven't seen such an improvement in accuracy that it makes a difference.
I think this answer should be sufficient for most people wanting to calibrate a camera - realistically unless you're trying to calibrate something exotic like a Fisheye or you're doing it for educational reasons, OpenCV/Matlab is all you need. Zhang's method is considered good enough that virtually everyone in computer vision research uses it, and most of them either use Bouguet's toolbox or OpenCV.