Tracking a bowling ball down the lane - opencv

So I'm working in this app and got many things done, i can track the ball perfectly on my current videos ( just a small part of the lane currently ). The Idea is to be able to track a bowling ball down the whole lane to calculate all sorts of things ( like speed and position ). My problem is that lanes are 60 foot ( 18m ) long and like 4 ( 1.1m ) wide. I figured the best way to do it was putting the Cameras on the ceiling, because background extraction worked flawlessly so my first test was a USB webcams. My first problems were to discover that ceilings are almost 10 foot ( 3m ) from the floor so i would need a lot of webcams, when i got 1 more i realized 2 things, webcams are not the way because:
a) It doesn't support the bandwidth of streaming the video of many cameras at the time at high res.
b) I would need a lot of webcams because of fixed lens sizes.
c) Unless you use powered cables USB doesn't get very far before losing singal
So i discovered what it seemed to be the solution that is something like 1 or 2 PointGrey cameras (Butterfly or other model) with maybe a 4mm or 8 mm lens would work. At this point my financing is very low and I'm trying to have the least amount of possible errors as is my own startup and as an Argentinean is not as easy to do stuff like being in the US.
My question is what kind of approach would you guys take to be able to capture the whole lane? maybe change camera positions and use other way instead of background extraction? You guys think I'm going on the right track? with the USB cam, at least i was able to capture and follow frame by frame the ball in a very limited space of lane, but i realized i can do everything i want and the project is possible, but USB is not the way.
Hoping to hear some advice as I'm no expert in computer vision or cameras. and i want to do a cost efficient project. I'm currently working on C# using emgu.
Thanks to anyone took the time to read this :)

These approaches came to my mind:
1- I seems that with one camera on the ceiling your system works well but you do not cover the whole lane. Then you can calculate the speed and position in the portion you cover and make an estimation for when the ball goes of the field of view.
2- Another approach is to use a camera with a wider angle on the ceiling.
3- Instead of on the ceiling, you can mount the camera somewhere else (for example on the sides) that covers the whole lane. Since you exactly know the location of your camera you can calculate speed and position based on the location of the ball on the screen (taking into account prespective,etc). The problem with this approach would be that you have to mount the camera always in one point for all customers and the system won't work if camera moved later.
4- The most robust approach to me is stereo vision. You use two cameras with a certain distance to each other, calibrate them. Then you can mount them anywhere that covers the whole lane. Distance, speed, position, etc are all feasible and easy to extract when you have the matrices of two cameras.

put a camera at an angle where it can see the whole lane. For example, you can mount it on a ceiling looking down and forward. Background subtraction should just work fine while the distance can be calibrated based on the initial and final position of the ball. The mapping between the image and the physical world (lane surface) will be given by Homography which requires 4 point correspondences to calculate.

Related

Am I missing something with stereo calibration?

I am trying the stereo_calib example and it fails with garbage output. For instance:
However, it is finding corners in my images...
My xml file and images are all here:
https://drive.google.com/open?id=12-5jBN7FK-LO6SLb4r3YYkrOnP7f_xmG
What am I doing wrong? I first tried printing a pattern on a sheet of paper, then thought ok that must be too wavy or something, so had this printed on foam board. But no dice.
(we chatted on a side channel, so this is to the benefit of the rest of the world)
tl;dr: hold the board very still or get a camera with global shutter.
Rolling shutter (see here and there), an attribute of most webcam sensors, many camcorder sensors, and some industrial image sensors, will distort objects that are moving. If you've moved the board even just a little during a frame capture (visible in files right19/right20), it will be captured with distortion. That will affect everything you do with the picture, starting with intrinsic calibration.
To give a sense of scale for the distortions: assuming a 30 FPS video stream, the worst case rolling shutter lag is 33 ms. A pedestrian travels 40-50 mm in that time. If your hands are moving slightly, you can maybe expect a tenth of that, which is still a lot in proportion to the square sizes most people use.
Another source of trouble is printers. If you've printed your checkerboard pattern, make sure to measure the width and height of your squares. they might be slightly rectangular. It's also a good idea to make sure the pattern is quite flat, not bent.

Setting up a scene for image capturing and image processing

I'm building a box where the user will put their foot and then have measurements of their feet taken.
My 1st tier goal is to take basic measurements and my reach goal is to build a 3d model of the person's foot.
Here are what some images from my first attempts and prototyping.
back of the foot | inside of the foot | outside of the foot | top of the foot
So, my big advantage is that I have a lot of control over the scene.
I want to use this fact to set things up so I can get reliable measurements using pictures.
So my questions are as follows:
1) What is the best way to set the scene up? Right now I'm going to have a blue background, lights, and a contrasting sock to create a consistent internal image. Is there a more 'optimal' contrast to use? As you can see below, it's working decently.
2) What's an easy way for me to get reliable pixel to mm measurements? I can use a patterned sock (to increase feature density) and then two cameras from each viewpoint, but it would be great to minimize the number of cameras I need.
I'm going to leave the questions there as not to overload this post - but if people have any other tips it would be very helpful. Thank you!
My approach to 1) would essentially be a "green screen" or "blue screen".
The idea is to carefully illuminate the background so that there are no shadows. Then, you can apply a color threshold, and everything that is not that specific color is the foreground. So far in your images, there is a quite a bit of shadow, which may be able to be eliminated by careful lighting. You'll have to experiment with how much of that is an issue for you.
2) This a little tougher, but possible. You will need to know the position and direction of your cameras, the lens parameters (such as F/#), the sensor parameters (pixel pitch/spacing). With this information, you should be able to locate the extrema of the foot and get some measurements. Here's a general diagram of how this might work. You could use the top view to locate the mid-line of the foot so you know how far it is from the side cameras. Then, you have all the information you need to solve of pixel to real-space measurements. The top camera is easy; since everything is in a plane (assuming the camera is properly aligned and rectified) all you have to do is put a ruler on the floor and take some pictures of it. Then, you can measure the pixels to real-space conversion directly from the image.
For your 3-d modeling issue, I'd like to point out that you don't actually have to get a full point cloud. You could just get a model of a foot and scale it for display based on the measurements you make. In any case, good luck on your project!

Calculate distance between camera and pixel in image

Can you, please, suggest me ways of determining the distance between camera and a pixel in an image (in real world units, that is cm/m/..).
The information I have is: camera horizontal (120 degrees) and vertical (90 degrees) field of view, camera angle (-5 degrees) and the height at which the camera is placed (30 cm).
I'm not sure if this is everything I need. Please tell me what information should I have about the camera and how can I calculate the distance between camera and one pixel?
May be it isn't right to tell 'distance between camera and pixel ', but I guess it is clear what I mean. Please write in the comments if something isn't clear.
Thank you in advance!
What I think you mean is, "how can I calculate the depth at every pixel with a single camera?" Without adding some special hardware this is not feasible, as Rotem mentioned in the comments. There are exceptions, and though I expect you may be limited in time or budget, I'll list a few.
If you want to find depths so that your toy car can avoid collisions, then you needn't assume that depth measurement is required. Google "optical flow collision avoidance" and see if that meets your needs.
If instead you want to measure depth as part of some Simultaneous Mapping and Localization (SLAM) scheme, then that's a different problem to solve. Though difficult to implement, and perhaps not remotely feasible for a toy car project, there are a few ways to measure distance using a single camera:
Project patterns of light, preferably with one or more laser lines or laser spots, and determine depth based on how the dots diverge or converge. The Kinect version 1 operates on this principle of "structured light," though the implementation is much too complicated to reproduce completely. For a collision warning simple you can apply the same principles, only more simply. For example, if the projected light pattern on the right side of the image changes quickly, turn left! Learning how to estimate distance using structured light is a significant project to undertake, but there are plenty of references.
Split the optical path so that one camera sensor can see two different views of the world. I'm not aware of optical splitters for tiny cameras, but they may exist. But even if you find a splitter, the difficult problem of implementing stereovision remains. Stereovision has inherent problems (see below).
Use a different sensor, such as the somewhat iffy but small Intel R200, which will generate depth data. (http://click.intel.com/intel-realsense-developer-kit-r200.html)
Use a time-of-flight camera. These are the types of sensors built into the Kinect version 2 and several gesture-recognition sensors. Several companies have produced or are actively developing tiny time-of-flight sensors. They will generate depth data AND provide full-color images.
Run the car only in controlled environments.
The environment in which your toy car operates is important. If you can limit your toy car's environment to a tightly controlled one, you can limit the need to write complicated algorithms. As is true with many imaging problems, a narrowly defined problem may be straightforward to solve, whereas the general problem may be nearly impossible to solve. If you want your car to run "anywhere" (which likely isn't true), assume the problem is NOT solvable.
Even if you have an off-the-shelf depth sensor that represents the best technology available, you would still run into limitations:
Each type of depth sensing has weaknesses. No depth sensors on the market do well with dark, shiny surfaces. (Some spot sensors do okay with dark, shiny surfaces, but area sensors don't.) Stereo sensors have problems with large, featureless regions, and also require a lot of processing power. And so on.
Once you have a depth image, you still need to run calculations, and short of having a lot of onboard processing power this will be difficult to pull off on a toy car.
If you have to make many compromises to use depth sensing, then you might consider just using a simpler ultrasound sensor to avoid collisions.
Good luck!

How align(register) and merge point clouds to get full 3d model?

I want to get 3d model of some real word object.
I have two web cams and using openCV and SBM for stereo correspondence I get point cloud of the scene, and filtering through z I can get point cloud only of object.
I know that ICP is good for this purprose, but it needs point clouds to be initally good aligned, so it is combined with SAC to achieve better results.
But my SAC fitness score it too big smth like 70 or 40, also ICP doesn't give good results.
My questions are:
Is it ok for ICP if I just rotate the object infront of cameras for obtaining point clouds? What angle of rotation must be to achieve good results? Or maybe there are better way of taking pictures of the object for getting 3d model? Is it ok if my point clouds will have some holes? What is maximal acceptable fitness score of SAC for good ICP, and what is maximal fitness score of good ICP?
Example of my point cloud files:
https://drive.google.com/file/d/0B1VdSoFbwNShcmo4ZUhPWjZHWG8/view?usp=sharing
My advice and experience is that you already have rgb images or grey. ICP is an good application for optimising the point cloud, but has some troubles aligning them.
First start with rgb odometry (through feature points aligning the point cloud (rotated from each other)) then use and learn how ICP works with the already mentioned point cloud library. Let rgb features giving you a prediction and then use ICP to optimize that when possible.
When this application works think about good fitness score calculation. If that all works use the trunk version of ICP and optimize the parameter. After this all been done You have code that is not only fast, but also with the a low error of going wrong.
The following post is explain what went wrong.
Using ICP, we refine this transformation using only geometric information. However, here ICP decreases the precision. What happens is that ICP tries to match as many corresponding points as it can. Here the background behind the screen has more points that the screen itself on the two scans. ICP will then align the clouds to maximize the correspondences on the background. The screen is then misaligned
https://github.com/introlab/rtabmap/wiki/ICP

Structure from Motion (SfM) in a tunnel-like structure?

I have a very specific application in which I would like to try structure from motion to get a 3D representation. For now, all the software/code samples I have found for structure from motion are like this: "A fixed object that is photographed from all angle to create the 3D". This is not my case.
In my case, the camera is moving in the middle of a corridor and looking forward. Sometimes, the camera can look on other direction (Left, right, top, down). The camera will never go back or look back, it always move forward. Since the corridor is small, almost everything is visible (no hidden spot). The corridor can be very long sometimes.
I have tried this software and it doesn't work in my particular case (but it's fantastic with normal use). Does anybody can suggest me a library/software/tools/paper that could target my specific needs? Or did you ever needed to implement something like that? Any help is welcome!
Thanks!
What kind of corridors are you talking about and what kind of precision are you aiming for?
A priori, I don't see why your corridor would not be a fixed object photographed from different angles. The quality of your reconstruction might suffer if you only look forward and you can't get many different views of the scene, but standard methods should still work. Are you sure that the programs you used aren't failing because of your picture quality, arrangement or other reasons?
If you have to do the reconstruction yourself, I would start by
1) Calibrating your camera
2) Undistorting your images
3) Matching feature points in subsequent image pairs
4) Extracting a 3D point cloud for each image pair
You can then orient the point clouds with respect to one another, for example via ICP between two subsequent clouds. More sophisticated methods might not yield much difference if you don't have any closed loops in your dataset (as your camera is only moving forward).
OpenCV and the Point Cloud Library should be everything you need for these steps. Visualization might be more of a hassle, but the pretty pictures are what you pay for in commercial software after all.
Edit (2017/8): I haven't worked on this in the meantime, but I feel like this answer is missing some pieces. If I had to answer it today, I would definitely suggest looking into the keyword monocular SLAM, which has recently seen a lot of activity, not least because of drones with cameras. Notably, LSD-SLAM is open source and may not be as vulnerable to feature-deprived views, as it operates directly on the intensity. There even seem to be approaches combining inertial/odometry sensors with the image matching algorithms.
Good luck!
FvD is right in the sense that your corridor is a static object. Your scenario is the same and moving around and object and taking images from multiple views. Your views are just not arranged to provide a 360 degree view of the object.
I see you mentioned in your previous comment that the data is coming from a video? In that case, the problem could very well be the camera calibration. A camera calibration tells the SfM algorithm about the internal parameters of the camera (focal length, principal point, lens distortion etc.) In the absence of knowledge about these, the bundler in VSfM uses information from the EXIF data of the image. However, I don't think video stores any EXIF information (not a 100% sure). As a result, I think the entire algorithm is running with bad focal length information and cannot solve for the orientation.
Can you extract a few frames from the video and see if there is any EXIF information?

Resources