How to track people across multiple cameras? - opencv

This is the setup: A fairly large room with 4 fish-eye cameras mounted on the ceiling. There are no blind spots. Each camera coverage overlaps a little with the other.
The idea is to track people across these cameras. As of now a blob extracting algorithm is in place, which detects people as blobs. It's a fairly decently working algorithm which detects individual people pretty good. Am using the OpenCV API for all of this.
What I mean by track people is that - Say, camera 1 identifies two people, say Person A and Person B. Now, as these two people move from the coverage of camera 1 into the overlapping area of coverage of cam1 and cam2 and into the area where only cam2 covers, cam2 should be able to identify them as the same people A and B cam1 identified them as.
This is what I thought I'd do -
1) The camera renders the image at 15fps and I think the dimensions of the frames are of 1920x1920.
2) Identify blobs individually in each camera and give each blob an unique label.
3) Now as for the overlaps - Compute an affine transformation matrix which maps pixels on one camera's frame onto another camera's frame - this needn't be done for every frame - this can be done before the whole process starts, as a pre-processing step. So in real time, whenever I detect a blob which is in the overlapping area, all I have to do is apply the transformation matrix to the pixels in cam1 and see if there is a corresponding blob in cam2 and give them the same label.
So, Questions :
1) Would this system give me a badly-working system which tracks people decently ?
2) So, for the affine transform, do I have to convert the fish-eye to rectilinear image ? (My answer is yes, but am not too sure)
Please feel free to point out possible errors and why certain things might not work in the process I've described. Also alternate suggestions are welcome! TIA

1- blob extraction is not enough to track a specific object, for people case I suggest HoG - or at least background subtraction before blob extraction, since all of the cameras have still scenes.
2- opencv <=2.4.9 uses pinhole model for stereo vision. so, before any calibration with opencv methods your fisheye images must be converted to rectilinear images first. You might try calibrating yourself using other approaches too
release 3.0.0 will have support for fisheye model. It is on alpha stage, you can still download and give it a try.

Related

How can I detect whether the object is 3D?

I am trying to build a solution where I could differentiate between a 3D textured surface with the height of around 200 micron and a regular text print.
The following image is a textured surface. The black color here is the base surface.
Regular text print will be the 2D print of the same 3D textured surface.
[EDIT]
Initial thought about solving this problem, could look like this:
General idea here would be, images shot at different angles of a 3D object would be less related to each other than the images shot for a 2D object in the similar condition.
One of the possible way to verify could be: 1. Take 2 images, with enough light around (flash of the camera). These images should be shot at as far angle from the object plane as possible. Say, one taken at camera making 45 degree at left side and other with the same angle on the right side.
Extract the ROI, perspective correct them.
Find GLCM of the composite of these 2 images. If the contrast of the GLCM is low, then it would be a 3D image, else a 2D.
Please pardon the language, open for edit suggestion.
General idea here would be, images shot at different angles of a 3D object would be less related to each other than the images shot for a 2D object in the similar condition.
One of the possible way to verify could be:
1. Take 2 images, with enough light around (flash of the camera). These images should be shot at as far angle from the object plane as possible. Say, one taken at camera making 45 degree at left side and other with the same angle on the right side.
Extract the ROI, perspective correct them.
Find GLCM of composite of these 2 images. If contrast of the GLCM is low, then it would be a 3D image, else a 2D.
Please pardon the language, open for edit suggestion.
If you can get another image which
different angle or
sharper angle or
different lighting condition
you may get result. However, using two image with different angle with calibrate camera can get stereo vision image which solve your problem easily.
This is a pretty complex problem and there is no plug-in-and-go solution for this. Using light (structured or laser) or shadow to detect a height of 0.2 mm will almost surely not work with an acceptable degree of confidence, no matter of how much "photos" you take. (This is just my personal intuition, in computer vision we verify if something works by actually testing).
GLCM is a nice feature to describe texture, but it is, as far as I know, used to verify if there is a pattern in the texture, so, I believe it would output a positive value for 2D print text if there is some kind of repeating pattern.
I would let the computer learn what is text, what is texture. Just extract a large amount of 3D and 2D data, and use a machine learning engine to learn which is what. If the feature space is rich enough, it may be able to find a way to differentiate one from another, in a way our human mind wouldn't be able to. The feature space should consist of edge and colour features.
If the system environment is stable and controlled, this approach will work specially well, since the training data will be so similar to the testing data.
For this problem, I'd start by computing colour and edge features (local image pixel sums over different edge and colour channels) and try a boosted classifier. Boosted classifiers aren't the state of the art when it comes to machine learning, but they are good at not overfitting (meaning you can just insert as much data as you want), and will most likely work in a stable environment.
Hope this helps,
Good luck.

Stiching Aerial images with OpenCV with a warper that projects images to the ground

Have anyone done something like that?
My problems with the OpenCV sticher is that it warps the images for panoramas, meaning the images get stretched a lot as one moves away from the first image.
From what I can tell OpenCV also builds ontop of the assumption of the camera is in the same position. I am seeking a little guidence on this, if its just the warper I need to change or I also need to relax this asusmption about the camera position being fixed before that.
I noticed that opencv uses a bundle adjuster also, is it using the same assumption that the camera is fixed?
Aerial image mosaicing
The image warping routines that are used in remote sensing and digital geography (for example to produce geotiff files or more generally orthoimages) rely on both:
estimating the relative image motion (often improved with some aircraft motion sensors such as inertial measurement units),
the availability of a Digital Elevation Model of the observed scene.
This allows to estimate the exact projection on the ground of each measured pixel.
Furthermore, this is well beyond what OpenCV will provide with its built-in stitcher.
OpenCV's Stitcher
OpenCV's Stitcher class is indeed dedicated to the assembly of images taken from the same point.
This would not be so bad, except that the functions try to estimate just a rotation (to be more robust) instead of plain homographies (this is where the fixed camera assumption will bite you).
It adds however more functionality that are useful in the context of panoramao creation, especially the image seam cut detection part and the image blending in overlapping areas.
What you can do
With aerial sensors, it is usually sound to assume (except when creating orthoimages) that the camera - scene distance is big enough so that you can approach the inter-frame transform by homographies (expecially if your application does not require very accurate panoramas).
You can try to customize OpenCV's stitcher to replace the transform estimate and the warper to work with homographies instead of rotations.
I can't guess if it will be difficult or not, because for the most part it will consist in using the intermediate transform results and bypassing the final rotation estimation part. You may have to modify the bundle adjuster too however.

Image Rectification for Shake Correction on OpenCV

I've 2 pictures of the same scene from an uncalibrated camera. The pics are from a slightly different angle and scale(zoom) and I'd like to superpose them, rejecting any kind of shake. In other words, I should transform them so the shake becomes imperceptible, do a Motion Compensation.
I've already tried using a simple SURF (feature) detector along with Homography but sometimes the result isn't satisfactory. So I am thinking about trying Image Rectification to compensate the motion.
- Would it work with slight changes, such as user shake?
- Would it really work to reject shake for these 2 frames? And for a bigger buffer of pictures (10 maybe)?
- Anyone knows if it would fix scale disparity (different zoom in the images)?
- What the algorithm really do? Will it transform both pictures into a third orientation?
If there is a better solution, I would be glad to know =)
EDIT
I don't aim to compensate blur motion but the displacement itself. For example, in this file the author compensates the angle difference between two cameras by Image Rectification. How does it actually work? Does it always create an intermediate picture orientation or can I specify that one of the pictures shall remains still??
Also, would I be able to apply this to many frames or it would always find an intermediate orientation for each two frames I put in?
Cheers,
I'm not sure how well superimposing the images would work. Another way to remove blur (including motion blur which should dominate in handheld camera devices) from an image is by blind deconvolution. It is basically a method of finding the inverse of the blur filter that was physically applied (camera shaken) to the real image. There's plenty of techniques out on the web. I've specifically had good results using a modified version of the algorithm in this paper: http://www.cse.cuhk.edu.hk/~leojia/all_final_papers/motion_deblur_cvpr07.pdf
It also comes with an executable file somewhere around the web so you can see if it's fit for your purpose.
Good luck out there!

OpenCV intrusion detection

For a project of mine, I'm required to process images differences with OpenCV. The goal is to detect an intrusion in a zone.
To be a little more clear, here are the inputs and outputs:
Inputs:
An image of reference
A second image from approximately the same point of view (can be an error margin)
Outputs:
Detection of new objects in the scene.
Bonus:
Recognition of those objects.
For me, the most difficult part of it is to take off small differences (luminosity, camera position margin error, movement of trees...)
I already read a lot about OpenCV image processing (subtraction, erosion, threshold, SIFT, SURF...) and have some good results.
What I would like is a list of steps you think is the best to have a good detection (humans, cars...), and the algorithms to do each step.
Many thanks for your help.
Track-by-Detect, human tracker:
You apply the Hog detector to detect humans.
You draw a respective rectangle as foreground area on the foreground mask.
You pass this mask to "The OpenCV Video Surveillance / Blob Tracker Facility"
You can, now, group the passing humans based on their blob.{x,y} values into public/restricted areas.
I had to deal with this problem the last year.
I suggest an adaptive background-foreground estimation algorithm which produced a foreground mask.
On top of that, you add a blob detector and tracker, and then calculate if an intersection takes place between the blobs and your intrusion area.
Opencv comes has samples of all of these within the legacy code. Ofcourse, if you want you can also use your own or other versions of these.
Links:
http://opencv.willowgarage.com/wiki/VideoSurveillance
http://experienceopencv.blogspot.gr/2011/03/blob-tracking-video-surveillance-demo.html
I would definitely start with a running average background subtraction if the camera is static. Then you can use findContours() to find the intruding object's location and size. If you want to detect humans that are walking around in a scene, I would recommend looking at using the built-in haar classifier:
http://docs.opencv.org/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html#cascade-classifier
where you would just replace the xml with the upperbody classifier.

3D reconstruction -- How to create 3D model from 2D image?

If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/

Resources