Contour assignment in OpenCv - Tracking across frames of movie - opencv

I am interested in tracking objects across frames of a movie to calculate the velocity of each object. These are Drosophila in well plates. Therefore there are always 12 objects in every frame and the objects should not be allowed to merge. I have written a script that identifies each object and finds their centroids. It them minimizes the distances between those centroids and the ones detected on the previous frame.
What I am really surprised to see is that without even taking the previous centroids into consideration, openCV seems to do a really good job of automatically assigning each contour the same relative identity across frames. So if I plot my video with the contour number over each blob, that number hardly changes across frames? How does this work? How does OpenCV decide which contour will be returned first and which will be returned last?
Thank you,
FB

Related

Object detection+tracking

I am new in the field of CV and trying to build an object detection with yolo and object tracking with DeepSort.
I have some problems with the identification of objects in the video. Here an example:
The sport ball is identified in the video but when it is too close to the person, the detector can not identify it.
In this picture the Ball is identified:
Here the bull is not identified:
How can I improve the detection? I am using pre-trained yolov3(which is trained on the coco dataset) and DeepSort.
To get a better accurancy from yolov3 you can retrain the model with new images.
If you don't want to do that, maybe you can try to upgrade your model and use Yolov4 or Yolov5. These has better perfomance and accurancy.
https://github.com/AlexeyAB/darknet
https://github.com/ultralytics/yolov5
Based on the resolution of your sample, I can say that it's a crop from much larger image, therefore the ball in its given context it's fairly small object.
Yolo architectures are notorious for poor performance on small objects due to their reduced dimensionality in the feature maps.
For tracking detecting and further tracking small objects I recommend using an SSD architecture which uses to feature maps extracted from multiple depth levels for producing the results; or maybe try out a newer architecture that can equal inference time performance such as EfficientDet. Implementations for both of them can be found over a variety of frameworks.
You can use a single object tracking when it is not detected. In OpenCV: KCF, CSRT etc.
Frame N. Detected 2 objects: person and ball. Save pointer to the Frame N.
Frame N+1. Detected only person. Create single tracker (CSRT is better) for previous ball position on Frame N. Run CSRT tracker on current Frame N+1. Get current position and add it to the detections. Save pointer to the Frame N+1.
Frame N+2. If ball was detected then delete CSRT tracker. Save pointer to the Frame N+2.
...

Extrinsic Camera Calibration Using OpenCV's solvePnP Function

I'm currently working on an augmented reality application using a medical imaging program called 3DSlicer. My application runs as a module within the Slicer environment and is meant to provide the tools necessary to use an external tracking system to augment a camera feed displayed within Slicer.
Currently, everything is configured properly so that all that I have left to do is automate the calculation of the camera's extrinsic matrix, which I decided to do using OpenCV's solvePnP() function. Unfortunately this has been giving me some difficulty as I am not acquiring the correct results.
My tracking system is configured as follows:
The optical tracker is mounted in such a way that the entire scene can be viewed.
Tracked markers are rigidly attached to a pointer tool, the camera, and a model that we have acquired a virtual representation for.
The pointer tool's tip was registered using a pivot calibration. This means that any values recorded using the pointer indicate the position of the pointer's tip.
Both the model and the pointer have 3D virtual representations that augment a live video feed as seen below.
The pointer and camera (Referred to as C from hereon) markers each return a homogeneous transform that describes their position relative to the marker attached to the model (Referred to as M from hereon). The model's marker, being the origin, does not return any transformation.
I obtained two sets of points, one 2D and one 3D. The 2D points are the coordinates of a chessboard's corners in pixel coordinates while the 3D points are the corresponding world coordinates of those same corners relative to M. These were recorded using openCV's detectChessboardCorners() function for the 2 dimensional points and the pointer for the 3 dimensional. I then transformed the 3D points from M space to C space by multiplying them by C inverse. This was done as the solvePnP() function requires that 3D points be described relative to the world coordinate system of the camera, which in this case is C, not M.
Once all of this was done, I passed in the point sets into solvePnp(). The transformation I got was completely incorrect, though. I am honestly at a loss for what I did wrong. Adding to my confusion is the fact that OpenCV uses a different coordinate format from OpenGL, which is what 3DSlicer is based on. If anyone can provide some assistance in this matter I would be exceptionally grateful.
Also if anything is unclear, please don't hesitate to ask. This is a pretty big project so it was hard for me to distill everything to just the issue at hand. I'm wholly expecting that things might get a little confusing for anyone reading this.
Thank you!
UPDATE #1: It turns out I'm a giant idiot. I recorded colinear points only because I was too impatient to record the entire checkerboard. Of course this meant that there were nearly infinite solutions to the least squares regression as I only locked the solution to 2 dimensions! My values are much closer to my ground truth now, and in fact the rotational columns seem correct except that they're all completely out of order. I'm not sure what could cause that, but it seems that my rotation matrix was mirrored across the center column. In addition to that, my translation components are negative when they should be positive, although their magnitudes seem to be correct. So now I've basically got all the right values in all the wrong order.
Mirror/rotational ambiguity.
You basically need to reorient your coordinate frames by imposing the constraints that (1) the scene is in front of the camera and (2) the checkerboard axes are oriented as you expect them to be. This boils down to multiplying your calibrated transform for an appropriate ("hand-built") rotation and/or mirroring.
The basic problems is that the calibration target you are using - even when all the corners are seen, has at least a 180^ deg rotational ambiguity unless color information is used. If some corners are missed things can get even weirder.
You can often use prior info about the camera orientation w.r.t. the scene to resolve this kind of ambiguities, as I was suggesting above. However, in more dynamical situation, of if a further degree of automation is needed in situations in which the target may be only partially visible, you'd be much better off using a target in which each small chunk of corners can be individually identified. My favorite is Matsunaga and Kanatani's "2D barcode" one, which uses sequences of square lengths with unique crossratios. See the paper here.

Unstable homography estimation using ORB

I am developing a feature tracking application and so far, after trying to almost all the feature detectors/descriptors, i've got the most satisfactory overall results with ORB.
Both my feature descriptor and detector is ORB.
I am selecting a specific area for detecting features on my source image (by masking). and then matching it with features detected on subsequent frames.
Then i filter my matches by performing ratio test on 'matches' obtained from the following code:
std::vector<std::vector<DMatch>> matches1;
m_matcher.knnMatch( m_descriptorsSrcScene, m_descriptorsCurScene, matches1,2 );
I also tried the two way ratio test(filtering matches from Source to Current scene and vice-versa, then filtering out common matches) but it didn't do much, so I went ahead with the one way ratio test.
i also add a min distance check to my ratio test, which, it apppears, gives better results
if (distanceRatio < m_fThreshRatio && bestMatch.distance < 5*min_dist)
{
refinedMatches.push_back(bestMatch);
}
and in the end , i estimate the Homography.
Mat H = findHomography(points1,points2);
I've tried using the RANSAC method for estimating inliners and then using those to recalculate my Homography, but that gives more unstability plus consumes more time.
then in the end i draw a rectangle around my specific region which is to be tracked. i get the plane coordinates by:
perspectiveTransform( obj_corners, scene_corners, H);
where 'objcorners' are the coordinates of my masked(or unmasked) region.
The reactangle I draw using 'scene_corners' seems to be vibrating. increasing the number of features has reduced it quite a bit, but I cant increase them too much because of the time constraint.
How can i improve the stability?
Any suggestions would be appreciated.
Thanks.
If it is the vibrations that are really bothersome to you then you could try taking the moving average of the homography matrices over time:
cv::Mat homoG = cv::findHomography(obj, scene, CV_RANSAC);
if (homography.empty()) {
homoG.copyTo(homography);
}
cv::accumulateWeighted(homoG, homography, 0.1);
Make the 'homography' variable global, and keep calling this every time you get a new frame.
The alpha parameter of accumulateWeighted is the reciprocal of the period of the moving average.
So 0.1 is taking the average of the last 10 frames and 0.2 is taking the average of the last 5 and so on...
A suggestion that comes to mind from experience with feature detection/matching is that sometimes you just have to accept the matched feature points will not work perfectly. Even subtle changes in the scene you are looking at can cause somewhat annoying problems, for example changes in light or unwanted objects coming into view.
It appears to me that you have a decently working feature matching in place from what you say, you may want to work on a way of keeping the region of interest constant. If you know the typical speed or any other movement patterns unique to any object you are trying to track between frames, or any constraints relating to the position of your camera, it may be useful in avoiding recalculating the region of interest unnecessarily causing vibrations. Or in fact it may help in creating a more efficient searching algorithm, allowing you to increase the number of feature points you can detect and use.
Another (small) hack you can use is to avoid redrawing the region window if the previous window was of similar size and position.

Vehicle segmentation and tracking

I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.

Image Processing: What are occlusions?

I'm developing an image processing project and I come across the word occlusion in many scientific papers, what do occlusions mean in the context of image processing? The dictionary is only giving a general definition. Can anyone describe them using an image as a context?
Occlusion means that there is something you want to see, but can't due to some property of your sensor setup, or some event.
Exactly how it manifests itself or how you deal with the problem will vary due to the problem at hand.
Some examples:
If you are developing a system which tracks objects (people, cars, ...) then occlusion occurs if an object you are tracking is hidden (occluded) by another object. Like two persons walking past each other, or a car that drives under a bridge.
The problem in this case is what you do when an object disappears and reappears again.
If you are using a range camera, then occlusion is areas where you do not have any information. Some laser range cameras works by transmitting a laser beam onto the surface you are examining and then having a camera setup which identifies the point of impact of that laser in the resulting image. That gives the 3D-coordinates of that point. However, since the camera and laser is not necessarily aligned there can be points on the examined surface which the camera can see but the laser can not hit (occlusion).
The problem here is more a matter of sensor setup.
The same can occur in stereo imaging if there are parts of the scene which are only seen by one of the two cameras. No range data can obviously be collected from these points.
There are probably more examples.
If you specify your problem, then maybe we can define what occlusion is in that case, and what problems it entails
The problem of occlusion is one of the main reasons why computer vision is hard in general. Specifically, this is much more problematic in Object Tracking. See the below figures:
Notice, how the lady's face is not completely visible in frames 0519 & 0835 as opposed to the face in frame 0005.
And here's one more picture where the face of the man is partially hidden in all three frames.
Notice in the below image how the tracking of the couple in red & green bounding box is lost in the middle frame due to occlusion (i.e. partially hidden by another person in front of them) but correctly tracked in the last frame when they become (almost) completely visible.
Picture courtesy: Stanford, USC
Occlusion is the one which blocks our view. In the image shown here, we can easily see the people in the front row. But the second row is partly visible and third row is much less visible. Here, we say that second row is partly occluded by first row, and third row is occluded by first and second rows.
We can see such occlusions in class rooms (students sitting in rows), traffic junctions (vehicles waiting for signal), forests (trees and plants), etc., when there are a lot of objects.
Additionally to what has been said I want to add the following:
For Object Tracking, an essential part in dealing with occlusions is writing an efficient cost function, which will be able to discriminate between the occluded object and the object that is occluding it. If the cost function is not ok, the object instances (ids) may swap and the object will be incorrectly tracked. There are numerous ways in which cost functions can be written some methods use CNNs[1] while some prefer to have more control and aggregate features[2]. The disadvantage of CNN models is that in case you are tracking objects that are in the training set in the presence of objects which are not in the training set, and the first ones get occluded, the tracker can latch onto the wrong object and may or may never recover. Here is a video showing this. The disadvantage of aggregate features is that you have to manually engineer the cost function, and this can take time and sometimes knowledge of advanced mathematics.
In the case of dense Stereo Vision reconstruction, occlusion happens when a region is seen with the left camera and not seen with the right(or vice versa). In the disparity map this occluded region appears black (because the corresponding pixels in that region have no equivalent in the other image). Some techniques use the so called background filling algorithms which fill the occluded black region with pixels coming from the background. Other reconstruction methods simply let those pixels with no values in the disparity map, because the pixels coming from the background filling method may be incorrect in those regions. Bellow you have the 3D projected points obtained using a dense stereo method. The points were rotated a bit to the right(in the 3D space). In the presented scenario the values in the disparity map which are occluded are left unreconstructed (with black) and due to this reason in the 3D image we see that black "shadow" behind the person.
As the other answers have explained the occlusion well, I will only add to that. Basically, there is semantic gap between us and the computers.
Computer actually see every image as the sequence of values, typically in the range 0-255, for every color in RGB Image. These values are indexed in the form of (row, col) for every point in the image. So if the objects change its position w.r.t the camera where some aspect of the object hides (lets hands of a person are not shown), computer will see different numbers (or edges or any other features) so this will change for the computer algorithm to detect, recognize or track the object.

Resources