Bounding box estimation in neural network output

Bounding box estimation in neural network output - machine-learning

I am working on a convolution neural network to identify objects like animals, vehicles, trees etc. One of the class for detection is auto. when I gave the image to the network , it predicted as auto . But I need to draw a bounding box around the object. When I tried the sliding window fashion, I got a lot of bounding boxes, but I need only one. How to find out the most appropriate bounding box of an object after neural network prediction?? Don't we need some method to localise the objects from a large image? That is what I want.
My final layer function is a logistic regression function, where it predicts only 1 or 0. I don't know how to make that prediction to a probability score. If I had a probability score of each box, then it was so easy to find out the most appropriate box. Please suggest me some methods for finding the same. Thanks in advance. All answers are welcome.
INPUT, OUTPUT AND EXPECTED OUTPUT

It's not clear if you have a single object in your input image or several. Your example shows one.
If you have ONE object, here are some options to consider for the bounding boxes:
Keep most distant ones: Keep the top, bottom, right, left boundaries that are most distant from the center of all the boundary boxes.
Keep the average ones: E.g. Take all the top boundaries and keep their average location. Repeat the same with all the bottom, right, and left boundaries.
Keep the median ones: Same as the average, but keep the median of each direction boundary instead.
Keep the boundary box with largest activation: You're using logistic regression as the final step, find the input that goes into that logistic layer, and keep the bounding box that has the largest input to the logistic layer.

Related

Can we replace anchor boxes in object detection with multiple bounding box predictors?

A lot of popular and state of the art object detection algorithms like YOLO and SSD use the concepts of anchor boxes. As far as I understand for networks like YOLO v3, each output grid cell has multiple anchor boxes with different aspect ratios. For detection the network predicts offset for the anchor box with the highest overlap a the given object. Why is this used instead of having multiple bounding box predictors ( each predicting x, y, w, h and c ).

No, anchor boxes cannot be simply replaced by multiple bounding box predictors.
In your description, there was a minor misunderstanding.
For detection the network predicts offset for the anchor box with the highest overlap a the given object
Selecting the anchor box with the highest overlap to a groundtruth only happens during training phase. As explained in the SSD paper section 2.2 Matching Strategy. Not only the highest overlap anchor boxes are selected but also the ones that has IoU bigger than 0.5.
During prediction time, the box predictor will predict the four offsets of each anchor box together with confidences for all categories.
Now it comes to the question of why predicting the offsets instead of box attributes (x,y, c,h).
In short, this is related to scales. For this I agree with #viceriel's answer but here is an vivid example.
Suppose the following two images of the same size (the left one has blue background) are fed to the predictor and we want to get the bbox for the dog. Now the red bbox in each image represent the anchor boxes, both are about perfect bbox for the dog. If we predict the offset, the box predictor only needs to predict 0 for the four offsets in both cases. While if you use multiple predictor, the model has to give two different sets of values for c and h while x and y are the same. This essentially is what #vicerial explains as predicting offsets will present a less difficult mapping for the predictor to learn.
This example also explains why anchor boxes can help improve detector's performance.

Key is the understand how anchor boxes are created. For example YOLOv3 take sizes of bounding boxes from training set apply K-means to them and find box sizes which describes well all boxes present at training set.
If you predict w, h instead of offset of anchor box your possible outputs will be more variable, in sense there will be many, many possible heights and widths for bounding box. But if you instead predict offset for box which somehow have appropriate size for your object detection task, there will be less variability because anchor boxes describe wanted bounding boxes. Which leads to better performance for the network, because you reframe the task and network now learns less difficult input-output mapping.

CNN approaches to detection which are not bounding box based

Most of the approaches to detection problems are based more or less on some form of bounding box proposal, which turns to two class (positive/negative) classification. I found lots of materials on these topics.
But I was wondering, aren't there some approaches taking the whole image as an input, then sending it through several convolutional and pooling layers, output of which would be two numbers (x, y position of the object)? Of course this would mean that there's just one object in the image. So far I haven't find anything about this, should I consider it not usable?

CNN Object Localization Preprocessing?

I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks

Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size

Pixel wise classification using Convolutional Neural Network?

The question is conceptual. I basically understand how MNIST example works, the feedforward net takes an image as input and output a predicted label 0 to 9.
I'm working on a project that ideally will take an image as the input, and for every pixel on that image, I will output a probability of that pixel being a certain label or not.
So my input, for example is of the size 600 * 800 * 3 pixels, and my output would be 600 * 800, where every single entry on my output is a probability.
How can I design the pipeline for that using Convolutional Neural Network? I'm working with Tensorflow. Thanks
Elaboration:
Basically I wanted to label every pixel as either foreground or background (The probability of the pixel being foreground). My intuition is that in convolutional layers, the neurons will be able to pick up information in a patch around that pixel, and finally be able to tell how likely this pixel could be the foreground.

Although it wouldn't be very efficient, a naive method could be to color a window (say, 5px x 5px) of pixels black, record the probabilities for each output class, then slide the window over a bit, then record again. This would be repeated until the window passed over the whole image.
Now we have some interesting information. For each window position, we know the delta of the probability distribution over the labels compared to the probabilities when the classifier received the whole image. That delta corresponds to the amount that that region contributed to the classifier making that decision.
If you want this mapped down to a per-pixel level for visualization purposes, you could use a stride length of 1 pixel when sliding the window and map the probability delta to the centermost pixel of the window.
Note that you don't want to make the window too small, otherwise the deltas will be too small to make a difference. Also, you'll probably want to be a bit smart about how you choose the color of the window so the window itself doesn't appear to be a feature to the classifier.
Edit in response to your elaboration:
This would still work for what you're trying to do. In fact, it becomes a bit nicer even. Instead of keeping all the label probability deltas separate, you would sum them. This would give you measurement which tells you "how much does this region make the image more like a number" (or in other words, the foreground). Also, you wouldn't measure the deltas against the uncovered image, but rather against the vector of probabilities where P(x)=0 for each label.

Draw ellipses on people using Expectation Maximization with OpenCV

I have a few doubts about how to approach my goal. I have an outside camera who is recording people and I want to draw an ellipse on every person.
Right now what I do is get the feature points of the people from the frame (I get them using a mask to only have the feature points on the people), set a EM algorithm and train it with my samples (the feature points extracted). The number of clusters is twice the number of people from the image (I get it before start the EM algorithm using other methods such as pixel counting with a codebook).
My question is
(a) Do I have to just train it only for the first frame and then use predict in the following frames? or,
(b) use train with the feature points in every frame?
Right now I am doing the option b) (I don't use predict) because I don't really know how to use the predict.
If I do a), can you help me with it and after that how to draw the ellipses?. If I do b), can you help me drawing an ellipse for every person? Since right know I got different ellipses for the same person using the cov, mean, etc (one for the arm, for example).
What I want to achieve is this paper using the Gaussian model: Link

If you would draw bounding boxes, rather then ellipses, you could use the function groupRectanlges to merge the different bounding boxes.
But, more important - for people detection, you can simply use openCV's person detector (based on HOG) or latent svm detector with the person model.

You should do b) anyway because, otherwise you'll try to match the keypoints to the clusters (persons) in the first frame. After a few seconds this would not be relevant.
It seems reasonable to assume that from frame to frame change is not going to be overwhelming, so reusing the results of the training on frame N-1 is a good seed to train on frame N, likely to converge faster that running EM from scratch on each frame.
in order to draw the ellipses you can leverage from the mixture of gaussian example in the python bindings:
https://github.com/opencv/opencv/blob/master/samples/python/gaussian_mix.py
Note if you use a diagonal covariance matrix, your ellipses are going to be aligned "straight", their own axis aligned with the X and Y axis of the frame, you can skip the calculation of the angle of the ellipse

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart