I am reading YOLO original paper https://arxiv.org/pdf/1506.02640.pdf.
At the beginning of the paper. It says
If the center of an object falls into a grid cell, that grid cell
is responsible for detecting that object.
The loss function
Note that the loss function only penalizes classification
error if an object is present in that grid cell (hence the conditional class probability discussed earlier).
So, my understanding is that an object is present in one cell if the center of this object falls into this cell. Even if a part of an object (but not the center) is contained in one cell. We still think this cell doesn't have an object (1_i^obj = 0). And the target confidence score should be 0
Am I correct?
I will try to answer the question (please correct me if I made any mistake).
First of all, why the centre point is responsible for the bounding box information?
1). YOLO bounding box annotation is different from other approaches, which uses (x_centre, y_centre, width, height) instead of (x_min, y_min, x_max, y_max). (why? see below).
2). One loss item penalizes the centre, width, height difference based on the ground truth, and such item makes the centre grid are more likely to make the prediction own highest IOU by compared with others.
Given that, in the training stage, only the highest IOU predicted bounding box is in use, so the centre grid almost always has the highest confidence.
Finally, back to your question, no. The no-centre grids' confidence scores are very likely to be lower than the centre point, but they are not 0 if there is an object is covered by the grid. Another way to explain this is, in the testing phase, there are amounts of bounding box be generated and the NMS is in use to pick the best.
Related
A lot of popular and state of the art object detection algorithms like YOLO and SSD use the concepts of anchor boxes. As far as I understand for networks like YOLO v3, each output grid cell has multiple anchor boxes with different aspect ratios. For detection the network predicts offset for the anchor box with the highest overlap a the given object. Why is this used instead of having multiple bounding box predictors ( each predicting x, y, w, h and c ).
No, anchor boxes cannot be simply replaced by multiple bounding box predictors.
In your description, there was a minor misunderstanding.
For detection the network predicts offset for the anchor box with the highest overlap a the given object
Selecting the anchor box with the highest overlap to a groundtruth only happens during training phase. As explained in the SSD paper section 2.2 Matching Strategy. Not only the highest overlap anchor boxes are selected but also the ones that has IoU bigger than 0.5.
During prediction time, the box predictor will predict the four offsets of each anchor box together with confidences for all categories.
Now it comes to the question of why predicting the offsets instead of box attributes (x,y, c,h).
In short, this is related to scales. For this I agree with #viceriel's answer but here is an vivid example.
Suppose the following two images of the same size (the left one has blue background) are fed to the predictor and we want to get the bbox for the dog. Now the red bbox in each image represent the anchor boxes, both are about perfect bbox for the dog. If we predict the offset, the box predictor only needs to predict 0 for the four offsets in both cases. While if you use multiple predictor, the model has to give two different sets of values for c and h while x and y are the same. This essentially is what #vicerial explains as predicting offsets will present a less difficult mapping for the predictor to learn.
This example also explains why anchor boxes can help improve detector's performance.
Key is the understand how anchor boxes are created. For example YOLOv3 take sizes of bounding boxes from training set apply K-means to them and find box sizes which describes well all boxes present at training set.
If you predict w, h instead of offset of anchor box your possible outputs will be more variable, in sense there will be many, many possible heights and widths for bounding box. But if you instead predict offset for box which somehow have appropriate size for your object detection task, there will be less variability because anchor boxes describe wanted bounding boxes. Which leads to better performance for the network, because you reframe the task and network now learns less difficult input-output mapping.
The question is conceptual. I basically understand how MNIST example works, the feedforward net takes an image as input and output a predicted label 0 to 9.
I'm working on a project that ideally will take an image as the input, and for every pixel on that image, I will output a probability of that pixel being a certain label or not.
So my input, for example is of the size 600 * 800 * 3 pixels, and my output would be 600 * 800, where every single entry on my output is a probability.
How can I design the pipeline for that using Convolutional Neural Network? I'm working with Tensorflow. Thanks
Elaboration:
Basically I wanted to label every pixel as either foreground or background (The probability of the pixel being foreground). My intuition is that in convolutional layers, the neurons will be able to pick up information in a patch around that pixel, and finally be able to tell how likely this pixel could be the foreground.
Although it wouldn't be very efficient, a naive method could be to color a window (say, 5px x 5px) of pixels black, record the probabilities for each output class, then slide the window over a bit, then record again. This would be repeated until the window passed over the whole image.
Now we have some interesting information. For each window position, we know the delta of the probability distribution over the labels compared to the probabilities when the classifier received the whole image. That delta corresponds to the amount that that region contributed to the classifier making that decision.
If you want this mapped down to a per-pixel level for visualization purposes, you could use a stride length of 1 pixel when sliding the window and map the probability delta to the centermost pixel of the window.
Note that you don't want to make the window too small, otherwise the deltas will be too small to make a difference. Also, you'll probably want to be a bit smart about how you choose the color of the window so the window itself doesn't appear to be a feature to the classifier.
Edit in response to your elaboration:
This would still work for what you're trying to do. In fact, it becomes a bit nicer even. Instead of keeping all the label probability deltas separate, you would sum them. This would give you measurement which tells you "how much does this region make the image more like a number" (or in other words, the foreground). Also, you wouldn't measure the deltas against the uncovered image, but rather against the vector of probabilities where P(x)=0 for each label.
I am working on a convolution neural network to identify objects like animals, vehicles, trees etc. One of the class for detection is auto. when I gave the image to the network , it predicted as auto . But I need to draw a bounding box around the object. When I tried the sliding window fashion, I got a lot of bounding boxes, but I need only one. How to find out the most appropriate bounding box of an object after neural network prediction?? Don't we need some method to localise the objects from a large image? That is what I want.
My final layer function is a logistic regression function, where it predicts only 1 or 0. I don't know how to make that prediction to a probability score. If I had a probability score of each box, then it was so easy to find out the most appropriate box. Please suggest me some methods for finding the same. Thanks in advance. All answers are welcome.
INPUT, OUTPUT AND EXPECTED OUTPUT
It's not clear if you have a single object in your input image or several. Your example shows one.
If you have ONE object, here are some options to consider for the bounding boxes:
Keep most distant ones: Keep the top, bottom, right, left boundaries that are most distant from the center of all the boundary boxes.
Keep the average ones: E.g. Take all the top boundaries and keep their average location. Repeat the same with all the bottom, right, and left boundaries.
Keep the median ones: Same as the average, but keep the median of each direction boundary instead.
Keep the boundary box with largest activation: You're using logistic regression as the final step, find the input that goes into that logistic layer, and keep the bounding box that has the largest input to the logistic layer.
I'm in the process of creating a classifier for an electrical outlet (specifically the three open holes that occur twice on standard outlet panels, not the entire panel itself).
My question is, what are the ideal traits of positive images and what width and height should I pass to train_cascade to enable my object detector to detect the smallest possible outlets? I.e. to detect them from the farthest possible distance? I also care about accuracy, and am fine with a classifier that takes weeks to train (assuming it is actually making progress).
And a question to increase my understanding of this: is the width and height I pass to train_cascade the dimensions of the search box that will be passed over each image? If so, and I want my detector to detect very small objects, than I should pass a small width and height, correct?
I would like to be able to detect very large and very small instances of outlets. From very close up (the camera is literally 3 inches away from the outlet) to at least a few feet away.
Ok, so after a few weeks of getting to know OpenCV and its object detection capabilities and no one else has answered, I'll answer my own question.
Anyways, my understanding is that the smallest object can be as small as the positive samples opencv_createsamples is fed.
I used OpenCV's object detection to detect the outlet, as pictured in the question. I specified 20x20 pixels to createsamples and got great results. The object can be detected to 3-4', which is I believe when its resolution falls under 20x20 pixels.
One thing to remember is that when you are running your detector, it is sliding squares with the dimensions you specify over the input frame. If your object appears smaller than that square in the image it will simply go undetected.
I'm looking for an efficient way of selecting a relatively large portion of points (2D Euclidian graph) that are the furthest away from the center. This resembles the convex hull, but would include (many) more points. Further criteria:
The number of points in the selection / set ("K") must be within a specified range. Most likely it won't be very narrow, but it most work for different ranges (eg. 0.01*N < K < 0.05*N as well as 0.1*N < K < 0.2*N).
The algorithm must be able to balance distance from the center and "local density". If there are dense areas near the upper part of the graph range, but sparse areas near the lower part, then the algorithm must make sure to select some points from the lower part even if they are closer to the center than the points in the upper region. (See example below)
Bonus: rather than simple distance from center, taking into account distance to a specific point (or both a point and the center) would be perfect.
My attempts so far have focused on using "pigeon holing" (divide graph into CxR boxes, assign points to boxes based on coordinates) and selecting "outer" boxes until we have sufficient points in the set. However, I haven't been successful at balancing the selection (dense regions over-selected because of fixed box size) nor at using a selected point as reference instead of (only) the center.
I've (poorly) drawn an Example: The red dots are the points, the green shape is an example of what I want (outside the green = selected). For sparse regions, the bounding shape comes closer to the center to find suitable points (but doesn't necessarily find any, if they're too close to the center). The yellow box is an example of what my Pigeon Holing based algorithms does. Even when trying to adjust for sparser regions, it doesn't manage well.
Any and all ideas are welcome!
I don't think there are any standard algorithms that will give you what you want. You're going to have to get creative. Assuming your points are embedded in 2D Euclidean space here are some ideas:
Iteratively compute several convex hulls. For example, compute the convex hull, keep the points that are part of the convex hull, then compute another convex hull ignoring the points from the original convex hull. Continue to do this until you have a sufficient number of points, essentially plucking off points on the perimeter for each iteration. The only problem with this approach is that it will not work well for concavities in your data set (e.g., the one on the bottom of your sample you posted).
Fit a Gaussian to your data and keep everything > N standard
deviations away from the mean (where N is a value that you'd have to
choose). This should work pretty well if your data is Gaussian. If
it isn't, you could always model it with several Gaussians (instead
of one), and keep points with a joint probability less than some threshold. Using multiple Gaussians will probably handle concavities decently. References:
http://en.wikipedia.org/wiki/Gaussian_function
How to fit a gaussian to data in matlab/octave?\
Use Kernel Density Estimation - If you create a kernel density
surface, you could slice the surface at some height (e.g., turning
it into a plateau), giving you a perimeter shape (the shape of the
plateau) around the points. The trick would be to slice it at the
right location though, because you could end up getting no points
outside of the shape, but with the right selection you could easily
get the green shape you drew. This approach will work well and give you the green shape in your example if you choose the slice point wisely (which may be difficult to do). The big drawback of this approach is that it is very computationally expensive. More information:
http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
Use alpha shapes to get a general shape the wraps tightly around
the outside perimeter of the point set. Then erode the shape a
little to force some points outside of the shape. I don't have a lot of experience with alpha shapes, but this approach will also be quite computationally expensive. More info:
http://doc.cgal.org/latest/Alpha_shapes_2/index.html