I prepared a dataset for training a object detection model (10 classes). I implied some people to help me in label annotation task. But when I assemble result from them, I had found that I sent many images to multiple people so those image contain many bounding boxes for same object. In case of some object, their bounding boxes that annotated by different people are not the same (They can be overlaped clearly, or only a part). So I have some questions:
Will these bounding boxes affect detection performance of my model?
If it will, how can I deal with this problem? Can I use an algorithm such as Weighted box fusion
to combine all bounding of an object to get the correct one?
Related
I am working on creating an object detection model that should be able to look at an image (and later watch a video) and label particular objects inside the image. However, in one dataset of "gun", "officer"s and "gun" are the two objects labelled, and if things like batons or riot shields happen to be inside the image they aren't labelled. There is however separate datasets for "riot shield" and "baton"s, because these are objects I want to detect. Equally, these two datasets happen to sometimes have guns inside them that aren't labelled etc etc because they were collected only to recognize those individual objects.
Here is my question:
If I train the model on these datasets, and it is training on the "gun" dataset for example, and see's unlabelled riot shields, will those unlabelled objects in those images conflict with the labelled images when it trains on "riot shield"s and ruin the detection? If so, is there a way to isolate it's training so it doesn't make assumptions about other objects that are unlabelled in images?
It's a problem if an object shows up unlabled in a picture. The network is being given conflicting messages about what is and isn't a certain object. AFAIK there's no real way to isolate training. However, you can train on the gun dataset first and then have it run through and label all the guns in the other datasets. The same idea should work for each object you want to detect in other datasets.
Most of the approaches to detection problems are based more or less on some form of bounding box proposal, which turns to two class (positive/negative) classification. I found lots of materials on these topics.
But I was wondering, aren't there some approaches taking the whole image as an input, then sending it through several convolutional and pooling layers, output of which would be two numbers (x, y position of the object)? Of course this would mean that there's just one object in the image. So far I haven't find anything about this, should I consider it not usable?
As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.
My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?
Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?
Short answer: no they are not redundant.
The R-CNN article and its variants popularized the use of what we used to call a cascade.
Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.
If the detections are partly orthogonal it allows to remove false positive along the way.
Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).
But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see e.g. this article)
PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow
If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.
However, the second stage is used mainly to obtain more accurate detection boxes.
Faster-RCNN is a two-stage detector, like Fast R-CNN.
There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.
Now why is this necessary for the RPN? So why are they only rough estimates?
One reason is the limited receptive field:
The input image is transformed via a CNN into a feature map with limited spatial resolution. For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.
The box regression is done based on the final feature map of the CNN. In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.
Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person. Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.
Therefore, RPN creates a rough estimate of the bounding box. The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate.
In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight. This however can be done much more accurate, since more content of the object is accessable for the network.
faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.
So yes, you can, but your performance is not good enough
I think the blue circle is completely redundant and just adding a class classification layer (gives class for each bounding box containing object) should work just fine and that's what the single shot detectors do with compromised accuracy.
According to my understanding, RPN is just for binary checking if you have Objects in the bbox or not and final Detector part is for classifying the classes ex) car, human, phones, etc
I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size
I am trying to implement Canny edge detection found hereCanny edge to differentiate objects based on their shapes. I would like to know what are the features? I need to find a score/metric so that I can define a probability from information like mean of the shape. The purpose is to differentiate between objects of different shapes. So, lets assume that the mean shape(x) of Object1 and Object2 are x1,x2 and the standard deviation(s) is s1,s2 respectively. From what do I calculate these information and How do I find these information?
Canny Algorithm is an edge detector. It searches for high frequencies in the image by computing the magnitude of the derivatives in x and y direction. In the end of you have contours of objects. What you are trying to do is to classify objects and using Canny does not sound like a right way to do it, I am not saying you cannot build features out of edges, but it might perform poorly.
In order to achieve what you want, you need first to identify what features are important for you. You mentioned the shape but is the color a good feature for the class of objects you are trying to find? Your pictures show very colorful objects. Are you only trying to distinguish one object to the other (considering the images only display only the object of interest) or do you want locate them in the screen? Does the image contain only one object or multiple ones?
I will give you some direction regarding feature modeling.
If color is a strong information for your objects, you could model your features using histogram information, compute n bins for all objects and store the distribution of the bins as a feature vector. You can use HOG.
Another possible (naive) solution is to compute all colors of patches (e.g. 7x7) belonging to each object and to compute later the histogram over patches instead of single pixels.
If you are not satisfied with color information and you would like to differentiate objects by comparing information in their neighborhood, you can use local binary patterns, which might be enough for the type of information you have.
Once you decide on the features which are important and modeled them, you can go for the classification (which is gonna determine which object you are seeing given a certain feature).
A probabilistic framework tries to estimate the posterior probability P(X|C), i.e. what is the probability of being object X given that we observed C (C could be your feature) and this is very powerful. You might consider reading about Maximum Likelihood Estimation and Maximum a posteriori. Also, a Naive Bayes classifier is a simple off the shelf algorithm available on Opencv that you could use.
You could use many other algorithms, such as SVM, Boost, Decision Trees, Neural Networks and so on. Bag of visual words is also a nice alternative.
If you are interested how to separate the object of interest from the background you are talking about image segmentation, you can look at K-Means or more powerfully Graph Cuts techniques. Of course you can always segment first and then classify the segmented blobs.
Samuel