Tensorflow object detection api extratct box feature - object-detection-api

I am trying to extract bounding box feature from Tensorflow object detection API with pretrained model 'faster_rcnn_resnet101_coco_2018_01_28', and trying to extract topK bounding box features.
Tensor with the name 'SecondStageBoxPredictor/AvgPool:0' give me box feature to be (100,1,1,2048), ''detection_boxes:0' is bounding boxes coordinates, 'detection_scores:0' is scores
The last two appear to be sorted (Descending as 'detection_scores:0'), so is the bounding box feature also sorted?

Related

How to detect an object (x) only when it is within another detected object (y) in computer vision?

This is the image...license plate within cars
I have used yolo for car detection, also trained another yolo model for license plate detection which detects license plate of all the vehicles. I want to join this two codes which detects licence plates only for cars. The above image detects licence plate for bus, trucks too. Is there any way where i can detect licence plate only if the vehicle detected is car?
You could ... submit the original image to the car detection model. The NMS output would contain the class, confidence, and bounding box coordinates. For the object classes that equal car, use OpenCV to crop the original image based upon the scaled bounding box output and submit the new image(s) to the license plate model. The output from the second model should only contain license plates within the car regions.

Bounding Box Regression

I have approximately 100K X ray pics of dimension 1024x1024. Only ~970 of them have pre existing bounding box coordinates. I am training the model on 70:30 training and testing ratio. My question is, how do i train the model if the rest of the images do not have bounding box? Since I'm no medical expert, I can't manually draw a bounding box around the image. There are 14 classes and it gets really difficult to draw bounding box manually
If you have a knowledge about the remaining not labelled images, for example if you know if an image has a particular class you can use weakly supervised learning to train image detection on all of them

Labeling runways for localization and detection using deep learning

Shown above is a sample image of runway that needs to be localized(a bounding box around runway)
i know how image classification is done in tensorflow, My question is how do I label this image for training?
I want model to output 4 numbers to draw bounding box.
In CS231n they say that we use a classifier and a localization head.
but how does my model knows where are the runnway in 400x400 images?
In short How do I LABEL this image for training? So that after training my model detects and localizes(draw bounding box around this runway) runways from input images.
Please feel free to give me links to lectures, videos, github tutorials from where I can learn about this.
**********Not CS231n********** I already took that lecture and couldnt understand how to solve using their approach.
Thanks
If you want to predict bounding boxes, then the labels are also bounding boxes. This is what most object detection systems use for training. You can just have bounding box labels, or if you want to detect multiple object classes, then also class labels for each bounding box would be required.
Collect data from google or any resources that contains only runway photos (From some closer view). I would suggest you to use a pre-trained image classification network (like VGG, Alexnet etc.) and fine tune this network with downloaded runway data.
After building a good image classifier on runway data set you can use any popular algorithm to generate region of proposal from the image.
Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object(Runway) is present in that region. Otherwise it's not.
If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

Faster RCNN: how to translate coordinates

I'm trying to understand and use the Faster R-CNN algorithm on my own data.
My question is about ROI coordinates: what we have as labels, and what we want in the end, are ROI coordinates in the input image. However, if I understand it correctly, anchor boxes are given in the convolutional feature map, then the ROI regression gives ROI coordinates relatively to an anchor box (so easily translatable to coordinates in conv feature map coordinates), and then the Fast-RCNN part does the ROI pooling using the coordinates in the convolutional feature map, and itself (classifies and) regresses the bounding box coordinates.
Considering that between the raw image and the convolutional features, some convolutions and poolings occured, possibly with strides >1 (subsampling), how do we associate coordinates in the raw images to coordinates in feature space (in both ways) ?
How are we supposed to give anchor boxes sizes: relatively to the input image size, or to the convolutional feature map ?
How is the bounding box regressed by Fast-RCNN expressed ? (I would guess: relatively to the ROI proposal, similarly to the encoding of the proposal relatively to the anchor box; but I'm not sure)
It looks like it's actually an implementation question, the method itself does not answer that.
A good way to do it though, that is used by Tensorflow Object Detection API, is to always give coordinates and ROI sizes relatively to the layer's input size. That is, all coordinates and sizes will be real numbers between 0 and 1. Likewise for the anchor boxes.
This handles nicely the problem of the downsampling, and allows easy computations of ROI coordinates.
When you don't use an activation function on a layer, the result will be raw numbers. These raw numbers are basically associated with the coordinates (labels) directly.
Using an activation function such as softmax or relu will give a probability value, which leads into a classification solution, instead of regression.

CNN Object Localization Preprocessing?

I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size

Resources