Yolo/SSD: How are features localized? - machine-learning

I want to fully understand how Yolo or SSD does work for object detection.
Here is what I understand:
in a simple CNN for image recognition, features are extracted, flattend and used for a classification
yolo devides the image into a grid. For each grid, some values like class probabilities and the bounding box parameters are calculated.
SSD not only uses one grid , but a combination of different sizes to better detect objects at any size.
What I don't understand:
in yolo and ssd, there is a classification per grid cell? How do we know, which features are in that specific cell?
are features extracted per grid cell or per bounding box or non of both?
how does ssd combine results from different grids?
Thank you!


Labeling runways for localization and detection using deep learning

Shown above is a sample image of runway that needs to be localized(a bounding box around runway)
i know how image classification is done in tensorflow, My question is how do I label this image for training?
I want model to output 4 numbers to draw bounding box.
In CS231n they say that we use a classifier and a localization head.
but how does my model knows where are the runnway in 400x400 images?
In short How do I LABEL this image for training? So that after training my model detects and localizes(draw bounding box around this runway) runways from input images.
Please feel free to give me links to lectures, videos, github tutorials from where I can learn about this.
**********Not CS231n********** I already took that lecture and couldnt understand how to solve using their approach.
If you want to predict bounding boxes, then the labels are also bounding boxes. This is what most object detection systems use for training. You can just have bounding box labels, or if you want to detect multiple object classes, then also class labels for each bounding box would be required.
Collect data from google or any resources that contains only runway photos (From some closer view). I would suggest you to use a pre-trained image classification network (like VGG, Alexnet etc.) and fine tune this network with downloaded runway data.
After building a good image classifier on runway data set you can use any popular algorithm to generate region of proposal from the image.
Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object(Runway) is present in that region. Otherwise it's not.
If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

CNN Object Localization Preprocessing?

I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size

Poor performance on digit recognition with CNN trained on MNIST dataset

I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
Normalization was the key.

OpenCV SVM training dataset

Lets say I have a dataset of about 350 positive images and more than 400 negative images. They aren't the same size. Also their size is bigger than 640x320.
What should I do to create a better dataset? Do I need the images to be smaller? If yes, why?
Should I apply some normalization to the dataset? What should it be (contrast, noise reduction)?
Can I create a bigger dataset using the existing one? If yes, how?
Thanks in advance!
Optimal size of images is that you can easily classify object by
Yes, classifiers works better after normalization, there are
options. Most popular ways is center dataset (subtract mean) and normalize range of
values say in [-1:1] range. Other popular way of normalization is similar to previous but normalize standard deviation (preferable in most cases).
Yes, you can create bigger dataset from existing on by adding
distorsions and noise to your images from existing dataset.
Have a look at INRIA dataset and their comments of how they "normalized" their input images for HoG person detection training.
one thing that wasn't mentioned yet is the fact, that for most detection techniques it isn't enough to collect a set of n images with the desired object "somewhere" within that image. Instead you should crop that image around the object (with some border).
e.g. for person detection they used this input image:
but they cropped and rescaled (and transformed) those regions (objects):
probably there are some good hints about training in the thesis too:

OpenCV cascade-training using images bounded/outlined by irregular polygons?

I need to prepare training data which I will then use with OpenCV's cascaded classifier. I understand that for training data I'll need to provide rectangular images as samples with aspect ratios that correspond to the -w and -h parameters in OpenCV's training commands.
I was fine with this idea, but then I saw web-based annotation tool LabelMe.
People have labelled in LabelMe using complex polygons!
Can these polygons be somehow used in cascaded training?
Wouldn't using irregular polygons improve the classification results?
If not, then what is the use of the complex polygons that outline objects in LabelMe'd images?
Data sets annotated with LabelMe are used for many different purposes. Some of them, like image segmentation, require tight boundaries, rather than bounding boxes.
On the other hand, the cascade classifier in OpenCV is designed to classify rectangular image regions. It is then used as part of a sliding-window object detector, which also works with bounding boxes.
Whether tight boundaries help improve object detection is an interesting question. There is evidence that the background pixels caught by the bounding box actually help the classification.
