Why is image segmentation needed for object detection? - image-processing

What is the point of image segmentation algorithms like SLIC? Most object detection algorithms work over the entire set of (square) sub-images anyway.
The only conceivable benefit to segmenting the image is that now, the classifier has shape information available to it. Is that right?
Most classifiers I know of take rectangular input images. What classifiers allow you to pass variable sized image segments to them?

First, SLIC, and the kind of algorithms I'm guessing you refer to, are not segmentation algorithms, they are oversegmentation algorithms. There is a difference between those two terms. segmentation methods split the image in objects while oversegmentation methods split the image in small clusters (spatially adjacent group of pixels with similar characteristics), these clusters are usually called superpixels. See the image** of superpixels below:
Now, answering parts of your original question:
Why to use superpixels?
They reduce the dimensionality of your data/complexity of the problem. From N = w x h pixels to M superpixels, with M << N. This is, for an image composed of N = 320x480 = 153600 pixels, M = 400 or M = 800 superpixels seem to be enough to oversegment it. Now, letting for latter how to classify them, just consider how easier your problem has become reducing from N=100K to N=800 training examples to train/classify. The superpixels still represent your data properly, as they adhere to image boundaries.
They allow you to compute more complex features. With pixels, you can only compute some simple statistics on them, or use a filter-bank/feature extractor to extract some features in its vicinity. This however represents your pixel's neighbour very locally, without having in consideration the context. With superpixels, you can compute a superpixel descriptor from all the pixels that belong to it. This is, features are usually computed at pixel level as before, but, then features are merged into a superpixel descriptor by different methods. Some of the methods to do that are: simple mean of all pixels inside a superpixel, histograms, bag-of-words, correlation. As a simple example, imagine you only consider grayscale as a feature for your image/classifier, if you use simple pixels, all you have is pixel's intensity, which is very local and noisy. If you use superpixels, you can compute a histogram of the intensities of all the pixels inside, which describes much better the region than a single local intensity value.
They allow you to compute new features. Over superpixels you can compute some regional statistics (1st order as mean or variance or second order covariance). You can now extract some other information not available before (e.g. shape, length, diameter, area...).
How to use them?
Compute pixel features
Merge pixel features into superpixel descriptors
Classify/Optimize superpixel descriptors
In step 2., either by averaging, using histogram or using bag-of-words model, the superpixel descriptor is computed fixed-sized (e.g. 100 bins for histogram). Therefore, at the end you have reduced the X = N x K training data (N = 100K pixels times K features) to X = M x D (with M = 1K superpixels and D the length of the superpixel descriptor). Usually D > K but M << N, therefore you endup with some regional/more robust features that represent better your data with lower data dimensionality, which is great and reduces the complexity of your problem (classify, optimize) in average 2-3 orders of magnitude!
Conclusions
You can compute more complex (robust) features, but you have to be careful how, when and whatfor do you use superpixels as your data representation. You might lose some information (e.g. you lose your 2D grid lattices) or if you don't have enough training examples you might make the problem more difficult as the features are more complex and could be that you transform a linearly-separable problem into a non-linear one.
** Image from SLIC superpixels: http://ivrl.epfl.ch/research/superpixels

Related

How to enable a Convolutional NN to take variable size input?

So, I've seen that many of the first CNN examples in Machine Learning use the MNIST dataset. Each image there is 28x28, and so we know the shape of the input before hand. How would this be done for variable size input, let's say you have some images that are 56x56 and some 28x28.
I'm looking for a language and framework agnostic answer if possible or in tensorflow terms preferable
In some cases, resizing the images appropriately (for example to keep the aspectratio) will be sufficient. But, this can introduce distortion, and in case this is harmful, another solution is to use Spatial Pyramidal Pooling (SPP). The problem with different image sizes is that it produces layers of different sizes, for example, taking the features of the n-th layer of some network, you can end up with a featuremap of size 128*fw*fh where fw and fh vary depending on the size of the input example. What SPP does in order to alleviate this problem, is to turn this variable size feature map into a fix-length vector of features. It operates on different scales, by dividing the image into equal patches and performing maxpooling on them. I think this paper does a great job at explaining it. An example application can be seen here.
As a quick explanation, imagine you have a feature map of size k*fw*fh. You can consider it as k maps of the form
X Y
Z T
where each of the blocks are of size fw/2*fh/2. Now, performing maxpooling on each of those blocks separately gives you a vector of size 4, and therefore, you can grossly describe the k*fw*fh map as a k*4 fixed-size vector of features.
Now, call this fixed-size vector w and set it aside, and this time, consider the k*fw*fh featuremap as k featureplanes written as
A B C D
E F G H
I J K L
M N O P
and again, perform maxpooling separately on each block. So, using this, you obtain a more fine-grained representation, as a vector of length v=k*16.
Now, concatenating the two vectors u=[v;w] gives you a fixed-size representation. This is exaclty what a 2-scale SPP does (well, of course you can change the number/sizes of divisions).
Hope this helps.
When you use CNN for classification task, your network has two part:
Feature generator. Part generates feature map with size WF x HF and CF channels by image with size WI x HI and CI channels . The relation between image sizes and feature map size depends of structure your NN (for example, on amount of pooling layers and stride of them).
Classifier. Part solves the task of classification vectors with WF*HF*CF components into classes.
You can put image with different size into feature generator, and get feature map with different sizes. But classifier can only be training on some fixed lengths vectors. Therefore you obviously train your network for some fixed sizes of images. If you have images with different size you resize it to input size of network, or crop some part of image.
Another way described in the article
K. He, X. Zhang, S. Ren, J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," arXiv:1406.4729 2014
Authors offered Spatial pyramid pooling, which solve the problem with different image on the input of CNN. But I don't sure is spatial pyramid pooling layer exists in tensorflow.

Bag of Words with HOG descriptors

I'm not quite sure how to implement the "Bag of Words" approach with HOG descriptors.
I've checked several sources which usually provide several steps to follow:
Compute the HOGs for the set of valid training images.
Apply an clustering algorithm to retrieve n centroids from the descriptors.
Perform some magic to create histograms with the frequency of the nearest centroids of the computed HOGs or use OpenCVs implementation to do this.
Train a linear SVM with the histograms
The step which involves magic (3) is not really clear. If I don't use OpenCV, how would I implement it?
The HOGs are vectors which are calculated cell-wise. So I have a vector for each cell. I could iterate over the vector and calculate the closest centroid for each element of the vector and create the histogram accordingly. Would this be a proper way to do it? But if so, I still have vectors of different sizes and no benefit from it.
Main steps can be expressed;
1- Extract features from your entire training set. (HOG feature for your aim)
2- Cluster those features into a vocabulary V; you get K distinct cluster centers.(K-Means, K-Medoid. Your hyperparameter will be K)
3- Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
For example; first element of K maybe occurs 5 times, second element of K maybe occurs 10 times in your image. Doesn't matter at the end you will have a vector which has K elements.
K[0] = 5
k[1] = 10
....
....
K[n] = 3
4- Train the classifier using this vector. (Linear SVM)
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.

Kmeans clustering on different distance function in Lab space

Problem:
To cluster the similar colour pixels in CIE LAB using K means.
I want to use CIE 94 for distance between 2 pixels
Formula of CIE94
What i read was Kmeans work in "Euclidean space" where the positional cordinates are minimised by cost function which is (sum of squared difference)
The reason of not Using kmeans in space other than euclidean is
"""algorithm is often presented as assigning objects to the nearest cluster by distance. The standard algorithm aims at minimizing the within-cluster sum of squares (WCSS) objective, and thus assigns by "least sum of squares", which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging""(source wiki)
So how to use distance CIE 94 in LAB SPACE for similar colour clustering ?
So how to approach the problem ? What should be the minimisation function here ? HOW to map euclidean space to lab space if for the k mean euclidean formula to work ? Any other approach here ?
The reason that CIE LAB is often used for clustering is because it reduces the color to 2 dimensions (as opposed to RGB with 3 color channels). You can easily think of the color for each pixel in a Cartesian coordinate system, instead of points (x,y) you have points (a,b) From here you simply perform a 2d kmeans.
Exactly how you implement kmeans is up to you. The nice thing about reducing colors to a 2d space is we can imagine the data on a grid, and now we can use any regular distance measure we want. Mahalonobis, euclidean, 1 norm, city block, etc. The possibilities are really endless here.
You don't have to use CIELAB, you can just as easily use YCbCr, YUV, or any other colorspace that represents color in 2 dimensions. IF you wanted to try a 3d kmeans you could use rgb, hsv, etc. One problem with higher dimensionality is sparsity of clusters (large variance) and most importantly, increased computation time.
Just for fun I've included two images clustered using kmeans, one in LAB and one in YCbCr, you can see the clustering is nearly identical (except that the labels are different), just proving that the exact color space is irrelevant, the main point is to match the dimensionality of your kmeans with that of your data
EDIT
You made some good points in your comments. I was merely demonstrating that by abstracting the problem you can imagine many variations for the same basic clustering algorithm. But you are right, there are advantages to using CIELAB
Back to the distance measure. Kmeans has two steps, assignment, and update (it is very similar to the Expectation Maximization algorithm). This distance is used in assignment step of k-means. Here is some psuedo code
for each pixel 1 to rows*cols
for each cluster 1 to k
dist[k] = calculate_distance(pixel, mu[k])
pixel_id = index k of minimum dist
you would create a function calculate_distance that uses the delta_e calculation from cielab94. This formula uses all 3 channels to calculate distance. Hopefully this answers your questions
NOTE
My examples only use the 2 color channels, ignoring the luminance channel. I used this technique since often the goal is group colors despite lighting disparities(such as shadows). The delta_E measure is not lighting invariant. This may or may not be a concern for your application, but it is something to keep in mind.
results using square euclidean distance
results using cityblock distance
There are k-means variations for other distance functions.
In particular k-medoids (PAM) works with arbitrary distance functions.

HOG features on different scales

Suppose we calculate HOG features of image patches of different sizes, ranging from 64 * 64 to 128 * 128. Now, if we want to do k-means on these, should we normalize the patches which belong to different scale? I know HOG features are normalized, but does scale matter?
Normally, the HOG representations are normalized. However, you must be careful to the block size. In fact, you must have the same number of blocks, whatever the size of the image. Otherwise, you obtain descriptors of different lengths and the k-means cannot be performed. This means that when having larger images, you will have larger blocks. The resulting histograms will contain information from more gradients, so they aren’t invariant at this stage. However, by applying the histogram normalization, the scale invariance of the final descriptor is obtained.
Yet, if you are not sure if the histogram normalization is well performed or not, you can extract the descriptor for an image and its resized version and compare them.
Good luck!

Ways to improve Image Pixel Classification

Here is the problem we are trying to solve:
Goal is to classify pixels of a colored image into 3 different classes.
We have a set of manually classified data for training purposes
Pixels almost do not correlate to each other (each have individual behaviour) - so most likely classification is on each individual pixel and based on it's individual features.
3 classes approximately can be mapped to colors of RED, YELLOW and BLACK color families.
We need to have the system semi-automatic, i.e. 3 parameters to control the probability of the presence of 3 outcomes (for final well-tuning)
Having this in mind:
Which classification technique will you choose?
What pixel features will you use for classification (RGB, Ycc, HSV, etc) ?
What modification functions will you choose for well-tuning between three outcomes.
My first try was based on
Naive bayes classifier
HSV (also tried RGB and Ycc)
(failed to find a proper functions for well-tuning)
Any suggestion?
Thanks
For each pixel in the image try using the histogram of colors the n x n window around that pixel as its features. For general-purpose color matching under varied lighting conditions, I have had good luck with using two-dimensional histograms of hue and saturation with a relatively small number of bins along each dimension. Depending upon your lighting consistency it might make sense for you to directly use the RGB values.
As for the classifier, the manual-tuning requirement is most easily expressed using class weights: parameters that specify the relative costs of false negatives versus false positives. I have only used this functionality with SVMs, but I'm sure you can find implementations of other classifiers that support a similar concept.

Resources