Suppose we calculate HOG features of image patches of different sizes, ranging from 64 * 64 to 128 * 128. Now, if we want to do k-means on these, should we normalize the patches which belong to different scale? I know HOG features are normalized, but does scale matter?

Normally, the HOG representations are normalized. However, you must be careful to the block size. In fact, you must have the same number of blocks, whatever the size of the image. Otherwise, you obtain descriptors of different lengths and the k-means cannot be performed. This means that when having larger images, you will have larger blocks. The resulting histograms will contain information from more gradients, so they aren’t invariant at this stage. However, by applying the histogram normalization, the scale invariance of the final descriptor is obtained.
Yet, if you are not sure if the histogram normalization is well performed or not, you can extract the descriptor for an image and its resized version and compare them.
How can I normalize two histograms in OpenCV?

I'm working on a CBIR image processing project and I need to normalize two histograms such that their values are on the same scale. I'm not quite sure if normalizing is the right term for it though.
Here's what I'm trying to accomplish.
I am slicing each image (single channel grayscale) into an 8x8 grid. Therefore, for some images the blocks of the grid may be larger and for others smaller. But for each block, I'm extracting a 256 bin histogram as a feature for storage and comparison against other features later, ergo all images have equal sized feature descriptors, regardless of size differences.
As it stands, this doesn't scale properly, as images with larger blocks have higher counts in each bin, but could simply be a 2x scale of the same image.
It's because of this I want to store histograms such that each bin value is the percentage of their occurrence over a 2D region in the grid. Doing so, I would able to compare histograms with both according to the same scale.
Why is image segmentation needed for object detection?

What is the point of image segmentation algorithms like SLIC? Most object detection algorithms work over the entire set of (square) sub-images anyway.
The only conceivable benefit to segmenting the image is that now, the classifier has shape information available to it. Is that right?
Most classifiers I know of take rectangular input images. What classifiers allow you to pass variable sized image segments to them?
First, SLIC, and the kind of algorithms I'm guessing you refer to, are not segmentation algorithms, they are oversegmentation algorithms. There is a difference between those two terms. segmentation methods split the image in objects while oversegmentation methods split the image in small clusters (spatially adjacent group of pixels with similar characteristics), these clusters are usually called superpixels. See the image** of superpixels below:
Now, answering parts of your original question:
Why to use superpixels?
They reduce the dimensionality of your data/complexity of the problem. From N = w x h pixels to M superpixels, with M << N. This is, for an image composed of N = 320x480 = 153600 pixels, M = 400 or M = 800 superpixels seem to be enough to oversegment it. Now, letting for latter how to classify them, just consider how easier your problem has become reducing from N=100K to N=800 training examples to train/classify. The superpixels still represent your data properly, as they adhere to image boundaries.
They allow you to compute more complex features. With pixels, you can only compute some simple statistics on them, or use a filter-bank/feature extractor to extract some features in its vicinity. This however represents your pixel's neighbour very locally, without having in consideration the context. With superpixels, you can compute a superpixel descriptor from all the pixels that belong to it. This is, features are usually computed at pixel level as before, but, then features are merged into a superpixel descriptor by different methods. Some of the methods to do that are: simple mean of all pixels inside a superpixel, histograms, bag-of-words, correlation. As a simple example, imagine you only consider grayscale as a feature for your image/classifier, if you use simple pixels, all you have is pixel's intensity, which is very local and noisy. If you use superpixels, you can compute a histogram of the intensities of all the pixels inside, which describes much better the region than a single local intensity value.
They allow you to compute new features. Over superpixels you can compute some regional statistics (1st order as mean or variance or second order covariance). You can now extract some other information not available before (e.g. shape, length, diameter, area...).
How to use them?
Compute pixel features
Merge pixel features into superpixel descriptors
Classify/Optimize superpixel descriptors
In step 2., either by averaging, using histogram or using bag-of-words model, the superpixel descriptor is computed fixed-sized (e.g. 100 bins for histogram). Therefore, at the end you have reduced the X = N x K training data (N = 100K pixels times K features) to X = M x D (with M = 1K superpixels and D the length of the superpixel descriptor). Usually D > K but M << N, therefore you endup with some regional/more robust features that represent better your data with lower data dimensionality, which is great and reduces the complexity of your problem (classify, optimize) in average 2-3 orders of magnitude!
You can compute more complex (robust) features, but you have to be careful how, when and whatfor do you use superpixels as your data representation. You might lose some information (e.g. you lose your 2D grid lattices) or if you don't have enough training examples you might make the problem more difficult as the features are more complex and could be that you transform a linearly-separable problem into a non-linear one.
What is the difference between Binning and sub-sampling in Image Signal Processing?

As I know, there are some functions in the CMOS Image Sensor ISP (Image Signal Processor).
Specifically, I'd like to know the difference between binning and sub-sampling. I think these purpose is same to reduce image size.
However, I'm not sure why these functions exist?
What is their purpose?
Binning and sub-sampling reduce the image size as you have suspected, but what they focus on are different things. Let's tackle each issue separately
Binning in image processing deals primarily with quantization. The closest thing I can think of is related to what is known as data binning. Basically, consider breaking up your image into distinct (non-overlapping) M x N tiles, where M and N are the rows and columns of a tile and M and N should be much smaller than the rows and columns of the image.
If you consider any grid of M x N pixels, all of these pixels get replaced with a representative colour. The way this representative colour is calculated is done in many ways... the average is a popular method. The reason why binning is performed is primarily as a data pre-processing technique which is used to reduce the effects of minor observation errors. This effectively reduces the amount of information that is representative of the image, and so it certainly reduces the image size by reducing the amount of unique colours that represent the image.
In addition, binning the data may also reduce the impact of noise that impacts the CMOS sensor on the final processed image, but at the cost of a lower dynamic range of colours.
Sub-sampling in the case of image processing mostly deals with image resizing. It's also called image scaling. The goal is to take an image and reduce its dimensions so that you get a smaller image as a result. Binning deals with keeping the image the same size (i.e. the same dimensions as the original) while reducing the amount of colours which ultimately reduces the amount of space the image takes up. Subsampling reduces the image size by removing information all together. Usually when you subsample, you also interpolate or smooth the image so that you reduce aliasing.
Sub-sampling has another application in video processing - especially in MPEG where video is encoded in YCbCr. Y is the luminance while Cb and Cr are the chrominance pairs. We tend to notice changes in luminance rather than chrominance, and so the chrominance is subsampled to reduce the amount of space taken up by the video. Specifically, the human visual system has poor acuity when it comes to colour information than we do with luminance / intensity. Usually, the chrominance values are filtered then subsampled by 1/2 or even 1/4 of that of the intensity. Even with a rather high subsampling rate, we don't notice any differences in terms of perceived image quality.
This is obviously a rather rough introduction on the differences between them both, but I hope this gives you enough of what you're after for your purposes.
minimum texture image dimension for effective classification

Iam a beginner in image mining. I would like to know the minimum dimension required for effective classification of textured images. As what i feel if a image is too small feature extraction step will not extract enough features. And if the image size goes beyond a certain dimension the processing time will increase exponentially with image size.
This is a complex question that requires a bit of thinking.
Short answer: It depends.
Long answer: It depends on the type of texture you want to classify and the type of feature your classification is based on. If the feature extracted is, say, color only, you can use "texture" as small as 1x1 pixel (in that case, using the word "texture" is a bit of an abuse). If you want to classify, say for example characters, you can usually extract a lot of local information from edges (Hough transform, Gabor filters, etc). The image plane just have to be big enough to hold the characters (say 16x16 pixels for Latin alphabet).
If you want to be able to classify any kind of images in any kind of number, you can also base your classification on global information, like entropy, correlogram, energy, inertia, cluster shade, cluster prominence, color and correlation. Those features are used for content based image retrieval.
Gaussian blur and FFT

I´m trying to make an implementation of Gaussian blur for a school project.
I need to make both a CPU and a GPU implementation to compare performance.
I am not quite sure that I understand how Gaussian blur works. So one of my questions is
if I have understood it correctly?
Heres what I do now:
I use the equation from wikipedia http://en.wikipedia.org/wiki/Gaussian_blur to calculate
the filter.
For 2d I take RGB of each pixel in the image and apply the filter to it by
multiplying RGB of the pixel and the surrounding pixels with the associated filter position.
These are then summed to be the new pixel RGB values.
For 1d I apply the filter first horizontally and then vetically, which should give
the same result if I understand things correctly.
Is this result exactly the same result as when the 2d filter is applied?
Another question I have is about how the algorithm can be optimized.
I have read that the Fast Fourier Transform is applicable to Gaussian blur.
But I can't figure out how to relate it.
Can someone give me a hint in the right direction?
Yes, the 2D Gaussian kernel is separable so you can just apply it as two 1D kernels. Note that you can't apply these operations "in place" however - you need at least one temporary buffer to store the result of the first 1D pass.
FFT-based convolution is a useful optimisation when you have large kernels - this applies to any kind of filter, not just Gaussian. Just how big "large" is depends on your architecture, but you probably don't want to worry about using an FFT-based approach for anything smaller than, say, a 49x49 kernel. The general approach is:
FFT the image
FFT the kernel, padded to the size of the image
multiply the two in the frequency domain (equivalent to convolution in the spatial domain)
IFFT (inverse FFT) the result
