What is the general consensus on rescaling images that have different sizes? I have read that one approach is to rescale the largest size of an image to a fixed size. It's not clear to me how only rescaling one of the dimensions would lead to uniform image shapes across the dataset.
Are there other approaches, e.g. would it work to take the average size of the two dimensions and then rescale the dimensions of each image to the mean of each dimension across the dataset?
Is it important which interpolation method is used in the rescaling?
Would it make sense to simply take an nxm part of each image and cut off the rest of each image?
Is there a list of approaches people have used and how they perform in different scenarios.
Depends on the target application of the CNNs. For object detection/classification usually a sliding window approach or cropping is used. For the first option, sliding window is moved around the image and for every patch (with different overlapping criterion) a prediction is made. This predictions are then filtered with other pooling or filter strategies.
For image segmentation (aka semantic segmentation), similar approaches are used. 1) image scaling + segmenting + scaling back to its original size. 2) different image patches + segmentation of each, or 3) sliding window segmentation + maxpooling. With the option (3) each pixel has a N = HxW votes (where N is the size of the sliding window). This N predictions are then aggregated into a maxixmum-voting classifier (similar to ensemble models on Random Forest and other classifiers).
So, in short, I believe there is no short nor unique answer to this question. The decision you take will depend in the goal you try to achieve with the CNN, and of course, the quality of your approach will have an impact in the performance of the CNN. I don't know about any study of this kind though.
Related
I read in many papers that a preprocessing of background removal help reduce the amount of computation. But why is this the case? My understanding is that he CNN works on a rectangular window no matter how is it filled up, 0 or positive.
See this for an example.
In the paper you provide, it seems that they do not pass the entire image to the network. Instead, they seem to be selecting smaller patches from the non-white background. This makes sense because it reduces the noise in their data, but it also reduces computational complexity, because of the effect it has on fully connected layers.
Suppose the input image is of size h*w. In your CNN, the image passes through a series of convolutions and max-poolings, and as a result, right before the first fully connected layer, you end up with a feature map of size
sz=m*(h/k)*(w/d)
where m is the number of feature planes, and where k and d depend on the number of layers, the parameters of each convolution and max pooling modules (e.g. the size of the convolution kernel, etc). Usually, we'll have d==k. Now, assume that you feed this to a fully connected layer, to produce a vector of q parameters. What this layer does is basicaly a matrix multiplication
A*x
where A is a matrix of size q*sz, and x is just your feature map written as a vector.
Now, assume you pass a patch of size (h/t)*(w/t) to the network. You end up with a feature map of size
sz/(t^2)
Given the size of the images in their datasets, this is a considerable reduction in the number of parameters. Also, small patches also means larger batches, and that too can accelerate training (better gradient approximation.).
I hope this helps.
Edit, following #wlnirvana's comment : Yes, patch size is a hyper parameter. In the example I gave, it is set via selecting t. Given the size of the images in the dataset, I'd say something like t>=6 would be realistic. As for how this relate to background removal, to quote the paper (section 3.1):
"To reduce computation time and to focus our analysis on regions of the slide most likely to contain cancer metastasis..."
This means that they select patches only around areas that are not background. This makes sense, since passing a completely white patch to the network would just be a waste of time (in figure 1, you can have so many white/gray/useless patches if you select them randomly, without removing the background). I didn't find any explanation on how patch selection is done in their paper, but I assume something like selecting a number of pixels p_1,...,p_n in the non-background regions, and considering n patches of size (h/t)*(w/t) around each of them would make sense.
Lets say I have a dataset of about 350 positive images and more than 400 negative images. They aren't the same size. Also their size is bigger than 640x320.
What should I do to create a better dataset? Do I need the images to be smaller? If yes, why?
Should I apply some normalization to the dataset? What should it be (contrast, noise reduction)?
Can I create a bigger dataset using the existing one? If yes, how?
Thanks in advance!
Optimal size of images is that you can easily classify object by
yourself.
Yes, classifiers works better after normalization, there are
options. Most popular ways is center dataset (subtract mean) and normalize range of
values say in [-1:1] range. Other popular way of normalization is similar to previous but normalize standard deviation (preferable in most cases).
Yes, you can create bigger dataset from existing on by adding
distorsions and noise to your images from existing dataset.
Have a look at INRIA dataset and their comments of how they "normalized" their input images for HoG person detection training.
http://pascal.inrialpes.fr/data/human/
one thing that wasn't mentioned yet is the fact, that for most detection techniques it isn't enough to collect a set of n images with the desired object "somewhere" within that image. Instead you should crop that image around the object (with some border).
e.g. for person detection they used this input image:
but they cropped and rescaled (and transformed) those regions (objects):
probably there are some good hints about training in the thesis too:
http://lear.inrialpes.fr/people/dalal/NavneetDalalThesis.pdf
The OpenCV Haar cascade classifier seems to use 24x24 images of faces as its positive training data. I have two questions regarding this:
What are the consideration that go into selecting the training image size, besides the fact that larger training images require more processing?
For non-square images, some people have chosen to keep one dimension at 24px, and expand the other dimension as necessary (to, say 100-200px). Is this the correct strategy?
How does one go about deciding the size of the training images (this is a variant of question 1)
I honestly believe that there are far better parameters to be tweaked than the image size. Even so, it's a question of fine-to-coarse detection - at finer levels, you gain detail and at coarser levels, you gain structure. Also, there is a trade off: with 24x24 detection regions, there are about ~160,000 possible rectangular (haar-like) features, so increasing or decreasing also affects this number for both training/testing (this is why boosting is used to select a small subset of discriminative features).
As you said, this is because his target was different (i.e. a pen). I think it is sensible to introduce a priori aspect ratio information to the cascade training, otherwise you would be getting detections that have square bounding boxes for a pen detector and probably suffer in performance because the training stage is picking up a larger background region around the pen.
See my first answer. I think this is largely empirical. There are techniques for either feature scaling or building image pyramids (e.g. see this work) that also mitigate the usefulness of highly controlling the choice of training target image sizes too.
Here is the problem we are trying to solve:
Goal is to classify pixels of a colored image into 3 different classes.
We have a set of manually classified data for training purposes
Pixels almost do not correlate to each other (each have individual behaviour) - so most likely classification is on each individual pixel and based on it's individual features.
3 classes approximately can be mapped to colors of RED, YELLOW and BLACK color families.
We need to have the system semi-automatic, i.e. 3 parameters to control the probability of the presence of 3 outcomes (for final well-tuning)
Having this in mind:
Which classification technique will you choose?
What pixel features will you use for classification (RGB, Ycc, HSV, etc) ?
What modification functions will you choose for well-tuning between three outcomes.
My first try was based on
Naive bayes classifier
HSV (also tried RGB and Ycc)
(failed to find a proper functions for well-tuning)
Any suggestion?
Thanks
For each pixel in the image try using the histogram of colors the n x n window around that pixel as its features. For general-purpose color matching under varied lighting conditions, I have had good luck with using two-dimensional histograms of hue and saturation with a relatively small number of bins along each dimension. Depending upon your lighting consistency it might make sense for you to directly use the RGB values.
As for the classifier, the manual-tuning requirement is most easily expressed using class weights: parameters that specify the relative costs of false negatives versus false positives. I have only used this functionality with SVMs, but I'm sure you can find implementations of other classifiers that support a similar concept.
OpenCV has a handy cvEqualizeHist() function that works great on faded/low-contrast images.
However when an already high-contrast image is given, the result is a low-contrast one. I got the reason - the histogram being distributed evenly and stuff.
Question is - how do I get to know the difference between a low-contrast and a high-contrast image?
I'm operating on Grayscale images and setting their contrast properly so that thresholding them won't delete the text i'm supposed to extract (thats a different story).
Suggestions welcome - esp on how to find out if the majority of the pixels in the image are light gray (which means that the equalise hist is to be performed)
Please help!
EDIT: thanks everyone for many informative answers. But the standard deviation calculation was sufficient for my requirements and hence I'm taking that to be the answer to my query.
You can probably just use a simple statistical measure of the image to determine whether an image has sufficient contrast. The variance of the image would probably be a good starting point. If the variance is below a certain threshold (to be empirically determined) then you can consider it to be "low contrast".
If you're adjusting contrast just so you can threshold later on, you may be able to avoid the contrast adjustment step if you set your threshold adaptively using Ohtsu's method.
If you're still interested in finding out the image contrast, then read on.
While there are a number of different ways to calculate "contrast". Often, those metrics are applied locally as opposed to the entire image, to make the result more sensitive to image content:
Divide the image into adjacent non-overlaying neighborhoods.
Pick neighborhood sizes that are approximate to size of the features of your image (e.g. if your main feature is horizontal text, make neighborhoods tall enough to capture 2 lines of text, and just as wide).
Apply the metric to each neighborhood individually
Threshold the metric result to separate low and high variance blocks. This will prevent such things as large, blank areas of page skewing your contrast estimates.
From there, you can use a number of features to determine contrast:
The proportion of high metric blocks to low metric blocks
High metric block mean
Intensity distance between the high and low metric blocks (using means, modes, etc)
This may serve as a better indication of image contrast than global image variance alone. Here's why:
(stddev: 50.6)
(stddev: 7.9)
The two images are perfectly in contrast (the grey background is just there to make it obvious it's an image), but their standard deviations (and thus variance) are completely different.
Calculate cumulative histogram of image.
Make linear regression of cumulative histogram in the form y(x) = A*x + B.
Calculate RMSE of real_cumulative_frequency(x)-y(x).
If that RMSE is close to zero - image is already equalized. (That means that for equalized images cumulative histograms must be linear)
Idea is taken from here.
EDIT:
I've illustrated this approach in my blog (C example code included).
There is a support provided in skimage for this. skimage.exposure.is_low_contrast. reference
example :
>>> image = np.linspace(0, 0.04, 100)
>>> is_low_contrast(image)
True
>>> image[-1] = 1
>>> is_low_contrast(image)
True
>>> is_low_contrast(image, upper_percentile=100)
False