Image size consideration for Haar cascades - opencv

The OpenCV Haar cascade classifier seems to use 24x24 images of faces as its positive training data. I have two questions regarding this:
What are the consideration that go into selecting the training image size, besides the fact that larger training images require more processing?
For non-square images, some people have chosen to keep one dimension at 24px, and expand the other dimension as necessary (to, say 100-200px). Is this the correct strategy?
How does one go about deciding the size of the training images (this is a variant of question 1)

I honestly believe that there are far better parameters to be tweaked than the image size. Even so, it's a question of fine-to-coarse detection - at finer levels, you gain detail and at coarser levels, you gain structure. Also, there is a trade off: with 24x24 detection regions, there are about ~160,000 possible rectangular (haar-like) features, so increasing or decreasing also affects this number for both training/testing (this is why boosting is used to select a small subset of discriminative features).
As you said, this is because his target was different (i.e. a pen). I think it is sensible to introduce a priori aspect ratio information to the cascade training, otherwise you would be getting detections that have square bounding boxes for a pen detector and probably suffer in performance because the training stage is picking up a larger background region around the pen.
See my first answer. I think this is largely empirical. There are techniques for either feature scaling or building image pyramids (e.g. see this work) that also mitigate the usefulness of highly controlling the choice of training target image sizes too.

Related

TensorFlow for image recognition, size of images

How can size of an image effect training the model for this task?
My current training set holds images that are 2880 X 1800, but I am worried this may be too large to train. In total my sample size will be about 200-500 images.
Would this just mean that I need more resources (GPU,RAM, Distribution) when training my model?
If this is too large, how should I go about resizing? -- I want to mimic real-world photo resolutions as best as possible for better accuracy.
Edit:
I would also be using TFRecord format for the image files
Your memory and processing requirements will be proportional to the pixel size of your image. Whether this is too large for you to process efficiently will depend on your hardware constraints and the time you have available.
With regards to resizing the images there is no one answer, you have to consider how to best preserve information that'll be required for your algorithm to learn from your data while removing information that won't be useful. Reducing the size of your input images won't necessarily be a negative for accuracy. Consider two cases:
Handwritten digits
Here the images could be reduced considerably in size and maintain all the structural information necessary to be correctly identified. Have a look at the MNIST data set, these images are distributed at 28 x 28 resolution and identifiable to 99.7%+ accuracy.
Identifying Tree Species
Imagine a set of images of trees where individual leaves could help identify species. Here you might find that reducing the image size reduces small scale detail on leaf shape in a way that's detrimental to the model, but you might find that you get a similar result with a tight crop (which preserves individual leaves) rather than an image resize. If this is the case you may find that creating multiple crops from the same image gives you an augmented data set for training that considerably improves results (which is something to consider, if possible, given your training set is very small)
Deep learning models are achieving results around human level in many image classification tasks: if you struggle to identify your own images then it's less likely you'll train an algorithm to. This is often a useful starting point when considering the level of scaling that might be appropriate.
If you are using GPUs to train, this will def affect your training time. Tensorflow does most of the GPU allocation so you don't have to worry about that. But with big photos you will be experiencing long training time although your dataset is small. You should consider data-augmentation.
You could complement your resizing with the data-augmentation. Resize in equal dimensions and then perform reflection and translation (as in geometric movement)
If your images are too big, your GPU might run out of memory before it can start training because it has to store the convolution outputs on its memory. If that happens, you can do some of the following things to reduce memory consumption:
resize the image
reduce batch size
reduce model complexity
To resize your image, there are many scripts just one Google search away, but I will add that in your case 1440 by 900 is probably a sweet spot.
Higher resolution images will result in a higher training time and an increased memory consumption (mainly GPU memory).
Depending on your concrete task, you might want to reduce the image size in order to therefore fit a reasonable batch size of let's say 32 or 64 on the GPU - for stable learning.
Your accuracy is probably affected more by the size of your training set. So instead of going for image size, you might want to go for 500-1000 sample images. Recent publications like SSD - Single Shot MultiBox Detector achieve high accuracy values like an mAP of 72% on the PascalVOC dataset - with "only" using 300x300 image resolution.
Resizing and augmentation: SSD for instance just scales every input image down to 300x300, independent of the aspect ratio - does not seem to hurt. You could also augment your data by mirroring, translating, ... etc (but I assume there are built-in methods in Tensorflow for that).

semantic segmentation for large images

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.

Rescaling input to CNNs

What is the general consensus on rescaling images that have different sizes? I have read that one approach is to rescale the largest size of an image to a fixed size. It's not clear to me how only rescaling one of the dimensions would lead to uniform image shapes across the dataset.
Are there other approaches, e.g. would it work to take the average size of the two dimensions and then rescale the dimensions of each image to the mean of each dimension across the dataset?
Is it important which interpolation method is used in the rescaling?
Would it make sense to simply take an nxm part of each image and cut off the rest of each image?
Is there a list of approaches people have used and how they perform in different scenarios.
Depends on the target application of the CNNs. For object detection/classification usually a sliding window approach or cropping is used. For the first option, sliding window is moved around the image and for every patch (with different overlapping criterion) a prediction is made. This predictions are then filtered with other pooling or filter strategies.
For image segmentation (aka semantic segmentation), similar approaches are used. 1) image scaling + segmenting + scaling back to its original size. 2) different image patches + segmentation of each, or 3) sliding window segmentation + maxpooling. With the option (3) each pixel has a N = HxW votes (where N is the size of the sliding window). This N predictions are then aggregated into a maxixmum-voting classifier (similar to ensemble models on Random Forest and other classifiers).
So, in short, I believe there is no short nor unique answer to this question. The decision you take will depend in the goal you try to achieve with the CNN, and of course, the quality of your approach will have an impact in the performance of the CNN. I don't know about any study of this kind though.

Haar classifiers for object with varied poses

I'm trying to perform image recognition of large birds, and since the camera is moving, my usual tactic of background removal and shape recognition won't be effective. Since all adult birds of the species are remarkably similar in coloration, I was thinking that Haar classifiers may be effective, and found some resources to try and train my own classifier.
Negative images should be significantly larger than the positive images, and present similar environments but not contain the positive image. However, I haven't been able to find too many details on what makes for a good positive training image. I found some references to trying to keep a similar aspect ratio in all positive training images, but for something like a bird that can change dramatically in different poses (wings open, wings closed), is it crucial? Is it better to train multiple classifiers for each pose and orientation of the target to be recognized? Is it better to instead try to identify a subset of the object that is relatively consistent (like the bird's very distinctive black and white head)?
If I use a sub-feature like the head, does it matter if it is "mirrored" in some views? Should I artificially make my positive images face the same direction, and when running the classifier on an image, run it twice, once mirrored? What are the considerations made when designing the positive image set for a classifier?

Using flipped images for machine learning dataset

I'v got a binary classification problem. I'm trying to train a neural network to recognize objects from images. Currently I've about 1500 50x50 images.
The question is whether extending my current training set by the same images flipped horizontally is a good idea or not? (images are not symetric)
Thanks
I think you can do this to a much larger extent, not just flipping the images horizontally, but changing the angle of the image by 1 degree. This will result in 360 samples for every instance that you have in your training set. Depending on how fast your algorithm is, this may be a pretty good way to ensure that the algorithm isn't only trained to recognize images and their mirrors.
It's possible that it's a good idea, but then again, I don't know what's the goal or the domain of the image recognition. Let's say the images contain characters and you're asking the image recognition software to determine if an image contains a forward slash / or a back slash \ then flipping the image will make your training data useless. If your domain doesn't suffer from such issues, then I'd think it's a good idea to flip them and even rotate with varying degrees.
I have used flipped images in AdaBoost with great success in the course: http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/Schedule.php
from the zip "TrainingImages.tar.gz".
I know there are some information on pros/cons with using flipped images somewhere in the slides (at the homepage) but I can't find it. Also a great resource is http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/DownloadMaterial/FaceLab/Manual.pdf (together with the slides) going thru things like finding things in different scales and orientation.
If the images patches are not symmetric I don't think its a good idea to flip. Better idea is to do some similarity transforms to the training set with some limits. Another way to increase the dataset is to add gaussian smoothed templates to it. Make sure that the number of positive and negative samples are proportional. Too many positive and too less negative might skew the classifier and give bad performance on testing set.
It depends on what your NN is based on. If you are extracting rotation invariant features or features that do not depend on the spatial position within the the image (like histograms or whatever) and train your NN with these features, then rotating will not be a good idea.
If you are training directly on pixel values, then it might be a good idea.
Some more details might be useful.

Resources