OpenCV positive samples dimensions? - opencv

So I've come across lots of tutorials about OpenCV's haartraining and cascaded training tools. In particular I'm interested in training a car classifier using the createsamples tool but there seem to be conflicting statements all over the place regarding the -w and -h parameters, so I'm confused.
I'm referring to the command:
$ createsamples -info samples.dat -vec samples.vec -w 20 -h 20
I have the following three questions:
I understand that the aspect ratio of the positive samples should be the same as the aspect ratio you get from the -w and -h parameters above. But do the -w and -h parameters of ALL of the positive samples have to be the same size, as well? Eg. I have close to 1000 images. Do all of them have to be the same size after cropping?
If it is not the size but the aspect ratio that matters, then how precisely matching must the aspect ratio be of the positive samples, compared to the -w and -h parameters mentioned in the OpenCV tools? I mean, is the classifier very sensitive, so that even a few pixels off here and there would affect its performance? Or would you say that it's safe to work with images as long as they're all approximately the same ratio by eye.
I have already cropped several images to the same size. But in trying to make them all the same size, some of them have a bit more background included in the bounding boxes than others, and some have slightly different margins. (For example, see the two images below. The bigger car takes up more of the image, but there's a wider margin around the smaller car). I'm just wondering if having a collection of images like this is fine, or if it will lower the accuracy of the classifier and that I should therefore ensure tighter bounding boxes around all objects of interest (in this case, cars)?

First Question: Yes, all the images to be used for training have to be the same size. (at least for the last time I did face detection sample training. Should be the same here. If I am not wrong, there will be an error if the images are not of same size. But u can try it out and see if time permits.)
Second Question: Not really sure what you are asking here. But the classifier is not that sensitive as u think. A few pixels off the object of interest, let's say the hand for instance, if the little finger is missing a few pixels(due to cropping) and other images have few pixels missing for the thumb, etc... the classifier will still be able to detect the hand. So a few pixels missing here and there or a few background pixels added in, will not affect the classifier much at the end of the day.
Third Question: You should crop the image to consist of the car only for maximum result. try eliminate as much background as possible. I did a research based on samples with noisy background, black background and cropped samples with minimum background. Cropped samples with minimum background shows the best results in terms of false positive and false negative, from what I remember.
U can use object marker to do it: http://achuwilson.wordpress.com/2011/02/13/object-detection-using-opencv-using-haartraining/
The tedious way would be to use paint to resize all the image to the same pixel value after cropping.
This link should also answer your question: http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html
I also agree with GilLevi that there are much better detection methods compared to Haar, HoG, LBP cascade. training of the images can take days(depends on number of images trained). If you really have to use the cascade methods and you are looking to minimise training time,
training with Haar-like features takes much longer than with HoG or LBP. But results wise, I am not really sure which will ensure better performance and robustness.
Hope my answer helped you. Should there be more questions, do comment.

Related

Haar Classifier positive image set clarification

Could you please help understand several points related to Haar Classifier training:
1) Should positive image contain only the training object or they can contain some other objects in it? Like I want to recognize some traffic sign, should the positive image contain only traffic sign or it can contain highway also?
2) There are 2 ways of creating samples vector file, one is using info file, which contains the detected object coordinates in positive image, another just giving the list of positives and negatives. Which one is better?
3) How usually you create info file, which contains the detected object coordinates in positive image? Can image clipper generate object cordinates?
And does dlib histogram of adaptive gradient provides better results than Haar classifier?
My target is traffic sign detection in raspberry pi.
Thanks
the positive sample (not necessarily the image) should contain only the object. Sometimes it is not possible to get the right aspect ratio for each positive sample, then you would either add some background or crop some of the object boundary. The final detector will detect regions of your positive sample aspect ratio, so if you use a lot of background around all of your positive samples, your final detector will probably not detect a region of your traffix sign, but a region with a lot of background around your traffic sign.
Afaik, the positive samples must be provided by a .vec file which is created with opencv_createsamples.exe and you'll need a file with the description (where in the images are your positive samples?). I typically go the way that I preprocess my labeled training samples, crop away all the background, so that there are only intermediate images where the positive sample fills the whole image and the image is already the right aspect ratio. I fill a text file with basically "folder/filename.png 0 0 width height" for each of those intermediate images and then create a .vec file from that intermediate images. But the other way, using a real roi information out of full-size images should be of same quality.
Be aware that if you don't fix the same aspect ratio for each positive sample, you'll stretch your objects, which might or might not be a problem in your task.
And keep in mind, that you can create additional positive samples from warping/transforming your images. opencv_createsamples can do that for you, but I never really used it, so I'm not sure whether training will benefit from using such samples.

TensorFlow for image recognition, size of images

How can size of an image effect training the model for this task?
My current training set holds images that are 2880 X 1800, but I am worried this may be too large to train. In total my sample size will be about 200-500 images.
Would this just mean that I need more resources (GPU,RAM, Distribution) when training my model?
If this is too large, how should I go about resizing? -- I want to mimic real-world photo resolutions as best as possible for better accuracy.
Edit:
I would also be using TFRecord format for the image files
Your memory and processing requirements will be proportional to the pixel size of your image. Whether this is too large for you to process efficiently will depend on your hardware constraints and the time you have available.
With regards to resizing the images there is no one answer, you have to consider how to best preserve information that'll be required for your algorithm to learn from your data while removing information that won't be useful. Reducing the size of your input images won't necessarily be a negative for accuracy. Consider two cases:
Handwritten digits
Here the images could be reduced considerably in size and maintain all the structural information necessary to be correctly identified. Have a look at the MNIST data set, these images are distributed at 28 x 28 resolution and identifiable to 99.7%+ accuracy.
Identifying Tree Species
Imagine a set of images of trees where individual leaves could help identify species. Here you might find that reducing the image size reduces small scale detail on leaf shape in a way that's detrimental to the model, but you might find that you get a similar result with a tight crop (which preserves individual leaves) rather than an image resize. If this is the case you may find that creating multiple crops from the same image gives you an augmented data set for training that considerably improves results (which is something to consider, if possible, given your training set is very small)
Deep learning models are achieving results around human level in many image classification tasks: if you struggle to identify your own images then it's less likely you'll train an algorithm to. This is often a useful starting point when considering the level of scaling that might be appropriate.
If you are using GPUs to train, this will def affect your training time. Tensorflow does most of the GPU allocation so you don't have to worry about that. But with big photos you will be experiencing long training time although your dataset is small. You should consider data-augmentation.
You could complement your resizing with the data-augmentation. Resize in equal dimensions and then perform reflection and translation (as in geometric movement)
If your images are too big, your GPU might run out of memory before it can start training because it has to store the convolution outputs on its memory. If that happens, you can do some of the following things to reduce memory consumption:
resize the image
reduce batch size
reduce model complexity
To resize your image, there are many scripts just one Google search away, but I will add that in your case 1440 by 900 is probably a sweet spot.
Higher resolution images will result in a higher training time and an increased memory consumption (mainly GPU memory).
Depending on your concrete task, you might want to reduce the image size in order to therefore fit a reasonable batch size of let's say 32 or 64 on the GPU - for stable learning.
Your accuracy is probably affected more by the size of your training set. So instead of going for image size, you might want to go for 500-1000 sample images. Recent publications like SSD - Single Shot MultiBox Detector achieve high accuracy values like an mAP of 72% on the PascalVOC dataset - with "only" using 300x300 image resolution.
Resizing and augmentation: SSD for instance just scales every input image down to 300x300, independent of the aspect ratio - does not seem to hurt. You could also augment your data by mirroring, translating, ... etc (but I assume there are built-in methods in Tensorflow for that).

Image size consideration for Haar cascades

The OpenCV Haar cascade classifier seems to use 24x24 images of faces as its positive training data. I have two questions regarding this:
What are the consideration that go into selecting the training image size, besides the fact that larger training images require more processing?
For non-square images, some people have chosen to keep one dimension at 24px, and expand the other dimension as necessary (to, say 100-200px). Is this the correct strategy?
How does one go about deciding the size of the training images (this is a variant of question 1)
I honestly believe that there are far better parameters to be tweaked than the image size. Even so, it's a question of fine-to-coarse detection - at finer levels, you gain detail and at coarser levels, you gain structure. Also, there is a trade off: with 24x24 detection regions, there are about ~160,000 possible rectangular (haar-like) features, so increasing or decreasing also affects this number for both training/testing (this is why boosting is used to select a small subset of discriminative features).
As you said, this is because his target was different (i.e. a pen). I think it is sensible to introduce a priori aspect ratio information to the cascade training, otherwise you would be getting detections that have square bounding boxes for a pen detector and probably suffer in performance because the training stage is picking up a larger background region around the pen.
See my first answer. I think this is largely empirical. There are techniques for either feature scaling or building image pyramids (e.g. see this work) that also mitigate the usefulness of highly controlling the choice of training target image sizes too.

find mosquitos' head in the image

I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.

Using flipped images for machine learning dataset

I'v got a binary classification problem. I'm trying to train a neural network to recognize objects from images. Currently I've about 1500 50x50 images.
The question is whether extending my current training set by the same images flipped horizontally is a good idea or not? (images are not symetric)
Thanks
I think you can do this to a much larger extent, not just flipping the images horizontally, but changing the angle of the image by 1 degree. This will result in 360 samples for every instance that you have in your training set. Depending on how fast your algorithm is, this may be a pretty good way to ensure that the algorithm isn't only trained to recognize images and their mirrors.
It's possible that it's a good idea, but then again, I don't know what's the goal or the domain of the image recognition. Let's say the images contain characters and you're asking the image recognition software to determine if an image contains a forward slash / or a back slash \ then flipping the image will make your training data useless. If your domain doesn't suffer from such issues, then I'd think it's a good idea to flip them and even rotate with varying degrees.
I have used flipped images in AdaBoost with great success in the course: http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/Schedule.php
from the zip "TrainingImages.tar.gz".
I know there are some information on pros/cons with using flipped images somewhere in the slides (at the homepage) but I can't find it. Also a great resource is http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/DownloadMaterial/FaceLab/Manual.pdf (together with the slides) going thru things like finding things in different scales and orientation.
If the images patches are not symmetric I don't think its a good idea to flip. Better idea is to do some similarity transforms to the training set with some limits. Another way to increase the dataset is to add gaussian smoothed templates to it. Make sure that the number of positive and negative samples are proportional. Too many positive and too less negative might skew the classifier and give bad performance on testing set.
It depends on what your NN is based on. If you are extracting rotation invariant features or features that do not depend on the spatial position within the the image (like histograms or whatever) and train your NN with these features, then rotating will not be a good idea.
If you are training directly on pixel values, then it might be a good idea.
Some more details might be useful.

Resources