I am planning on making a cascade detector for a white cup, a red ball, and a blue puck. With how simple these objects are in their shape, I was wondering if there are any parameter differences I will have to have in the training vs finding complex objects such as cars / faces? Also, within the training pos images I have the objects in different lighting conditions and instances where the objects are under shadow.
For training negative images I noticed the image sizes may vary. However, for positive images they MUST be a fixed size.
I plan on using 100x100 pos images to help detect the objects from 20-30 feet, the 200x200 pos images to detect the objects when I am within 5ft / am directly overhead of the object (3 ft off the ground appx). Does this mean that I will have to train 6 different XMLs? 2 for each object as it is trained for 100x100 and 200x200?
Short answer: Yes
Long Answer: Probably:
You have to think about it like this, the classifier is going to build up a set of features for the positive images and then use these to determine whether your detection image is the same or not. If you are drastically moving the angle of your detection, then you are going to need a different classifier.
Let me example with pictures:
If at 20ft away your cup looks like this:
with associated background/lighting etc, then it is going to be a very different classifier if your cup looks like this(maybe 5ft away but different angle):
Now, with all that being said, if you only have larger and smaller versions of your cup, then you may only need one. However you will need a different classifier for each object (cup/ball/puck)
Images not mine - Taken from Google
Related
I have image patches from DDSM Breast Mammography that are 150x150 in size. I would like to augment my dataset by randomly cropping these images 2x times to 120x120 size. So, If my dataset contains 6500 images, augmenting it with random crop should get me to 13000 images. Thing is, I do NOT want to lose potential information in the image and possibly change ground truth label.
What would be best way to do this? Should I crop them randomly from 150x150 to 120x120 and hope for the best or maybe pad them first and then perform the cropping? What is the standard way to approach this problem?
If your ground truth contains the exact location of what you are trying to classify, use the ground truth to crop your images in an informed way. I.e. adjust the ground truth, if you are removing what you are trying to classify.
If you don't know the location of what you are classifying, you could
attempt to train a classifier on your un-augmented dataset,
find out, what the regions of the images are that your classifier reacts to,
make note of these location
crop your images in an informed way
train a new classifier
But how do you "find out, what regions your classifier reacts to"?
Multiple ways are described in Visualizing and Understanding Convolutional Networks by Zeiler and Fergus:
Imagine your classifier classifies breast cancer or no breast cancer. Now simply take an image that contains positive information for breast cancer and occlude part of the image with some blank color (see gray square in image above, image by Zeiler et al.) and predict cancer or not. Now move the occluded square around. In the end you'll get rough predictions scores for all parts of your original image (see (d) in the image above), because when you covered up the important part that is responsible for a positive prediction, you (should) get a negative cancer prediction.
If you have someone who can actually recognize cancer in an image, this is also a good way to check for and guard against confounding factors.
BTW: You might want to crop on-the-fly and randomize how you crop even more to generate way more samples.
If the 150x150 is already the region of interest (ROI) you could try the following data augmentations:
use a larger patch, e.g. 170x170 that always contains your 150x150 patch
use a larger patch, e.g. 200x200, and scale it down to 150x150
add some gaussian noise to the image
rotate the image slightly (by random amounts)
change image contrast slightly
artificially emulate whatever other (image-)effects you see in the original dataset
My problem is very similar to detecting birds flying in a flock. The objects have few features, can be positioned with different angles in images. Objects can be positioned quite ocasionally in a group (not a regular grid), but they never intersect. I tried YoloV3: at start, as I had <30 training images, it worked quite good (overfitted, but at least it worked for training images). As I increased the number of training images, it stopped working, the network does not learn the data (underfitting). I think the main problem is, that the objects have not enough features for CNN, i.e. separate objects are too simple. I wanted to somehow use the fact that they always come in groups, i.e. somehow consider neighbors. There may be different number of them in groups, at least 3, but mostly > 10. They may look differently (like birds with different positions of wings), but the size of all objects in a group is about the same. I am a newbie in neural networks, so maybe someone with more experience could point me in the right direction.
I tried to use template matching from OpenCV: I must use many templates (>20), because objects may look quite differently (different positions of wings); and also multiscale matching is needed, that all takes much execution time. But more important is that, under different settings, template matching finds either too few objects, or too many false positives. So I think neural networks fit better for this task. Please correct me if I am wrong. I thought maybe it could make sense to mask "useful" regions with a pass through a Mask-RCNN, and then somehow separate the objects in these regions (because I have to mark them separately for the user). Could this work, or maybe there are some other ways I could try? Any hints would be greatly appreciated!
EDIT: I also have many other objects in images (not just sky and birds), for example like trees. And the leaves or groups of leaves give false positives. They may be of different color (green, orange, dark-green, black), so filtering them on color is hardly possible.
Quote from YOLO introduction article:
2.4. Limitations of YOLO
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear ingroups, such as flocks of birds.
Since YOLO version 1 the recognition of small and multiscale objects improved a lot but I didn't find any evidence that it got better in small grouped objects recognition (correct me if I am wrong please).
It will be problematic to recognise very small objects on large high-res images, as YOLOv3 will downscale them to 416x416 (or 320x320 if you use YOLOv3 320) resolution. You can feed YOLOv3 regions of an image if it is too big. Or you can find some existing solutions for such cases.
In this article guys combined a CNN-based detector with a fully convolutional network and a superpixel-based semantic segmentation by using support vector machines to achieve high performance in detecting small objects in large images. They claim to achieve high detection precision.
But often you can utilize much easier approaches that involve only a little bit of algorithmical image processing, if the relative difference between object you want ot detect and its environment is obvious and simple to calculate.
You can try to detect a flock by high color contrast between birds and sky. Usually birds appear much darker on sky background. You may find this and this this OpenCV docs about image thresholding helpful for that.
int main(){
Mat src = imread("1.jpg");
Mat gray;
cvtColor(src, gray, CV_BGR2GRAY);
Mat thresholded;
threshold(gray, thresholded, 100, 255, CV_THRESH_BINARY_INV);
imwrite("2.jpg", thresholded);
return 0;
}
I got this:
Now you can extract white bird blobs with findContours() or SimpleBlobDetector (and match them against templates or do additional recognition/classification if that is required).
I'm trying to perform image recognition of large birds, and since the camera is moving, my usual tactic of background removal and shape recognition won't be effective. Since all adult birds of the species are remarkably similar in coloration, I was thinking that Haar classifiers may be effective, and found some resources to try and train my own classifier.
Negative images should be significantly larger than the positive images, and present similar environments but not contain the positive image. However, I haven't been able to find too many details on what makes for a good positive training image. I found some references to trying to keep a similar aspect ratio in all positive training images, but for something like a bird that can change dramatically in different poses (wings open, wings closed), is it crucial? Is it better to train multiple classifiers for each pose and orientation of the target to be recognized? Is it better to instead try to identify a subset of the object that is relatively consistent (like the bird's very distinctive black and white head)?
If I use a sub-feature like the head, does it matter if it is "mirrored" in some views? Should I artificially make my positive images face the same direction, and when running the classifier on an image, run it twice, once mirrored? What are the considerations made when designing the positive image set for a classifier?
I'm trying to understand Viola Jones method, and I've mostly got it.
It uses simple Haar like features boosted into strong classifiers and organized into layers /cascade in order to accomplish better performances (not bother with obvious 'non object' regions).
I think I understand integral image and I understand how are computed values for the features.
The only thing I can't figure out is how is algorithm dealing with the face size variations.
As far as I know they use 24x24 subwindow that slides over the image, and within it algorithm goes through classifiers and tries to figure out is there a face/object on it, or not.
And my question is - what if one face is 10x10 size, and other 100x100? What happens then?
And I'm dying to know what are these first two features (in first layer of the cascade), how do they look like (keeping in mind that these two features, acording to Viola&Jones, will almost never miss a face, and will eliminate 60% of the incorrect ones) ? How??
And, how is possible to construct these features to work with these statistics for different face sizes in image?
Am I missing something, or maybe I've figured it all wrong?
If I'm not clear enough, I'll try to explain better my confusion.
Training
The Viola-Jones classifier is trained on 24*24 images. Each of the face images contains a similarly scaled face. This produces a set of feature detectors built out of two, three, or four rectangles optimised for a particular sized face.
Face size
Different face sizes are detected by repeating the classification at different scales. The original paper notes that good results are obtained by trying different scales a factor of 1.25 apart.
Note that the integral image means that it is easy to compute the rectangular features at any scale by simply scaling the coordinates of the corners of the rectangles.
Best features
The original paper contains pictures of the first two features selected in a typical cascade (see page 4).
The first feature detects the wide dark rectangle of the eyes above a wide brighter rectangle of the cheeks.
----------
----------
++++++++++
++++++++++
The second feature detects the bright thin rectangle of the bridge of the nose between the darker rectangles on either side containing the eyes.
---+++---
---+++---
---+++---
I want to develop an application in which user input an image (of a person), a system should be able to identify face from an image of a person. System also works if there are more than one persons in an image.
I need a logic, I dont have any idea how can work on image pixel data in such a manner that it identifies person faces.
Eigenface might be a good algorithm to start with if you're looking to build a system for educational purposes, since it's relatively simple and serves as the starting point for a lot of other algorithms in the field. Basically what you do is take a bunch of face images (training data), switch them to grayscale if they're RGB, resize them so that every image has the same dimensions, make the images into vectors by stacking the columns of the images (which are now 2D matrices) on top of each other, compute the mean of every pixel value in all the images, and subtract that value from every entry in the matrix so that the component vectors won't be affine. Once that's done, you compute the covariance matrix of the result, solve for its eigenvalues and eigenvectors, and find the principal components. These components will serve as the basis for a vector space, and together describe the most significant ways in which face images differ from one another.
Once you've done that, you can compute a similarity score for a new face image by converting it into a face vector, projecting into the new vector space, and computing the linear distance between it and other projected face vectors.
If you decide to go this route, be careful to choose face images that were taken under an appropriate range of lighting conditions and pose angles. Those two factors play a huge role in how well your system will perform when presented with new faces. If the training gallery doesn't account for the properties of a probe image, you're going to get nonsense results. (I once trained an eigenface system on random pictures pulled down from the internet, and it gave me Bill Clinton as the strongest match for a picture of Elizabeth II, even though there was another picture of the Queen in the gallery. They both had white hair, were facing in the same direction, and were photographed under similar lighting conditions, and that was good enough for the computer.)
If you want to pull faces from multiple people in the same image, you're going to need a full system to detect faces, pull them into separate files, and preprocess them so that they're comparable with other faces drawn from other pictures. Those are all huge subjects in their own right. I've seen some good work done by people using skin color and texture-based methods to cut out image components that aren't faces, but these are also highly subject to variations in training data. Color casting is particularly hard to control, which is why grayscale conversion and/or wavelet representations of images are popular.
Machine learning is the keystone of many important processes in an FR system, so I can't stress the importance of good training data enough. There are a bunch of learning algorithms out there, but the most important one in my view is the naive Bayes classifier; the other methods converge on Bayes as the size of the training dataset increases, so you only need to get fancy if you plan to work with smaller datasets. Just remember that the quality of your training data will make or break the system as a whole, and as long as it's solid, you can pick whatever trees you like from the forest of algorithms that have been written to support the enterprise.
EDIT: A good sanity check for your training data is to compute average faces for your probe and gallery images. (This is exactly what it sounds like; after controlling for image size, take the sum of the RGB channels for every image and divide each pixel by the number of images.) The better your preprocessing, the more human the average faces will look. If the two average faces look like different people -- different gender, ethnicity, hair color, whatever -- that's a warning sign that your training data may not be appropriate for what you have in mind.
Have a look at the Face Recognition Hompage - there are algorithms, papers, and even some source code.
There are many many different alghorithms out there. Basically what you are looking for is "computer vision". We had made a project in university based around facial recognition and detection. What you need to do is google extensively and try to understand all this stuff. There is a bit of mathematics involved so be prepared. First go to wikipedia. Then you will want to search for pdf publications of specific algorithms.
You can go a hard way - write an implementaion of all alghorithms by yourself. Or easy way - use some computer vision library like OpenCV or OpenVIDIA.
And actually it is not that hard to make something that will work. So be brave. A lot harder is to make a software that will work under different and constantly varying conditions. And that is where google won't help you. But I suppose you don't want to go that deep.