I have 5000 images for fish detection and tracking. These images consist of images of 5-10 different fish species. Should I do the labeling as a small number of fish species? Or should I do it as a fish top type varying between 50-100? Or shall I say all of them are fish? Would the number of images labeled 5-10 or 50-100 be few for training?
If you are interested in detecting any kind of fish, then you can give all kinds of fish the same class. If you want the model to detect and classify different types of fishes differently, your labelling should also be specific for each class
And for the last question, yes, 5-10 or 50-100 would be few for training You must at least target 300-400 objects as the bare minimum for training
Related
Context: I have partial images of size view of different types of vehicles in my data set ( Partial images because of limited Field Of View of my camera lens ). These partial images cover more than half the vehicle and can be considered as good representative images of the vehicle. The vehicle categories are car, bus, trucks. I always get a wheel of the vehicle in these images and because I am capturing these images during different parts of the day the colour intensity of the wheels vary throughout the day. However a wheel is definitely present in all the images.
Question: I wanted to know if presence of a object in all the images of a data set not logically useful for classification will affect the CNN in any way. Basically I wanted to know before training the CNN should I mask the object i.e black it out in all the images or just let it be there.
A CNN creates a hierarchical decomposition of the image into combinations of various discriminatory patterns. These patterns are learnt during training to find those that separate the classes well.
If an object is present in every image, it is likely that it is not needed to separate the classes and won't be learnt. If there is some variation on the onject that is class dependant, then maybe it will be used. It is really difficult to know what features are important beforehand. Maybe busses have shinier wheels than other cars, and this is something you have not noticed, and thus having the wheel in the image is beneficial.
If you have inadvertently introduced some class specific variation, this can cause a problem for later classification. For example, if you only took photos of busses at night, the network might learn night = bus and when you show it a photo of a bus during the day it won't classify correctly.
However, using dropout in the network forces it to learn multiple features for classification, and not just rely on one. So if there is variation, this might not have as big an impact.
I would use the images without blanking anything out. Unless it is something simple such as background removal of particles etc., finding and blacking out the object adds another layer of complexity. You can test if the wheels make a big difference by training the network on the normal images, then classifying a few training examples with the object blacked out and seeing if the class probabilities change.
Focus you energy on doing good data augmentation, that is where you will get the most gains.
You can see an example of which features are learnt on MNIST in this paper.
In my project I need to recognize euro coins and someone advised me to use OpenCV classifiers and training algorithms. So I downloaded the 3.1 version of OpenCV and I trying to get it started. I would like to know some things that I don't understand from tutorials (the ones I am following are this, this from OpenCV official documentation and this).
First of all, is it mandatory to generate and consider negative samples? if yes, which kind of "objects" should I consider as negatives? In my app I should detect and recognize euro coins so...should I create negatives from any other random kind of objects?
Secondly, my app is supposed to recognize 2€, 1€ and 0.50€ coins. So, how many positive samples shall I generate with opencv_createsamples? One for each coin (front and back) or an unique one for all the 3 kind of coins? If I have understood well, then I will have some .xml files which I am supposed to enclose in my iOS app project, right?
Finally, will detectMultiScale() not only detect the coin but also its kind? That's why I was thinking that I would need more than a classifier file, to distinguish back from right side and to distinguish the value of the coin.
Hoping I have not written a too broad question, thank you for your attention.
Regarding negative (or, background) example consider on which background the euro coins will be in the image that your program has to process in the end. E.g., if they are on a table together with non-Euros, use table tops (without coins) as negatives. If you want to spot them on images where people hold them, use hands as negatives, etc.
Regarding the second question - I am also searching for advice on this. I guess it depends on the complexity of the features you need to recognize and also on the complexity of the background. I succeeded to train a classifier with 40 circle positives, but completely failed to train a classifier using 60 (aerial) cow images.
Regarding your third question, I think (not tested!) that depends on your recognition needs: If you just want to spot one of the Euro coins, you'll be fine to train one classifier with images from all coins from all sides. If you want to distinguish the coins, create different classifiers with images of each coin from all sides. If you even need to distinguish front from back, you'll have to create one classifier for every side of every coin.
yes, Negative samples are mandatory and again yes, you can have images of random objects but you need to have lots of them so i suggest use videos to extract their frames or website such as this
next, you need to have positive samples for your coins. Greater the number of positive coins, more accurate results will be. One thing you can do is to take 2-3 images of your coin and super impose them on your negative ones by changing an angle every time you superimpose. That way if you have say, 500 negative images and 2 positive images of a coin, you will have 1000 positive images after superimposing. opencv_createsamples script takes care of all of this.
I am planning on making a cascade detector for a white cup, a red ball, and a blue puck. With how simple these objects are in their shape, I was wondering if there are any parameter differences I will have to have in the training vs finding complex objects such as cars / faces? Also, within the training pos images I have the objects in different lighting conditions and instances where the objects are under shadow.
For training negative images I noticed the image sizes may vary. However, for positive images they MUST be a fixed size.
I plan on using 100x100 pos images to help detect the objects from 20-30 feet, the 200x200 pos images to detect the objects when I am within 5ft / am directly overhead of the object (3 ft off the ground appx). Does this mean that I will have to train 6 different XMLs? 2 for each object as it is trained for 100x100 and 200x200?
Short answer: Yes
Long Answer: Probably:
You have to think about it like this, the classifier is going to build up a set of features for the positive images and then use these to determine whether your detection image is the same or not. If you are drastically moving the angle of your detection, then you are going to need a different classifier.
Let me example with pictures:
If at 20ft away your cup looks like this:
with associated background/lighting etc, then it is going to be a very different classifier if your cup looks like this(maybe 5ft away but different angle):
Now, with all that being said, if you only have larger and smaller versions of your cup, then you may only need one. However you will need a different classifier for each object (cup/ball/puck)
Images not mine - Taken from Google
Viola-Jones' AdaBoost method is very popular for face detection? We need lots of positive and negative samples o train a face detector.
The rule for collecting positive sample is simple: the image which contains faces. But the rule for collecting negative sample is not very clear: the image which does not contains faces.
But there are so many scene that do not contain faces (which may be sky, river, house animals etc.). Which should I collect it? How can know I have collected enough negative samples?
Some suggested idea for negative samples: using the positive samples and crop the face region using the left part as negative samples. Is this work?
You have asked many questions inside your thread.
Amount of samples. As a rule of thumbs: When you train a detector you need roughly few thousands positive and negative examples per stage. Typical detector has 10-20 stages. Each stage reduces the amount of negative by a factor of 2. So you will need roughly 3,000 - 10,000 positive examples and ~5,000,000 to 100,000,000 negative examples.
Which negatives to take. A rule of thumb: You need to find a face in a given environment. So you need to take that environment as negative examples. For instance, if you try to detect faces of students sitting in a classroom than take as negative examples images from the classroom (walls, windows, human body, clothes etc). Taking images of the moon or of the sky will probably not help you. If you don't know your environment than just take as much as possible different natural images (under different light conditions).
Should you take facial parts (like an eye, or a nose) as negative? You can but this is definitely not enough (to take only those negatives). The real strength of the detector will come from the negative images which represent the typical background of the faces
How to collect/generate negative samples - You don't actually need many negative images. You can take 1000 images and generate 10,000,000 negative samples from them. Here is how you do it. Suppose you take a photo of a car of 1 mega pixel resolution 1000x1000 pixels. Suppose than you want to train face detector to work on resolution of 20x20 pixels (like openCV did). So you take your 1000x1000 big image and cut it to pieces of 20x20. You can get 2,500 pieces (50x50). So this is how from a single big image you generated 2,500 negative examples. Now you can take the same big image and cut it to pieces of size 10x10 pixels. You will now have additional 10,000 negative examples. Each example is of size 10x10 pixels and you can enlarge it by factor of 2 to force all the sample to have the same size. You can repeat this process as much as you want (cutting the input image to pieces of different size). Mathematically speaking, if your image is of size NxN - You can generate O(N^4) negative examples from it by taking each possible rectangle inside it.
In step 4, I described how to take a single big image and cut it to a large amount of negative examples. I must warn you that negative examples should not have high co-variance so I don't recommend taking only one image and generating 1 million negative examples from it. As a rule of thumb - create a library of 1000 images (or download random images from Google). Verify than none of the images contains faces. Crop about 10,000 negative examples from each image and now you have got a decent 10,000,000 negative examples. Train your detector. In the next step you can cut each image to ~50,000 (partially overlapping pieces) and thus enlarge your amount of negatives to 50 millions. You will start having very good results with it.
Final enhancement step of the detector. When you already have a rather good detector, run it on many images. It will produce false detections (detect face where there is no face). Gather all those false detections and add them to your negative set. Now retrain the detector once again. The more such iterations you do the better your detector becomes
Real numbers - The best face detectors today (like Facebooks) use hundreds of millions of positive examples and billions of negatives. As positive examples they take not only frontal faces but faces in many orientations, different facial expressions (smiling, shouting, angry,...), different age groups, different genders, different races (Caucasians, blacks, Thai, Chinese,....), with or without glasses/hat/sunglasses/make-up etc. You will not be able to compete with the best, so don't get angry if your detector misses some faces.
Good luck
I am wanting to count the number of cars in aerial images of parking lots. After some research I believe that Haar Cascade Classifiers might be an option for this. An example of an image I will be using would be something similar to a zoomed in image of a parking lot from Google Maps.
My current plan to accomplish this is to train a custom Haar Classifier using cars that I crop out of images in only one orientation (up and down), and then attempt recognition multiple times while rotating the image in 15 degree increments. My specific questions are:
Is using a Haar Classifier a good approach here or is there something better?
Assuming this is a good approach, when cropping cars from larger images for training data would it be better to crop a larger area that could possibly contain small portions of cars in adjacent parking spaces (although some training images would obviously include solo cars, cars with only one car next to them, etc.) or would it be best to crop the cars as close to their outline as possible?
Again assuming I am taking this approach, how could I avoid double counting cars? If a car was recognized in one orientation, I don't want it to be counted again. Is there some way that I could mark a car as counted and have it ignored?
I think in your case I would not go for Haar features, you should search for something that is rotation invariant.
I would recommend to approach this task in the following order:
Create a solid training / testing data set and have a good look into papers about getting good negative samples. In my experience good negative samples have a great deal of influence on the resulting quality of your classifier. It makes your life a lot easier if all your samples are of the same image size. Add different types of negative samples, half cars, just pavement, grass, trees, people etc...
Before starting your search for a classifier make sure that you have your evaluation pipeline in order, do a 10 fold cross evaluation with the simplest Haar classifier possible. Now you have a baseline. Try to keep the software for all features you tested working in caseou find out that your data set needs adjustment. Ideally you can just execute a script and rerun your whole evaluation on the new data set automatically.
The problem of counting cars multiple times will not be of such importance when you can find a feature that is rotation invariant. Still non maximum suppression will be in order becaus you might not get a good recognition with simple thresholding.
As a tip, you might consider HOG features, I did have some good results on cars with them.