I want to build a face detector/classifier to generate a network that detects whether a face is present in an image/video.
I understand the basic concept, but what I have problems with is the choice of the number of classes.
Initially, I thought that two classes (with face / without face) would be sufficient. However, I was unsure which data I should use for the class 'without face'. So I threw together datasets of equipment and plants and animals, whereupon the classes were very unbalanced, which is apparently not good.
Then I thought it would be better to use as many classes as possible.
But again, I am unsure what would be the best/common approach to the problem?
You can experiment with any number of samples and different images for the negative class. If the datasets with equipment/plant/places you have are imbalanced, you can try to subsample, e.g. pick 100 images from each.
Just don't make the negative class too huge, w.r.t the number of images with human samples you have. The rest is up to experimentation.
Related
im trying to figure out a question, the thing is that im working with a big dataset of pictures, the key idea is that almost all the pictures have just 1 person in it, every class should represent a different person but for some reason, lets say 1 of 1000 pictures in every class has a face that does not belong to that class(is not the same person that is on the other pics in that class) actually the person miss labeled is not from any class. here is my question: what happens on the learning process?, the convnet learns that that face is not useful for the task? or it generate some kind of error? i ask this because i need to know if i need to remove these "noisy" pictures for better performance, or if it is the case, the error would be neglectable. Thank you all in advance
Misleading targets will definitely add noise to your data. It will make training much more unstable if you have significant amount of incorrectly labeled data. Although, in your case, if you have 1/1000 ratio of incorrectly labeled data, unless you are using weighted classes, it won't much affect training.
By the way, if you are trying to create model that classifies a person by
face image, you might want to create other features, like eyes position, skin color, etc.
I was wondering if is it possible combining images and some "bios" data for finding patterns. For example, if I want to know if a image is a cat or dog and I have:
Enough image data for train my model
Enough "bios" data like:
size of the animal
size of the tail
weight
height
Thanks!
Are you looking for a simple yes or no answer? In that case, yes. You are in complete control over building your models which includes what data you make them process and what predictions you get.
If you actually wanted to ask on how to do it, it will depend on specific datasets and application but one way to do it would be by having two models, one specialized for determining the output label (cat or dog) from the image - so perhaps some kind of a simple CNN. The other would process the text data and find patterns in that. Then at the end, you could have either a non-AI evaluator that would combine these two predictions into one naively or you could have both of these models as an input to a simple neural network that would learn pattern from the output of these two models.
That is just one way to possibly do it though and, as I said, the exact implementation will depend on a lot of other factors. How are both of the datasets labeled? Are the data connected to each other? Meaning that, for each picture, do you have some textual data that is for that specific image? Or do you jsut have a spearated dataset of pictures and separate dataset of biological information?
There is also the consideration that you'll probably want to make about the necessity of this approach. Current models can predict categories from processing images with super-human precision. Unless this is an excersise in creating a more complex model, this seems like an overkill.
PS: I wouldn't use term "bios" in this context, I believe it is not a very common usage and here on SO it will mostly confuse people into thinking you mean the actual BIOS.
I am using tensorflow (object-detection) on my own dataset (drone recognition), also only 1 class named 'drone', after about 30000 steps trained, my result model can detect drone with very high accuracy, but I got a problem, I used ssd_inception_v2_coco model and its fine_tune_checkpoint on model zoo, right now sometimes in my real time detection, it detected human face as drone (very big different between 2 objects like that), I think because of the old checkpoint.
How can I prevent the detection of some object that have big different with my drone object, like human, dog, cat... Or can someone describe for me what problem here?
Sorry for my bad english
Even if you train an SSD for one class, it automatically creates another class called background. The background is trained using the regions of the training images that are not labeled as the desired classes (in your case, drone).
An easy way out is to add training samples that include images that have both drones and the things that you don't want to recognize as drones, in the same scene. Doing this and then increasing the number of epochs should improve the precision.
If you are doing an application where there are frequent occurences of some objects with drones, another possiblity is to actually train the network for those things too. This will increase your training workload, but improve the accuracy.
Some implementations of SSD have an option for hard negative mining of data, so that mistakes made during validation are specifically used with training. If you are familiar with the code, you might want to check if this is available.
I am just a beginner of the ML. I have gone through several websites for the basics and there are lots of unclear stuffs obviously to me and among below is the one.
In CNN(Convolutional Neural Network), is it required to indicate to the system prior that how many number of classes available as a result?
I was going through below URL, and get this question.
https://www.youtube.com/watch?v=2-Ol7ZB0MmU
Yes. The final layer of the CNN depends on the quantity of output classes you have: one element for each class. A CNN is built to handle a particular problem; this includes knowing the full shapes of the input and output.
For instance, the ILSVRC image data set comes with classified images, 1000 classes in all. The topologies that learn on this data set have 1000 elements in the final layer.
Does that solve the problem?
I want to classify the input as one of 3 possibilities. Is it better to use 3 networks with one output each or 1 network with 3 outputs?
(i.e. 3 networks that output 0 or 1 or 1 network that outputs a one hot vector of length 3 [1,0,0]
Does the answer change depending on how complex the incoming data is to classify?
At what amount of outputs does it make sense to partition the networks (if ever)? For example, if I want to classify into 20 groups, does it make a difference?
I would say it would make more sense to use a single network with multiple outputs.
The main reason is that hidden layers (I'm assuming you'll have at least one hidden layer) can be interpreted as transforming the data from the original space (feature space) into a different space that is more suitable for the task (classification in your case). For example, when training a network to recognize faces from raw pixels, it might use a hidden layer to first detect simple shapes such as small lines based on pixels, then use another hidden layer to detect simple shapes such as eyes/noses based on the lines from the first layer, etc. (it may not be entirely as ''clean'' as this, but this is an easy-to-understand example).
Such a transformation that a network can learn is typically useful for the classification task, regardless of what class the specific example has. For example, it is useful to be able to detect eyes in images regardless of whether or not the actual image contains a face; if you do indeed detect two eyes, you can classify it as a face, and otherwise you classify it as not being a face. In both cases, you were looking for eyes.
So, by splitting up into multiple networks, you may end up learning quite similar patterns in all networks anyway. Then you might as well have saved yourself the computational effort and just learned it once.
Another disadvantage of splitting up into multiple networks would be that you would probably cause your dataset to become imbalanced (or more imbalanced if it already is imbalanced). Suppose you have three classes, with exactly 1/3 of the dataset belonging to each class. If you use three networks for three binary classification tasks, you suddenly always have 1/3 ''1'' classes and 2/3 ''0'' classes. A network may then become biased towards predicting 0s everywhere, since those are the majority classes in each of the three separate problems.
Note that this is all based on my intuition; the best solution if you have time would be to simply try both approaches and test! I don't think I have ever seen someone using multiple networks for a single classification task in practice though, so if you only have time for one approach I'd recommend going for a single network.
I think the only case where it would really make sense to use multiple networks would be if you actually want to predict multiple unrelated values (or at least values that are not strongly related). For example, if, given images, you want to 1) predict whether or not there is a dog on the image, and 2) whether it is a photograph or a painting. Then it may be better to use two networks with two outputs each, instead of a single network with four outputs.