So I have been working on object detection since a lot of time, I have seen models like YOLO and Mask-RCNN use a single deep model to classify objects. Is it possible to make multiple small networks comparatively to identify each and every object separately to increase accuracy and what will be the effect on speed. I'm confused little bit.
If you look inside the "black box" of models such as Yolo and Mask-RCNN, you will realize that they already contain "multiple small networks", to a certain extent, regarding object detection.
Actually, Mask-RCNN is roughly a Faster-RCNN with an additional branch for segmentation. However, regarding detection, there is "somewhere" a classification layer that give a score for each class object (and a regression layer to estimate the box). All the object classes are estimated from a common representation (all the rest of the network) and only the last layer is specialized to each class. The point is nevertheless that there are advantages to compute the common representation jointly to all object classes, in particular because a positive sample for class i is usually also a negative sample for class j.
The idea is quite different for YOLO (v1) but "somewhere" at the end of the network, there is a stack of neurons layers. There is a layer for each object class and it computes the probability of presence of the corresponding object in a region of the image. Once again, the layers are computed from a "common representation" thus in that sense they are quite independent "classifiers". But once again, these "classifiers" benefit from the representation that is computed in jointly for all class objects.
To be honest, these explanation are quite approximate, in order to try to be clear. If you really want to understand, the best is to read the publication(s) of Yolo or that of mask R-CNN. However, it is quite technical and requires to understand quite well deep learning basics. There are also some good tutorials on the web.
This being said, you can modify the architecture of Yolo and Mask R-CNN to put more complex "small neural networks" in replacement of the existing layers. It may improve performance since you will have more neurons, but will be also more complex to train. As said in comments by #jakub, you can also train multiple specific network and add a layer to choose between all, but it would be a "new" architecture and I doubt that you will obtain a better compromise between performances and computational efficiency than Yolo or Mask R-CNN
you can ensemble multiple models. I found a this article which has a better explanation about ensembling multiple models
Related
I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.
I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.
I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.
As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.
I am a deep-learning newbie and working on creating a vehicle classifier for images using Caffe and have a 3-part question:
Are there any best practices in organizing classes for training a
CNN? i.e. number of classes and number of samples for each class?
For example, would I be better off this way:
(a) Vehicles - Car-Sedans/Car-Hatchback/Car-SUV/Truck-18-wheeler/.... (note this could mean several thousand classes), or
(b) have a higher level
model that classifies between car/truck/2-wheeler and so on...
and if car type then query the Car Model to get the car type
(sedan/hatchback etc)
How many training images per class is a typical best practice? I know there are several other variables that affect the accuracy of
the CNN, but what rough number is good to shoot for in each class?
Should it be a function of the number of classes in the model? For
example, if I have many classes in my model, should I provide more
samples per class?
How do we ensure we are not overfitting to class? Is there way to measure heterogeneity in training samples for a class?
Thanks in advance.
Well, the first choice that you mentioned corresponds to a very challenging task in computer vision community: fine-grained image classification, where you want to classify the subordinates of a base class, say Car! To get more info on this, you may see this paper.
According to the literature on image classification, classifying the high-level classes such as car/trucks would be much simpler for CNNs to learn since there may exist more discriminative features. I suggest to follow the second approach, that is classifying all types of cars vs. truck and so on.
Number of training samples is mainly proportional to the number of parameters, that is if you want to train a shallow model, much less samples are required. That also depends on your decision to fine-tune a pre-trained model or train a network from scratch. When sufficient samples are not available, you have to fine-tune a model on your task.
Wrestling with over-fitting has been always a problematic issue in machine learning and even CNNs are not free of them. Within the literature, some practical suggestions have been introduced to reduce the occurrence of over-fitting such as dropout layers and data-augmentation procedures.
May not included in your questions, but it seems that you should follow the fine-tuning procedure, that is initializing the network with pre-computed weights of a model on another task (say ILSVRC 201X) and adapt the weights according to your new task. This procedure is known as transfer learning (and sometimes domain adaptation) in community.
I'm building a recognizer of antibodies in blod-cells images. It is based on libsvm. The prototype works well when it comes to recognize an instance which belongs to one of trained classes.
But when I give any image even not containing blod-cells (e.g. Microscope had bad offset/focus), it still suggests one of the classes known by model.
I first considered to implement class "Unknown" but I'm affraid training it with all the noise images would make the model performance worse.
So my idea is to check, if one/several feature(s) of an instance to be recognized is out of value-range and discard it.
Is it a good method?
If yes, how should the cut-off be selected (e.g. in terms of standard deviations)?
Thank you very much!
In problems with "possible non class samples" the most obvious solution seems to be create a one-class SVM (outlier detection algorithm) in one of two ways:
Train two one-class SVMs (oner per class) and discard samples marked by both models as "outliers"
Train one one-class SVM on the whole dataset (instances of both classes) and discard data marked as outlier
Suggested approach with "out of range check" is good as long as there is an obvios threshold value - as you are asking here what would be the best choice - it means that it is not a good way. If you cannot (as an expert) figure out it by yourself, it seems much better and safer option to train outlier detection method as suggested before, which will actualy do the same thing, but in the automatic fashion (as it will find rules for discarding "bad data" without training on any "bad images").