Instance Normalisation vs Batch normalisation - machine-learning

I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.

Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.

Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms

I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.

IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Why do we use fully-connected layer at the end of CNN?

I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.

Can a neural network be trained while it changes in size?

Are there known methods of continuous training and graceful degradation of a neural net while it shrinks or grows in size (by number of nodes, connections, whatever)?
To the best of my memory, everything I've read about neural networks is from a static perspective. You define the net and then train it.
If there is some neural network X with N nodes (neurons, whatever), is it possible to train the network (X) so that while N increases or decreases, the network is still useful and capable of performing?
In general, changing network architecture (adding new layers, adding more neurons into existing layers) once the network was already trained makes sense and a rather common operation in Deep Learning domain. One example is the dropout - during training half of the neurons randomly get switched off completely and only remaining half participates in training during specific iteration (each iteration or 'epoch' as it often is named has different random list of switched off neurons). Another example is transfer learning - where you learn network on one set of input data, cut off part of the outcoming layers, replace them with new layers and re-learn the model on another dataset.
To better explain why it makes sense lets step back for a moment. In deep networks, where you have lots of hidden layers each layer learns some abstraction from the incoming data. Each additional layer uses abstract representations learned by previous layer and builds upon them, combining such abstraction to form a higher level of the data representation. For instance, you could be trying to classify the images with DNN. First layer will learn rather simple concepts from images - like edges or points in data. Next layer could combine this simple concepts to learn primitives - like triangles or circles of squares. Next layer could drive it further and combine this primitives to represent some objects which you could find in images, like 'a car' or 'a house'and using softmax it calculates the probabilities of the answer you are looking for (what to actually output). I need to mention that these facts and learned representations could be actually checked. You could visualize the activation of your hidden layer and see what it learned. For example this was done with google's project 'inceptionism'. With that in mind let's get back to what I mentioned earlier.
Dropout is used to improve generalization of the network. It forces each neuron to 'not be so sure' that some pieces of the information from the previous layer will be available and makes it to try to learn the representations relying on less favorable and informative pieces of abstractions from previous layer. It forces it to consider all of the representations from previous layer to make decisions instead of putting all of its weight into couple of neurons it 'likes most of all'. By doing this the network is usually better prepared to new data where the input will be different from the training set.
Q: "As far as you're aware is the quality of the stored knowledge (whatever training has done to the net) still usable following the dropout? Maybe random halves could be substituted by random 10ths with a single 10th dropping, that might result in less knowledge loss during the transition period."
A: Unfortunately I can't properly answer why precisely half of the neurons is switched off and not 10% (or any other number). Maybe there is an explanation but I haven't seen it. In general it just works and that's it.
Also I need to mention that the task of dropout is to ensure that each neuron doesn't consider just several of the neurons from previous layer and is ready to make some decision even if neurons which usually helped it to make correct decision are not available. This is used for generalization only and helps the network to better cope with the data it haven't seen previously, nothing else is achieved with a dropout.
Now let's consider Transfer Learning again. Consider that you have a network with 4 layers. You train it to recognize specific objects in pictures (cat, dog, table, car etc). Than you cut off last layer, replace it with three additional layers and now you train the resulting 6-layered network on a dataset which, for instance, wrights short sentences about what is shown on this image ('a cat is on the car', 'house with windows and tree nearby' etc). What we did with such operation? Our original 4-layer network was capable to understand if some specific object is in the image we feed it with. Its first 3 layers learned good representations of the images - first layer learned about possible edges or points or some extremely primitive geometric shapes in images. Second layer learned some more elaborate geometric figures like 'circle' or 'square'. Last layer knows how to combine them to form some higher level objects - 'car', 'cat', 'house'. Now, we could just re-use this good representation which we learned in different domain and just add several more layers. Each of them will use abstractions from last (3rd) layer of original network and learn how combine them to create meaningful descriptions of images. While you will perform learning on new dataset with images as input and sentences as output it will adjust first 3 layers which we got from original network but these adjustments will be mostly minor, while 3 new layers will be adjusted by learning significantly. What we achieve with transfer learning is:
1) We can learn a much better data representations. We could create a network which is very good at specific task and than build upon that network to perform something different.
2) We can save training time - first layers of network will already be trained well enough so that your layers which are closer to output already get a rather good data representations. So the training should finish much faster using pre-trained first layers.
So the bottom line is that pre-training some network and than re-using part or whole network in another network makes perfect sense and is not something uncommon.
This is something I have seen in the likes of this video...
https://youtu.be/qv6UVOQ0F44
There are links to further resources in the video description.
And is based on a process called NEAT. Neuro Evolution of Augmenting Topologies.
It uses a genetic algorithm and evolutionary process to design and evolve a neural net from scratch with no prior assumptions of structure or complexity of the neural net.
I believe this is what you are looking for.

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources