I am running the training example for Quantization aware training as is in Google Colab.
You can find the tutorial here.
My exact code can be found here. I expected that all the layers in the output frozen model (.tflite) would be quantized (integer, int8 precisely)
However, I noticed a dequantize node present in the layers.
I am not sure what leads to this dequant node. This is not desirable and I would like to remove it, preferable with network surgery.
Related
I dont mean that a neural network can complete the work of traditional image processing algorithm.What i want to say is if it exists a kind of neural network can use the parameters of the traditional method as input and outputs more universal parameters that dont require manual adjustment.Intuitively, my ideas are less efficient than using neural networks directly,but I don't know much about the mathematics of neural networks.
If I understood correctly, what you mean is for a traditional method (let's say thresholding), you want to find the best parameters using ann. It is possible but you have to supply so many training data which needs to be created, processed and evaluated that it will take a lot of time. AFAIK many mobile phones that have AI assisted camera use this method to find the best aperture, exposure..etc.
First of all, thank you very much. I still have two things to figure out. If I wanted to get a (or a set of) relatively optimal parameters, what data set would I need to build (such as some kind of error between input and output and threshold) ? Second, as you give an example, is it more efficient or better than traversal or Otsu to select the optimal threshold through neural networks in practice?To be honest, I wonder if this is really more efficient than training input and output directly using neural networks
For your second question, Otsu only works on cases where the histogram has two distinct peaks. Thresholding is a simple function but the cut-off value is based on your objective; there is no single "best" value valid for every case. So if you want to train a model for thresholding, I think you have to come up with separate models for each case (like a model for thresholding bright objects, another for darker ones...etc.) Maybe an additional output parameter for determining the aim works but I am not sure. Will it be more efficient and better? Depends on the case (and your definition of better). Otsu, traversal or adaptive thresholding does not work all the time (actually Otsu has very specific use cases). If they work for your case, excellent. If not, then things get messy. So to answer your question, it depends on your problem at hand.
For the first question, TBF, it is quite difficult to work with images in traditional ANNs. Images have a lot of pixels, so standard ANNs struggle with inputs. Moreover, when the location/scale of an object in the image changes, the whole pixel data changes even though the content is the same (These are the reasons why CNN's are superior to ANN's for images). For these reasons it is better to use processed metrics which contain condensed and location-invariant information. E.g. for thresholding, you can give the histogram and it returns a thresholding value. Therefore you need an ann with 256 input neurons (for an intensity histogram of 8bit grayscale image), 1 output neuron, and 1-2 middle layers with some deeply connected neurons (128 maybe?). Your training data will be a bunch of histograms as input and corresponding best threshold value for each histogram. Then once training is finished, you can give the ANN a histogram it has never seen before and it will tell you the optimal thresholding value based on its training.
what I want to do is a model that can output different parameters (parameter sets) based on different input images, so I think if you choose a good enough data set it should be somewhat universal.
Most likely, but your data set should be quite inclusive of expected images (in terms of metrics and features), which means it has to be large.
Also, I don't know much about modeling -- can I use a function about the output/parameters (which might be a function about the result of the traditional method) as an error in the back-propagation by create a custom loss function?
I think so, but training the model will be more involved than using predefined loss functions because, well, you have to write them. Also you have to test they work as expected.
I am currently building a 2-channel (also called double-channel) convolutional neural network in order to classify 2 binary images (containing binary objects) as 'similar' or 'different'.
The problem I am having is that it seems as though the network doesn't always converge to the same solution. For example, I can use exactly the same ordering of training pairs and all the same parameters and so forth, and when I run the network multiple times, each time produces a different solution; sometimes converging to below 2% error rates, and other times I get 50% error rates.
I have a feeling that it has something to do with the random initialization of the weights of the network, which results in different optimization paths each time the network is executed. This issue even occurs when I use SGD with momentum, so I don't really know how to 'force' the network to converge to the same solution (global optima) every time?
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
There are several sources of randomness in training.
Initialization is one. SGD itself is of course stochastic since the content of the minibatches is often random. Sometimes, layers like dropout are inherently random too. The only way to ensure getting identical results is to fix the random seed for all of them.
Given all these sources of randomness and a model with many millions of parameters, your quote
"I don't really know how to 'force' the network to converge to the same solution (global optima) every time?"
is something pretty much something anyone should say - no one knows how to find the same solution every time, or even a local optima, let alone the global optima.
Nevertheless, ideally, it is desirable to have the network perform similarly across training attempts (with fixed hyper-parameters and dataset). Anything else is going to cause problems in reproducibility, of course.
Unfortunately, I suspect the problem is inherent to CNNs.
You may be aware of the bias-variance tradeoff. For a powerful model like a CNN, the bias is likely to be low, but the variance very high. In other words, CNNs are sensitive to data noise, initialization, and hyper-parameters. Hence, it's not so surprising that training the same model multiple times yields very different results. (I also get this phenomenon, with performances changing between training runs by as much as 30% in one project I did.) My main suggestion to reduce this is stronger regularization.
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
As I mentioned, this problem is present inherently for deep models to an extent. However, your use of binary images may also be a factor, since the space of the data itself is rather discontinuous. Perhaps consider "softening" the input (e.g. filtering the inputs) and using data augmentation. A similar approach is known to help in label smoothing, for example.
What is a multi-headed model in deep learning?
The only explanation I found so far is this: Every model might be thought of as a backbone plus a head, and if you pre-train backbone and put a random head, you can fine tune it and it is a good idea
Can someone please provide a more detailed explanation.
The explanation you found is accurate. Depending on what you want to predict on your data you require an adequate backbone network and a certain amount of prediction heads.
For a basic classification network for example you can view ResNet, AlexNet, VGGNet, Inception,... as the backbone and the fully connected layer as the sole prediction head.
A good example for a problem where you need multiple-heads is localization, where you not only want to classify what is in the image but also want to localize the object (find the coordinates of the bounding box around it).
The image below shows the general architecture
The backbone network ("convolution and pooling") is responsible for extracting a feature map from the image that contains higher level summarized information. Each head uses this feature map as input to predict its desired outcome.
The loss that you optimize for during training is usually a weighted sum of the individual losses for each prediction head.
Head is the top of a network. For instance, on the bottom (where data comes in) you take convolution layers of some model, say resnet. If you call ConvLearner.pretrained, CovnetBuilder will build a network with appropriate head to your data in Fast.ai (if you are working on a classification problem, it will create a head with a cross entropy loss, if you are working on a regression problem, it will create a head suited to that).
But you could build a model that has multiple heads. The model could take inputs from the base network (resnet conv layers) and feed the activations to some model, say head1 and then same data to head2. Or you could have some number of shared layers built on top of resnet and only those layers feeding to head1 and head2.
You could even have different layers feed to different heads! There are some nuances to this (for instance, with regards to the fastai lib, ConvnetBuilder will add an AdaptivePooling layer on top of the base network if you don’t specify the custom_head argument and if you do it won’t) but this is the general picture.
https://forums.fast.ai/t/terminology-question-head-of-neural-network/14819/2
https://youtu.be/h5Tz7gZT9Fo?t=3613 (1:13:00)
I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.
I've been doing a lot of reading on Conv Nets and even some playing using Julia's Mocha.jl package (which looks a lot like Caffe, but you can play with it in the Julia REPL).
In a Conv net, Convolution layers are followed by "feature map" layers. What I'm wondering is how does one determine how many feature maps a network needs to have to solve some particular problem? Is there any science to this or is it more art? I can see that if you're trying to make a classification at least that last layer should have number of feature maps == number of classes (unless you've got a fully connected MLP at the top of the network, I suppose).
In my case, I 'm not doing a classification so much as trying to come up with a value for every pixel in an image (I suppose this could be seen as a classification where the classes are from 0 to 255).
Edit: as pointed out in the comments, I'm trying to solve a regression problem where the outputs are in a range from 0 to 255 (grayscale in this case). Still, the question remains: How does one determine how many feature maps to use at any given convolution layer? Does this differ for a regression problem vs. a classification problem?
Basically, like any other hyperparameter - by evaluting results on separate development set and finding what number works best. It also worth checking publications that deal with similar problem and finding what number of feature maps they were using.
More art. The only difference between imagenet winners that use conv-nets has been changing the structure of layers and maybe some novel ways of training.
VGG is a neat example. Begins with filter sizes beginning with 2^7, then 2^8, then 2^9 followed by fully connected layers, then an output layer which will give you your classes. Your maps and layer depths can be completely unrelated to the number of output classes.
You would not want a fully connected layer at the top. That kind of defeats the purpose that convolutional nets were designed to solve (overfitting and optimizing hundreds of thousands of weights per neuron)
Training on big sets will require some heavy computational resources. If you're working with imagenet - there's a set of pre-trained models with caffe that you could build on top of http://caffe.berkeleyvision.org/model_zoo.html
I'm not sure if you can port these to mocha. There's a port to tensor flow though if you're interested in that https://github.com/ethereon/caffe-tensorflow