Fine tuning of pre-trained convolutional neural network [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
As I read and searched about the fine tuning of pre-trained network, it is done in following two steps (in short):
freeze the hidden layer and unfreeze the fully connected layer and trained.
unfreeze both the layers and again train.
My questions are:
Whether it is enough to perform only first step?
If I preform only first step, is it not same as network as a feature extractor method?
(The network as a feature extractor method is, to extract the feature using pre-trained network and classify it using tradition machine learning classification algorithm).
If you want more information to clarify the question, please let me know.

There are some issues with your question...
First, you clearly imply a network with only 2 layers, which is rather (very) far from the way fine-tuning is actually used in practice nowadays.
Second, what exactly do you mean by "enough" in your first question (enough for what)?
In fact, there is enough overlapping between the notions of pre-trained models, feature extractors, and fine-tuning, and different people may even use the involved terms in not exactly the same ways. One approach, adopted by the Stanford CNNs for Visual Recognition course, is to consider all these as special cases of something more general called transfer learning; here is a useful excerpt from the respective section of the aforementioned course, which arguably addresses the spirit (if not the letter) of your questions:
The three major Transfer Learning scenarios look as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. In an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features CNN codes. It is important for performance that these codes are ReLUd (i.e. thresholded at zero) if they were also thresholded during the training of the ConvNet on ImageNet (as is usually the case). Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset. In case of ImageNet for example, which contains many dog breeds, a significant portion of the representational power of the ConvNet may be devoted to features that are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of others who can use the networks for fine-tuning. For example, the Caffe library has a Model Zoo where people share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images). Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

Related

Why I need pre-trained weight in transfer learning

I'm learning transfer learning with some pre-trained models (vgg16, vgg19,…), and I wonder why I need to load pre-trained weight to train my own dataset.
I can understand if the classes in my dataset are included in the dataset that the pre-trained model is trained with. For example, VGG model was trained with 1000 classes in Imagenet dataset, and my model is to classify cat-dog, which are also in the Imagenet dataset. But here the classes in my dataset are not in this dataset. So how the pre-trained weight can help?
You don't have to use a pretrained network in order to train a model for your task. However, in practice using a pretrained network and retraining it to your task/dataset is usually faster and often you end up with better models yielding higher accuracy. This is especially the case if you do not have a lot of training data.
Why faster?
It turns out that (relatively) independent of the dataset and target classes, the first couple of layers converge to similar results. This is due to the fact that low level layers usually act as edge, corner and other simple structure detectors. Check out this example that visualizes the structures that filters of different layers "react" to. Having already trained the lower layers, adapting the higher level layers to your use case is much faster.
Why more accurate?
This question is harder to answer. IMHO it is due to the fact that pretrained models that you use as basis for transfer learning were trained on massive datasets. This means that the knowledge acquired flows into your retrained network and will help you to find a better local minimum of your loss function.
If you are in the compfortable situation that you have a lot of training data you should probably train a model from scratch as the pretained model might "point you in the wrong direction".
In this master thesis you can find a bunch of tasks (small datasets, medium datasets, small semantical gap, large semantical gap) where 3 methods are compared : fine tuning, features extraction + SVM, from scratch. Fine tuning a model pretrained on Imagenet is almost always a better choice.

MobileNet vs SqueezeNet vs ResNet50 vs Inception v3 vs VGG16

I have recently been looking into incorporating the machine learning release for iOS developers with my app. Since this is my first time ever using anything ML related I was very lost when I started reading the different model descriptions that Apple has made available. They have the same purpose/description, the only difference being the actual file size. What is the difference between these models and how would you know which one is best fit ?
The models Apple makes available are just for simple demo purposes. Most of the time, these models are not sufficient for use in your own app.
The models on Apple's download page are trained for a very specific purpose: image classification on the ImageNet dataset. This means they can take an image and tell you what the "main" object is in the image, but only if it's one of the 1,000 categories from the ImageNet dataset.
Usually, this is not what you want to do in your own apps. If your app wants to do image classification, typically you want to train a model on your own categories (like food or cars or whatever). In that case you can take something like Inception-v3 (the original, not the Core ML version) and re-train it on your own data. That gives you a new model, which you then need to convert to Core ML again.
If your app wants to do something other than image classification, you can use these pretrained models as "feature extractors" in a larger neural network structure. But again this involves training your own model (usually from scratch) and then converting the result to Core ML.
So only in a very specific use case -- image classification using the 1,000 ImageNet categories -- are these Apple-provided models useful to your app.
If you do want to use any of these models, the difference between them is speed vs. accuracy. The smaller models are fastest but also least accurate. (In my opinion, VGG16 shouldn't be used on mobile. It's just too big and it's no more accurate than Inception or even MobileNet.)
SqueezeNets are fully convolutional and use Fire modules which have a squeeze layer of 1x1 convolutions which vastly decreases parameters as it can restrict the number of input channels each layer. This makes SqueezeNets extremely low latency, in addition to the fact they don't have dense layers.
MobileNets utilise depth-wise separable convolutions, very similar to inception towers in inception. These also reduce the number of a parameters and hence latency. MobileNets also have useful model-shrinking parameters than you can call before training to make it exact size you want. The Keras implementation can use ImageNet pre-trained weights too.
The other models are very deep, large models. The reduced number of parameters / style of convolution is not used for low latency but just for the ability to train very deep models, essentially. ResNet introduced residual connections between layers which were originally believed to be key in training very deep models. These aren't seen in the previously mentioned low latency models.

How do multiple hidden layers in a neural network improve its ability to learn?

Most neural networks bring high accuracy with only one hidden layer, so what is the purpose of multiple hidden layers?
To answer you question you first need to find the reason behind why the term 'deep learning' was coined almost a decade ago. Deep learning is nothing but a neural network with several hidden layers. The term deep roughly refers to the way our brain passes the sensory inputs (specially eyes and vision cortex) through different layers of neurons to do inference. However, until about a decade ago researchers were not able to train neural networks with more than 1 or two hidden layers due to different issues arising such as vanishing, exploding gradients, getting stuck in local minima, and less effective optimization techniques (compared to what is being used nowadays) and some other issues. In 2006 and 2007 several researchers 1 and 2 showed some new techniques enabling a better training of neural networks with more hidden layers and then since then the era of deep learning has started.
In deep neural networks the goal is to mimic what the brain does (hopefully). Before describing more, I may point out that from an abstract point of view the problem in any learning algorithm is to approximate a function given some inputs X and outputs Y. This is also the case in neural network and it has been theoretically proven that a neural network with only one hidden layer using a bounded, continuous activation function as its units can approximate any function. The theorem is coined as universal approximation theorem. However, this raises the question of why current neural networks with one hidden layer cannot approximate any function with a very very high accuracy (say >99%)? This could potentially be due to many reasons:
The current learning algorithms are not as effective as they should be
For a specific problem, how one should choose the exact number of hidden units so that the desired function is learned and the underlying manifold is approximated well?
The number of training examples could be exponential in the number of hidden units. So, how many training examples one should train a model with? This could turn into a chicken-egg problem!
What is the right bounded, continuous activation function and does the universal approximation theorem is generalizable to any other activation function rather than sigmoid?
There are also other questions that need to be answered as well but I think the most important ones are the ones I mentioned.
Before one can come up with provable answers to the above questions (either theoretically or empirically), researchers started using more than one hidden layers with limited number of hidden units. Empirically this has shown a great advantage. Although adding more hidden layers increases the computational costs, but it has been empirically proven that more hidden layers learn hierarchical representations of the input data and can better generalize to unseen data as well. By looking at the pictures below you can see how a deep neural network can learn hierarchies of features and combine them successively as we go from the first hidden layer to the one in the end:
Image taken from here
As you can see, the first hidden layer (shown in the bottom) learns some edges, then combining those seemingly, useless representations turn into some parts of the objects and then combining those parts will yield things like faces, cars, elephants, chairs and ... . Note that these results were not achievable if new optimization techniques and new activation functions were not used.

How to find dynamically the depth of a network in Convolutional Neural Network

I was looking for an automatic way to decide how many layers should I apply to my network depends on data and computer configuration. I searched in web, but I could not find anything. Maybe my keywords or looking ways are wrong.
Do you have any idea?
The number of layers, or depth, of a neural network is one of its hyperparameters.
This means that it is a quantity that can not be learned from the data, but you should choose it before trying to fit your dataset. According to Bengio,
We define a hyper-
parameter for a learning algorithm A as a variable to
be set prior to the actual application of A to the data,
one that is not directly selected by the learning algo-
rithm itself.
There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the paper I linked.
Manual search. Using well-known black magic, the researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian optimization.
More specifically, adding more layers to a deep neural network is likely to improve the performance (reduce generalization error), up to a certain number when it overfits the training data.
So, in practice, you should train your ConvNet with, say, 4 layers, try adding one hidden layer and train again, until you see some overfitting. Of course, some strong regularization techniques (such as dropout) is required.

What is suitable neural network architecture for the prediction of popularity of articles?

I am a newbie in machine learning and also in neural networks. Currently I'm taking a course at coursera.org about neural networks, but I don't understand everything. I have a little problem with my thesis. I should use a neural network, but I don't know how to choose the right neural network architecture for my problem.
I have a lot of data from web portals (typically online editions of newspapers, magazines). There is information about articles for example, name, text of article and release of article. There are also large amounts of sequence data that capture behavior of users.
My goal is to predict the popularity of an article (number of readers or clicks on article by unique user). I want to make vectors from this data and feed my neural network with these vectors.
I have two questions:
1. How do I create the right vector?
2. Which neural network architecture is best suited for this problem?
Those are very broad questions. You'll need to identify smaller issues if you want more exact answers.
How to create a right vector?
For text data, you usually use the vector space model. Best results are often obtained using tf-idf weighting.
Which neural network architecture is suitable for this problem?
This is very hard to say. I would start with a network with k input neurons (where k is the size of your vectors after applying tf-idf: you might also want to do some sort of feature selection to reduce the number of features. A good feature selection method is by using the chi squared test.)
Then, a standard network layout is given by using a single hidden layer with number of neurons equal to the average between the number of input neurons and output neurons. Then it looks like you only need a single output neuron that will output how popular the article is going to be (this can be a linear neuron or a sigmoid neuron).
For the neurons in your hidden layer, you can also experiment with linear and sigmoid neurons.
There are many other things you can try as well: weight decay, the momentum technique, networks with multiple layers, recurrent networks and so on. It's impossible to say what would work best for your given problem without a lot of experimentation.

Resources