What's the difference between these 2 Keras approaches in Transfer Learning? - machine-learning

I've seen two different approaches for Transfer Learning/Fine Tuning and I'm not sure about their differences and benefits:
One simply loads the model, eg. Inception, initialized with the weights generated from training on eg. Imagenet, freezes the conv layers and appends some dense layers to adapt to the specific classification task one's working on. Some references are: [1], [2], [3], [4]
On this keras blog tutorial the process seems more convoluted: runs train/test data through the VGG16 model once and records in two numpy arrays the output from the last activation maps before the fully-connected layers. Then trains a small fully-connected model on top of the stored features (the weights are stored as eg. mini-fc.h5). At this point if follows a procedure similar to approach #1 where it freezes the first convolutional layers of VGG16 (initialized with weights from imagenet) and trains only the last conv layers and the fully connected classifier (which is instead initialized with the weights from the previous training part of this approach, mini-fc.h5). This final model is then trained. Maybe a more recent version of this approach is explained in the section Fine-tune InceptionV3 on a new set of classes of this keras page: https://keras.io/applications/
What's the difference/benefits of the two approaches? Are those distinct examples of Transfer Learning vs Fine Tuning? The last link is really just a revised version of method #2?
Thanks for your support

Related

how to do fine-tuning with resnet50 model?

I have seen many examples in the Internet about how to fine tune VGG16 and InceptionV3.For example, some people will set the first 25 layers to be frozen when fine tuning VGG16. For InceptionV3, the first 172 layers will be frozen. But how about resnet? When we do fine tuning, we will freeze some layers of the base model, like follows:
from keras.applications.resnet50 import ResNet50
base_model = ResNet50(include_top=False, weights="imagenet", input_shape=(input_dim, input_dim, channels))
..............
for layer in base_model.layers[:frozen_layers]:
layer.trainable = False
So how should I set the frozen_layers? Actually I do not know how many layers should I set to be frozen when I do fine-tuning with VGG16, VGG19, ResNet50, InceptionV3 .etc. Can anyone give me suggestions on how to fine tune these models? Especially how many layers people will freeze when they do fine tuning with these models?
That's curious.... the VGG16 model has a total of 23 layers... (https://github.com/fchollet/keras/blob/master/keras/applications/vgg16.py)
All these models have a similar strucutre:
A series of convolutional layers
Followed by a few dense layers
These few dense layers are what keras calls top. (As in the include_top parameter).
Usually, this fine tuning happens only in the last dense layers. You let the convolutional layers (which understand images and locate features) do their job unchanged, and create your ou top part adapted to your personal classes.
People often create their own top part because they don't have exactly the same classes the original model was trained to. So they adapt the final part, and train only the final part.
So, you create a model with include_top=False, then you freeze it entirely.
Now you add your own dense layers and leave these trainable.
This is the most usual adaptation of these models.
For other kinds of fine tuning, there probably aren't clear rules.

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Fine Tuning of GoogLeNet Model

I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?
Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel
Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.

Caffe fine-tuning vs. starting from scratch

Context: let's say I have trained a CNN on datasetA and I've obtained caffeModelA.
Current situation: new pictures arrive so I can build a new dataset, datasetB
Question: would these two situations lead to same caffemodel?
merge datasetA and datasetB and train the net from scratch.
perform some fine-tuning on existing caffeModelA by training it only on datasetB (as explained here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html)
It might seem a dumb question, but I'm not really sure about its answer. And it's really important because if the two approximations lead to same result I can save time by performing number 2.
Note: bear in mind that it's the same problem, so no need to change architecture here, I just plan to add new images to the training.
In the Flicker-style example the situation is a bit more generic. They use the weights of first layers from a model trained for a different classification task and employ it for a new task, training only a new last layer and fine-tuning the first layers a bit (by setting a low learning rate for those pretrained layers). Your case is similar but more specific, you want to use the pretrained model to train the exact architecture for the exact same task but with an extension of your data.
If your question if whether Option 1. will produce exactly the same model (all resulting weights are equal) as Option 2. Then no, most probably not.
In Option 2. the network is trained for iterations of dataset A then for dataset B then dataset A again..and so on (assuming both were just concatenated together).
While in Option 1. will have the network trained for some iterations/epochs on dataset A, then later continue learning for iterations/epochs on only dataset B and that's it. So the solver will see a different sequence of gradients in both options resulting in two different models. That's from a strict theoretical perspective.
If you ask from a practical perspective, the two options will probably end up with very similar models. How many epochs (not iterations) did you train on dataset A ? say N epochs, then you can safely go with Option 2. and train your existing model further on dataset B for the same number of epochs and same learning rate and batch size.

How to train and fine-tune fully unsupervised deep neural networks?

In scenario 1, I had a multi-layer sparse autoencoder that tries to reproduce my input, so all my layers are trained together with random-initiated weights. Without a supervised layer, on my data this didn't learn any relevant information (the code works fine, verified as I've already used it in many other deep neural network problems)
In scenario 2, I simply train multiple auto-encoders in a greedy layer-wise training similar to that of deep learning (but without a supervised step in the end), each layer on the output of the hidden layer of the previous autoencoder. They'll now learn some patterns (as I see from the visualized weights) separately, but not awesome, as I'd expect it from single layer AEs.
So I've decided to try if now the pretrained layers connected into 1 multi-layer AE could perform better than the random-initialized version. As you see this is same as the idea of the fine-tuning step in deep neural networks.
But during my fine-tuning, instead of improvement, the neurons of all the layers seem to quickly converge towards an all-the-same pattern and end up learning nothing.
Question: What's the best configuration to train a fully unsupervised multi-layer reconstructive neural network? Layer-wise first and then some sort of fine tuning? Why is my configuration not working?
After some tests I've came up with a method that seems to give very good results, and as you'd expect from a 'fine-tuning' it improves the performance of all the layers:
Just like normally, during the greedy layer-wise learning phase, each new autoencoder tries to reconstruct the activations of the previous autoencoder's hidden layer. However, the last autoencoder (that will be the last layer of our multi-layer autoencoder during fine-tuning) is different, this one will use the activations of the previous layer and tries to reconstruct the 'global' input (ie the original input that was fed to the first layer).
This way when I connect all the layers and train them together, the multi-layer autoencoder will really reconstruct the original image in the final output. I found a huge improvement in the features learned, even without a supervised step.
I don't know if this is supposed to somehow correspond with standard implementations but I haven't found this trick anywhere before.

Resources