I have seen many examples in the Internet about how to fine tune VGG16 and InceptionV3.For example, some people will set the first 25 layers to be frozen when fine tuning VGG16. For InceptionV3, the first 172 layers will be frozen. But how about resnet? When we do fine tuning, we will freeze some layers of the base model, like follows:
from keras.applications.resnet50 import ResNet50
base_model = ResNet50(include_top=False, weights="imagenet", input_shape=(input_dim, input_dim, channels))
..............
for layer in base_model.layers[:frozen_layers]:
layer.trainable = False
So how should I set the frozen_layers? Actually I do not know how many layers should I set to be frozen when I do fine-tuning with VGG16, VGG19, ResNet50, InceptionV3 .etc. Can anyone give me suggestions on how to fine tune these models? Especially how many layers people will freeze when they do fine tuning with these models?
That's curious.... the VGG16 model has a total of 23 layers... (https://github.com/fchollet/keras/blob/master/keras/applications/vgg16.py)
All these models have a similar strucutre:
A series of convolutional layers
Followed by a few dense layers
These few dense layers are what keras calls top. (As in the include_top parameter).
Usually, this fine tuning happens only in the last dense layers. You let the convolutional layers (which understand images and locate features) do their job unchanged, and create your ou top part adapted to your personal classes.
People often create their own top part because they don't have exactly the same classes the original model was trained to. So they adapt the final part, and train only the final part.
So, you create a model with include_top=False, then you freeze it entirely.
Now you add your own dense layers and leave these trainable.
This is the most usual adaptation of these models.
For other kinds of fine tuning, there probably aren't clear rules.
Related
I am trying to utilize the pretrained Bert model of tensorflow which has approx 110 million params and it is near impossible to train these params using my gpu. And freezing the entire layer makes all these params untrainable.
Is it possible to make the layer partially trainable? Like have a couple million params trainable and the rest untrainable?
input_ids_layer = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name='input_ids')
input_attention_layer = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name='attention_mask')
model = TFAutoModel.from_pretrained("bert-base-uncased")
for layer in model.layers:
for i in range(len(layer.weights)):
//assuming there are 199 weights
if i>150:
layer.weights[i]._trainable = True
else:
layer.weights[i]._trainable = False
I don't know about training some weights inside a layers, but I still suggest you to do the "standard way": freezing the layers is what is usually done in these cases to avoid retraining everything. However, you must not freeze all the layers, since it would be useless. What you want to do is to freeze everything except the last few layers, and then train the neural network.
This works since the first layers usually learn very abstract features, and therefore are transferrable across many problems. On the other hand, the last layers usually learn the features that really solves the task at hand, based on the current dataset.
Therefore, if you want to re-train a pretrained model in another dataset, you just need to retrain the last few layers. You can also edit the last layers of the neural network by adding some Dense layers and changing the output of the last layer, which is useful if for example the number of classes to predict is different w.r.t the original dataset. There are a lot of short and easy tutorials that you can follow online to do that.
To summarize:
Freeze all the layers expect the last one
(optional) Create new layers and link them with the output of the second-last layer
Train the network
I've seen two different approaches for Transfer Learning/Fine Tuning and I'm not sure about their differences and benefits:
One simply loads the model, eg. Inception, initialized with the weights generated from training on eg. Imagenet, freezes the conv layers and appends some dense layers to adapt to the specific classification task one's working on. Some references are: [1], [2], [3], [4]
On this keras blog tutorial the process seems more convoluted: runs train/test data through the VGG16 model once and records in two numpy arrays the output from the last activation maps before the fully-connected layers. Then trains a small fully-connected model on top of the stored features (the weights are stored as eg. mini-fc.h5). At this point if follows a procedure similar to approach #1 where it freezes the first convolutional layers of VGG16 (initialized with weights from imagenet) and trains only the last conv layers and the fully connected classifier (which is instead initialized with the weights from the previous training part of this approach, mini-fc.h5). This final model is then trained. Maybe a more recent version of this approach is explained in the section Fine-tune InceptionV3 on a new set of classes of this keras page: https://keras.io/applications/
What's the difference/benefits of the two approaches? Are those distinct examples of Transfer Learning vs Fine Tuning? The last link is really just a revised version of method #2?
Thanks for your support
I have a general question regarding fine-tuning and transfer learning, which came up when I tried to figure out how to best get yolo to detect my custom object (being hands).
I apologize for the long text possibily containing lots of false information. I would be glad if someone had the patience to read it and help me clear my confusion.
After lots of googling, I learned that many people regard fine-tuning to be a sub-class of transfer learning while others believe that they are to different approaches to training a model. At the same time, people differentiate between re-training only the last classifier layer of a model on a custom dataset vs. also re-training other layers of the model (and possbibly adding an enirely new classifier instead of retraining?). Both approaches use pre-trained models.
My final confusien lies here: I followed these instructions: https://github.com/thtrieu/darkflow to train tiny yolo via darkflow, using the command:
# Initialize yolo-new from yolo-tiny, then train the net on 100% GPU:
flow --model cfg/yolo-new.cfg --load bin/tiny-yolo.weights --train --gpu 1.0
But what happens here? I suppose I only retrain the classifier because the instructions say to change the number of classes in the last layer in the configuration file. But then again, it is also required to change the number of filters in the second last layer, a convolutional layer.
Lastly, the instructions provide an example of an alternative training:
# Completely initialize yolo-new and train it with ADAM optimizer
flow --model cfg/yolo-new.cfg --train --trainer adam and I don't understand at all how this relates to the different ways of transfer learning.
If you are using AlexeyAB's darknet repo (not darkflow), he suggests to do Fine-Tuning instead of Transfer Learning by setting this param in cfg file : stopbackward=1 .
Then input ./darknet partial yourConfigFile.cfg yourWeightsFile.weights outPutName.LastLayer# LastLayer# such as :
./darknet partial cfg/yolov3.cfg yolov3.weights yolov3.conv.81 81
It will create yolov3.conv.81 and will freeze the lower layer, then you can train by using weights file yolov3.conv.81 instead of original darknet53.conv.74.
References : https://github.com/AlexeyAB/darknet#how-to-improve-object-detection , https://groups.google.com/forum/#!topic/darknet/mKkQrjuLPDU
I have not worked on YOLO but looking at your problems I think I can help. Fine tuning, re-training, post-tuning are all somewhat ambiguous terms often used interchangeably. It's all about how much you want to change the pre-trained weights.
Since you are loading the weights in the first case with --load, the pre-trained weights are being loaded here - it could mean you are adjusting the weights a bit with a low learning rate or maybe not changing them at all. In the second case, however, you are not loading any weights, so probably you are training it from scratch. So when you make small (fine) changes, call it fine-tuning, post-tuning would be tuning again after initial training, maybe not as fine as fine-tuning and retraining would then be training the whole network or a part again
There would be separate ways in which you can freeze some layers optionally.
I would like to know how to define or represent a negative training set if I would want to train a binary classifier from a pre-trained model say, AlexNet on ILSVRC12 (or ImageNet) dataset. What I am currently thinking of is to take one the classes which is not related as the negative training set while the one which is related as positive one. Is there any better way which is more elegant?
The CNNs trained on the ILSVRC data set are already discriminating among 1000 classes of images. Yes, you can use one of those topologies to train a binary classifier, but I suggest that you start with an untrained model and run it through your two chosen classes. If you start with a trained model, you have to unlearn a lot, and your result is still trying to discriminate among 1000 classes: that last FC layer is going to give you trouble.
There are ways to work around the 1000-class problem. If your application already overlaps one or more of the trained classes, then simply add a layer that maps those classes to label "1" and all the others to label "0".
If you're insistent on retaining the trained kernels, then try replacing the final FC layer (1000) with a 2-class FC layer. Then choose your two classes (applicable images vs everything else) and run your training.
I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?
Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel
Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.