Fine Tuning of GoogLeNet Model - machine-learning

I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?

Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel

Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.

Related

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

When should you use pretrained weights when training deep learning models?

I am interested in training a range of image and object detection models and I am wondering what the general rule of when to use pretrained weights of a network like VGG16 is.
For example, it seems obvious that fine-tuning pre-trained VGG16 imagenet model weights is helpful you are looking for a subset ie. Cats and Dogs.
However it seems less clear to me whether using these pretrained weights is a good idea if you are training an image classifier with 300 classes with only some of them being subsets of the classes in the pretrained model.
What is the intuition around this?
Lower layers learn features that are not necessarily specific to your application/dataset: corners, edges , simple shapes, etc. So it does not matter if your data is strictly a subset of the categories that the original network can predict.
Depending on how much data you have available for training, and how similar the data is to the one used in the pretrained network, you can decide to freeze the lower layers and learn only the higher ones, or simply train a classifier on top of your pretrained network.
Check here for a more detailed answer

how to load reference .caffemodel for training

I'm using alexnet to train my own dataset.
The example code in caffe comes with
bvlc_reference_caffenet.caffemodel
solver.prototxt
train_val.prototxt
deploy.prototxt
When I train with the following command:
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt
I'd like to start with weights given in bvlc_reference.caffenet.caffemodel.
My questions are
How do I do that?
Is it a good idea to start from the those weights? Would this converge faster? Would this be bad if my data are vastly different from the Imagenet dataset?
1.
In order to use existing .caffemodel weights for fine-tuning, you need to use --weights command line argument:
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --weights=models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2.
In most cases fine-tuning a net is quite a recommended practice, even when the input images are quite different than "imagenet" photos.
However, you should note that when training for the original weights you are about to use, some (very reasonable) assumptions were made. You should decide whether these assumptions are still true for your task.
For instance, most nets were trained with simple data augmentation using an image and its horizontal flip. However, if your task is to distinguish between images that are flipped you will find it very difficult to fine tune.

how to fine-tune word2vec when training our CNN for text classification?

I have 3 Questions about fine-tuning word vectors. Please, help me out. I will really appreciate it! Many thanks in advance!
When I train my own CNN for text classification, I use Word2vec to initialize the words, then I just employ these pre-trained vectors as my input features to train CNN, so if I never had a embedding layer, it surely can not do any fine-tunes through back-propagation. my question is if I want to do fine-tuning, does it means to create a Embedding layer?and how to create it?
When we train Word2vec, we use unsupervised training right? as in my case, I use the skip-gram model to get my pre-trained word2vec; But when I had the vec.bin and use it in the text classification model (CNN) as my words initialiser, if I could fine-tune the word-to-vector map in vec.bin, does it means that I have to have a CNN net structure exactly same as the one when training my Word2vec? and does the fine-tunes stuff would change the vec.bin or just fine-tune in computer memory?
Are the skip-gram model and CBOW model are only used for unsupervised Word2vec training? Or they could also apply for other general text classification tasks? and what's the different of the network between Word2vec unsupervised training supervised fine-tuning?
#Franck Dernoncourt thank you for reminding me. I'm green here, and hope to learn something from the powerful community. Please have a look at my questions when you have time, thank you again!
1) What you need is just a good example of using pretrained word embedding with trainable/fixed embedding layer with following change in code. In Keras you can update this layer by default, to exclude it from training you need set trainable to False.
embedding_layer = Embedding(nb_words + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
2) Your w2v is just for embedding layer initialization , no more relation to what CNN structure you are going to use. Will only update the weights in memory.
Answer to your 1st question-
When you set trainable=True in your Embedding constructor. Your pretrained-embeddings are set as weights of that embedding layer. Now any fine-tuning that happens on those weights has nothing to do with w2v(CBOW or SG). If you want to finetune you will have to finetune your w2v model using any of these techniques. Refer to these answers.
Answer 2-
Any finetuning on weights of embedding layer does not affect your vec.bin. These updated weights are saved along with the model though so theoretically you can get them out.
Answer 3 -
gensim implements only these two methods (SG and CBOW). However there are multiple new methods being used for training word vectors like MLM(masked language modeling).
glove tries to model probabilities of co-occurences of words.
If you want to fine-tune using your own custom method. You will just have to specify a task(like text classification) and then save your updated embedding layer weights. You will have to take proper care of indexing to assign each word its corresponding vector.

Caffe fine-tuning vs. starting from scratch

Context: let's say I have trained a CNN on datasetA and I've obtained caffeModelA.
Current situation: new pictures arrive so I can build a new dataset, datasetB
Question: would these two situations lead to same caffemodel?
merge datasetA and datasetB and train the net from scratch.
perform some fine-tuning on existing caffeModelA by training it only on datasetB (as explained here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html)
It might seem a dumb question, but I'm not really sure about its answer. And it's really important because if the two approximations lead to same result I can save time by performing number 2.
Note: bear in mind that it's the same problem, so no need to change architecture here, I just plan to add new images to the training.
In the Flicker-style example the situation is a bit more generic. They use the weights of first layers from a model trained for a different classification task and employ it for a new task, training only a new last layer and fine-tuning the first layers a bit (by setting a low learning rate for those pretrained layers). Your case is similar but more specific, you want to use the pretrained model to train the exact architecture for the exact same task but with an extension of your data.
If your question if whether Option 1. will produce exactly the same model (all resulting weights are equal) as Option 2. Then no, most probably not.
In Option 2. the network is trained for iterations of dataset A then for dataset B then dataset A again..and so on (assuming both were just concatenated together).
While in Option 1. will have the network trained for some iterations/epochs on dataset A, then later continue learning for iterations/epochs on only dataset B and that's it. So the solver will see a different sequence of gradients in both options resulting in two different models. That's from a strict theoretical perspective.
If you ask from a practical perspective, the two options will probably end up with very similar models. How many epochs (not iterations) did you train on dataset A ? say N epochs, then you can safely go with Option 2. and train your existing model further on dataset B for the same number of epochs and same learning rate and batch size.

Resources