I'm using alexnet to train my own dataset.
The example code in caffe comes with
bvlc_reference_caffenet.caffemodel
solver.prototxt
train_val.prototxt
deploy.prototxt
When I train with the following command:
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt
I'd like to start with weights given in bvlc_reference.caffenet.caffemodel.
My questions are
How do I do that?
Is it a good idea to start from the those weights? Would this converge faster? Would this be bad if my data are vastly different from the Imagenet dataset?
1.
In order to use existing .caffemodel weights for fine-tuning, you need to use --weights command line argument:
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --weights=models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2.
In most cases fine-tuning a net is quite a recommended practice, even when the input images are quite different than "imagenet" photos.
However, you should note that when training for the original weights you are about to use, some (very reasonable) assumptions were made. You should decide whether these assumptions are still true for your task.
For instance, most nets were trained with simple data augmentation using an image and its horizontal flip. However, if your task is to distinguish between images that are flipped you will find it very difficult to fine tune.
Related
I have a general question regarding fine-tuning and transfer learning, which came up when I tried to figure out how to best get yolo to detect my custom object (being hands).
I apologize for the long text possibily containing lots of false information. I would be glad if someone had the patience to read it and help me clear my confusion.
After lots of googling, I learned that many people regard fine-tuning to be a sub-class of transfer learning while others believe that they are to different approaches to training a model. At the same time, people differentiate between re-training only the last classifier layer of a model on a custom dataset vs. also re-training other layers of the model (and possbibly adding an enirely new classifier instead of retraining?). Both approaches use pre-trained models.
My final confusien lies here: I followed these instructions: https://github.com/thtrieu/darkflow to train tiny yolo via darkflow, using the command:
# Initialize yolo-new from yolo-tiny, then train the net on 100% GPU:
flow --model cfg/yolo-new.cfg --load bin/tiny-yolo.weights --train --gpu 1.0
But what happens here? I suppose I only retrain the classifier because the instructions say to change the number of classes in the last layer in the configuration file. But then again, it is also required to change the number of filters in the second last layer, a convolutional layer.
Lastly, the instructions provide an example of an alternative training:
# Completely initialize yolo-new and train it with ADAM optimizer
flow --model cfg/yolo-new.cfg --train --trainer adam and I don't understand at all how this relates to the different ways of transfer learning.
If you are using AlexeyAB's darknet repo (not darkflow), he suggests to do Fine-Tuning instead of Transfer Learning by setting this param in cfg file : stopbackward=1 .
Then input ./darknet partial yourConfigFile.cfg yourWeightsFile.weights outPutName.LastLayer# LastLayer# such as :
./darknet partial cfg/yolov3.cfg yolov3.weights yolov3.conv.81 81
It will create yolov3.conv.81 and will freeze the lower layer, then you can train by using weights file yolov3.conv.81 instead of original darknet53.conv.74.
References : https://github.com/AlexeyAB/darknet#how-to-improve-object-detection , https://groups.google.com/forum/#!topic/darknet/mKkQrjuLPDU
I have not worked on YOLO but looking at your problems I think I can help. Fine tuning, re-training, post-tuning are all somewhat ambiguous terms often used interchangeably. It's all about how much you want to change the pre-trained weights.
Since you are loading the weights in the first case with --load, the pre-trained weights are being loaded here - it could mean you are adjusting the weights a bit with a low learning rate or maybe not changing them at all. In the second case, however, you are not loading any weights, so probably you are training it from scratch. So when you make small (fine) changes, call it fine-tuning, post-tuning would be tuning again after initial training, maybe not as fine as fine-tuning and retraining would then be training the whole network or a part again
There would be separate ways in which you can freeze some layers optionally.
I am new to TensorFlow and Machine Learning and found the concept of batch.
What is the purpose of splitting the DataSets into batches and how does the TensorFlow perform an optimization task on variables, using different sub-sets?
You are confusing a few things, as far as I understand.
First, you need to split the dataset into two (or more) distinct sets. The one is a set that you train your system on, and the second one is used to test your model.
This are basics of ML and you should easily find more in the internet. Look for "cross-validation" or "train, validation, test sets".
Batch is something that is usually important in Neural Networks (NN). You are not using one example at each training step (then the algorithm would be called Stochastic Gradient Descent), nor every example at each training step (this would be Batch Gradient Descent). Usually, it is the best to train NN using mini-batches (Mini-batch Gradient Descent).
It is a trade-off in optimization between accuracy and training speed.
Tensorflow is just a library for NN. You can easily find how sets and batches are split in many tutorials. Remember to learn the basic concepts first, for example in this great class:
https://www.coursera.org/specializations/deep-learning
The purpose of splitting the DataSet into batches is typically to speed up the learning. Instead of processing the entire set of training examples, it processes only a batch, which is just a small subset of the entire training set, at a time. There are various techniques about how to formulate such batches and process them to get the final trained model.
I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?
Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel
Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.
Given any image I want my classifier to tell if it is Sunflower or not. How can I go about creating the second class ? Keeping the set of all possible images - {Sunflower} in the second class is an overkill. Is there any research in this direction ? Currently my classifier uses a neural network in the final layer. I have based it upon the following tutorial :
https://github.com/torch/tutorials/tree/master/2_supervised
I am taking images with 254x254 as the input.
Would SVM help in the final layer ? Also I am open to using any other classifier/features that might help me in this.
The standard approach in ML is that:
1) Build model
2) Try to train on some data with positive\negative examples (start with 50\50 of pos\neg in training set)
3) Validate it on test set (again, try 50\50 of pos\neg examples in test set)
If results not fine:
a) Try different model?
b) Get more data
For case #b, when deciding which additional data you need the rule of thumb which works for me nicely would be:
1) If classifier gives lots of false positive (tells that this is a sunflower when it is actually not a sunflower at all) - get more negative examples
2) If classifier gives lots of false negative (tells that this is not a sunflower when it is actually a sunflower) - get more positive examples
Generally, start with some reasonable amount of data, check the results, if results on train set or test set are bad - get more data. Stop getting more data when you get the optimal results.
And another thing you need to consider, is if your results with current data and current classifier are not good you need to understand if the problem is high bias (well, bad results on train set and test set) or if it is a high variance problem (nice results on train set but bad results on test set). If you have high bias problem - more data or more powerful classifier will definitely help. If you have a high variance problem - more powerful classifier is not needed and you need to thing about the generalization - introduce regularization, remove couple of layers from your ANN maybe. Also possible way of fighting high variance is geting much, MUCH more data.
So to sum up, you need to use iterative approach and try to increase the amount of data step by step, until you get good results. There is no magic stick classifier and there is no simple answer on how much data you should use.
It is a good idea to use CNN as the feature extractor, peel off the original fully connected layer that was used for classification and add a new classifier. This is also known as the transfer learning technique that has being widely used in the Deep Learning research community. For your problem, using the one-class SVM as the added classifier is a good choice.
Specifically,
a good CNN feature extractor can be trained on a large dataset, e.g. ImageNet,
the one-class SVM can then be trained using your 'sunflower' dataset.
The essential part of solving your problem is the implementation of the one-class SVM, which is also known as anomaly detection or novelty detection. You may refer http://scikit-learn.org/stable/modules/outlier_detection.html for some insights about the method.
We all know that the objective function of SVM is iteratively trained. In order to continue training, at least we can store all the variables used in the iterations if we want to continue on the same training dataset.
While, if we want to train on a slightly different dataset, what should we do to make full use of the previously trained model? Or does this kind of thought make sense? I think it is quite reasonable if we train a K-means model. But I am not sure if it still makes sense for the SVM problem.
There are some literature on this topic:
alpha-seeding, in which the training data is divided into chunks. After you train a SVM on the ith chunk, you take those and use them to train your SVM with the (i+1)th chunk.
Incremental SVM serves as an online learning in which you update a classifier with new examples rather than retrain the entire data set.
SVM heavy package with online SVM training as well.
What you are describing is what an online learning algorithm does and unfortunately the classic definition for SVM is done in a batch fashion.
However, there are several solvers for SVM that produces quasy optimal hypothesis to the underneath optimization problem in an online learning way. In particular my favourite is Pegasos-SVM which can find a good near optimal solution in linear time:
http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
In general this doesn't make sense. SVM training is an optimization process with regard to every training set vector. Each training vector has an associated coefficient, which as a result is either 0 (irrelevant) or > 0 (support vector). Adding another training vector imposes another, different, optimization problem.
The only way to reuse information from previous training I can think of is to choose support vectors from the previous training and add them to the new training set. I'm not sure, but this probably will negatively affect generalization - VC dimension of an SVM is related to the number of support vectors, so adding previous support vectors to the new dataset is likely to increase the support vector count.
Apparently, there are more possibilities, as noted in lennon310's answer.