I am relatively new in this domain. Currently I have three models:
Model #1: training from scratch but using the googlenet NN
architectures
Model #2: transfer learning (finetuning), only use the last layer of
googlenet and then use the googlenet model as the initial weight
Model #3: transfer learning (finetuning), use all the layers and use
the googlenet model as the initial weight
Model #2 and #3 showed a promising validation accuracy (about 79%-80%), however, when I tested using the testing set (30000 photos), the performance of both models were poor (error rate: 95%) while Model #1 achieved 70% of accuracy using the same test set.
I am afraid, due to lack of knowledge in this domain, that I’ve done something wrong when finetuning the existing model (googlenet). Following the guidance I got from quora and stackoverflow, I tried to modify the train.prototxt.
In order to get the Model #2, what I did was set all the lr_mult and decay_mult to 0 (zero) except the last layer (in my case would be the fc layer).
In order to get the Model #3, I used all the layers (only modify the last layer and change the num_ouput of the last layer)
Then I trained the model using the following command:
/root/caffe/build/tools/caffe train --solver my_solver.prototxt -weights googlenet_places365.caffemodel --gpu=0,1,2,3
My questions are:
Are the steps which I mentioned above (for getting model #2 and #3)
correct? Did I miss some important steps?
Why the accuracy using the validation set when training the model
showed the high value but performed really poor when using the test
data (overfit)? What can be the root causes of this overfit?
FYI, I have 4millions photos which divided into three sets: training sets (contains 70% of the photos), validation set (20% of the photos) and testing set (10% of the photos).
Thank you very much, I would really glad to hear any guide, comment or information from you.
Related
When pretraining Deep Learning model (lets say a deep convolutional neural netowork) in order to achieve good weight initialization, do I use entire training set without validation (so that I avoid information leak) or just subset of training set?
If you want to fine-tune your network after training it on your dataset then you can use the same dataset (making sure that the data in the training/test, and validation sets do not switch around). What you can also do as 'pre-training' is to download a model that is already trained on a similar dataset/problem to yours and then training it on your dataset. This is known as transfer learning and works well for similar problems, but of course the bigger the gap between the 2 problems the more you need to train.
In conclusion: you can use any dataset as long as the validation set remains hidden from the network.
I think if we divide the dataset into training, validation and test data, it will be more useful. Keeping a completely new test data aside and validating the model with only validation data is a good choice. Entire training data should be used for training.
I have built two ML models with the following roc_auc_score
Model 1
Training score - 95%
Test score - 74%
Model 2
Training score - 78%
Test score - 74%
It is high likely that model 1 is trying to overfit but test score is same in both cases. So, which of these two is a better performing one?
I assume this is a hypothetical question where all other conditions are equal. In this case, I would argue with occam's razor and declare the simpler model (probably model 2) the winner.
In practice other factors might be important too. For example have you extensively tuned hyperparameters to get to Model 2 and thus overfit to the test data?
Without any further information, I would agree that your first model does appear to be overfit. Other than that, both models conceptually have "learned" about the behavior of the underlying real world training data with a similar level of accuracy, as given by the identical test scores.
But because the first model is overfit, it means that the first model also has possibly incorporated noise from the training data. This additional information won't help the model, and might actually hurt with making new predictions.
So, I would lean towards using the second model, if I had to choose one of the two.
In general it is hard to give a concrete answer without getting insight in the use case, the problem to be sovled and the model and training strategy you have chosen.
However, perhaps a differentiation between errors might help:
Bayes Error: This the theoretically lowest possible error a classifier might reach
Human Error: Classification error exhibited by a human solving the task.
Avoidable Bias: Difference between the human/bias error and the error exhibited by your model evaluated on the training set.
Avoidable Variace: Error difference between the test error and the training error
So in your case, it seems at the first sight that model 1 is overfitting when compared to model 2 since it has a lower variance. When compared. That does not mean model 1 is better, it depends. I would advice you to:
Take a closer look at your available data: what is the distribution of the data? How does it differ from the possible upcoming data where the model is implemented?
Further implement training techniques on model 1 to see if you can reduce the test error: data augmentation (relative to the task), weights regularization, dropout, etc.
If you have already extensively performed this, then I would analyze performance/computation cost of both models (which one is faster/lighter) and as #saibot suggested, go with the simpler one (the one that consumes less ressources) (occams razer).
Remember, goal is not necessary to get your test error equal to the training error. It is actually to get your test error as close as possible to the bias error.
I have a general question regarding fine-tuning and transfer learning, which came up when I tried to figure out how to best get yolo to detect my custom object (being hands).
I apologize for the long text possibily containing lots of false information. I would be glad if someone had the patience to read it and help me clear my confusion.
After lots of googling, I learned that many people regard fine-tuning to be a sub-class of transfer learning while others believe that they are to different approaches to training a model. At the same time, people differentiate between re-training only the last classifier layer of a model on a custom dataset vs. also re-training other layers of the model (and possbibly adding an enirely new classifier instead of retraining?). Both approaches use pre-trained models.
My final confusien lies here: I followed these instructions: https://github.com/thtrieu/darkflow to train tiny yolo via darkflow, using the command:
# Initialize yolo-new from yolo-tiny, then train the net on 100% GPU:
flow --model cfg/yolo-new.cfg --load bin/tiny-yolo.weights --train --gpu 1.0
But what happens here? I suppose I only retrain the classifier because the instructions say to change the number of classes in the last layer in the configuration file. But then again, it is also required to change the number of filters in the second last layer, a convolutional layer.
Lastly, the instructions provide an example of an alternative training:
# Completely initialize yolo-new and train it with ADAM optimizer
flow --model cfg/yolo-new.cfg --train --trainer adam and I don't understand at all how this relates to the different ways of transfer learning.
If you are using AlexeyAB's darknet repo (not darkflow), he suggests to do Fine-Tuning instead of Transfer Learning by setting this param in cfg file : stopbackward=1 .
Then input ./darknet partial yourConfigFile.cfg yourWeightsFile.weights outPutName.LastLayer# LastLayer# such as :
./darknet partial cfg/yolov3.cfg yolov3.weights yolov3.conv.81 81
It will create yolov3.conv.81 and will freeze the lower layer, then you can train by using weights file yolov3.conv.81 instead of original darknet53.conv.74.
References : https://github.com/AlexeyAB/darknet#how-to-improve-object-detection , https://groups.google.com/forum/#!topic/darknet/mKkQrjuLPDU
I have not worked on YOLO but looking at your problems I think I can help. Fine tuning, re-training, post-tuning are all somewhat ambiguous terms often used interchangeably. It's all about how much you want to change the pre-trained weights.
Since you are loading the weights in the first case with --load, the pre-trained weights are being loaded here - it could mean you are adjusting the weights a bit with a low learning rate or maybe not changing them at all. In the second case, however, you are not loading any weights, so probably you are training it from scratch. So when you make small (fine) changes, call it fine-tuning, post-tuning would be tuning again after initial training, maybe not as fine as fine-tuning and retraining would then be training the whole network or a part again
There would be separate ways in which you can freeze some layers optionally.
I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).
I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow