Transformer model fine-tuning on single GPU

Transformer model fine-tuning on single GPU - machine-learning

Is fine-tuning a pre-trained transformer model a easier model an ‘easier’ task than training a transformer from scratch (BERT, GPT-2) in terms of GPU needs and GPU memory usage?
To clarify further, I’ve read how to train most transformer models, one would require multi-GPU training. However, is it possible to fine-tune some of these models on a single-GPU?
Why is this the case?
Is it because we can use with smaller batches, the fine-tuning time is not as much as training from scratch?

Yes, fine-tuning a pre-trained Transformer model is a typical way to go. The training hours required are prohibitively large (even hundred of thousands of hours of decent GPU card per model) while fine-tuning can be done on a single GPU. The reason is that fine-tuning entails training only a few layers on top of the output of the pre-trained model to tailor for a given task. As such, fine-tuning requires less data and significantly less training time to achieve good results.

Related

What is meant by stability in relation to neural networks

I hear the terms stability/instability thrown around a lot when reading up on Deep Q Networks. I understand that stability is improved with the addition of a target network and replay buffer but I fail to understand exactly what it's refering to.
What would the loss graph look like for an instable vs stable neural network?
What does it mean when a neural network converges/diverges?

Stability, also known as algorithmic stability, is a notion in
computational learning theory of how a machine learning algorithm is
perturbed by small changes to its inputs. A stable learning algorithm
is one for which the prediction does not change much when the training
data is modified slightly.
Here Stability means suppose you have 1000 training data that you use to train the model and it performs well. So in terms of model stability if you train the same model with 900 training data the model should still perform well , thats why it is also called as algorithmic stability.
As For the loss Graph if the model is stable the loss graph probably should be same for both size of training data (1000 & 900). And different in case of unstable model.
As in Machine learning we want to minimize loss so when we say a model converges we mean to say that the model's loss value is within acceptable margin and the model is at that stage where no additional training would improve the model.
Divergence is a non-symmetric metrics which is used to measure the difference between continuous value. For example you want to calculate difference between 2 graphs you would use Divergence instead of traditional symmetric metrics like Distance.

Why I need pre-trained weight in transfer learning

I'm learning transfer learning with some pre-trained models (vgg16, vgg19,…), and I wonder why I need to load pre-trained weight to train my own dataset.
I can understand if the classes in my dataset are included in the dataset that the pre-trained model is trained with. For example, VGG model was trained with 1000 classes in Imagenet dataset, and my model is to classify cat-dog, which are also in the Imagenet dataset. But here the classes in my dataset are not in this dataset. So how the pre-trained weight can help?

You don't have to use a pretrained network in order to train a model for your task. However, in practice using a pretrained network and retraining it to your task/dataset is usually faster and often you end up with better models yielding higher accuracy. This is especially the case if you do not have a lot of training data.
Why faster?
It turns out that (relatively) independent of the dataset and target classes, the first couple of layers converge to similar results. This is due to the fact that low level layers usually act as edge, corner and other simple structure detectors. Check out this example that visualizes the structures that filters of different layers "react" to. Having already trained the lower layers, adapting the higher level layers to your use case is much faster.
Why more accurate?
This question is harder to answer. IMHO it is due to the fact that pretrained models that you use as basis for transfer learning were trained on massive datasets. This means that the knowledge acquired flows into your retrained network and will help you to find a better local minimum of your loss function.
If you are in the compfortable situation that you have a lot of training data you should probably train a model from scratch as the pretained model might "point you in the wrong direction".
In this master thesis you can find a bunch of tasks (small datasets, medium datasets, small semantical gap, large semantical gap) where 3 methods are compared : fine tuning, features extraction + SVM, from scratch. Fine tuning a model pretrained on Imagenet is almost always a better choice.

Does it overfit if the nested models are trained on the same data

Does it overfit if I build a machine learning model where it use the output from another machine learning model while both models are trained on the same data?
Basically I was wondering if I can use the KNN prediction result as an input for a deep neural network model while both of the models are trained on the very same data.

Nesting machine learning models is possible. For example, neuronal networks can be seen as multiple nested perceptrons (see https://en.wikipedia.org/wiki/Perceptron).
However you are right - nesting machine learning models increase the VC-dimension (https://en.wikipedia.org/wiki/VC_dimension) of your complete machine learning system and thus the risk of overfitting.
In practice cross-validation is often used in order to reduce the risk of overfitting.
Edit:
#MatiasValdenegro +1 for pointing towards a point I do not specify very clearly in my answer. Pure cross-validation can indeed only be used in order to detect overfitting.
However when we training certain machine learning systems like neuronal networks, it is possible to use some sort of cross-validation in order to reduce the risk of overfitting. In order to do so, we simply discard e.g. 10% of the training data for training. Then after each training round, the trained machine learning system is evaluated on the discarded training data. Once the trained neuronal network is getting worse on the discarded part, the training algorithm stops. This is for example done by the python pybrain (http://pybrain.org/) library.

Fine tuning of pre-trained convolutional neural network [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
As I read and searched about the fine tuning of pre-trained network, it is done in following two steps (in short):
freeze the hidden layer and unfreeze the fully connected layer and trained.
unfreeze both the layers and again train.
My questions are:
Whether it is enough to perform only first step?
If I preform only first step, is it not same as network as a feature extractor method?
(The network as a feature extractor method is, to extract the feature using pre-trained network and classify it using tradition machine learning classification algorithm).
If you want more information to clarify the question, please let me know.

There are some issues with your question...
First, you clearly imply a network with only 2 layers, which is rather (very) far from the way fine-tuning is actually used in practice nowadays.
Second, what exactly do you mean by "enough" in your first question (enough for what)?
In fact, there is enough overlapping between the notions of pre-trained models, feature extractors, and fine-tuning, and different people may even use the involved terms in not exactly the same ways. One approach, adopted by the Stanford CNNs for Visual Recognition course, is to consider all these as special cases of something more general called transfer learning; here is a useful excerpt from the respective section of the aforementioned course, which arguably addresses the spirit (if not the letter) of your questions:
The three major Transfer Learning scenarios look as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. In an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features CNN codes. It is important for performance that these codes are ReLUd (i.e. thresholded at zero) if they were also thresholded during the training of the ConvNet on ImageNet (as is usually the case). Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset. In case of ImageNet for example, which contains many dog breeds, a significant portion of the representational power of the ConvNet may be devoted to features that are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of others who can use the networks for fine-tuning. For example, the Caffe library has a Model Zoo where people share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images). Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

Why do we use metric learning when we can classify

So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?

Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.

As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart