Multi GPU training for Transformers with different GPUs - machine-learning

I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.
My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.
Will my model size limit essentially be sum of all available GPU memory? (35GB?)
I’m not interested in doing this in AWS.

You already solved your problem. That's great. I would like to point out a different approach and address a few questions.
Will my model size limit essentially be sum of all available GPU
memory? (35GB?)
This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.
The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.
The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)
If you want to understand in detail how it works, here is the paper:
https://arxiv.org/pdf/1910.02054.pdf
More information on integrating DeepSpeed with Huggingface can be found here:
https://huggingface.co/docs/transformers/main_classes/deepspeed
PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.

Related

Do GPUs have a noticeable improvement in ML predictions? (not training)

I interested in improving performance with regard to ML predictions. (I don't care about training)
-Will GPUs provide more throughput or lower latency?
-Are they good for batch or online serving?
-What types of models would be most impact by using GPUs?
Disclaimer: The real answer is "it depends; if such a decision is important to you, you should benchmark CPU performance against GPU performance on your target systems to make an informed decision." The rest of this answer is just advice to loosely guide your decision when you don't want to (or don't have time to) do any benchmarking.
In a research environment, predictions are often (though not always) done in batches. As such, even if the model is entirely serial (i.e. there is an execution dependency between every pair of operations), it will likely still benefit from parallelization in that those serial operations may have to be replicated for multiple query points simultaneously, and so you can parallelize predictions across query points within a batch. So if your prediction setting involves batches, you should pretty much always use a GPU. From my own research experiences, a GPU is always faster than a CPU in batched prediction settings, regardless of the model used.
If you are only making a single prediction at a time (e.g. an "online" prediction setting), most modern ML methods are still highly parallelizable in general. In a neural network, for instance, there are only execution dependencies between layers; there are no execution dependencies between nodes within a layer. If you have many nodes per layer (which most modern deep learning architectures do), then your model is likely very parallelizable and can benefit from using a GPU instead of a CPU.
Naive Bayes classifiers make predictions by computing a bunch of (supposedly) conditionally independent probabilities, which can be parallelized, and then multiplying them together, which can be parallelized via reduction. As such, they may also benefit from using a GPU instead of a CPU.
For a support vector machine with the dual problem approach, making a prediction requires computing an inner product (kernel trick) for each training data point with the query point, and multiplying each inner product by the corresponding parameters and target binary labels. This can very easily be parallelized in a similar way to naive Bayes classifiers.
The list goes on. The point is, most ML methods are at least relatively conducive to parallelization even if you're processing a single query point at a time, and extremely conducive to parallelization if you're processing query points in batch. This makes them generally run faster on the "average" GPU than the "average" CPU.
But ultimately, it depends on your model and target system, so if it matters that much to you, you should benchmark to make an informed decision.

Do you need to train your ml model equal no. of times before and after fine tuning while using transfer learning?

I'm trying to make a model that can classify 7 different denomination of bank note. I'm using VGG19 as a convolution base. I've a dataset of more than 10000 images with each category containing more than 1k. How many layers should I add after convolution base? and also what would be the size of each layer.
This question is too vague. Choosing right architecture is not a simple task. It depends on your domain. Using ready-made architectures you're prone to overshoot the problem. Capacity of such networks might be an overkill. You'll get nice low entropy in your net but it will overfit as crazy. Rule of thumb: start with smaller nets, slowly build up and compare your metrics.
There's similar thread here.
There's ongoing research regarding Network Architecture Search. Afaik the only available solution is Google's Auto-ML. Metaheuristics-based NAS is still in its infancy and won't be widely used for some time.
Most popular open-source NAS is AutoKeras.

what machine learning algorithm could be better for this scenario

I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.

MobileNet vs SqueezeNet vs ResNet50 vs Inception v3 vs VGG16

I have recently been looking into incorporating the machine learning release for iOS developers with my app. Since this is my first time ever using anything ML related I was very lost when I started reading the different model descriptions that Apple has made available. They have the same purpose/description, the only difference being the actual file size. What is the difference between these models and how would you know which one is best fit ?
The models Apple makes available are just for simple demo purposes. Most of the time, these models are not sufficient for use in your own app.
The models on Apple's download page are trained for a very specific purpose: image classification on the ImageNet dataset. This means they can take an image and tell you what the "main" object is in the image, but only if it's one of the 1,000 categories from the ImageNet dataset.
Usually, this is not what you want to do in your own apps. If your app wants to do image classification, typically you want to train a model on your own categories (like food or cars or whatever). In that case you can take something like Inception-v3 (the original, not the Core ML version) and re-train it on your own data. That gives you a new model, which you then need to convert to Core ML again.
If your app wants to do something other than image classification, you can use these pretrained models as "feature extractors" in a larger neural network structure. But again this involves training your own model (usually from scratch) and then converting the result to Core ML.
So only in a very specific use case -- image classification using the 1,000 ImageNet categories -- are these Apple-provided models useful to your app.
If you do want to use any of these models, the difference between them is speed vs. accuracy. The smaller models are fastest but also least accurate. (In my opinion, VGG16 shouldn't be used on mobile. It's just too big and it's no more accurate than Inception or even MobileNet.)
SqueezeNets are fully convolutional and use Fire modules which have a squeeze layer of 1x1 convolutions which vastly decreases parameters as it can restrict the number of input channels each layer. This makes SqueezeNets extremely low latency, in addition to the fact they don't have dense layers.
MobileNets utilise depth-wise separable convolutions, very similar to inception towers in inception. These also reduce the number of a parameters and hence latency. MobileNets also have useful model-shrinking parameters than you can call before training to make it exact size you want. The Keras implementation can use ImageNet pre-trained weights too.
The other models are very deep, large models. The reduced number of parameters / style of convolution is not used for low latency but just for the ability to train very deep models, essentially. ResNet introduced residual connections between layers which were originally believed to be key in training very deep models. These aren't seen in the previously mentioned low latency models.

Choice of infrastructure for faster deep learning model training with tensorflow?

I am a newbie in deep learning with tensorflow.I am trying out a seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq
model with satisfactory accuracy?
Also,the minimum infrastructure setup required in terms of memory
and cpu capability to train this deep learning model within a max
time of a few hours.
My experience has been training a seq2seq model to build a neural network with
2 layers of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor.
took around 1 day to train on a 8GB RAM,3GHz Intel i5 single core
processor.
Which helps the most for faster training - more RAM capacity, multiple CPU cores or a CPU + GPU combination core?
Disclaimer: I'm also new, and could be wrong on a lot of this.
I am a newbie in deep learning with tensorflow.I am trying out a
seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq model
with satisfactory accuracy?
I think that this will just have to be up to your experimentation. Find out what works for your data set. I have heard a few pieces of advice: don't pick your own architecture if you can - find someone else's that is tried and tested. Seems deeper networks are better than wider if you're going to choose between the too. I also think bigger batch sizes are better if you have the memory. I've heard to maximize network size and then regularize so you don't overfit.
I have the impression these are big questions that no one really knows the answer to (could be very wrong about this!). We'd all love a smart way of choosing layer size / number of layers, but no one knows exactly how changing these things affects training.
Also,the minimum infrastructure setup required in terms of memory and cpu capability to train this deep
learning model within a max time of a few hours.
Depending on your model, that could be an unreasonable request. Seems like some models train for hundreds if not thousands of hours (on GPUs).
My experience has
been training a seq2seq model to build a neural network with 2 layers
of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor. took around 1 day to train on a 8GB RAM,3GHz Intel i5
single core processor. Which helps the most for faster training - more
RAM capacity, multiple CPU cores or a CPU + GPU combination core?
I believe a GPU will help you the most. I have seen some stuff that uses the CPU (asynchronous actor critic or something? They didn't use locking) where it seemed like CPU was better, but I think GPU will give you huge speedups.

Resources