For many machine learning models it is desirable to compare models that have a similar number of multiplies as to limit the total inference time for the finished product (e.g. when such an algorithm is to be released on a mobile device). Many neural network libraries have functionality that supports easily calculating the total number of multiplies in a model, however I could not a similar feature for TensorFlow.
Hence my question is: how would one go about calculating the total number of multiply operations? Is there possibly a tool that someone has developed for that purpose?
Related
I'm currently working with a huge dataset consisting of speech recordings of conersations of 120 persons. For each person, I have around 5 conversation recordings lasting 40-60min (the conversations are dyadic). For each recording, I have a label (i.e., around 600 labels). The labels provide information about the mental state of one of the person in the conversation (three classes). To classify this mental state based on the speech recordings, I see the following three possibilities:
Extracting Mel-Spectrograms or MFCCs (better for speech) and training
a RNN CNN (e.g., ConvLSTM) for the classification task. Here I see
the problem that with the small amount of labels, it might overfit.
In addition, recordings are long and thus training a RNN might be
difficult. A pretrained network on a different task (e.g., speech
recognition?) could also be used (but probably not available for
RNN).
Training a CNN autoencoder on Mel-Spectrograms or MFCCs over small
shifted windows (e.g., 1 minutes). The encoder could then be used to
extract features. The problem here is that probably the whole
recording must be used for the prediction. Thus, features need to be
extracted over the whole recording which would require the same
shifted windows as for training the autoencoder.
Extracting the features manually (e.g. frequency-based features etc.)
and using a SVM or Random Forest for the prediction which might suit
better for the small amount of labels (a fully-connected network
could also be used for comparison). Here the advantage is that
features can be chosen that are independent of the length of the
recording.
Which option do you think is best? Do you have any recommendations?
I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.
My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.
Will my model size limit essentially be sum of all available GPU memory? (35GB?)
I’m not interested in doing this in AWS.
You already solved your problem. That's great. I would like to point out a different approach and address a few questions.
Will my model size limit essentially be sum of all available GPU
memory? (35GB?)
This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.
The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.
The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)
If you want to understand in detail how it works, here is the paper:
https://arxiv.org/pdf/1910.02054.pdf
More information on integrating DeepSpeed with Huggingface can be found here:
https://huggingface.co/docs/transformers/main_classes/deepspeed
PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.
I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.
If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.
Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).
You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.
I'm using the random forest algorithm as the classifier of my thesis project.
The training set consists of thousands of images, and for each image about 2000
pixels get sampled. For each pixel, I've hundred of thousands of features. With
my current hardware limitations (8G of ram, possibly extendable to 16G) I'm able
to fit in memory the samples (i.e. features per pixel) for only one image. My
questions is: is it possible to call multiple times the train method, each time
with a different image's samples, and get the statistical model automatically
updated at each call? I'm particularly interested in the variable importance since, after I
train the full training set with the whole features set, my idea is to reduce
the number of features from hundred of thousands to about 2000, keeping only the
most important ones.
Thank you for any advice,
Daniele
I dont think the algorithm supports incremental training. You could consider reducing the size of your descriptors prior to training, using other feature reduction method. Or estimate the variable importance on a random subset of pixels taken among all your training images, as much as you can stuff into your memory...
See my answer to this post. There are incremental versions of random forests, and they will let you train on much larger data.