I interested in improving performance with regard to ML predictions. (I don't care about training)
-Will GPUs provide more throughput or lower latency?
-Are they good for batch or online serving?
-What types of models would be most impact by using GPUs?
Disclaimer: The real answer is "it depends; if such a decision is important to you, you should benchmark CPU performance against GPU performance on your target systems to make an informed decision." The rest of this answer is just advice to loosely guide your decision when you don't want to (or don't have time to) do any benchmarking.
In a research environment, predictions are often (though not always) done in batches. As such, even if the model is entirely serial (i.e. there is an execution dependency between every pair of operations), it will likely still benefit from parallelization in that those serial operations may have to be replicated for multiple query points simultaneously, and so you can parallelize predictions across query points within a batch. So if your prediction setting involves batches, you should pretty much always use a GPU. From my own research experiences, a GPU is always faster than a CPU in batched prediction settings, regardless of the model used.
If you are only making a single prediction at a time (e.g. an "online" prediction setting), most modern ML methods are still highly parallelizable in general. In a neural network, for instance, there are only execution dependencies between layers; there are no execution dependencies between nodes within a layer. If you have many nodes per layer (which most modern deep learning architectures do), then your model is likely very parallelizable and can benefit from using a GPU instead of a CPU.
Naive Bayes classifiers make predictions by computing a bunch of (supposedly) conditionally independent probabilities, which can be parallelized, and then multiplying them together, which can be parallelized via reduction. As such, they may also benefit from using a GPU instead of a CPU.
For a support vector machine with the dual problem approach, making a prediction requires computing an inner product (kernel trick) for each training data point with the query point, and multiplying each inner product by the corresponding parameters and target binary labels. This can very easily be parallelized in a similar way to naive Bayes classifiers.
The list goes on. The point is, most ML methods are at least relatively conducive to parallelization even if you're processing a single query point at a time, and extremely conducive to parallelization if you're processing query points in batch. This makes them generally run faster on the "average" GPU than the "average" CPU.
But ultimately, it depends on your model and target system, so if it matters that much to you, you should benchmark to make an informed decision.
Related
I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.
My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.
Will my model size limit essentially be sum of all available GPU memory? (35GB?)
I’m not interested in doing this in AWS.
You already solved your problem. That's great. I would like to point out a different approach and address a few questions.
Will my model size limit essentially be sum of all available GPU
memory? (35GB?)
This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.
The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.
The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)
If you want to understand in detail how it works, here is the paper:
https://arxiv.org/pdf/1910.02054.pdf
More information on integrating DeepSpeed with Huggingface can be found here:
https://huggingface.co/docs/transformers/main_classes/deepspeed
PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.
I looked at the classic word2vec sources, and, if I understood correctly, there is no data access synchronization when training the neural network by several threads (synchronization for matrixes syn0, syn1, syn1neg). Is it normal practice for training, or is it a bug?
Perhaps counterintuitively, it's normal. A pioneering work on this was the 'Hogwild' paper in 2011:
https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent
Its abstract:
Stochastic Gradient Descent (SGD) is a popular algorithm that can
achieve state-of-the-art performance on a variety of machine learning
tasks. Several researchers have recently proposed schemes to
parallelize SGD, but all require performance-destroying memory locking
and synchronization. This work aims to show using novel theoretical
analysis, algorithms, and implementation that SGD can be implemented
without any locking. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility
of overwriting each other's work. We show that when the associated
optimization problem is sparse, meaning most gradient updates only
modify small parts of the decision variable, then Hogwild achieves a
nearly optimal rate of convergence. We demonstrate experimentally that
Hogwild outperforms alternative schemes that use locking by an order
of magnitude.
It turns out SGD is more slowed by synchronized access than by threads overwriting each others' work... and some results even seem to hint that in practice, the extra "interference" may be a net benefit for optimization progress.
I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.
I recently took a courser by Andrew Ng on Coursera. After that I shifted to Python and used Pandas, Numpy, Sklearn to implement ML algorithms. Now while surfing I came across tensorFLow and found it pretty amazing, and implemented this example which takes MNIST data as input.
But I am unsure why use such as library(TensorFlow)?
We are not doing any parallel calculations, since the weights updated in the previous epoch are used in the next one???
I am finding it difficult to find a reason to use such a Library?
There are several forms of parallelism that TensorFlow provides when training a convolutional neural network (and many other machine learning models), including:
Parallelism within individual operations (such as tf.nn.conv2d() and tf.matmul()). These operations have efficient parallel implementations for multi-core CPUs and GPUs, and TensorFlow uses these implementations wherever available.
Parallelism between operations. TensorFlow uses a dataflow graph representation for your model, and where there are two nodes that aren't connected by a directed path in the dataflow graph, these may execute in parallel. For example, the Inception image recognition model has many parallel branches in its dataflow graph (see figure 3 in this paper), and TensorFlow can exploit this to run many operations at the same time. The AlexNet paper also describes how to use "model parallelism" to run operations in parallel on different parts of the model, and TensorFlow supports that using the same mechanism.
Parallelism between model replicas. TensorFlow is also designed for distributed execution. One common scheme for parallel training ("data parallelism") involves sharding your dataset across a set of identical workers, performing the same training computation on each of those workers for different data, and sharing the model parameters between the workers.
In addition, libraries like TensorFlow and Theano can perform various optimizations when they can work with the whole dataflow graph of your model. For example, they can eliminate common subexpressions, avoid recomputing constant values, and generate more efficient fused code.
You might be able to find pre-baked models in sklearn or other libraries, but TensorFlow allows for really fast iteration of custom machine learning models. It also comes with a ton of useful functions that you would have to (and probably shouldn't) write yourself.
To me, it's less about performance (though they certainly care about performance), and more about whipping out neural networks really quickly.
I am designing a neural network and am trying to determine if I should write it in such a way that each neuron is its own 'process' in Erlang, or if I should just go with C++ and run a network in one thread (I would still use all my cores by running an instance of each network in its own thread).
Is there a good reason to give up the speed of C++ for the asynchronous neurons that Erlang offers?
I'm not sure I understand what you're trying to do. An artificial neural network is essentially represented by the weight of the connections between nodes. The nodes themselves don't exist in isolation; their values are only calculated (at least in feed-forward networks) through the forward-propagation algorithm, when it is given input.
The backpropagation algorithm for updating weights is definitely parallelizable, but that doesn't seem to be what you're describing.
The usefulness of having neurons in a Neural Network (NN), is to have a multi-dimension matrix which coefficients you want to handle ( to train them, to change them, to adapt them little by little, so as they fit well to the problem you want to solve). On this matrix you can apply numerical methods (proven and efficient) so as to find an acceptable solution, in an acceptable time.
IMHO, with NN (namely with back-propagation training method), the goal is to have a matrix which is efficient both at run-time/predict-time, and at training time.
I don't grasp the point of having asynchronous neurons. What would it offers ? what issue would it solve ?
Maybe you could explain clearly what problem you would solve putting them asynchronous ?
I am indeed inverting your question: what do you want to gain with asynchronicity regarding traditional NN techniques ?
It would depend upon your use case: the neural network computational model and your execution environment. Here is a recent paper (2014) by Plotnikova et al, that uses "Erlang and platform Erlang/OTP with predefined base implementation of actor model functions" and a new model developed by the authors that they describe as “one neuron—one process” using "Gravitation Search Algorithm" for training:
http://link.springer.com/chapter/10.1007%2F978-3-319-06764-3_52
To briefly cite their abstract, "The paper develops asynchronous distributed modification of this algorithm and presents the results of experiments. The proposed architecture shows the performance increase for distributed systems with different environment parameters (high-performance cluster and local network with a slow interconnection bus)."
Also, most other answers here reference a computational model that uses matrix operations for the base of training and simulation, for which the authors of this paper compare by saying, "this case neural network model [ie matrix operations based] becomes fully mathematical and its original nature (from neural networks biological prototypes) gets lost"
The tests were run on three types of systems;
IBM cluster is represented as 15 virtual machines.
Distributed system deployed to the local network is represented as 15 physical machines.
Hybrid system is based on the system 2 but each physical machine has four processor cores.
They provide the following concrete results, "The presented results evidence a good distribution ability of gravitation search, especially for large networks (801 and more neurons). Acceleration depends on the node count almost linearly. If we use 15 nodes we can get about eight times acceleration of the training process."
Finally, they conclude regarding their model, "The model includes three abstraction levels: NNET, MLP and NEURON. Such architecture allows encapsulating some general features on general levels and some specific for the considered neural networks features on special levels. Asynchronous message passing between levels allow to differentiate synchronous and asynchronous parts of training and simulation algorithms and, as a result, to improve the use of resources."
It depends what you are after.
2nd Generation of Neural Networks are synchronous. They perform computations on an input-output basis without a delay, and can be trained either through reinforcement or back-propagation. This is the prevailing type of ANN at the moment and the easiest to get started with if you are trying to solve a problem via machine learning, lots of literature and examples available.
3rd Generation of Neural Networks (so-called "Spiking Neural Networks") are asynchronous. Signals propagate internally through the network as a chain-reaction of spiking events, and can create interesting patterns and oscillations depending on the shape of the network. While they model biological brains more closely they are also harder to make use of in a practical setting.
I think that async computation for NNs might prove beneficial for the (recognition) performance. In fact, the result might be similar (maybe less pronounced) to using dropout.
But a straight-forward implementation of async NNs would be much slower, because for synchronous NNs you can use linear algebra libraries, which make good use of vectorization or GPUs.