I'm looking at creative ways to speed up training time for my neural nets and also maybe reducing vanishing gradient. I was considering breaking up the net onto different nodes, using classifiers on each node as backprop "boosters", and then stacking the nodes on top of each other with sparse connections between each node (as many as I can get away with without ethernet network saturation making it pointless). If I do this, I am uncertain if I have to maintain some kind of state between nodes and train synchronously on the same example (probably defeats the purpose of speeding up the process), OR I can simply train on the same data but asynchronously. I think I can, and the weight space can still be updated and propagated down my sparse connections between nodes even if they are training on different examples, but uncertain. Can someone confirm this is possible or explain why not?
It is possible to do what you suggest, however it is a formidable amount of work for one person to undertake. The most recent example that I'm aware of is the "DistBelief" framework, developed by a large research/engineering team at Google -- see the 2012 NIPS paper at http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf.
Briefly, the DistBelief approach partitions the units in a neural network so that each worker machine in a cluster is responsible for some disjoint subset of the overall architecture. Ideally the partitions are chosen to minimize the amount of cross-machine communication required (i.e., a min-cut through the network graph).
Workers perform computations locally for their part of the network, and then send updates to the other workers as needed for links that cross machine boundaries.
Parameter updates are handled by a separate "parameter server." The workers send gradient computations to the parameter server, and periodically receive updated parameter values from the server.
The entire setup runs asynchronously and works pretty well. Due to the async nature of the computations, the parameter values for a given computation might be "stale," but they're usually not too far off. And the speedup is pretty good.
Related
I interested in improving performance with regard to ML predictions. (I don't care about training)
-Will GPUs provide more throughput or lower latency?
-Are they good for batch or online serving?
-What types of models would be most impact by using GPUs?
Disclaimer: The real answer is "it depends; if such a decision is important to you, you should benchmark CPU performance against GPU performance on your target systems to make an informed decision." The rest of this answer is just advice to loosely guide your decision when you don't want to (or don't have time to) do any benchmarking.
In a research environment, predictions are often (though not always) done in batches. As such, even if the model is entirely serial (i.e. there is an execution dependency between every pair of operations), it will likely still benefit from parallelization in that those serial operations may have to be replicated for multiple query points simultaneously, and so you can parallelize predictions across query points within a batch. So if your prediction setting involves batches, you should pretty much always use a GPU. From my own research experiences, a GPU is always faster than a CPU in batched prediction settings, regardless of the model used.
If you are only making a single prediction at a time (e.g. an "online" prediction setting), most modern ML methods are still highly parallelizable in general. In a neural network, for instance, there are only execution dependencies between layers; there are no execution dependencies between nodes within a layer. If you have many nodes per layer (which most modern deep learning architectures do), then your model is likely very parallelizable and can benefit from using a GPU instead of a CPU.
Naive Bayes classifiers make predictions by computing a bunch of (supposedly) conditionally independent probabilities, which can be parallelized, and then multiplying them together, which can be parallelized via reduction. As such, they may also benefit from using a GPU instead of a CPU.
For a support vector machine with the dual problem approach, making a prediction requires computing an inner product (kernel trick) for each training data point with the query point, and multiplying each inner product by the corresponding parameters and target binary labels. This can very easily be parallelized in a similar way to naive Bayes classifiers.
The list goes on. The point is, most ML methods are at least relatively conducive to parallelization even if you're processing a single query point at a time, and extremely conducive to parallelization if you're processing query points in batch. This makes them generally run faster on the "average" GPU than the "average" CPU.
But ultimately, it depends on your model and target system, so if it matters that much to you, you should benchmark to make an informed decision.
I think convolution layers should be fully connected (see this and this). That is, each feature map should be connected to all feature maps in the previous layer. However, when I looked at this CNN visualization, the second convolution layer is not fully connected to the first. Specifically, each feature map in the second layer is connected to 3~6 (all) feature maps in the first layer, and I don't see any pattern in it. The questions are
Is it canonical/standard to fully connect convolution layers?
What's the rational for the partial connections in the visualization?
Am I missing something here?
Neural networks have the remarkable property that knowledge is not stored anywhere specifically, but in a distributed sense. If you take a working network, you can often cut out large parts and still get a network that works approximately the same.
A related effect is that the exact layout is not very critical. ReLu and Sigmoid (tanh) activation functions are mathematically very different, but both work quite well. Similarly, the exact number of nodes in a layer doesn't really matter.
Fundamentally, this relates to the fact that in training you optimize all weights to minimize your error function, or at least find a local minimum. As long as there are sufficient weights and those are sufficiently independent, you can optimize the error function.
There is another effect to take into account, though. With too many weights and not enough training data, you cannot optimize the network well. Regularization only helps so much. A key insight in CNN's is that they have less weights than a fully connected network, because nodes in a CNN are connected only to a small local neighborhood of nodes in the prior layer.
So, this particular CNN has even less connections than a CNN in which all feature maps are connected, and therefore less weights. That allows you to have more and/or bigger maps for a given amount of data. Is that the best solution? Perhaps - choosing the best layout is still a bit of a black art. But it's not a priori unreasonable.
In comma.ai's self-driving car software they use a client/server architecture. Two processes are started separately, server.py and train_steering_model.py.
server.py sends data to train_steering_model.py via http and sockets.
Why do they use this technique? Isn't this a complicated way of sending data? Isn't this easier to make train_steering_model.py load the data set by it self?
The document DriveSim.md in the repository links to a paper titled Learning a Driving Simulator. In the paper, they state:
Due to the problem complexity we decided to learn video prediction with separable networks.
They also mention the frame rate they used is 5 Hz.
While that sentence is the only one that addresses your question, and it isn't exactly crystal clear, let's break down the task in question:
Grab an image from a camera
Preprocess/downsample/normalize the image pixels
Pass the image through an autoencoder to extract representative feature vector
Pass the output of the autoencoder on to an RNN that will predict proper steering angle
The "problem complexity" refers to the fact that they're dealing with a long sequence of large images that are (as they say in the paper) "highly uncorrelated." There are lots of different tasks that are going on, so the network approach is more modular - in addition to allowing them to work in parallel, it also allows scaling up the components without getting bottlenecked by a single piece of hardware reaching its threshold computational abilities. (And just think: this is only the steering aspect. The Logs.md file lists other components of the vehicle to worry about that aren't addressed by this neural network - gas, brakes, blinkers, acceleration, etc.).
Now let's fast forward to the practical implementation in a self-driving vehicle. There will definitely be more than one neural network operating onboard the vehicle, and each will need to be limited in size - microcomputers or embedded hardware, with limited computational power. So, there's a natural ceiling to how much work one component can do.
Tying all of this together is the fact that cars already operate using a network architecture - a CAN bus is literally a computer network inside of a vehicle. So, this work simply plans to farm out pieces of an enormously complex task to a number of distributed components (which will be limited in capability) using a network that's already in place.
After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism.
I was told it's possible with the current implementation by having all GPUs in single box as a worker, but I am not able to figure out how to tie the average gradients with SyncReplicasOptimizer, since SyncReplicasOptimizer directly takes the optimizer as input.
Any ideas from anyone?
Distributed TensorFlow supports multiple GPUs in the same worker task. One common way to perform distributed training for image models is to perform synchronous training across multiple GPUs in the same worker, and asynchronous training across workers (though other configurations are possible). This way you only pull the model parameters to the worker once, and they are distributed among the local GPUs, easing the network bandwidth utilization.
To do this kind of training, many users perform "in-graph replication" across the GPUs in a single worker. This can use an explicit loop across the local GPU devices, like in the CIFAR-10 example model; or higher-level library support, like in the model_deploy() utility from TF-Slim.
I am designing a neural network and am trying to determine if I should write it in such a way that each neuron is its own 'process' in Erlang, or if I should just go with C++ and run a network in one thread (I would still use all my cores by running an instance of each network in its own thread).
Is there a good reason to give up the speed of C++ for the asynchronous neurons that Erlang offers?
I'm not sure I understand what you're trying to do. An artificial neural network is essentially represented by the weight of the connections between nodes. The nodes themselves don't exist in isolation; their values are only calculated (at least in feed-forward networks) through the forward-propagation algorithm, when it is given input.
The backpropagation algorithm for updating weights is definitely parallelizable, but that doesn't seem to be what you're describing.
The usefulness of having neurons in a Neural Network (NN), is to have a multi-dimension matrix which coefficients you want to handle ( to train them, to change them, to adapt them little by little, so as they fit well to the problem you want to solve). On this matrix you can apply numerical methods (proven and efficient) so as to find an acceptable solution, in an acceptable time.
IMHO, with NN (namely with back-propagation training method), the goal is to have a matrix which is efficient both at run-time/predict-time, and at training time.
I don't grasp the point of having asynchronous neurons. What would it offers ? what issue would it solve ?
Maybe you could explain clearly what problem you would solve putting them asynchronous ?
I am indeed inverting your question: what do you want to gain with asynchronicity regarding traditional NN techniques ?
It would depend upon your use case: the neural network computational model and your execution environment. Here is a recent paper (2014) by Plotnikova et al, that uses "Erlang and platform Erlang/OTP with predefined base implementation of actor model functions" and a new model developed by the authors that they describe as “one neuron—one process” using "Gravitation Search Algorithm" for training:
http://link.springer.com/chapter/10.1007%2F978-3-319-06764-3_52
To briefly cite their abstract, "The paper develops asynchronous distributed modification of this algorithm and presents the results of experiments. The proposed architecture shows the performance increase for distributed systems with different environment parameters (high-performance cluster and local network with a slow interconnection bus)."
Also, most other answers here reference a computational model that uses matrix operations for the base of training and simulation, for which the authors of this paper compare by saying, "this case neural network model [ie matrix operations based] becomes fully mathematical and its original nature (from neural networks biological prototypes) gets lost"
The tests were run on three types of systems;
IBM cluster is represented as 15 virtual machines.
Distributed system deployed to the local network is represented as 15 physical machines.
Hybrid system is based on the system 2 but each physical machine has four processor cores.
They provide the following concrete results, "The presented results evidence a good distribution ability of gravitation search, especially for large networks (801 and more neurons). Acceleration depends on the node count almost linearly. If we use 15 nodes we can get about eight times acceleration of the training process."
Finally, they conclude regarding their model, "The model includes three abstraction levels: NNET, MLP and NEURON. Such architecture allows encapsulating some general features on general levels and some specific for the considered neural networks features on special levels. Asynchronous message passing between levels allow to differentiate synchronous and asynchronous parts of training and simulation algorithms and, as a result, to improve the use of resources."
It depends what you are after.
2nd Generation of Neural Networks are synchronous. They perform computations on an input-output basis without a delay, and can be trained either through reinforcement or back-propagation. This is the prevailing type of ANN at the moment and the easiest to get started with if you are trying to solve a problem via machine learning, lots of literature and examples available.
3rd Generation of Neural Networks (so-called "Spiking Neural Networks") are asynchronous. Signals propagate internally through the network as a chain-reaction of spiking events, and can create interesting patterns and oscillations depending on the shape of the network. While they model biological brains more closely they are also harder to make use of in a practical setting.
I think that async computation for NNs might prove beneficial for the (recognition) performance. In fact, the result might be similar (maybe less pronounced) to using dropout.
But a straight-forward implementation of async NNs would be much slower, because for synchronous NNs you can use linear algebra libraries, which make good use of vectorization or GPUs.