I need to train a recurrent neural network as a language model and I decided to use keras with theano backend for that. Is it better to use an ordinary PC with some graphics card instead of a "cool" server machine that can't do gpu computing? Is there a boundary (given perhaps by the architecture of the NN and amount of the training data) that would separate "cpu-learnable" problems from those that can be done (in reasonable time) only by utilizing gpu?
(I have access to an older production server in the company I work in. It has 16 cores, about 49GB of available RAM so I thought I was ready for training, now I am reading about gpu optimization theano is doing and I am thinking I am basically screwed without it.)
Edit
I have just come across this article, where Tomáš Mikolov states they managed to train a single-layer recurrent neural network with 1024 states in 10 days while using only 24 CPUs and no GPU.
Is there a boundary
One that would separate CPU vs GPU is memory access. If you are accessing the values from your neural network often, CPU would do better, as it has faster access to RAM. If I'm not wrong, getting the updates (SGD, RMSProp, Adagrad etc) would require that the values be accessed.
GPU would be advisable when amount of computation is larger than memory access, e.g. training a deep neural network.
that can be done (in reasonable time) only by utilizing gpu
Unfortunately, if you are trying to solve such a hard problem, Theano would be a bad choice, as you are constrained to running on a single machine. Try other frameworks that would allow running on multiple CPU and GPU across machines, such as Microsoft CNTK or Google TensorFlow.
thinking I am basically screwed
The difference (may be speed up or slow down) won't be that big, depending on the neural network. Plus, running the neural network computation on your machine can get in the way of your work. So you are probably better off using that extra server and making it useful.
Related
I'm seeing a significant difference in inference performance between my desktop CPU and when I run on the Neural Compute Stick 2 VPU - almost 500ms slower on VPU. This is the one line that takes the most time and has the biggest difference:
result = exec_net.infer( inputs={input_layer_ir: blob} )
My desktop is my gaming machine and has a nice fast Intel CPU. That said is this the expected order of magnitude difference between the VPU and CPU?
CPU speeds are really fast like .07 seconds and VPU is around .5.
It’s the road segmentation model from the open zoo samples.
Intel® Neural Compute Stick 2 (NCS 2) is a USB stick that offers you access to neural network functionality, without the need for large, expensive hardware. It is a plug-and-play device, so you are ready to start prototyping right away.
The performance of NCS 2 compared to the well-known CPUs or GPUs in the meaning of TFLOPS, it is still a hundred times lower. This behaviour is expected, so don’t rely on it as an external device to replace the CPU plugin.
I have a Keras model which is doing inference on a Raspberry Pi (with a camera). The Raspberry Pi has a really slow CPU (1.2.GHz) and no CUDA GPU so the model.predict() stage is taking a long time (~20 seconds). I'm looking for ways to reduce that by as much as possible. I've tried:
Overclocking the CPU (+ 200 MhZ) and got a few extra seconds of performance.
Using float16's instead of float32's.
Reducing the image input size as much as possible.
Is there anything else I can do to increase the speed during inference? Is there a way to simplify a model.h5 and take a drop in accuracy? I've had success with simpler models, but for this project I need to rely on an existing model so I can't train from scratch.
VGG16 / VGG19 architecture is very slow since it has lots of parameters. Check this answer.
Before any other optimization, try to use a simpler network architecture.
Google's MobileNet seems like a good candidate since it's implemented on Keras and it was designed for more limited devices.
If you can't use a different network, you may compress the network with pruning. This blog post specifically do pruning with Keras.
Maybe OpenVINO will help. OpenVINO is an open-source toolkit for network inference, and it optimizes the inference performance by, e.g., graph pruning and fusing some operations. The ARM support is provided by the contrib repository.
Here are the instructions on how to build an ARM plugin to run OpenVINO on Raspberry Pi.
Disclaimer: I work on OpenVINO.
Let's say I used CUDA to train an object tracking program. Could I then put that program on another computer that didn't have a powerful gpu and run the object tracking program? Or is gpu support required to run the outputted algorithm as well as train it?
No, it does not matter how you trained your model. You can execute it in completely different scenario, using CPU, GPU, cloud or whatever you want. Since execution is usually much cheaper than training - you will usually need much less powerful hardware.
I am a newbie in deep learning with tensorflow.I am trying out a seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq
model with satisfactory accuracy?
Also,the minimum infrastructure setup required in terms of memory
and cpu capability to train this deep learning model within a max
time of a few hours.
My experience has been training a seq2seq model to build a neural network with
2 layers of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor.
took around 1 day to train on a 8GB RAM,3GHz Intel i5 single core
processor.
Which helps the most for faster training - more RAM capacity, multiple CPU cores or a CPU + GPU combination core?
Disclaimer: I'm also new, and could be wrong on a lot of this.
I am a newbie in deep learning with tensorflow.I am trying out a
seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq model
with satisfactory accuracy?
I think that this will just have to be up to your experimentation. Find out what works for your data set. I have heard a few pieces of advice: don't pick your own architecture if you can - find someone else's that is tried and tested. Seems deeper networks are better than wider if you're going to choose between the too. I also think bigger batch sizes are better if you have the memory. I've heard to maximize network size and then regularize so you don't overfit.
I have the impression these are big questions that no one really knows the answer to (could be very wrong about this!). We'd all love a smart way of choosing layer size / number of layers, but no one knows exactly how changing these things affects training.
Also,the minimum infrastructure setup required in terms of memory and cpu capability to train this deep
learning model within a max time of a few hours.
Depending on your model, that could be an unreasonable request. Seems like some models train for hundreds if not thousands of hours (on GPUs).
My experience has
been training a seq2seq model to build a neural network with 2 layers
of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor. took around 1 day to train on a 8GB RAM,3GHz Intel i5
single core processor. Which helps the most for faster training - more
RAM capacity, multiple CPU cores or a CPU + GPU combination core?
I believe a GPU will help you the most. I have seen some stuff that uses the CPU (asynchronous actor critic or something? They didn't use locking) where it seemed like CPU was better, but I think GPU will give you huge speedups.
After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism.
I was told it's possible with the current implementation by having all GPUs in single box as a worker, but I am not able to figure out how to tie the average gradients with SyncReplicasOptimizer, since SyncReplicasOptimizer directly takes the optimizer as input.
Any ideas from anyone?
Distributed TensorFlow supports multiple GPUs in the same worker task. One common way to perform distributed training for image models is to perform synchronous training across multiple GPUs in the same worker, and asynchronous training across workers (though other configurations are possible). This way you only pull the model parameters to the worker once, and they are distributed among the local GPUs, easing the network bandwidth utilization.
To do this kind of training, many users perform "in-graph replication" across the GPUs in a single worker. This can use an explicit loop across the local GPU devices, like in the CIFAR-10 example model; or higher-level library support, like in the model_deploy() utility from TF-Slim.