OpenVino Performance - Neural Stick vs. CPU - opencv

I'm seeing a significant difference in inference performance between my desktop CPU and when I run on the Neural Compute Stick 2 VPU - almost 500ms slower on VPU. This is the one line that takes the most time and has the biggest difference:
result = exec_net.infer( inputs={input_layer_ir: blob} )
My desktop is my gaming machine and has a nice fast Intel CPU. That said is this the expected order of magnitude difference between the VPU and CPU?
CPU speeds are really fast like .07 seconds and VPU is around .5.
It’s the road segmentation model from the open zoo samples.

Intel® Neural Compute Stick 2 (NCS 2) is a USB stick that offers you access to neural network functionality, without the need for large, expensive hardware. It is a plug-and-play device, so you are ready to start prototyping right away.
The performance of NCS 2 compared to the well-known CPUs or GPUs in the meaning of TFLOPS, it is still a hundred times lower. This behaviour is expected, so don’t rely on it as an external device to replace the CPU plugin.

Related

Choice of infrastructure for faster deep learning model training with tensorflow?

I am a newbie in deep learning with tensorflow.I am trying out a seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq
model with satisfactory accuracy?
Also,the minimum infrastructure setup required in terms of memory
and cpu capability to train this deep learning model within a max
time of a few hours.
My experience has been training a seq2seq model to build a neural network with
2 layers of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor.
took around 1 day to train on a 8GB RAM,3GHz Intel i5 single core
processor.
Which helps the most for faster training - more RAM capacity, multiple CPU cores or a CPU + GPU combination core?
Disclaimer: I'm also new, and could be wrong on a lot of this.
I am a newbie in deep learning with tensorflow.I am trying out a
seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq model
with satisfactory accuracy?
I think that this will just have to be up to your experimentation. Find out what works for your data set. I have heard a few pieces of advice: don't pick your own architecture if you can - find someone else's that is tried and tested. Seems deeper networks are better than wider if you're going to choose between the too. I also think bigger batch sizes are better if you have the memory. I've heard to maximize network size and then regularize so you don't overfit.
I have the impression these are big questions that no one really knows the answer to (could be very wrong about this!). We'd all love a smart way of choosing layer size / number of layers, but no one knows exactly how changing these things affects training.
Also,the minimum infrastructure setup required in terms of memory and cpu capability to train this deep
learning model within a max time of a few hours.
Depending on your model, that could be an unreasonable request. Seems like some models train for hundreds if not thousands of hours (on GPUs).
My experience has
been training a seq2seq model to build a neural network with 2 layers
of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor. took around 1 day to train on a 8GB RAM,3GHz Intel i5
single core processor. Which helps the most for faster training - more
RAM capacity, multiple CPU cores or a CPU + GPU combination core?
I believe a GPU will help you the most. I have seen some stuff that uses the CPU (asynchronous actor critic or something? They didn't use locking) where it seemed like CPU was better, but I think GPU will give you huge speedups.

nerual networks: gpu vs no-gpu

I need to train a recurrent neural network as a language model and I decided to use keras with theano backend for that. Is it better to use an ordinary PC with some graphics card instead of a "cool" server machine that can't do gpu computing? Is there a boundary (given perhaps by the architecture of the NN and amount of the training data) that would separate "cpu-learnable" problems from those that can be done (in reasonable time) only by utilizing gpu?
(I have access to an older production server in the company I work in. It has 16 cores, about 49GB of available RAM so I thought I was ready for training, now I am reading about gpu optimization theano is doing and I am thinking I am basically screwed without it.)
Edit
I have just come across this article, where Tomáš Mikolov states they managed to train a single-layer recurrent neural network with 1024 states in 10 days while using only 24 CPUs and no GPU.
Is there a boundary
One that would separate CPU vs GPU is memory access. If you are accessing the values from your neural network often, CPU would do better, as it has faster access to RAM. If I'm not wrong, getting the updates (SGD, RMSProp, Adagrad etc) would require that the values be accessed.
GPU would be advisable when amount of computation is larger than memory access, e.g. training a deep neural network.
that can be done (in reasonable time) only by utilizing gpu
Unfortunately, if you are trying to solve such a hard problem, Theano would be a bad choice, as you are constrained to running on a single machine. Try other frameworks that would allow running on multiple CPU and GPU across machines, such as Microsoft CNTK or Google TensorFlow.
thinking I am basically screwed
The difference (may be speed up or slow down) won't be that big, depending on the neural network. Plus, running the neural network computation on your machine can get in the way of your work. So you are probably better off using that extra server and making it useful.

TBB parallel_for with less number of threads

I have written a multi-view face detection code using opencv face detector. I am running five detectors (trained for different pose angles) over an image and taking their weights to detect faces in an image. I have made the code parallel using TBB parallel_for but it improved the performance by just 1.7-times. I would like to ask if there is any better way to run five detectors in parallel?
I am running my code on a cluster with 16-cores. I think number of threads (that in my case are 5) are too less to utilize the complete power.
Any suggestions?
Thanks,
Some possible problems to look into:
One of the detectors takes longer than the other detectors to run. For example, if one detector takes 4 units of time, and the other four detectors each take 1 unit of time, the most possible speedup is 2x. Parallelizing the slow detector itself might help in this kind of situation.
The detectors run so fast that the parallel_for does not have time to spread the work. If the detectors each take at least 0.1 sec, this should not be a problem.
Memory bandwidth can be a limiting resource, particularly if working sets do not fit in outer-level cache.
A profiler such as Intel(R) VTune(TM) Amplifier can sometimes help to track down these problems. Both commercial and non-commercial licenses exist for Amplifier. [Disclaimer: I work for Intel.]

Use Digital Signal Processors to accelerate calculations in the same fashion than GPUs

I read that several DSP cards that process audio, can calculate very fast Fourier Transforms and some other functions involved in Sound processing and others. There are some scientific problems (not many), like Quantum mechanics, that involver Fourier Transform calculations. I wonder if DSP could be used to accelerate calculations in this fashion, like GPUs do in some other cases, and if you know succcessful examples.
Thanks
Any linear operations are easier and faster to do on DSP chips. Their architecture allows you to perform a linear operation (take two numbers, multiply each of them by a constant and add the results) in a single clock cycle. This is one of the reasons FFT can be calculated so quickly on a DSP chip. This is also a reason many other linear operations can be accelerated with their use. I guess I have three main points to make concerning performance and code optimization for such processors.
1) Perhaps less relevant, but I'd like to mention it nonetheless. In order to take full advantage of DSP processor's architecture, you have to code in Assembly. I'm pretty sure that regular C code will not be fully optimized by the compiler to do what you want. You literally have to specify each register, etc. It does pay off, however. The same way, you are able to make use of circular buffers and other DSP-specific things. Circular buffers are also very useful for calculating the FFT and FFT-based (circular) convolution.
2) FFT can be found in solutions to many problems, such as heat flow (Fourier himself actually came up with the solution back in the 1800s), analysis of mechanical oscillations (or any linear oscillators for that matter, including oscillators in quantum physics), analysis of brain waves (EEG), seismic activity, planetary motion and many other things. Any mathematical problem that involves convolution can be easily solved via the Fourier transform, analog or discrete.
3) For some of the applications listed above, including audio processing, other transforms other than FFT are constantly being invented, discovered, and applied to processing, such as Mel-Cepstrum (e.g. MPEG codecs), wavelet transform (e.g. JPEG2000 codecs), discrete cosine transform (e.g. JPEG codecs) and many others. In quantum physics, however, the Fourier Transform is inherent in the equation of angular momentum. It arises naturally, not just for the purposes of analysis or easy of calculations. For this reason, I would not necessarily put the reasons to use Fourier Transform in audio processing and quantum mechanics into the same category. For signal processing, it's a tool; for quantum physics, it's in the nature of the phenomenon.
Before GPUs and SIMD instruction sets in mainstream CPUs this was the only way to get performance for some applications. In the late 20th Century I worked for a company that made PCI cards to place extra processors in a PCI slot. Some of these were DSP cards using a TI C64x DSP, others were PowerPC cards to provide Altivec. The processor on the cards would typically have no operating system to give more predicatable real-time scheduling than the host. Customers would buy an industrial PC with a large PCI backplace, and attach multiple cards. We would also make cards in form factors such as PMC, CompactPCI, and VME for more rugged environments.
People would develop code to run on these cards, and host applications which communicated with the add-in card over the PCI bus. These weren't easy platforms to develop for, and the modern libraries for GPU computing are much easier.
Nowadays this is much less common. The price/performance ratio is so much better for general purpose CPUs and GPUs, and DSPs for scientific computing are vanishing. Current DSP manufacturers tend to target lower power embedded applications or cost sensitive high volume devices like digital cameras. Compare GPUFFTW with these Analog Devices benchmarks. The DSP peaks at 3.2GFlops, and the Nvidia 8800 reachs 29GFlops.

Reasons for NOT scaling-up vs. -out?

As a programmer I make revolutionary findings every few years. I'm either ahead of the curve, or behind it by about π in the phase. One hard lesson I learned was that scaling OUT is not always better, quite often the biggest performance gains are when we regrouped and scaled up.
What reasons to you have for scaling out vs. up? Price, performance, vision, projected usage? If so, how did this work for you?
We once scaled out to several hundred nodes that would serialize and cache necessary data out to each node and run maths processes on the records. Many, many billions of records needed to be (cross-)analyzed. It was the perfect business and technical case to employ scale-out. We kept optimizing until we processed about 24 hours of data in 26 hours wallclock. Really long story short, we leased a gigantic (for the time) IBM pSeries, put Oracle Enterprise on it, indexed our data and ended up processing the same 24 hours of data in about 6 hours. Revolution for me.
So many enterprise systems are OLTP and the data are not shard'd, but the desire by many is to cluster or scale-out. Is this a reaction to new techniques or perceived performance?
Do applications in general today or our programming matras lend themselves better for scale-out? Do we/should we take this trend always into account in the future?
Because scaling up
Is limited ultimately by the size of box you can actually buy
Can become extremely cost-ineffective, e.g. a machine with 128 cores and 128G ram is vastly more expensive than 16 with 8 cores and 8G ram each.
Some things don't scale up well - such as IO read operations.
By scaling out, if your architecture is right, you can also achieve high availability. A 128-core, 128G ram machine is very expensive, but to have a 2nd redundant one is extortionate.
And also to some extent, because that's what Google do.
Scaling out is best for embarrassingly parallel problems. It takes some work, but a number of web services fit that category (thus the current popularity). Otherwise you run into Amdahl's law, which then means to gain speed you have to scale up not out. I suspect you ran into that problem. Also IO bound operations also tend to do well with scaling out largely because waiting for IO increases the % that is parallelizable.
The blog post Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood has some interesting points to consider, such as software licensing and power costs.
Not surprisingly, it all depends on your problem. If you can easily partition it with into subproblems that don't communicate much, scaling out gives trivial speedups. For instance, searching for a word in 1B web pages can be done by one machine searching 1B pages, or by 1M machines doing 1000 pages each without a significant loss in efficiency (so with a 1,000,000x speedup). This is called "embarrassingly parallel".
Other algorithms, however, do require much more intensive communication between the subparts. Your example requiring cross-analysis is the perfect example of where communication can often drown out the performance gains of adding more boxes. In these cases, you'll want to keep communication inside a (bigger) box, going over high-speed interconnects, rather than something as 'common' as (10-)Gig-E.
Of course, this is a fairly theoretical point of view. Other factors, such as I/O, reliability, easy of programming (one big shared-memory machine usually gives a lot less headaches than a cluster) can also have a big influence.
Finally, due to the (often extreme) cost benefits of scaling out using cheap commodity hardware, the cluster/grid approach has recently attracted much more (algorithmic) research. This makes that new ways of parallelization have been developed that minimize communication, and thus do much better on a cluster -- whereas common knowledge used to dictate that these types of algorithms could only run effectively on big iron machines...

Resources