Use Digital Signal Processors to accelerate calculations in the same fashion than GPUs - signal-processing

I read that several DSP cards that process audio, can calculate very fast Fourier Transforms and some other functions involved in Sound processing and others. There are some scientific problems (not many), like Quantum mechanics, that involver Fourier Transform calculations. I wonder if DSP could be used to accelerate calculations in this fashion, like GPUs do in some other cases, and if you know succcessful examples.
Thanks

Any linear operations are easier and faster to do on DSP chips. Their architecture allows you to perform a linear operation (take two numbers, multiply each of them by a constant and add the results) in a single clock cycle. This is one of the reasons FFT can be calculated so quickly on a DSP chip. This is also a reason many other linear operations can be accelerated with their use. I guess I have three main points to make concerning performance and code optimization for such processors.
1) Perhaps less relevant, but I'd like to mention it nonetheless. In order to take full advantage of DSP processor's architecture, you have to code in Assembly. I'm pretty sure that regular C code will not be fully optimized by the compiler to do what you want. You literally have to specify each register, etc. It does pay off, however. The same way, you are able to make use of circular buffers and other DSP-specific things. Circular buffers are also very useful for calculating the FFT and FFT-based (circular) convolution.
2) FFT can be found in solutions to many problems, such as heat flow (Fourier himself actually came up with the solution back in the 1800s), analysis of mechanical oscillations (or any linear oscillators for that matter, including oscillators in quantum physics), analysis of brain waves (EEG), seismic activity, planetary motion and many other things. Any mathematical problem that involves convolution can be easily solved via the Fourier transform, analog or discrete.
3) For some of the applications listed above, including audio processing, other transforms other than FFT are constantly being invented, discovered, and applied to processing, such as Mel-Cepstrum (e.g. MPEG codecs), wavelet transform (e.g. JPEG2000 codecs), discrete cosine transform (e.g. JPEG codecs) and many others. In quantum physics, however, the Fourier Transform is inherent in the equation of angular momentum. It arises naturally, not just for the purposes of analysis or easy of calculations. For this reason, I would not necessarily put the reasons to use Fourier Transform in audio processing and quantum mechanics into the same category. For signal processing, it's a tool; for quantum physics, it's in the nature of the phenomenon.

Before GPUs and SIMD instruction sets in mainstream CPUs this was the only way to get performance for some applications. In the late 20th Century I worked for a company that made PCI cards to place extra processors in a PCI slot. Some of these were DSP cards using a TI C64x DSP, others were PowerPC cards to provide Altivec. The processor on the cards would typically have no operating system to give more predicatable real-time scheduling than the host. Customers would buy an industrial PC with a large PCI backplace, and attach multiple cards. We would also make cards in form factors such as PMC, CompactPCI, and VME for more rugged environments.
People would develop code to run on these cards, and host applications which communicated with the add-in card over the PCI bus. These weren't easy platforms to develop for, and the modern libraries for GPU computing are much easier.
Nowadays this is much less common. The price/performance ratio is so much better for general purpose CPUs and GPUs, and DSPs for scientific computing are vanishing. Current DSP manufacturers tend to target lower power embedded applications or cost sensitive high volume devices like digital cameras. Compare GPUFFTW with these Analog Devices benchmarks. The DSP peaks at 3.2GFlops, and the Nvidia 8800 reachs 29GFlops.

Related

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:
SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.
It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/
At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.
Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?
The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".
The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.
The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.
None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.
The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.
By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.
The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.
So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.
Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.
I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.
If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.
I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).
And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.
But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.
Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).
Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.
Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

Comma.ai self-driving car neural network using client/server architecture in TensorFlow, why?

In comma.ai's self-driving car software they use a client/server architecture. Two processes are started separately, server.py and train_steering_model.py.
server.py sends data to train_steering_model.py via http and sockets.
Why do they use this technique? Isn't this a complicated way of sending data? Isn't this easier to make train_steering_model.py load the data set by it self?
The document DriveSim.md in the repository links to a paper titled Learning a Driving Simulator. In the paper, they state:
Due to the problem complexity we decided to learn video prediction with separable networks.
They also mention the frame rate they used is 5 Hz.
While that sentence is the only one that addresses your question, and it isn't exactly crystal clear, let's break down the task in question:
Grab an image from a camera
Preprocess/downsample/normalize the image pixels
Pass the image through an autoencoder to extract representative feature vector
Pass the output of the autoencoder on to an RNN that will predict proper steering angle
The "problem complexity" refers to the fact that they're dealing with a long sequence of large images that are (as they say in the paper) "highly uncorrelated." There are lots of different tasks that are going on, so the network approach is more modular - in addition to allowing them to work in parallel, it also allows scaling up the components without getting bottlenecked by a single piece of hardware reaching its threshold computational abilities. (And just think: this is only the steering aspect. The Logs.md file lists other components of the vehicle to worry about that aren't addressed by this neural network - gas, brakes, blinkers, acceleration, etc.).
Now let's fast forward to the practical implementation in a self-driving vehicle. There will definitely be more than one neural network operating onboard the vehicle, and each will need to be limited in size - microcomputers or embedded hardware, with limited computational power. So, there's a natural ceiling to how much work one component can do.
Tying all of this together is the fact that cars already operate using a network architecture - a CAN bus is literally a computer network inside of a vehicle. So, this work simply plans to farm out pieces of an enormously complex task to a number of distributed components (which will be limited in capability) using a network that's already in place.

If a computer can be Turing complete with one instruction what is the purpose of having many instructions?

I understand the concept of a computer being Turing complete ( having a MOV or command or a SUBNEG command and being able to therefore "synthesize" other instructions such as ). If that is true what is the purpose of having 100s of instructions like x86 has for example? Is to increase efficiency?
Yes.
Equally, any logical circuit can be made using just NANDs. But that doesn't make other components redundant. Crafting a CPU from NAND gates would be monumentally inefficient, even if that CPU performed only one instruction.
An OS or application has a similar level of complexity to a CPU.
You COULD compile it so it just used a single instruction. But you would just end up with the world's most bloated OS.
So, when designing a CPU's instruction set, the choice is a tradeoff between reducing CPU size/expense, which allows more instructions per second as they are simpler, and smaller size means easier cooling (RISC); and increasing the capabilities of the CPU, including instructions that take multiple clock-cycles to complete, but making it larger and more cumbersome to cool (CISC).
This tradeoff is why math co-processors were a thing back in the 486 days. Floating point math could be emulated without the instructions. But it was much, much faster if it had a co-processor designed to do the heavy lifting on those floating point things.
Remember that a Turing Machine is generally understood to be an abstract concept, not a physical thing. It's the theoretical minimal form a computer can take that can still compute anything. Theoretically. Heavy emphasis on theoretically.
An actual Turing machine that did something so simple as decode an MP3 would be outrageously complicated. Programming it would be an utter nightmare as the machine is so insanely limited that even adding two 64-bit numbers together and recording the result in a third location would require an enormous amount of "tape" and a whole heap of "instructions".
When we say something is "Turing Complete" we mean that it can perform generic computation. It's a pretty low bar in all honesty, crazy things like the Game of Life and even CSS have been shown to be Turing Complete. That doesn't mean it's a good idea to program for them, or take them seriously as a computational platform.
In the early days of computing people would have to type in machine codes by hand. Adding two numbers together and storing the result is often one or two operations at most. Doing it in a Turing machine would require thousands. The complexity makes it utterly impractical on the most basic level.
As a challenge try and write a simple 4-bit adder. Then if you've successfully tackled that, write a 4-bit multiplier. The complexity ramps up exponentially once you move to things like 32 or 64-bit values, and when you try and tackle division or floating point values you're quickly going to drown in the outrageousness of it all.
You don't tell the CPU which transistors to flip when you're typing in machine code, the instructions act as macros to do that for you, but when you're writing Turing Machine code it's up to you to command it how to flip each and every single bit.
If you want to learn more about CPU history and design there's a wealth of information out there, and you can even implement your own using transistor logic or an FPGA kit where you can write it out using a higher level design language like Verilog.
The Intel 4004 chip was intended for a calculator so the operation codes were largely geared towards that. The subsequent 8008 built on that, and by the time the 8086 rolled around the instruction set had taken on that familiar x86 flavor, albeit a 16-bit version of same.
There's an abstraction spectrum here between defining the behaviour of individual bits (Turing Machine) and some kind of hypothetical CPU with an instruction for every occasion. RISC and CISC designs from the 1980s and 1990s differed in their philosophy here, where RISC generally had fewer instructions, CISC having more, but those differences have largely been erased as RISC gained more features and CISC became more RISC-like for the sake of simplicity.
The Turing Machine is the "absolute zero" in terms of CPU design. If you can come up with something simpler or more reductive you'd probably win a prize.

How do I know when I need a dedicated DSP chip?

When designing an embedded system, how can I tell in general when the floating point processing required will be too much for a standard microcontroller?
In case anyone is curious, the system I am designing is a Kalman filter and some motor control. However, I am looking for an engineering methodology for the general case.
The general case on finding out whether the given processor can solve your problem, is to estimate the number of floating-point operations that have to be run per second, and then comparing it to what the processor can do.
This ideal case will be affected by memory-access times, I/O-interrupts, etc. In practise, you'll have to run it (although you don't want to hear that).
For the Kalman filter case:
1. Know the sample rate, the size of the state variable and the measurement-variable.
2. The complexity of the Kalman filter is dominated by the matrix inversion and multiple matrix multiplications. (O(d^3), d: size of state variable, or the Information Filter (inverse problem): O(z^3), z: size of measurement-vector) On-line or in books you'll find in-detail analysis of the operations required for Kalman Filters.
3. Find out what actual operations are run in the algorithms and add the number of operations required for each part of the algorithm.
The analysis is essentially the same for a general microcontroller or a DSP, except that some things come for free on the DSP.

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources