How do I Perform Integer SIMD operations on the iPad A4 Processor? - ipad

I feel the need for speed. Double for loops are killing my iPad apps performance. I need SIMD. How do I perform integer SIMD operations on the iPad A4 processor?
Thanks,
Doug

The instruction set is NEON, intrinsics reference
I've never been able to find good documentation on what they all actually are. But you pick it up pretty quickly if you've had any exposure to SSE

To get the fastest speed, you will have to write ARM Assembly language code that uses NEON SIMD operations, because the C compilers generally don't make very good SIMD code, so hand-written Assembly will make a big difference. I have a brief intro here: http://www.shervinemami.co.cc/iphoneAssembly.html
Note that the iPad A4 uses the ARMv7-A CPU, so the reference manual for the NEON SIMD instructions is at: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html
(but its 2000 pages long and requires the understanding of Assembly code and perhaps SIMD in general!).

Related

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:
SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.
It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/
At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.
Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?
The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".
The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.
The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.
None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.
The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.
By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.
The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.
So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.
Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.
I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.
If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.
I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).
And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.
But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.
Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).
Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.
Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

If a computer can be Turing complete with one instruction what is the purpose of having many instructions?

I understand the concept of a computer being Turing complete ( having a MOV or command or a SUBNEG command and being able to therefore "synthesize" other instructions such as ). If that is true what is the purpose of having 100s of instructions like x86 has for example? Is to increase efficiency?
Yes.
Equally, any logical circuit can be made using just NANDs. But that doesn't make other components redundant. Crafting a CPU from NAND gates would be monumentally inefficient, even if that CPU performed only one instruction.
An OS or application has a similar level of complexity to a CPU.
You COULD compile it so it just used a single instruction. But you would just end up with the world's most bloated OS.
So, when designing a CPU's instruction set, the choice is a tradeoff between reducing CPU size/expense, which allows more instructions per second as they are simpler, and smaller size means easier cooling (RISC); and increasing the capabilities of the CPU, including instructions that take multiple clock-cycles to complete, but making it larger and more cumbersome to cool (CISC).
This tradeoff is why math co-processors were a thing back in the 486 days. Floating point math could be emulated without the instructions. But it was much, much faster if it had a co-processor designed to do the heavy lifting on those floating point things.
Remember that a Turing Machine is generally understood to be an abstract concept, not a physical thing. It's the theoretical minimal form a computer can take that can still compute anything. Theoretically. Heavy emphasis on theoretically.
An actual Turing machine that did something so simple as decode an MP3 would be outrageously complicated. Programming it would be an utter nightmare as the machine is so insanely limited that even adding two 64-bit numbers together and recording the result in a third location would require an enormous amount of "tape" and a whole heap of "instructions".
When we say something is "Turing Complete" we mean that it can perform generic computation. It's a pretty low bar in all honesty, crazy things like the Game of Life and even CSS have been shown to be Turing Complete. That doesn't mean it's a good idea to program for them, or take them seriously as a computational platform.
In the early days of computing people would have to type in machine codes by hand. Adding two numbers together and storing the result is often one or two operations at most. Doing it in a Turing machine would require thousands. The complexity makes it utterly impractical on the most basic level.
As a challenge try and write a simple 4-bit adder. Then if you've successfully tackled that, write a 4-bit multiplier. The complexity ramps up exponentially once you move to things like 32 or 64-bit values, and when you try and tackle division or floating point values you're quickly going to drown in the outrageousness of it all.
You don't tell the CPU which transistors to flip when you're typing in machine code, the instructions act as macros to do that for you, but when you're writing Turing Machine code it's up to you to command it how to flip each and every single bit.
If you want to learn more about CPU history and design there's a wealth of information out there, and you can even implement your own using transistor logic or an FPGA kit where you can write it out using a higher level design language like Verilog.
The Intel 4004 chip was intended for a calculator so the operation codes were largely geared towards that. The subsequent 8008 built on that, and by the time the 8086 rolled around the instruction set had taken on that familiar x86 flavor, albeit a 16-bit version of same.
There's an abstraction spectrum here between defining the behaviour of individual bits (Turing Machine) and some kind of hypothetical CPU with an instruction for every occasion. RISC and CISC designs from the 1980s and 1990s differed in their philosophy here, where RISC generally had fewer instructions, CISC having more, but those differences have largely been erased as RISC gained more features and CISC became more RISC-like for the sake of simplicity.
The Turing Machine is the "absolute zero" in terms of CPU design. If you can come up with something simpler or more reductive you'd probably win a prize.

SIMD math libraries for SSE and AVX

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once.
AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It give better performance than the standard math libraries on Intel CPUs but it's not much better.
Intel has the SVML as part of it's C++ compiler but the compiler suite is very expensive on Windows. Additionally, Intel cripples the library on non-Intel CPUs.
I found the following AVX library, http://software-lisc.fbk.eu/avx_mathfun/, which supports a few math functions (exp, log, sin, cos, and sincos). It gives very fast results for me, faster than SVML, but I have not checked the accuracy. It only works on single floating point and does not work in Visual Studio (though that would be easy to fix). It's based on another SSE library.
Does anyone have any other suggestions?
Edit: I found a SO thread that has many answers on this subject
Vectorized Trig functions in C?
I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.
Gromacs is a highly optimized molecular dynamics software package written in C++ that makes use of SIMD. As far as I know the mathematics SIMD functionality has not yet been split out into a separate library but I guess the implementation might be useful for others nonetheless.
https://github.com/gromacs/gromacs/blob/master/src/gromacs/simd/simd_math.h
http://manual.gromacs.org/documentation/2016.4/doxygen/html-lib/simd__math_8h.xhtml

Linear Algebra library using OpenGL ES 2.0 for iOS

Does anyone know of a linear algebra library for iOS that uses OpenGL ES 2.0 under the covers?
Specifically, I am looking for a way to do matrix multiplication on arbitrary-sized matrices (e.g., much larger than 4x4, more like 5,000 x 100,000) using the GPUs on iOS devices.
Is there a specific reason you're asking for "uses OpenGL ES 2.0 under the covers?" Or do you just want a fast, hardware optimized linear algebra library such as BLAS, which is built into iOS?
MetalPerformanceShaders.framework provides some tuned BLAS-like functions. It is not GLES. It is metal and runs on the GPU. See MetalPerformanceShaders/MPSMatrixMultiplication.h
OpenGL on iOS is probably the wrong way to go. Metal support on iOS would be the better way to go if you're going GPU.
Metal
You could use Apple's support for Metal Compute shaders. I've written high-performance code for my PhD in it. An early experiment I made calculating some fractals using Metal might give you some ideas to start
Ultimately, this question is too broad. What do you intend to use the library for, or how do you intend to use it? Is it a one off multiplication? Have you tested with current libraries and found the performance to be too slow? If so, by how much?
In general, you can run educational or purely informational experiments on performance of algorithm X on CPU vs. GPU vs. specialized hardware, but most often you run up against Amdahl's law and your code vs. a team of experts in the field.
Accelerate
You can also look into the Accelerate framework which offers BLAS.
Apple, according to the WWDC 2014 talk What's new in the Accelerate Framework, has hand tuned the Linear Algebra libraries targeted at their current generation hardware. They aren't just fast, but energy efficient. There are newer talks as well.

ipad2 neon floating point speed versus ipad1

When testing NEON instructions on ipad1 and ipad2, I notice allmost no speed up in the code on ipad2, where most functions actually run much faster on ipad2 than on ipad1.
This is for instructions like VMUL, VLD1, VADD and VSUB etc using quad word registers like q0 for things like FFT.
In addition I notice that apples own FFT function vdsp_fft_zrip does not speed up for ipad2 either.
So the question is, does ipad2 NEON execute faster than ipad1 NEON engine for the quad word SIMD type instructions?
The NEON unit on the A4 was extraordinarily powerful compared to the rest of the core. The rest of the core on the A5 is much improved from A4, but the NEON unit's performance is more-or-less unchanged. What you are observing is expected.
Of course, there are now two cores, so if you can take advantage of both of them, you can still see significant speedups. Also, double-precision computation on the A5 is vastly improved from the A4, as it is now fully pipelined.
NEON will remain the same for quite a while, even on the recently introduced 64bit ARM.
NEON doesn't benefit much from increased clock speed. NEON is already so fast that it spends the majority of the function execution time waiting for the data from memory.

Resources