Is long defined by system architecture, or is it an IEEE standard? - communication

Working on getting some communication of longs between computers and chips. Was running into some issues and thought it might be because the definition of longs between different system architectures (We're talking between 32-bit and 64-bit machines). Does anyone know if longs are an IEEE standard (like floats and doubles), or if they vary based on system architecture? (like ints)

The type long is not an IEEE standard. It's size may vary between different architectures. In C you can use the header stdint.h that defines types like uint32_t uint16_t etc that have fixed size. If your chip has a own C compiler that should solve your problem.

Related

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:
SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.
It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/
At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.
Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?
The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".
The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.
The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.
None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.
The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.
By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.
The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.
So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.
Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.
I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.
If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.
I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).
And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.
But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.
Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).
Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.
Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

How was integer divide pervasively used in Apple programming?

I was reading an interesting interview with Chris Lattner, author of LLVM and Swift, and noticed a very curious claim:
Other things are that Apple does periodically add new instructions [11:00] to its CPUs. One example of this historically was the hilariously named “Swift” chip that it launched which was the first designed in-house 32-bit ARM chip. This was the iPhone 5, if I recall.
In this chip, they added an integer-divide instruction. All of the chips before that didn’t have the ability to integer-divide in hardware: you had to actually open-code it, and there was a library function to do that. [11:30] That, and a few other instructions they added, were a pretty big deal and used pervasively
Now that is surprising. As I understand it, integer divide is almost never needed. The cases I've seen where it could be used, fall into a few categories:
Dividing by a power of two. Shift right instead.
Dividing by an integer constant. Multiply by the reciprocal instead. (It's counterintuitive but true that this works in all cases.)
Fixed point as an approximation of real number arithmetic. Use floating point instead.
Fixed point, for multiplayer games that run peer-to-peer across computers with different CPU architectures and need each computer to agree on the results down to the last bit. Okay, but as I understand it, multiplayer games on iPhone don't use that kind of peer-to-peer design.
Rendering 3D textures. Do that on the GPU instead.
After floating point became available in hardware, I've never seen a workload that needed to do integer divide with significant frequency.
What am I missing? What was integer divide used for on Apple devices, so frequently that it was considered worth adding as a CPU instruction?

Addressing on 32bit and 64bit architectures

If we have a 32bit or 64bit computer architecture, does it mean that we can address 2^32 or 2^64 byte sized memory locations?
Do we always relate addressing with the smallest memory unit byte?
No, and no. (But on "normal" CPUs, sort of and yes.)
x86-64 is a 64-bit architecture with 64-bit integer registers, but it only has 48-bit virtual address space (extensible to 64 in the future if needed). Pointers are 64-bit integers, but the top 16 bits have to be the sign-extension of the low 48 bits. The initial implementations (AMD K8) only supported 40-bit physical addresses. The page-table format allows up to 52-bit physical addresses, but current implementations only support 48-bit (256TiB of RAM).
32-bit x86 has segmentation as well as paging, so it can address 4GiB from each of cs, ss, ds, es, fs, and gs. (This is super inconvenient, so all the major OSes just use a flat 4GiB memory space).
Most 8-bit CPUs could use a pair of 8-bit registers as a 16-bit pointer. AVR (8-bit RISC microcontroller with 32 8-bit registers) still does that.
The terms 32-bit or 64-bit architecture are pretty fuzzy. Does it mean register size, data bus width, pointer width? People (sometimes including the marketing department) call things whatever they want. Skylake-avx512 has a 512-bit path between execution units and L1D cache (for vector loads/stores). So is it a 512-bit SIMD architecture? Yes, I guess so. It's also a 64-bit architecture. AMD Ryzen splits 256b AVX instructions into two 128b halves. Is it a 256-bit SIMD architecture? Sure, why not.
Architecturally, the registers are 256b wide. Is it also a 128b SIMD architecture? Yes.
Most modern architectures are byte-addressable, but some (like maybe DSPs?) are only word-addressable. Each address can be a whole machine word (e.g. 36 bits on some weird CPUs). In that case, 32-bit addresses would get you 4GiB * bytes-per-word of address space, but you'd have to load + shift&mask to use it as that many separate bytes.

SIMD math libraries for SSE and AVX

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once.
AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It give better performance than the standard math libraries on Intel CPUs but it's not much better.
Intel has the SVML as part of it's C++ compiler but the compiler suite is very expensive on Windows. Additionally, Intel cripples the library on non-Intel CPUs.
I found the following AVX library, http://software-lisc.fbk.eu/avx_mathfun/, which supports a few math functions (exp, log, sin, cos, and sincos). It gives very fast results for me, faster than SVML, but I have not checked the accuracy. It only works on single floating point and does not work in Visual Studio (though that would be easy to fix). It's based on another SSE library.
Does anyone have any other suggestions?
Edit: I found a SO thread that has many answers on this subject
Vectorized Trig functions in C?
I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.
Gromacs is a highly optimized molecular dynamics software package written in C++ that makes use of SIMD. As far as I know the mathematics SIMD functionality has not yet been split out into a separate library but I guess the implementation might be useful for others nonetheless.
https://github.com/gromacs/gromacs/blob/master/src/gromacs/simd/simd_math.h
http://manual.gromacs.org/documentation/2016.4/doxygen/html-lib/simd__math_8h.xhtml

How do I Perform Integer SIMD operations on the iPad A4 Processor?

I feel the need for speed. Double for loops are killing my iPad apps performance. I need SIMD. How do I perform integer SIMD operations on the iPad A4 processor?
Thanks,
Doug
The instruction set is NEON, intrinsics reference
I've never been able to find good documentation on what they all actually are. But you pick it up pretty quickly if you've had any exposure to SSE
To get the fastest speed, you will have to write ARM Assembly language code that uses NEON SIMD operations, because the C compilers generally don't make very good SIMD code, so hand-written Assembly will make a big difference. I have a brief intro here: http://www.shervinemami.co.cc/iphoneAssembly.html
Note that the iPad A4 uses the ARMv7-A CPU, so the reference manual for the NEON SIMD instructions is at: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html
(but its 2000 pages long and requires the understanding of Assembly code and perhaps SIMD in general!).

Resources