Addressing on 32bit and 64bit architectures - memory

If we have a 32bit or 64bit computer architecture, does it mean that we can address 2^32 or 2^64 byte sized memory locations?
Do we always relate addressing with the smallest memory unit byte?

No, and no. (But on "normal" CPUs, sort of and yes.)
x86-64 is a 64-bit architecture with 64-bit integer registers, but it only has 48-bit virtual address space (extensible to 64 in the future if needed). Pointers are 64-bit integers, but the top 16 bits have to be the sign-extension of the low 48 bits. The initial implementations (AMD K8) only supported 40-bit physical addresses. The page-table format allows up to 52-bit physical addresses, but current implementations only support 48-bit (256TiB of RAM).
32-bit x86 has segmentation as well as paging, so it can address 4GiB from each of cs, ss, ds, es, fs, and gs. (This is super inconvenient, so all the major OSes just use a flat 4GiB memory space).
Most 8-bit CPUs could use a pair of 8-bit registers as a 16-bit pointer. AVR (8-bit RISC microcontroller with 32 8-bit registers) still does that.
The terms 32-bit or 64-bit architecture are pretty fuzzy. Does it mean register size, data bus width, pointer width? People (sometimes including the marketing department) call things whatever they want. Skylake-avx512 has a 512-bit path between execution units and L1D cache (for vector loads/stores). So is it a 512-bit SIMD architecture? Yes, I guess so. It's also a 64-bit architecture. AMD Ryzen splits 256b AVX instructions into two 128b halves. Is it a 256-bit SIMD architecture? Sure, why not.
Architecturally, the registers are 256b wide. Is it also a 128b SIMD architecture? Yes.
Most modern architectures are byte-addressable, but some (like maybe DSPs?) are only word-addressable. Each address can be a whole machine word (e.g. 36 bits on some weird CPUs). In that case, 32-bit addresses would get you 4GiB * bytes-per-word of address space, but you'd have to load + shift&mask to use it as that many separate bytes.

Related

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:
SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.
It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/
At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.
Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?
The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".
The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.
The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.
None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.
The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.
By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.
The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.
So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.
Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.
I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.
If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.
I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).
And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.
But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.
Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).
Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.
Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

How many instructions could fit into a 256 byte memory unit, given a 32-bit architecture?

How many instructions could fit into a 256-byte memory unit, given a 32-bit architecture?
Could someone explain how you get this answer? I know each instruction is 32 bits. I know the answer is 9 but I don't know how to get that answer
1 instruction = 32 bits = 4 bytes.
Then 256/4 = 64.
What I would like an explanation if possible is what difference it makes if it is 32-bit architecture. That refers to the instruction right? In other words, if it said a 64-bit architecture would that mean; each instruction is 64 bits = 8 bytes so then only 32 instructions would fit.
Am I correct?
Thanks!
A 32 bit architecture does not necessarily use 32 bit long instructions. The 32 bit only tells you, that memory addresses are 32 bit long so 232 addresses are possible (232 Byte = 4GiB) and registers can store equally many bits.
Instructions can have variable length depending on your architecture.
For example if you have an x86 architecture (16, 32 or 64 bit), instructions have a variable length. The maximum length is theoretically unlimited. In the real world, instructions on x86 machines are limited to 15 bytes. (See this answer)
If the 15 byte limit applies, the minimum number of instructions, which can be stored in a 256 Byte memory unit is ⌊256/15⌋=17 instructions.
The shortest possible instructions on an x86 machine are just 1 Byte long, so 256 instructions would fit. (x86 Wikipedia article)
If you are using a RISC architecture (reduced instruction set), often instructions do have a fixed length. For example, the ARM architecture initially used only 32 bit instructions, so 256/4=64 instructions would fit into your memory unit. Current versions of ARM (32 bit or 64 bit) have 32 and 16 bit long instructions to increase code density, so up to 128 instructions could theoretically fit in your memory unit. (ARM Wikipedia article)

1MiB = 1024KiB = 2^10. Nonetheless, why not use just 1000 byte instead 1024 to calculate size? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 years ago.
Improve this question
1024 = 2 to the power 10. Computers use binary system where the base is 2 (0 and 1). Humans use decimal system where the base is 10. So if I have 1 byte which contains 8 bit in modern computers I can represent up to 256 different states, possibilities, values or such. 10 bits can represent 1024 states. Well..so what? What does it have to do with memory? I think memory size it's about number of bits/bytes not about number of states that bits and bytes can represent. I'm confused. What the technical benefit from thinking 1K(i)B = 1024 and not 1000 bytes?
I think I need more technical explanation maybe something related how CPU works or how data actually stores at hard drive. Not just: hey computer use binary form so we use 2^10 and not 10^2.
What does it have to do with memory?
It has to do with memory addressing, which is done using binary numbers as well.
On a very high level, a typical memory chip works like this: it has pins of three types - address pins, data pins, and control pins. When CPU needs to read or write memory, it sets up a combination of zeros and ones on the address pins of a memory chip, sends a read or write signal to the control pins, waits a little, and then reads or writes data using the data pins.
The combination of zeros and ones placed on the address pins is called memory address. It is a binary number in the range from zero to 2n, where n is the number of address pins.
This is how the powers of two get into measuring the capacity of memory. Conveniently, 1024 was very near 1000, so "K" was borrowed to mean 1024 when talking about memory size.
Note that measuring data sizes using binary multiples is not universal. For example, capacities of hard drives are often quoted using decimal, not binary multiples, because hard drives do not inherently use binary addressing (and the number of decimal gigabytes is higher, which helps marketing the product).
What does it have to do with memory?
Obviously all these sizes are powers of 2. The reason it's 1024 and 1024^2 is because it's an exact description of what you get.
It's easy to address memory when you use the building blocks of a computer. These building blocks are address pins, and having more corresponds to powers of 2.
That said, we can also just call it '1000'. It's just less exact in most cases, and therefore makes less sense in the general case. There are exceptions though: most harddisks actually use powers of 10 for the capacity. You notice that when you put it in your computer, and the '24' suddenly starts to make a difference. :-)
Memory management is historically done in powers of two because powers of two can be performed using bit shifts. If the system had 1024 pages, the hardware can identify the page simply extracting (or shifting) bits.
In disk drives, 1K is 1000 bytes. This was done for marketing purposes. Disk drive manufacturers could advertise slightly larger capacities. This is true even though drives work with blocks whose sizes are powers of 2.

Is long defined by system architecture, or is it an IEEE standard?

Working on getting some communication of longs between computers and chips. Was running into some issues and thought it might be because the definition of longs between different system architectures (We're talking between 32-bit and 64-bit machines). Does anyone know if longs are an IEEE standard (like floats and doubles), or if they vary based on system architecture? (like ints)
The type long is not an IEEE standard. It's size may vary between different architectures. In C you can use the header stdint.h that defines types like uint32_t uint16_t etc that have fixed size. If your chip has a own C compiler that should solve your problem.

in what cases memory is byte or word addressable and why

Memory may be byte addressable or word (2 bytes, 4 bytes, etc.) addressable (please correct me if I am wrong here).
Does this (being byte addressable or word addressable) depend on the processor architecture?
If yes, in what cases we go for byte addressable memory and in what cases we go for word addressable memory?
And what are the reasons for that? In other words, why is memory byte addressable (in cases it is so) and why word addressable (in case it is so) and the reasons thereof. I saw few questions on byte addressable memory in this site, but none provided answer on these questions.
Obviously, different software needs to operate with data/variables of different types and sizes and often times there's a need to operate with several different ones in the same code.
Being able to access those variables directly and as whole, irrespective of the size, simplifies programming as you don't need to glue together, say, 4 8-bit bytes to form a 32-bit value or similarly extract individual 8-bit values from a 32-bit memory location.
There exist processors that aren't very flexible in terms of the natively supported data sizes. For example, fixed-point digital signal processors. Some can only directly access memory as 16-bit words and 32-bit double words. I think the absence of 8-bit byte addressing in them isn't a big problem because they are expected to be doing lots of signal processing instead of being versatile and be suited for general-purpose computing, and signal samples are rarely 8-bit (that's too coarse), most often they're 16-bit.
Supporting fewer data sizes and other features in hardware makes this hardware simpler and cheaper (including in terms of consumed energy), which becomes important if we're talking about thousands and millions of devices.
Different problems need different solutions, hence the variety.

Resources