Does Word length == number of bits transferred between memory and CPU per access? - memory

I am really confused about the concept of the "word length".
I know that in 32-bit machine, the memory address has 32 bits. And each memory access transfers 32 bits (4 bytes) to the CPU.
In 64-bit machine, the address has 64 bits. But does it mean the memory access unit is also 64 bits?
In this answer, the author says "Word: The natural size with which a processor is handling data (the register size)". But it does not explicitly specifies how many bits are transferred between memory and CPU per memory access.

In a CPU with a cache, data usually only transfers between CPU and memory a whole cache-line at a time. e.g. on a modern x86, a 1B load that hits in cache would not produce any external memory access.
If it missed even in the last-level cache, the memory chips would see a request for the 64B aligned block containing that byte.
Modern x86 CPUs have 16B or even 32B (256b) data paths between cache and execution units.
See also other links in the x86 tag wiki to learn more.

It does not say that because it does not mean that. Addresses are often 64-bits on a 64-bit machine, but not always. Data paths are often 64-bits on a 64-bit machine, but not always.

Related

Can we use SSE intrinsics to write to a memory mapped PCI device memory

I have a use case where the x86 CPU has to write 64 bytes of data to PCIe slave device whose memory has been mmapp'ed into the user space. As of now, i use memcpy to do that, but it turns out that it is very slow. Can we use the Intel SSE intrinsics like _mm_stream_si128 to speed it up? Or any other mechanism other than using DMA.
The objective is to pack all the 64 bytes into one TLP and send it on the PCI bus to reduce the overhead.
As I understand it, memory mapped I/O doesn't make certain store instructions special. An 8B store from movq mem, xmm is the same as the store from mov mem, r64.
I think if you have 64B to write into MMIO, you should do it with whatever instructions do it most efficiently as its generated, then flush the cache line. Generating a 64B buffer and then doing memcpy (or doing it yourself with four movdqa, or two AVX vmovdqa) is a waste of time, unless you expect your code that generates the 64B to be slow and more likely to be interrupted part way through than memcpy. A timer interrupt can come in any time, including during your memcpy, if you're in user space where you can't disable interrupts. Since you can't guarantee complete 64B writes, a 99.99% chance of a full cacheline write vs. a 99.99999% chance prob. won't make a difference.
Streaming stores to the MMIO region might avoid the CPU doing a read-for-ownership after the clflush from the previous write. clwb isn't available yet, so the only option is clflush, which evicts the data from cache.
Non-temporal load/stores are so-called weakly-ordered. IDK if that means you'd need more fencing to guarantee ordering.
One use-case for streaming loads/stores is copying from uncacheable memory, like video RAM. I'm not sure about using them for MMIO. I found this article about it, talking about how to read from MMIO without just getting the same cached value.

Do 32-bit types save memory on 64-bit systems?

Do 32-bit types save memory on 64-bit systems?
Also, is the memory divided into individual bytes or multi-bytes (32/64-bit)?
I know that the processor processes all data as 64 bit, filling in the missing data.
So would a 32-bit int slow down the calculation? Or would the int be stored as 64 bit anyway?
I ask because I was trained for micro-controllers, where memory and storage are limited, and I'm wondering whether it's at all relevant on smartphones and computers.
Thanks.
The size is not the issue, the compiler will generally has that out. What is important to performance is the structuring of the data storage.
Pad data structures so that every data element of data structures and arrays is aligned to a natural operand size of 64 or 128 bit. Also structure the data into 64 byte segments to match the 64 and 128 byte L1 cache line size. 64 bytes on Intel Pentium 4, Intel Xeon, Pentium M, Intel Core Duo processors, 128 bytes on Pentium 4 and Intel Xeon processors.
You do not want to access variables stored in two different cache lines. In a routine use variables defined locally rather than globals. If globals are to be accessed repeatedly over an extended period, copy them to the variables defined on the local stack.
Unless you have huge arrays (hundreds of MB or even GBs large), you won't save much memory on using 32-bit types on modern systems. If you are doing math (eg. cryptoalgorithms), operating with 64-bit integers should give you performance boost.

32-bit PC, size of pointer

For a 4G ram, there is 4 * 1024 * 1024 * 1024 * 8 = 2^(32+3) bits. My question is how could a 32-bit PC can access a 4G memory. What I can think of this is "a byte is the storage unit, one can not store a data in a bit". Is this correct?
Another question is: in such PC, does a pointer always have size 32 bit? It seems reasonable for me, because we have 2^32 storage units to store the data. But in this answer and the next with their remarks, this is said to be wrong. If it is wrong, why?
Individual bits are accessed by reading the address of the byte containing it, modifying the byte and writing back if necessary.
In some architectures the smallest addressable unit is double word, in which case no single byte can be accessed "as is". Theoretically one could design an architecture that would address 16 GB of memory with 32-bits of unique addresses. And similar things happened years ago, when the addressable units of a Hard Drive were limited to bare 2^28 units of 512 byte sectors or so.
It's not completely wrong to say that PC's have 32-bit pointers. That's just a bit old information, as the newer models are internally 64-bit systems and can access depending on the OS up to 2^48 bytes of memory. Currently most existing PCs are 32-bit and nothing can be done about it.
Well, StuartLC remainded about paging. Even in the current 32-bit systems, one can use 48-bits of addressing using old age segment registers. (Can't remember if there was a restriction of segment registers low three bits being zero...) But anyway that would allow 2^45 bytes of individual addresses, out of which just a small fraction could ever be in the main memory simultaneously. If an OS supporting that addressing mode was developed, then probably full 64 bits would be allocated for the pointer. Just like it is today with 64-bit processors.
My question is how could a 32-bit PC can access a 4G memory
You may be confusing address bus (addressable memory) and the size of the processor registers. This superuser post details the differences.
Paging is a technique commonly used to allow memory to be addressed beyond the size of the OS's capabilities, e.g. see PAE
does a pointer always have size 32 bit
No, not necessarily - e.g. on 16 bit DOS and Windows, and also pointers could be relative to a segment.
Can one can not store a data in a bit?
Yes, you can, e.g. in C, bit packing in structs can be done, albeit at the cost of performance and portability.
Today performance is more important, and compilers will typically try and align data to its machine word size, for performance reasons.

Is there merit to having less-than-8-byte pointers on 64-bit systems?

We know that in 64bit computers pointers will be 8bytes,
that will allow us to address a huge memory.
But on the other hand, memories that are available to usual people
now are up to 16G, that means that at the moment we do not need 8 bytes for
addressig, but 5 or at most 6 bytes.
I am a Delphi user.
The question (probably for developers of 64 bit compiler) is:
Would it be possible to declare somewhere how many bytes you would like to
use for pointers, and that will be valid for the whole application.
In case that you have application with millions of pointers and you will
be able to declare that pointers are only 5 bytes, the amount of memory
that will be occupied will be much lower.
I can imagine that this could be difficult to implement,
but I am curious anyway about it.
Thanks in advance.
A million 64-bit pointers will occupy less than eight megabytes. That's nothing. A typical modern computer has 6 GB of RAM. Hence, 8 MB is only slightly more than 1 permille of the total amount of RAM.
There are other uses for the excess precision of 8-byte pointers: you can, for example, encode the class of a reference (as an ordinal index) into the pointer itself, stealing 10 or 20 bits from the 64 available, leaving more than enough for currently available systems.
This can let the compiler writer do inline caching of virtual methods without the cost of an indirection when confirming that the instance is of the expected type.
Actually, it wouldn't save memory. Memory allocations have to be aligned based on the size of what you're allocating. E.g., a 4 byte section of memory has to be placed at a multiple of 4. So, due to the padding to align your 5-byte pointers, they'd actually consume the same amount of memory.
Remember actual OSes don't let you use physical addresses. User processes always use virtual addresses (usually only the kernel can access physical addresses). The processor will transparently turn virtual addresses into physical addresses. That means you can find your program uses pointers to virtual addresses large enough that they don't have a real address counterpart for a given system. It always happened in 32 bit Windows, where DLLs are mapped in the upper 2GB (virtual process address space, always 4GB), even when the machine has far less than 2GB of memory (actually it started to happen when PC had only a few megabytes - it doesn't matter).
Thereby using "small" pointers is a nonsense (even ignoring all the other factors, i.e. memory access, register sizes, instructions standard operand sizez, etc.) which would only reduce the virtual address space available. Also techniques like memory mapped files needs "large" pointers to access a file which could be far larger than the available memory.
Another use for some excess pointer space would be for storing certain value types without boxing. I'm not sure one would want a general-purpose mechanism for small value types, but certainly it would be reasonable to encode all 32-bit signed and unsigned integers, as well as all single-precision floats, and probably many values of type 'long' and 'unsigned long' (e.g. all those that would could be precisely represented by an int, unsigned int, or float).

Memory Addressing

I was reading http://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-map and in specific, the following section:
In a motherboard the CPU’s gateway to
the world is the front-side bus
connecting it to the northbridge.
Whenever the CPU needs to read or
write memory it does so via this bus.
It uses some pins to transmit the
physical memory address it wants to
write or read, while other pins send
the value to be written or receive the
value being read. An Intel Core 2
QX6600 has 33 pins to transmit the
physical memory address (so there are
2^33 choices of memory locations) and
64 pins to send or receive data (so
data is transmitted in a 64-bit data
path, or 8-byte chunks). This allows
the CPU to physically address 64
gigabytes of memory (2^33 locations *
8 bytes) although most chipsets only
handle up to 8 gigs of RAM.
Now the math above states that since there are 33 pins for addressing, 2^33 * 8 bytes = 64 GB. All good, but now I get a bit confused. Let's say I install a 64 bit OS, I'll be able to address 64 GB total or 2^64Gb * 8 = 2^64GB (which is much more)? Also, assuming I'm using the same cpu above on a 32 bit cpu, I can address only 4 GB still (2^32 bits = 4Gb * 8 = 4GB)?
I think the physical vs "OS Allowable" is getting me confused.
Thanks!
You're confusing a bunch of things:
The size of a pointer limits the amount of virtual memory a user process can access. Not all of these will actually be usable by your process (it is traditional to reserve the "high" 1 or 2 GB for use by the kernel).
Not all virtual address bits are valid. The original AMD64 implementation effectively uses 48-bit sign-extended addresses (i.e. addresses in the range [0x0000800000000000,0xFFFF7FFFFFFFFFFF] are invalid). This exists largely to limit page tables to 4 levels, which decreases the cost of a page fault; you need 6-level page tables to address the full 2^64 bits, assuming 4K pages. For comparison, i386 has 2-level page tables.
Not all virtual addresses need to correspond to physical addresses at any given time. This is the whole point of virtual memory: you can address memory which doesn't "physically" exist, and the OS pages it in for you.
Not all physical addresses correspond to virtual addresses. They might not be mapped, for one, but it's also possible to have more physical memory than you can address. PAE supports up to 64 GB of physical addresses, and was common on servers before AMD64. While an indivial process can't address 64 GB, it means you can run a lot of multi-gigabyte processes without swapping all the time.
And finally: There's no point having more physical addresses than your RAM slots can handle. I have a D945GCLF2 board which supports AMD64, but only 2 GB of RAM. There's no point having extra physical address lines which can't be used anyway. (I'm handwaving over memory-mapped devices and the funky two-DIMMs-one-slot thing which I forget the name of.)
Also, note a few other things:
For memory-mapped I/O (in the hardware sense), the CPU needs to address individual bytes. It can't just do a 64-bit access. This seems to have been glossed over.
Modern processors include the memory controller on the CPU instead using the traditional northbridge and FSB (see HyperTransport and QuickPath).
Yes, number of bits in physical and virtual addresses can be different. Say, here is what 64-bit Linux says about the cores here (cat /proc/cpuinfo):
...
processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 280
stepping : 2
cpu MHz : 2392.623
cache size : 1024 KB
...
bogomips : 4784.41
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
There are a few things to consider about the physical address wires:
Each physical address wire ("pin") references a front-side-bus-word, not a byte address. If the CPU fetches 64-bit words, then the physical address wires will be aligned to that 8-byte boundary. Therefore, address lines A0-A2 are not wired because they would always be zero. Thus, the byte address range of the physical wires is increased by the width of the front-side bus.
The virtual memory system can maintain a map of 64-bit virtual addresses to n-bit physical addresses. In practice, the OS maintains a "physical max address" value which the VM mappings do not exceed.
Some memory architectures allow memory bank paging, where off-CPU hardware increases the effective physical memory address range by re-using some physical addresses for different "banks" of memory.
Imagine that in a 64-bit OS some of the wires to address memory don't go anywhere. The OS understands that this is pretty confusing, so it takes the standard 64-bit address and uses virtual memory mapping to make you believe that you're living in a flat 64-bit space.
The chipset limit is a big factor -- the hardware on the motherboard has to be able to pass the addresses from the CPU to the RAM. So the 8GB limit will apply unless you have a motherboard designed to handle more.
For reference, current 64-bit CPUs have the upper x-number-of bits (somewhere between 8 and 24 bits) of the address space wired together, as 64 bits is simply too much address space for now (you'd need 8 billion 2GB modules to take up that much address space). AMDs, for example, have a 48-bit limit (IIRC) on address space in a single segment. Which is more than enough, but nowhere near the theoretical max.
The main difference between a 64-bit and 32-bit OS is that one simply regards the primitive datatype (e.g. a word) as being wider. If the CPU can only physically address 2^33 locations, that won't change just because you're using a 64-bit OS. On the other hand, using a 32-bit OS will generally limit your addressable memory since 32-bit pointers can't represent all the possible values that your CPU could use to address memory (in your example, a 32-bit pointer is one bit short).
Long story short, your addressable memory is limited by both the pointer width (an OS restriction) and the data address bus width (a physical restriction). Some architectures have clever ways of getting around the OS pointer width by using two pointers, one to address a "bank" of memory and another to locally address within the bank. These schemes have sort of fallen out vogue lately, though.
Also, modern OSes generally use a virtual memory subsystem that translates logical addresses into their corresponding physical ones. With caching, the actual physical location of the memory could be in one (or several!) components along a memory heirarchy (e.g. processor cache, main memory, hard disk, etc.) Don't know how I completely forgot to mention VM, but it definitely would help your understanding to investigate it.
I believe that if you have a 64 bit operating system you can (theoretically) address 2^64 * 8 bytes = 16 EB (exabytes), but you will be limited by the hardware to 2^33 * 8 bytes = 64 GB. If you have a 32 bit OS you will not be able to utilize the full hardware capacity since the OS is the limiting factor, only being able to express 2^32 different addresses. I might be off but that's my current understanding.
I think you are getting confused by the fact the memory store 8 bytes at the same time , but an address (at the CPU level) refer to 1 byte (and not a bunch of 8). So with 32 bits you can "refer" to 2^32 bytes = 4GB. If you prefer +8 on pointer correspond to +1 on the number of the "physical" line.
You can then have access to more memory using pagination (not sure if it still used in modern computer).
To do an analogy with a library, you (or the CPU) can enumerate 32^2 books, but the librarian (the chipset) deals with shelves of book. So what is for you book #10, is book #2 or of the shelf #2 but you never see the shelves number. That's the job of the librarian to go to the good shelf and bring you the good book.
For me (another program on the same computer) book #10 could be a different one : book #2 of the shelf 100002 (because my page start at shelf 10000)
We can both refer to 32^2 different book, but they are not the same (and the library can have much more than that).
(Change have changed lots since I studied computer, so what I'm saying can be not 100 % accurate, but I think the idea is there)

Resources