Why is there an alignment bigger than a word? - alignment

Ok I understand that storing a data aligned to a CPU word sized chunks increase the speed of accessing it. But those chunks are normally 16, 32 or 64bit, why there are other aligment values like 128bit or 256bit? I mean there aren't any processors using such large registers in PC's anyway. I supose this have something to do with the CPU cache? Also I have seen such alignments in secondary storage too (but there they are actually much more large - 10240bit for eg.).

Many processors do have 128-bit SIMD registers (e.g., x86 SSE registers, ARM Neon registers, MIPS SIMD Architecture registers); x86 AVX extends SSE registers to 256 bits and AVX-512 doubles the size again.
However, there are other reasons for desiring larger alignments. As you guessed, cache behavior is one motive for using larger alignments. Aligning a larger data structure to the size of a cache line (commonly 64 bytes for x86, usually not smaller than 32 bytes in modern systems) guarantees that an access to any member will bring the same other members into the cache. This can be used to reduce cache capacity use and miss rate by placing members that are frequently used (a.k.a., hot) or which are commonly used at about the same time together in what will be the same cache block.
E.g., consider the following structure accessed with a cache having 32-byte cache blocks:
struct {
int64_t hot1; // frequently used member
int64_t hot2; // frequently used member
int64_t hot3; // frequently used member
int64_t hot4; // frequently used member
// end of 32-byte cache block if 32-byte aligned
int64_t a; // always used by func1, func2
int64_t b; // always used by func2
int64_t c; // always used by func1, func3
int64_t d; // always used by func2, func3
// end of 32-byte cache block if 32-byte aligned
int64_t e; // used by func4
int64_t f; // used by func5
int64_t g; // used by func6
int64_t h; // used by func7
}
If the structure is 32-byte aligned:
an access to any of the hot members will bring all of the hot members into the cache
calling func1, func2, or func3 will bring a, b, c, and d into the cache; if these are functions are called nearby in time, then the data will still be in cache
If the structure is 16-byte aligned but not 32-byte aligned (50% chance with 16-byte alignment):
an access to hot1 or hot2 will bring 16-bytes of unrelated data located before hot1 and not automatically load hot3 and hot4 into cache
an access to hot3 or hot4 will bring in a and b into the cache (likely unnecessarily)
a call to func1 or func2 is more likely to encounter cache hits for a and b since these would be in the same cache block as hot3 and hot4 but have a miss for c and d and less usefully bring e and f into the cache.
a call to func3 will less usefully bring e and f into the cache but not a and b
Even for a small structure, alignment can prevent the structure (or just the hot or accessed-nearby-in-time portions) from crossing cache block boundaries. E.g., aligning a 24-byte structure with 16-bytes of hot data to 16 bytes can guarantee that the hot data will always be in the same cache block.
Cache block alignment can also be used to guarantee that two locks (or other data elements that are accessed by different threads and written by at least one) do not share the same cache block. This avoids false sharing issues. (False sharing is when unrelated data used by different threads share a cache block. A write by one thread will remove that cache block from all other caches. If another thread writes to unrelated data in that block, it removes the block from the first thread's cache. For ISAs using linked-load/store-condition to set locks, this can cause the store-conditional to fail even though there was not an actual data conflict.)
Similar alignment considerations apply with respect to virtual memory page size (typically 4KiB). By guaranteeing that data accessed nearby in time are in a smaller number of pages, the cache storing translations of virtual memory addresses (the translation lookaside buffer [TLB]) will not have as much capacity pressure.
Alignment can also be used in object caches to reduce cache conflict misses, which occur when items have the same cache index. (Caches are typically indexed simply by a selection of some least significant bits. At each index a limited number of blocks, called a set, are available. If more blocks want to share an index than there are blocks in the set—the associativity or number of ways—, then one of the blocks in the set must be removed from the cache to make room.) A 2048-byte, fully aligned chunk of memory could hold 21 copies of the above structure with a 32-byte chunk of padding (which might be used for other purposes). This guarantees that hot members from different chunks will only have a 33.3% chance of using the same cache index. (Allocating in a chunk, even if not aligned, also guarantees that none of the 21 copies within a chunk will share a cache index.)
Large alignment can also be convenient in buffers since a simple bitwise and can produce the start address of the buffer or the number of bytes in the buffer.
Alignment can also be exploited to provide pointer compression (e.g., 64-byte alignment would allow a 32-bit pointer to address 256 GiB instead of 4 GiB at the cost of a 6-bit left shift when loading the pointer). Similarly, the least significant bits of a pointer to an aligned object can be used to store metadata, requiring an and to zero the bits before using the pointer.

Here are the alignments I have used:
SSE: 16 bytes
AVX: 32 bytes
cache-line: 64 bytes
page: 4096 bytes
SSE and AVX both offer load and store instructions which require alignment to 16 bytes for SSE or 32 bytes for AVX. E.g.
SSE: _mm_load_ps() and _mm_store_ps()
AVX: _mm256_load_ps() and _mm256_store_ps()
However, they also offer instructions which don't require alignment:
SSE: _mm_loadu_ps() and _mm_storeu_ps()
AVX: _mm256_loadu_ps() and _mm256_storeu_ps()
Before Nahelem the unaligned loads/store had a larger latency/throughput even on aligned memory then the instructions that required alignment. However, since Nahelem they have the same latency/throughput on aligned memory which means there is no reason to use the load/store instructions which require alignment anymore. That does NOT mean that aligned memory does not matter anymore.
If 16 or 32 bytes cross a cache line and these 16 or 32 bytes are loaded into a SSE/AVX register this can cause a stall so it can also help to align to a cache line. In practice I usually align to 64 bytes.
On multi-socket systems with multiple processors sharing memory between the processors is slower than accessing the main memory of each processor. For this reason it can help to make sure that memory is not split between a virtual page which is usually, but not not necessarily, 4096 bytes.

Related

Why is primary and cache memory divided into blocks?

Why is primary and cache memory divided into blocks?
Hi just got posed with this question, I haven't been able to find a detailed explanation corresponding to both primary memory and cache memory, if you have a solution It would be greatly appreciated :)
Thank you
Cache blocks exploit locality of reference based on two types of locality. Temporal locality, after you reference location x you are likely to access location x again shorty. Spatial locality, after you reference location x you are likely to access nearby locations, location x+1, ... shortly.
If you use a value at some distant data center x, you are likely to reuse that value and so it is copied geographically closer, 150ms. If you use a value on disk block x, you are likely to reuse disk block x and so it is kept in memory, 20 ms. If you use a value on memory page x, you are like to reuse memory page x and so the translation of its virtual address to its physical address is kept in the TLB cache. If you use a particular memory location x you are likely to reuse it and its neighbors and so it is kept in cache.
Cache memory is very small, L1D on an M1 is 192kB, and DRAM is very big, 8GB on an M1 Air. L1D cache is much faster than DRAM, maybe 5 cycles vs maybe 200 cycles. I wish this table was in cycles and included registers but it gives a useful overview of latencies:
https://gist.github.com/jboner/2841832
The moral of this is to pack data into aligned structures which fit. If you randomly access memory instead, you will miss in the cache, the TLB, the virtual page cache, ... and everything will be excruciatingly slow.
Most things in computer systems are divided into chunks of fixed sizes: bytes, words, cache blocks, pages.
One reason for this is that while hardware can do many things at once, it is hardware and thus, generally can only do what it was designed for.  Making bytes of blocks of 8 bits, making words blocks of 4 bytes (32-bit systems) or 8 bytes (64-bit systems) is something that we can design hardware to do, and mostly in parallel.
Not using fixed-sized chunks or blocks, on the other hand, can make things much more difficult for the hardware, so, data structures like strings — an example of data that is highly variable in length — are usually handled with software loops.
Usually these fixed sizes are in powers of 2 (32, 64, etc..) — because division and modulus, which are very useful operations are easy to do in binary for powers of 2.
In summary, we must subdivide data into blocks because, we cannot treat all the data as one lump sum (hardware wise at least), and, treating all data as individual bits is also too cumbersome.  So, we chunk or group data into blocks as appropriate for various levels of hardware to deal with in parallel.

Is it easier to fetch a 4 byte word from a word addressable memory compared to byte addressable?

So i did find some answers related to this in stackvoerflow but non of them clearly answered this
so if our memory is byte addressable and the word size is for example 4 byte, then why not make the memory byte addressable?
if i'm not mistaking CPU will work with words right? so when the cpu tries to get a word from the memory what's the difference between getting a 4 byte word from a byte addressable memory vs getting a word from word addressable memory?
if i'm not mistaking CPU will work with words right?
It depends on the Instruction Set Architecture (ISA) implemented by the CPU. For example, x86 supports operands of sizes ranging from a single 8-bit byte to as much as 64 bytes (in the most recent CPUs). Although the word size in modern x86 CPUs is 8 or 4 bytes only. The word size is generally defined as equal to the size of a general-purpose register. However, the granularity of accessing memory or registers is not necessarily restricted to the word size. This is very convenient from a programmer's perspective and from the CPU implementation perspective as I'll discuss next.
so when the cpu tries to get a word from the memory what's the
difference between getting a 4 byte word from a byte addressable
memory vs getting a word from word addressable memory?
While an ISA may support byte addressability, a CPU that implements the ISA may not necessarily fetch data from memory one byte at a time. Spatial locality of reference is a memory access pattern very common in most real programs. If the CPU was to issue single-byte requests along the memory hierarchy, it would unnecessarily consume a lot of energy and significantly hurt performance to handle single-byte requests and move one-byte data across the hierarchy. Therefore, typically, when the CPU issues a memory request for data of some size at some address, a whole block of memory (known as a cache line, which is usually 64-byte in size and 64-byte aligned) is brought to the L1 cache. All requests to the same cache line can be effectively combined into a single request. Therefore, the address bus between different levels of the memory hierarchy does not have to include wires for the bits that constitute an offset within the cache line. In that case, the implementation would be really addressing memory at the 64-byte granularity.
It can be useful, however, to support byte addressability in the implementation . For example, if only one byte of a cache line has changed and the cache line has to be written back to main memory, instead of sending all the 64 bytes to memory, it would take less energy, bandwidth, and time to send only the byte that changed (or few bytes). Another situation where byte addressability is useful is when providing support for the idea of critical-word first. This is much more to it, but to keep the answer simple, I'll stop here.
DDR SDRAM is a prevalent class of main memory interfaces used in most computer systems today. The data bus width is 8 bytes in size and the protocol supports only transferring aligned 8-byte chunks with byte enable signals (called data masks) to select which bytes to write. Therefore, main memory is typically 8-byte addressable. It is the CPU that provides the illusion of byte addressability.
memory normally is byte-addressable. But whole-word loads are possible, and get 4x as much data in the same time.
There's basically no difference, if the word load is naturally aligned; the low bits of the address are zero instead of being not present.

Is data loaded into the cache aligned to the cache line size?

Let's say that the cache line size is 64 bytes and that I have an object whose size is also 64 bytes. If this object is accessed, will it be:
All loaded into one cache line
Only the part between the start of the object and the next multiple of 64 bytes will be loaded
The object will be loaded into two different cache lines
Something else
I have a feeling that the answer differs from processor to processor, but what is the most likely outcome on modern CPU's?
When it comes to machine instruction level, concept of objects that are available in higher level languages disappears. Accessing their members through higher level assignment operations will be translated to regular read and write instructions. So if the runtime system, or the virtual machine of the language is smart and able to allocate objects to gain better cache utilization, in your case, into 64 byte aligned addresses, when the higher level language reads any member of the object, which could be anywhere within the 64 byte aligned address, the entire object will be loaded to the cacheline (because its allocated at 64 byte aligned address). If the runtime system is dumb, if it just allocates objects as requested in the flow of the program, without looking into situations (64 byte objects and 64 byte cache lines) as mentioned in your question, then when a read happens to a member of the object, data at the 64 byte aligned address of that member will be loaded into the cache. Therefore in the latter case you need to be either lucky or write code such that special padding is in place, either front or back of the object to make it cache line aligned.

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Purpose of memory alignment

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Resources