Related
Why is primary and cache memory divided into blocks?
Hi just got posed with this question, I haven't been able to find a detailed explanation corresponding to both primary memory and cache memory, if you have a solution It would be greatly appreciated :)
Thank you
Cache blocks exploit locality of reference based on two types of locality. Temporal locality, after you reference location x you are likely to access location x again shorty. Spatial locality, after you reference location x you are likely to access nearby locations, location x+1, ... shortly.
If you use a value at some distant data center x, you are likely to reuse that value and so it is copied geographically closer, 150ms. If you use a value on disk block x, you are likely to reuse disk block x and so it is kept in memory, 20 ms. If you use a value on memory page x, you are like to reuse memory page x and so the translation of its virtual address to its physical address is kept in the TLB cache. If you use a particular memory location x you are likely to reuse it and its neighbors and so it is kept in cache.
Cache memory is very small, L1D on an M1 is 192kB, and DRAM is very big, 8GB on an M1 Air. L1D cache is much faster than DRAM, maybe 5 cycles vs maybe 200 cycles. I wish this table was in cycles and included registers but it gives a useful overview of latencies:
https://gist.github.com/jboner/2841832
The moral of this is to pack data into aligned structures which fit. If you randomly access memory instead, you will miss in the cache, the TLB, the virtual page cache, ... and everything will be excruciatingly slow.
Most things in computer systems are divided into chunks of fixed sizes: bytes, words, cache blocks, pages.
One reason for this is that while hardware can do many things at once, it is hardware and thus, generally can only do what it was designed for. Making bytes of blocks of 8 bits, making words blocks of 4 bytes (32-bit systems) or 8 bytes (64-bit systems) is something that we can design hardware to do, and mostly in parallel.
Not using fixed-sized chunks or blocks, on the other hand, can make things much more difficult for the hardware, so, data structures like strings — an example of data that is highly variable in length — are usually handled with software loops.
Usually these fixed sizes are in powers of 2 (32, 64, etc..) — because division and modulus, which are very useful operations are easy to do in binary for powers of 2.
In summary, we must subdivide data into blocks because, we cannot treat all the data as one lump sum (hardware wise at least), and, treating all data as individual bits is also too cumbersome. So, we chunk or group data into blocks as appropriate for various levels of hardware to deal with in parallel.
Intel architecture has had 64 byte caches for a long time. I am curious, if instead of 64-byte cache lines a processor had 32-byte or 16-byte cachelines, would this improve the RAM-to-register data transfer latency? if so, how much? if not, why?
Thank you.
Transferring a larger amount of data of course increases the communication time. But the increase is very small due the way memory are organized and it does it does not impact memory to register latency.
Memory access operations are done in three steps:
bitline precharge: row address is sent and the internal busses of memory are precharged (duration tRP)
row access: an internal row of a memory is read and written to internal latches. During that time, column address is sent (duration tRCD)
column access: the selected columns are read in the row latches and start to be sent to the processor (duration tCL)
Row access is a long operation.
A memory is a matrix of cell elements. To increase the capacity of memory, cells must be rendered as small as possible. And when reading a row of cells, one has to drive a very capacitive and large bus that goes along a memory column. The voltage swing is very low and there are sense amplifiers amplifiers to detect small voltage variations.
Once this operation is done, a complete row is memorized in latches and reading them can be fast and are generally sent in burst mode.
Considering a typical DDR4 memory, with a 1GHz IO cycle time, we generally have tRP/tRCD/tCL=12-15cy/12-15cy/10-12cy and the complete time is around 40 memory cycles (if processor frequency is 4GHz, this is ~160 processor cycles). Then data is sent in burst mode twice per cycle, and 2x64 bits are sent every cycle. So, data transfer adds 4 cycles for 64 bytes and it would add only 2 cycles for 32 bytes.
So reducing cache line from 64B to 32B would reduce the transfer time by ~2/40=5%
If row address do not change, precharging and reading memory row is not required and the access time is ~15 memory cycles. In that case, the relative increase of time for transferring 64B vs 32B is larger but still limited: ~2/15~15%.
Both evaluations do not take into account the extra time required to process a miss in the memory hierachy and the actual percentage will be even smaller.
Data can be sent "critical word first" by the memory. If processor requires a given word, the address of this word is sent to memory. Once the row is read, memory sends first this word, then the other words in the cache line. So, caches can serve processor request as soon as the first word is received, whatever cache line is, and decreasing line width would have no impact on cache latency. So if using this feature, memory-to-register time would not change.
In recent processors, exchanges between different caches levels are based on the cache line width and sending the critical word first does not bring any gain.
Besides that, large line sizes reduce mandatory misses thanks to spatial locality and reducing line size would have a negative impact on cache miss rate.
Last, using larger cache lines increases data transfer rate between cache and memory.
The only negative aspect of large cache lines (besides the small transfer time increase) are that the number of lines in the cache is reduced and conflict misses may increase. But with the large associativity of modern caches, this effect is limited.
Please explain it nicely. Don't just write definition. Also explain what it does and how is it different from segmentation.
Fragmentation needs to be considered with memory allocation techniques. Paging is basically not a memory allocation technique, but rather a means of providing virtual address spaces.
Considering the comparison with segmentation, what you're probably asking about is the difference between a memory allocation technique using fixed size blocks (like the pages of paging, assuming 4KB page size here) and a technique using variable size blocks (like the segments used for segmentation).
Now, assume that you directly use the page allocation interface to implement memory management, that is you have two functions for dealing with memory:
alloc_page, which allocates a single page and returns a pointer to the beginning of the newly available address space, and
free_page, which frees a single, allocated page.
Now suppose all of your currently available virtual memory is used, but you need to store 1 additional byte. You call alloc_page and get a 4KB block of memory. You only use 1 byte of that huge block, but also the other 4095 bytes are, from the perspective of the allocator, used. If this happens multiple times eventually all pages will be allocated, so further calls to alloc_page will fail. Even if you just need another additional byte (which could be one of the 4095 that got wasted above) the allocator will tell you that you're out of memory. This is internal fragmentation.
If, on the other hand, you would use variable sized blocks (like in segmentation), then you're vulnerable to external fragmentation: Suppose you manage 6 bytes of memory (F means "free"):
FFFFFF
You first allocate 3 bytes for a, then 1 for b and finally 2 bytes for c:
aaabcc
Now you free both a and c, leaving only b allocated:
FFFbFF
You now have 5 bytes of unused memory, but if you try to allocate a block of 4 bytes (which is less than the available memory) the allocation will fail due to the unfavorable placement of the memory for b. This is external fragmentation.
Now, if you extend your page allocator to be able to allocate multiple pages and add alloc_multiple_pages, you have to deal with both internal and external fragmentation.
There is no external fragmentation in paging but internal fragmentation exists.
First, we need to understand what is external fragmentation. External fragmentation occurs when we have a memory to accommodate a process but it's not continuous.
How does it not occur in paging?
Paging divides virtual memory or all processes into equal-sized pages and physical memory into fixed size frames. So you are typically fixing equal size blocks called pages into equal block shaped spaces called frames! Try to visualize and conclude that there can never be external fragmentation.
In the case of segmentation, we divide virtual addresses into different sized blocks that is why there may be the case some blocks in main memory must stick together or compact to make space for the new process! I hope it helps!
When a process is divided into fix sized pages, there is generally some leftover space in the last page(internal fragmentation). When there are many processes, each of their last page's unused area could add up to be greater than or equal to size of one page. Now even if you have to total free size of one page or more but you cannot load a new page because a page has to be continuous. External fragmentation has happened. So, I don't think external fragmentation is completely zero in paging.
EDIT: It is all about how External Fragmentation is defined. The collection of internal fragmentation do not contribute to external fragmentation. External fragmentation is contributed by the empty space which is EXTERNAL to partition(or page). So if suppose there are only two frames in main memory ,say of size 16B, each occupied by only 1B data. The internal fragmentation in each frame is 15B. The total unused space is 30B. Now if you want to load one new page of some process, you will see that you do not have any frame available. You are unable to load a new page eventhough you have 30B unused space. Will you call this as external fragmentation? Answer is no. Because these 15B unused space are INTERNAL to the pages. So in paging, internal fragmentation is possible but not external fragmentation.
Paging allows a process to be allocated physical memory in non-contiguous fashion. I will answer that why external fragmentation can't occur in paging.
External frag occurs when a process, which was allocated contiguous memory , is unloaded from physical memory, which creates a hole (free space ) in the memory.
Now if a new process comes, which requires more memory than this hole, then we won't be able to allocate contiguous memory to that process due to non contiguous nature of free memory, this is called external fragmentation.
Now, the problem above originated due to the constraint of allocating contiguous memory to the process. This is what paging solved by allowing process to get non contiguous physical memory.
In paging, the probability of having external fragmentation is very low although internal fragmentation may occur.
In paging scheme, the whole main memory and the virtual memory is divided into some fixed size slots which are called pages (in case of virtual memory) and page frames (in case of main memory or RAM or physical memory). So, whenever a process is executed in main memory, it occupies the entire space of a page frame. Let us say, the main memory has 4096 page frames with each page frame having a size of 4096 bytes. Suppose, there is a process P1 which requires 3000 bytes of space for its execution in main memory. So, in order to execute P1, it is brought from virtual memory to main memory and placed in a page frame (F1) but P1 requires only 3000 bytes of space for its execution and as a result of which (4096 - 3000 = 1096 bytes) of space in the page frame F1 is wasted. In other words, this denotes the case of internal fragmentation in the page frame F1.
Again, external fragmentation may occur if some space of the main memory could not be included in a page frame. But this case is very rare as usually the size of a main memory, the size of a page frame as well as the total no. of page frames in main memory can be expressed in terms of power of 2.
As far as I've understood, I would answer your question like so:
Why is there internal fragmentation with paging?
Because a page has fixed size, but processes may request more or less space. Say a page is 32 units, and a process requests 20 units. Then when a page is given to the requesting process, that page is no longer useable despite having 12 units of free "internal" space.
Why is there no external fragmentation with paging?
Because in paging, a process is allowed to be allocated spaces that are non-contiguous in the physical memory. Meanwhile, the logical representation of those blocks will be contiguous in the virtual memory. This is what I mean:
A process requires 128 units of space. This is 4 pages as in the previous example. Unregardless of the actual page numbers (formally frame numbers) in the physical memory, you give those pages the numbers 0, 1, 2, and 3. This is the virtual representation that is the defining characteristic of paging itself. Those pages may be 21, 213, 23, 234 in the actual physical memory. But they can really be anything, contiguous or non-contiguous. Therefore, even if paging leaves small free spaces in between used spaces, those small free spaces can still be used together as if they were one contiguous block of space. That's why external fragmentation won't happen.
Frames are allocated as units. If the memory requirements of a process do not happen to coincide with page boundaries, the last frame allocated may not be completely full.
For example, if the page size is 2,048 bytes, a process of 72,766 bytes will need 35 pages plus 1,086 bytes. It will be allocated 36 frames, resulting in internal fragmentation of 2,048 - 1,086 = 962 bytes. In the worst case, a process would need 11 pages plus 1 byte. It would be allocated 11 + 1 frames, resulting in internal fragmentation of almost an entire frame.
I am implementing a spiking neural network using the CUDA library and am really unsure of how to proceed with regard to the following things:
Allocating memory (cudaMalloc) to many different arrays. Up until now, simply using cudaMalloc 'by hand' has sufficed, as I have not had to make more than 10 or so arrays. However, I now need to make pointers to, and allocate memory for thousands of arrays.
How to decide how much memory to allocate to each of those arrays. The arrays have a height of 3 (1 row for the postsynaptic neuron ids, 1 row for the number of the synapse on the postsynaptic neuron, and 1 row for the efficacy of that synapse), but they have an undetermined length which changes over time with the number of outgoing synapses.
I have heard that dynamic memory allocation in CUDA is very slow and so toyed with the idea of allocating the maximum memory required for each array, however the number of outgoing synapses per neuron varies from 100-10,000 and so I thought this was infeasible, since I have on the order of 1000 neurons.
If anyone could advise me on how to allocate memory to many arrays on the GPU, and/or how to code a fast dynamic memory allocation for the above tasks I would have more than greatly appreciative.
Thanks in advance!
If you really want to do this, you can call cudaMalloc as many times as you want; however, it's probably not a good idea. Instead, try to figure out how to lay out the memory so that neighboring threads in a block will access neighboring elements of RAM whenever possible.
The reason this is likely to be problematic is that threads execute in groups of 32 at a time (a warp). NVidia's memory controller is quite smart, so if neighboring threads ask for neighboring bytes of RAM, it coalesces those loads into a single request that can be efficiently executed. In contrast, if each thread in a warp is accessing a random memory location, the entire warp must wait till 32 memory requests are completed. Furthermore, reads and writes to the card's memory happen a whole cache line at a time, so if the threads don't use all the RAM that was read before it gets evicted from the cache, memory bandwidth is wasted. If you don't optimize for coherent memory access within thread blocks, expect a 10x to 100x slowdown.
(side note: The above discussion is still applicable with post-G80 cards; the first generation of CUDA hardware (G80) was even pickier. It also required aligned memory requests if the programmer wanted the coalescing behavior.)
Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.