Why is stack frame a multiple of 16 bytes long? - alignment

CSAPP explains that SSE instructions operate on 16-byte blocks of data and it needs memory addresses to be multiple of 16.
But what's the relationship with stack frame? Does it means SSE instructions operate on stack frame? If so, what's the commonly used instructions?

Yes, the alignmentment of stack frame is set so any instruction could work on any data type which you could potentially store in the stack frame.
So on x86/x86_64 for example there are SSE instructions which suppose that the memory address is aligned to 16 bytes. Compiler then supposes that the stack frame is aligned 16 bytes so it could arrange the local variables that they are aligned too if needed. SSE instructions (like any other) can operate on any memory, including global, heap or stack.
The same it is actually for heap - when you allocate structure longer than 16 (or equal), the malloc/new must return 16-bytes aligned address so that kind of instruction could work with it.

Related

How do stack machines efficiently store data types of different sizes?

Suppose I have the following primitive stack implementation for a virtual machine:
unsigned long stack[512];
unsigned short top = 0;
void push(unsigned long qword) {
stack[top] = qword;
top++;
}
void pop() {
top--;
}
unsigned long get() {
return top-1;
}
This stack actually works fine (except that it doesn't check for an overflow) but I now have the following problem: It is quite inefficient.
Here is an example:
Let's say I want to push a byte onto the stack. I would now have to cast it to a long and then push it onto the stack. But now a whole 7 bytes are not being used. This feels kind of wrong.
So now I have the following question:
How do stack machines efficiently store data types of different sizes? Do they do the same as in this implementation?
There are different metrics of efficiency. Using an eight bytes long to store a single byte will raise the memory consumption. On the other hand, memory is not the major concern on most of today’s machines. Further, a stack is a pre-allocated amount of memory, typically. So as long as not the entire memory block has been exhausted, it is entirely irrelevant whether the unused seven bytes are within that long or on the other side of the location marked by top.
In terms of CPU time, you don’t gain any advantage of transferring a quantity smaller than the hardware’s bus size. In the best case, it makes no difference. In the worst case, transferring a single byte boils down to reading a long from memory, manipulating one byte of it and writing the long back. In the latter case, it would be more efficient to expand the byte to long, to overwrite all eight bytes explicitly.
This is reflected by the design of the Java bytecode, for example. It does not only drop support for pushing and popping quantities smaller than 32 bit, it doesn’t even support arithmetic instructions for them¹. So for most use cases, you don’t even know that a quantity could be a byte before pushing. Only formal parameter types and array types may refer to byte.
But note that a JVM isn’t even a stack engine in the narrowest sense. There is no support for pushing and popping arbitrary numbers of items. As explained in this answer, expressing the intent using a stack allows very compact instructions. But Java bytecode doesn’t allow branching to code locations with a different number of items on the stack. So it doesn’t support pushing or popping items in a loop. In other words, for each instruction, the actual offset into the stack is predictable and also the operand types are known. So it’s always possible to transform Java bytecode to an IR not using a stack straight-forwardly. Such transformed code could use instructions with arbitrary operand sizes, if that has a benefit on the particular target architecture.
¹ And that was accounting for hardware in use a quarter century ago
There's no "one true" way of doing this, and the Java VM uses a few different strategies. All types less than 32-bits in size are widened to 32-bits. Pushing 1 byte to the stack effectively pushes 4 bytes to the stack. The benefit is simplicity when there are fewer native value sizes to deal with.
Another strategy is used for 64-bit values. They occupy two stack slots instead of one. The JVM has specific opcodes which indicate which type of value they expect on the stack, and the verifier ensures that no opcode is attempting to access a variable off the stack that doesn't match the type that should be there.
A third strategy is used for object references. The actual pointer size can be 32 bits or 64 bits, depending on the CPU capabilities, whether the JVM is running in 64-bit mode, etc. The JVM has specific opcodes for handling object references, and the verifier checks this too.

Is it easier to fetch a 4 byte word from a word addressable memory compared to byte addressable?

So i did find some answers related to this in stackvoerflow but non of them clearly answered this
so if our memory is byte addressable and the word size is for example 4 byte, then why not make the memory byte addressable?
if i'm not mistaking CPU will work with words right? so when the cpu tries to get a word from the memory what's the difference between getting a 4 byte word from a byte addressable memory vs getting a word from word addressable memory?
if i'm not mistaking CPU will work with words right?
It depends on the Instruction Set Architecture (ISA) implemented by the CPU. For example, x86 supports operands of sizes ranging from a single 8-bit byte to as much as 64 bytes (in the most recent CPUs). Although the word size in modern x86 CPUs is 8 or 4 bytes only. The word size is generally defined as equal to the size of a general-purpose register. However, the granularity of accessing memory or registers is not necessarily restricted to the word size. This is very convenient from a programmer's perspective and from the CPU implementation perspective as I'll discuss next.
so when the cpu tries to get a word from the memory what's the
difference between getting a 4 byte word from a byte addressable
memory vs getting a word from word addressable memory?
While an ISA may support byte addressability, a CPU that implements the ISA may not necessarily fetch data from memory one byte at a time. Spatial locality of reference is a memory access pattern very common in most real programs. If the CPU was to issue single-byte requests along the memory hierarchy, it would unnecessarily consume a lot of energy and significantly hurt performance to handle single-byte requests and move one-byte data across the hierarchy. Therefore, typically, when the CPU issues a memory request for data of some size at some address, a whole block of memory (known as a cache line, which is usually 64-byte in size and 64-byte aligned) is brought to the L1 cache. All requests to the same cache line can be effectively combined into a single request. Therefore, the address bus between different levels of the memory hierarchy does not have to include wires for the bits that constitute an offset within the cache line. In that case, the implementation would be really addressing memory at the 64-byte granularity.
It can be useful, however, to support byte addressability in the implementation . For example, if only one byte of a cache line has changed and the cache line has to be written back to main memory, instead of sending all the 64 bytes to memory, it would take less energy, bandwidth, and time to send only the byte that changed (or few bytes). Another situation where byte addressability is useful is when providing support for the idea of critical-word first. This is much more to it, but to keep the answer simple, I'll stop here.
DDR SDRAM is a prevalent class of main memory interfaces used in most computer systems today. The data bus width is 8 bytes in size and the protocol supports only transferring aligned 8-byte chunks with byte enable signals (called data masks) to select which bytes to write. Therefore, main memory is typically 8-byte addressable. It is the CPU that provides the illusion of byte addressability.
memory normally is byte-addressable. But whole-word loads are possible, and get 4x as much data in the same time.
There's basically no difference, if the word load is naturally aligned; the low bits of the address are zero instead of being not present.

Is data loaded into the cache aligned to the cache line size?

Let's say that the cache line size is 64 bytes and that I have an object whose size is also 64 bytes. If this object is accessed, will it be:
All loaded into one cache line
Only the part between the start of the object and the next multiple of 64 bytes will be loaded
The object will be loaded into two different cache lines
Something else
I have a feeling that the answer differs from processor to processor, but what is the most likely outcome on modern CPU's?
When it comes to machine instruction level, concept of objects that are available in higher level languages disappears. Accessing their members through higher level assignment operations will be translated to regular read and write instructions. So if the runtime system, or the virtual machine of the language is smart and able to allocate objects to gain better cache utilization, in your case, into 64 byte aligned addresses, when the higher level language reads any member of the object, which could be anywhere within the 64 byte aligned address, the entire object will be loaded to the cacheline (because its allocated at 64 byte aligned address). If the runtime system is dumb, if it just allocates objects as requested in the flow of the program, without looking into situations (64 byte objects and 64 byte cache lines) as mentioned in your question, then when a read happens to a member of the object, data at the 64 byte aligned address of that member will be loaded into the cache. Therefore in the latter case you need to be either lucky or write code such that special padding is in place, either front or back of the object to make it cache line aligned.

How is there internal fragmentation in paging and no external fragmentation?

Please explain it nicely. Don't just write definition. Also explain what it does and how is it different from segmentation.
Fragmentation needs to be considered with memory allocation techniques. Paging is basically not a memory allocation technique, but rather a means of providing virtual address spaces.
Considering the comparison with segmentation, what you're probably asking about is the difference between a memory allocation technique using fixed size blocks (like the pages of paging, assuming 4KB page size here) and a technique using variable size blocks (like the segments used for segmentation).
Now, assume that you directly use the page allocation interface to implement memory management, that is you have two functions for dealing with memory:
alloc_page, which allocates a single page and returns a pointer to the beginning of the newly available address space, and
free_page, which frees a single, allocated page.
Now suppose all of your currently available virtual memory is used, but you need to store 1 additional byte. You call alloc_page and get a 4KB block of memory. You only use 1 byte of that huge block, but also the other 4095 bytes are, from the perspective of the allocator, used. If this happens multiple times eventually all pages will be allocated, so further calls to alloc_page will fail. Even if you just need another additional byte (which could be one of the 4095 that got wasted above) the allocator will tell you that you're out of memory. This is internal fragmentation.
If, on the other hand, you would use variable sized blocks (like in segmentation), then you're vulnerable to external fragmentation: Suppose you manage 6 bytes of memory (F means "free"):
FFFFFF
You first allocate 3 bytes for a, then 1 for b and finally 2 bytes for c:
aaabcc
Now you free both a and c, leaving only b allocated:
FFFbFF
You now have 5 bytes of unused memory, but if you try to allocate a block of 4 bytes (which is less than the available memory) the allocation will fail due to the unfavorable placement of the memory for b. This is external fragmentation.
Now, if you extend your page allocator to be able to allocate multiple pages and add alloc_multiple_pages, you have to deal with both internal and external fragmentation.
There is no external fragmentation in paging but internal fragmentation exists.
First, we need to understand what is external fragmentation. External fragmentation occurs when we have a memory to accommodate a process but it's not continuous.
How does it not occur in paging?
Paging divides virtual memory or all processes into equal-sized pages and physical memory into fixed size frames. So you are typically fixing equal size blocks called pages into equal block shaped spaces called frames! Try to visualize and conclude that there can never be external fragmentation.
In the case of segmentation, we divide virtual addresses into different sized blocks that is why there may be the case some blocks in main memory must stick together or compact to make space for the new process! I hope it helps!
When a process is divided into fix sized pages, there is generally some leftover space in the last page(internal fragmentation). When there are many processes, each of their last page's unused area could add up to be greater than or equal to size of one page. Now even if you have to total free size of one page or more but you cannot load a new page because a page has to be continuous. External fragmentation has happened. So, I don't think external fragmentation is completely zero in paging.
EDIT: It is all about how External Fragmentation is defined. The collection of internal fragmentation do not contribute to external fragmentation. External fragmentation is contributed by the empty space which is EXTERNAL to partition(or page). So if suppose there are only two frames in main memory ,say of size 16B, each occupied by only 1B data. The internal fragmentation in each frame is 15B. The total unused space is 30B. Now if you want to load one new page of some process, you will see that you do not have any frame available. You are unable to load a new page eventhough you have 30B unused space. Will you call this as external fragmentation? Answer is no. Because these 15B unused space are INTERNAL to the pages. So in paging, internal fragmentation is possible but not external fragmentation.
Paging allows a process to be allocated physical memory in non-contiguous fashion. I will answer that why external fragmentation can't occur in paging.
External frag occurs when a process, which was allocated contiguous memory , is unloaded from physical memory, which creates a hole (free space ) in the memory.
Now if a new process comes, which requires more memory than this hole, then we won't be able to allocate contiguous memory to that process due to non contiguous nature of free memory, this is called external fragmentation.
Now, the problem above originated due to the constraint of allocating contiguous memory to the process. This is what paging solved by allowing process to get non contiguous physical memory.
In paging, the probability of having external fragmentation is very low although internal fragmentation may occur.
In paging scheme, the whole main memory and the virtual memory is divided into some fixed size slots which are called pages (in case of virtual memory) and page frames (in case of main memory or RAM or physical memory). So, whenever a process is executed in main memory, it occupies the entire space of a page frame. Let us say, the main memory has 4096 page frames with each page frame having a size of 4096 bytes. Suppose, there is a process P1 which requires 3000 bytes of space for its execution in main memory. So, in order to execute P1, it is brought from virtual memory to main memory and placed in a page frame (F1) but P1 requires only 3000 bytes of space for its execution and as a result of which (4096 - 3000 = 1096 bytes) of space in the page frame F1 is wasted. In other words, this denotes the case of internal fragmentation in the page frame F1.
Again, external fragmentation may occur if some space of the main memory could not be included in a page frame. But this case is very rare as usually the size of a main memory, the size of a page frame as well as the total no. of page frames in main memory can be expressed in terms of power of 2.
As far as I've understood, I would answer your question like so:
Why is there internal fragmentation with paging?
Because a page has fixed size, but processes may request more or less space. Say a page is 32 units, and a process requests 20 units. Then when a page is given to the requesting process, that page is no longer useable despite having 12 units of free "internal" space.
Why is there no external fragmentation with paging?
Because in paging, a process is allowed to be allocated spaces that are non-contiguous in the physical memory. Meanwhile, the logical representation of those blocks will be contiguous in the virtual memory. This is what I mean:
A process requires 128 units of space. This is 4 pages as in the previous example. Unregardless of the actual page numbers (formally frame numbers) in the physical memory, you give those pages the numbers 0, 1, 2, and 3. This is the virtual representation that is the defining characteristic of paging itself. Those pages may be 21, 213, 23, 234 in the actual physical memory. But they can really be anything, contiguous or non-contiguous. Therefore, even if paging leaves small free spaces in between used spaces, those small free spaces can still be used together as if they were one contiguous block of space. That's why external fragmentation won't happen.
Frames are allocated as units. If the memory requirements of a process do not happen to coincide with page boundaries, the last frame allocated may not be completely full.
For example, if the page size is 2,048 bytes, a process of 72,766 bytes will need 35 pages plus 1,086 bytes. It will be allocated 36 frames, resulting in internal fragmentation of 2,048 - 1,086 = 962 bytes. In the worst case, a process would need 11 pages plus 1 byte. It would be allocated 11 + 1 frames, resulting in internal fragmentation of almost an entire frame.

x86 Can push/pop be less than 4 bytes? [duplicate]

This question already has answers here:
Why is it not possible to push a byte onto a stack on Pentium IA-32?
(4 answers)
Closed 7 years ago.
Hi I am reading a guide on x86 by the University of Virginia and it states that pushing and popping the stack either removes or adds a 4-byte data element to the stack.
Why is this set to 4 bytes? Can this be changed, could you save memory on the stack by pushing on smaller data elements?
The guide can be found here if anyone wishes to view it:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
Short answer: Yes, 16 or 32 bits. And, for x86-64, 64 bits.
The primary reasons for a stack are to return from nested function calls and to save/restore register values. It is also typically used to pass parameters and return function results. Except for the smallest parameters, these items usually have the same size by the design of the processor, namely, the size of the instruction pointer register. For 8088/8086, it is 16-bits. For 80386 and successors, it is 32-bits. Therefore, there is little value in having stack instructions that operate on other sizes.
There is also the consideration of the size of the data on the memory bus. It takes the same amount of time to retrieve or store a word as it does a byte. (Except for 8088 which has 16-bit registers but an 8-bit data bus.) Alignment also comes into play. The stack should be aligned on word boundaries so each value can be retrieved as one memory operation. The trade-off is usually taken to save time over saving memory. To pass one byte as a parameter, one word is usually used. (Or, depending on the optimization available to the compiler, one word-sized register would be used, avoiding the stack altogether.)

Resources