How to learn the associativity (number of way) of the TLB? - tlb

I have a task to learn the number of ways in TLB-cache. Which algorithm should I use?

The question is a bit unclear as to what you need help with, so this is a summary of information related to the topics you mention.
There are two "ways" to get to memory - direct mapping, where the page table is kept in memory and is indexed by virtual page number. To translate from virtual page number to real page number the OS goes to the base address of the page table and adds the virtual page number. The value at this location gives the real address of the page.
The other way is associative mapping. Associative mapping keeps the page table in content-addressed memory, so when a virtual address is looked up, all the process's pages are searched in parallel giving O(1) lookup time complexity. Another advantage is that this stores only the pages that have been actually allocated.
The problem is that associative mapping requires special hardware to accomplish the content-addressed memory.
So the trade-off is that a small amount of content-addressed memory is used (a TLB = translation lookaside buffer which you refer to in your question), with the majority using direct mapping.
Then the big consideration is when to place an address in the TLB and which old one to evict from the TLB. For this there are many choices: most likely it will be Least Recently Used (LRU) to exploit temporal locality. Other choices could be Least Frequently Used, Round Robin (probably not very good here), WS_Clock, etc.

Related

What makes having an array of pointers to page table entries better than having an array of page table entries?

Here in this lecture, some extreme implementations of a page table are discussed as well as some reasonable ones.
One of the extreme cases was to allocate a flat array that maps every possible virtual address to a physical address.
At the minute 19:19 (which the given link will start at), the lecturer says that he's talking about a flat array of pointers to PTEs. And then mentions that he could've done something even more stupid which is to use an array of actual page table entries.
Why would having an array of pointers to PTEs be better that having an actual array of PTEs?
He is talking about a 32 bit system with an address of 4 bytes, But PTE is also 4 bytes.
Isn't having an array of pointers more wasteful because it'll take double the space (4 bytes for the pointer and 4 for the PTE)?
Also, I believe that allocating a lot of PTEs that're spread across the physical memory would cause fragmentation and will be hard to manage, as opposed to creating just an array of PTEs which will be one chunk of memory that does not need a lot of management.
Why would having an array of pointers be a better case?
If you have many virtual address spaces that are all 4 GiB each; you might find that:
a large area (e.g. 1 GiB) is used for kernel and needs to be the same in all virtual address spaces. Because this is the same in all virtual address spaces, any modification in this area needs to modify every virtual address space at the same time.
various areas will be shared by 2 or more virtual address spaces for other reasons - memory mapped files, shared libraries, shared memory, "copy on write" areas caused by "fork()", etc.
some areas will be the same in the same virtual address space (e.g. refer to the same "read only physical page full of zeros" to implement an "allocate on write" strategy)
a lot of space will be entirely unused (maybe an average of 2 GiB per virtual address space)
For anything that's used in multiple places; a "pointer to PTE" would give you a single PTE that can be modified regardless of how many places the page is used.
For one example; let's say you have a "C standard library" shared by 40 different processes (and included in 40 different virtual address spaces), but part of the library's code is still on disk and the PTE/s for those parts say "not present". When any process needs that part of the shared library you get a page fault (because it's not present yet) and the OS has to fetch the page from disk. In this case "pointer to PTE" means the OS can change one PTE (from "not present" to "present") and doesn't need to figure out how many PTEs need to be updated, then update 40 different PTEs for 40 different virtual address spaces/processes.
Isn't having an array of pointers more wasteful because it'll take double the space (4 bytes for the pointer and 4 for the PTE)?
Array of pointers to PTEs would waste more space, but it's hard to say how much space (it wouldn't be "double" because lots of PTEs would be used multiple times, and might be closer to 50% more space). Array of PTEs would waste more CPU time (in kernel's code trying to manage everything) instead, and (if you take into account kernel using its own additional data/memory to be able to figure out which pages are shared where) it might actually cost more memory.
However...
They are both relatively awful; and I'd expect that the lecturer is preparing to introduce multi-level paging (where "one pointer to PTE per page" gets replaced with "one pointer to group of PTEs"), which is what most real CPUs use.
An overhead incurs when walking the page tables to find a translation. Every now and then a new paper is published explaining how their implementation is superior. Some suggest hashing the page tables. I suggest you don't over think it, understand the principles of walking the page tables and a simple implementation is enough to get a grasp.

What's the difference between Virtually Indexed Physically tagged and Virtually Indexed and Virtually Tagged Paging?

Also, state if there are any other methods for Indexing and Tagging ?
The CPU cache is divided in cache lines, which have a fixed size (usually 64 bytes). In general, we can say that each cache line is identified by the address of the memory it refers to (with the last 6 bits discarded because they refer to an offset inside the cache line).
To make the lookup faster the address is split in two parts: index and tag. The index addresses a set of cache lines which position is known: accessing a set is really fast. Inside an N-way associative set you have N cache lines, in no particular order, which will be identified using a tag.
Now we said that tag and index are portions of the memory address, but what type of address? Physical or virtual?
In theory you can have any combination of physically indexed (PI), physically tagged (PT), virtually indexed (VI), and virtually tagged (VT).
Each combination has its pros and cons. In general, we can say that using physical addresses has the drawback of having to wait for the virtual address to be translated (which can be expensive in case of a TLB miss), on the other hand, using virtual addresses, while faster, can cause coherency problems because multiple virtual addresses can map to the same physical address and mappings can change over time requiring cache flushes.
For these reasons, PIPT is used only for slow cache (e.g. L2/L3), VIVT is rarely used, PIVT is almost never used and VIPT is used for fast (and small) cache.
The advantage of using VIPT is that, while the lookup can start in parallel with the address translation (and so it's faster than PIPT), it uses the physical address for the last part of the lookup so, with a properly chosen size for the index, it can prevent coherency problems.
The correct size for the index depends on the page size: translations between virtual and physical address is made at page level, if the index is chosen in such a way that it always refers to the offset of the cache line inside the page it won't make any difference if we are using a physical address or a virtual address. Unfortunately this limits the size of the cache, hence the reason why it is used only for fast and small cache (e.g. L1).

Do modern OS's use paging and segmentation?

I was reading about memory architecture and I got a bit confused with the paging and segmentation. I read that modern OS systems use only paging to manage memory access but looking at a disassembled codes I can see segments like "ds" and "fs". Does it means that the OS (saw that on Windows and Linux) is using both segmentation and paging or is it just mapping all the segments into the same pages (making segments irrelevant) ?
Ok, based on the book Modern Operating Systems 3rd Edition by Andrew S. Tanenbaum and the materials form Open Security Training (opensecuritytraining.info), i manage to understand the segmentation and paging and the answer for my question is:
Concepts:
1.1.Segmentation:
Segmentation is the division of the memory into pieces (sections) called segments. These segments are independents from each other, have variable sizes and may grow as needed.
1.2. Virtual Memory:
A virtual memory is an abstraction of real memory. This means that it maps a virtual address (used by programs) into a physical address (used by the hardware). If a program wants to access the memory 2000 (mov eax, 2000), the virtual address (2000) will be converted into the real physical address ( for example 1422) and then the program will access the memory 1422 thinking that he’s accessing the memory 2000.
So, if virtual memory is being used by the system, programs no longer access real memory directly, instead, they used the correspondent virtual memory.
1.3. Paging:
Paging is a virtual memory management scheme. As explained before, it maps a virtual address into a physical address. Paging divides the virtual memory into pieces called “pages” and also divides the physical memory into pieces called “frame pages”. A page can be bound to one or more frame page ( keep in mind that a page can map different frame pages, but just one at the time)
Advantages and Disadvantages
2.1. Segmentation:
The main advantage of using segmentation is to divide the memory into different linear address spaces. With this programs can have an address space for his code, other for the stack, another for dynamic variables (heap) etc. The disadvantage for segmentation is the fragmentation of the memory.
Consider this example – One portion of the memory is divided into 4 segments sequentially allocated -> seg1(in use) --- seg2(in use)--- seg3(in use)---seg4(in use). Now if we free the memory from seg2, we will have -> seg1(in use) --- seg2(FREE)--- seg3(in use)---seg4(in use). If we want to allocate some data, we can do it by using seg2, but if the size of data is bigger than the size of the seg2, we won’t be able to do so and the space will be wasted and the memory fragmented. Another problem is that some segments could have a larger size and since this segment can’t be “broken” into smaller pieces, it must be fully allocated in memory.
2.1. Paging:
The main advantages of using paging is to abstract the physical memory, hence, the programs (and programmers) don’t need to bother with memory addresses. You can have two programs that access the (virtual) memory 2000 because it will be mapped into two different physical memories. Also, it’s possible to use the hard disk to ensure that only the necessary pages are allocated into memory. The main disadvantage is that paging uses only one linear address space. Paging maps virtual memories from 0 to the max addressable value for all programs. Now consider this example - two chunks of memory, “chk1” and “chk2” are next to each other on virtual memory ( chk1 is allocated from address 1000 to 2000 and chk2 uses from 2001 to 3000), if chk1 needs to grow, the OS needs to make room in the memory for the new size of chk1. If that happens again, the OS will have to do the same thing or, if no space can be found, an exception will pop. The routines for managing this kind of operations are very bad for performance, because it this growth and shrink of the memory happens a lot of time.
Segmentation and Paging
3.1. Why combine both?
An operational systems can combine both segmentation and paging. Why? Because combining then it’s possible to have the best from both without their disadvantages. So, the memory is divided in many segments and each one of those have one or more pages. When a program tries to access a memory address, first the OS goes to the correspondent segment, there he’ll find the correspondent page, so he goes to the page and there he’ll find the frame page that the program wanted to access. Using this approach, the OS divides the memory into various segments, including segments for the kernel and user programs. These segments have variable sizes and access protection features. To solve the fragmentation and the “big segment” problems, paging is used. With paging a larger segment is broken into several pages and only the necessary pages remains in memory (and more pages could be allocated in memory if needed). So, basically, the Modern OSs have two memory abstractions, where segmentation is more used for “handling the programs” and Paging for “managing the physical memory”.
3.2. How it really works?
An OS running segmentation and Paging will have the following structures:
3.2.1. Segment Selector:
This represents an index on the Global/Local Descriptor Table. It contains 3 fields that represent the index on the descriptor table, a bit to identify if that segment is present on Global or Local descriptor table and a privilege level.
3.2.2. Segment Register:
A CPU register used to store segment selector. Usually ( on a x86 machine) there is at least the register CS (Code Segment) and DS (Data Segment).
3.2.3. Segment descriptors:
This structure contains the data regarding a segment, like his base address, size ( in pages or in bytes), the access privilege, the information of if this segment is present on the memory or not, etc. (for all the fields, search for segment descriptors on google)
3.2.4. Global/Local Descriptor Table:
The table that contains several segment descriptors. So this structure holds all the segment descriptors for the system. If the table is Global, it can hold other things like Local Descriptor Tables descriptors, Task State Segment Descriptors and Call Gate descriptors (I’ll not explain these here, please, google them). If the table is Local, it will only ( as far as I know) holds user related segment descriptors.
3.3. How it works?
So, to access a memory, first the program needs to access a segment. To do this a segment selector is loaded into a segment register and then a Global or Local Descriptor table (depending of field on the segment selector). This is the reason of the fully memory address be SEGMENT REGISTER: ADDRESS , like CS:ADDRESS -> 001B:0044BF7A. Now the OS goes to the G/LDT and (using the index field of the segment selector) finds the segment descriptor of the address trying to access. Then it checks if the segment if present, the protection and if everything is ok it goes to the address stated on the “base field” (of the descriptor) + the address offset . If paging is not enabled, the system goes directly into the real memory, but with paging on the address is treated as a virtual address and it goes to the Page directory. The base address + the offset are called Linear Address and will be interpreted as 3 fields: Directory+Page+offset. So on the Directory page, it will search for the directory entry specified on the “directory” field of the linear address, this entry points to the page table and the field “page” of the linear address is used to find the page, this entry points to a frame page and the offset is used to find the exactly address that the program want to access.
3.4. Modern Operational Systems
Modern OSes "do not use" segmentation. Its in quotes because they use 4 segments: Kernel Code Segment, Kernel Data Segment, User Code Segment and User Data Segment. What does it means is that all user's processes have the same code and data segments (so the same segment selector). The segments only change when going from user to kernel.
So, all the path explained on the section 3.3. occurs, but they use the same segments and, since the page tables are individual per process, a page fault is difficult to happen.
Hope this helps and if there is any mistakes or some more details ( I was I bit generic on some parts) please, feel free to comment and reply.
Thank you guys
Danilo PC
They only use paging for memory protection, while they use segmentation for other purposes (like storing thread-local data).

How is RAM able to acess any place in memory at O(1) speed

We are taught that the abstraction of the RAM memory is a long array of bytes. And that for the CPU it takes the same amount of time to access any part of it. What is the device that has the ability to access any byte out of the 4 gigabytes (on my computer) in the same time? As this does not seem as a trivial task for me.
I have asked colleagues and my professors, but nobody can pinpoint to the how this task can be achieved with simple logic gates, and if it isn't just a tricky combination of logic gates, then what is it?
My personal guess is that you could achieve the access of any memory in O(log(n)) speed, where n would be the size of memory. Because each gate would split the memory in two and send you memory access instruction to the next split the memory in two gate. But that requires ALOT of gates. I can't come up with any other educated guess, and I don't even know the name of the device that I should look up in Google.
Please help my anguished curiosity, and thanks in advance.
edit<
This is what I learned!
quote from yours "the RAM can send the value from cell addressed X to some output pins", here is where everyone skip (again) the thing that is not trivial for me. The way that I see it, In order to build a gate that from 64 pins decides which byte out of 2^64 to get, each pin needs to split the overall possible range of memory into two. If bit at index 0 is 0 -> then the address is at memory 0-2^64/2, else address is at memory 2^64/2-2^64. And so on, However the amout of gates (lets call them) that the memory fetch will go through will be 64, (a constant). However the amount of gates needed is N, where N is the number of memory bytes there are.
Just because there is 64 pins, it doesn't mean that you can still decode it into a single fetch from a range of 2^64. Does 4 gigabytes memory come with a 4 gigabytes gates in the memory control???
now this can be improved, because as I read furiously more and more about how this memory is architectured, if you place the memory into a matrix with sqrt(N) rows and sqrt(N) columns, the amount of gates that a fetch memory will need to go through is O(log(sqrt(N)*2) and the amount of gates that will be required will be 2*sqrt(N), which is much better, and I think that its probably a trade secret.
/edit<
What the heck, I might as well make this an answer.
Yes, in the physical world, memory access cannot be constant time.
But it cannot even be logarithmic time. The O(log n) circuit you have in mind ultimately involves some sort of binary (or whatever) tree, and you cannot make a binary tree with constant-length wires in a 3D universe.
Whatever the "bits per unit volume" capacity of your technology is, storing n bits requires a sphere with radius O(n^(1/3)). Since information can only travel at the speed of light, accessing a bit at the other end of the sphere requires time O(n^(1/3)).
But even this is wrong. If you want to talk about actual limitations of our universe, our physics friends say the absolute maximum number of bits you can store in any sphere is proportional to the sphere's surface area, not its volume. So the actual radius of a minimal sphere containing n bits of information is O(sqrt(n)).
As I mentioned in my comment, all of this is pretty much moot. The models of computation we generally use to analyze algorithms assume constant-access-time RAM, which is close enough to the truth in practice and allows a fair comparison of competing algorithms. (Although good engineers working on high-performance code are very concerned about locality and the memory hierarchy...)
Let's say your RAM has 2^64 cells (places where it is possible to store a single value, let's say 8-bit). Then it needs 64 pins to address every cell with a different number. When at the input pins of your RAM there 'appears' a binary number X the RAM can send the value from cell addressed X to some output pins, and your CPU can get the value from there. In hardware the addressing can be done quite easily, for example by using multiple NAND gates (such 'addressing device' from some logic gates is called a decoder).
So it is all happening at the hardware-level, this is just direct addressing. If the CPU is able to provide 64 bits to 64 pins of your RAM it can address every single memory cell (as 64 bit is enough to represent any number up to 2^64 -1). The only reason why you do not get the value immediately is a kind of 'propagation time', so time it takes for the signal to go through all the logic gates in the circuit.
The component responsible for dealing with memory accesses is the memory controller. It is used by the CPU to read from and write to memory.
The access time is constant because memory words are truly layed out in a matrix form (thus, the "byte array" abstraction is very realistic), where you have rows and columns. To fetch a given memory position, the desired memory address is passed on to the controller, which then activates the right column.
From http://computer.howstuffworks.com/ram1.htm:
Memory cells are etched onto a silicon wafer in an array of columns
(bitlines) and rows (wordlines). The intersection of a bitline and
wordline constitutes the address of the memory cell.
So, basically, the answer to your question is: the memory controller figures it out. Of course that, given a memory address, the mapping to column and row must be easy to calculate in a constant time.
To fully understand this topic, I recommend you to read this guide on how memory works: http://computer.howstuffworks.com/ram.htm
There are so many concepts to master that it is difficult to explain it all in one answer.
I've been reading your comments and questions until I answered. I think you are on the right track, but there is some confusion here. The random access in which you are implying doesn't exist in the same way you think it does.
Reading, writing, and refreshing are done in a continuous cycle. A particular cell in memory is only read or written in a certain interval if a signal is detected to do so in that cycle. There is going to be support circuitry that includes "sense amplifiers to amplify the signal or charge detected on a memory cell."
Unless I am misunderstanding what you are implying, your confusion is in how easy it is to read/write to a cell. It's different dependent on chip design but there IS a minimum number of cycles it takes to read or write data to a cell.
These are my sources:
http://www.doc.ic.ac.uk/~dfg/hardware/HardwareLecture16.pdf
http://www.electronics.dit.ie/staff/tscarff/memory/dram_cycles.htm
http://www.ece.cmu.edu/~ece548/localcpy/dramop.pdf
To avoid a humungous answer, I left most of the detail out but all three of these will describe the process you are looking for.

4 questions about processor architecture. (Computer engineering)

Our teachers has asked us around 50 true of false questions in preparation for our final exam. I could find an answer for most of them online or by asking relative. How ever, those 4 questions adrive driving me crazy. Most of those question aren't that hard, I just cant get any satisfying answer anywhere. Sorry, the original question are not written in english, i had to translate them myself. If you don't understand something, please tell me.
Thanks!
True or false
The size of the manipulated address by the processor determines the size of the virtual memory. How ever, the size of the memory cache is independent.
For long, DRAM technology stayed imcompatible with CMOS technology used to do the standard logic in processor. This is the reason DRAM memory is (most of the time) used outside of the processor (on a different chip).
Pagination let correspond multiple virtual addressing space to a same space of physical addressing.
An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block.
"Manipulated address" is not a term of the art. You have an m-bit virtual address mapping to an n-bit physical address. Yes, a cache may be of any size up to the physical address size, but typically is much smaller. Note that cache lines are tagged with virtual or more typically physical address bits corresponding to the maximum virtual or physical address range of the machine.
Yes, DRAM processes and logic processes are each tuned for different objectives, and involve different process steps (different materials and thicknesses to lay down DRAM capacitor stacks/trenches, for example) and historically you haven't built processors in DRAM processes (except the Mitsubishi M32RD) nor DRAM in logic processes. Exception is so-called eDRAM that IBM likes to use for their SOI processes, and which is used as last level cache in IBM microprocessors such as the Power 7.
"Pagination" is what we call issuing a form feed so that text output begins at the top of the next page. "Paging" on the other hand is sometimes a synonym for virtual memory management, by which a virtual address is mapped (on a page by page basis) to a physical address. If you set up your page tables just so it allows multiple virtual addresses (indeed, virtual addresses from different processes' virtual address spaces) to map to the same physical address and hence the same location in real RAM.
"An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block."
Hmm, that's a strange question. Let's break it down. 1) You can have a direct mapped cache, in which an address maps to only one cache line. 2) You can have a fully associative cache, in which an address can map to any cache line; there is something like a CAM (content addressible memory) tag structure to find which if any line matches the address. Or 3) you can have an n-way set associative cache, in which you have, essentially, n sets of direct mapped caches, and a given address can map to one of n lines. There are other more esoteric cache organizations, but I doubt you're being taught them.
So let's parse the statement. "An associative cache memory". Well that rules out direct mapped caches. So we're left with "fully associative" and "n-way set associative". It has sets of 1 line. OK, so if it is set associative, then instead of something traditional like 4-ways x 64 lines/way, it is n-ways x 1 lines/way. In other words, it is fully associative. I would say this is a true statement, except the term of the art is "fully associative" not "entirely associative."
Makes sense?
Happy hacking!
True, more or less (it depends on the accuracy of your translation I guess :) ) The number of bits in addresses sets an upper limit on the virtual memory space; you could, of course, choose not to use all the bits. The size of the memory cache depends on how much actual memory is installed, which is independent; but of course if you had more memory than you can address, then it still can't be used.
Almost certainly false. We have RAM on separate chips so that we can install more without building a whole new computer or replacing the CPU.
There is no a-priori upper or lower limit to the cache size, though in a real application certain sizes make more sense than others, of course.
I don't know of any incompatibility. The reason why we use SRAM as on-die cache is because it's faster.
Maybe you can force an MMUs to map different virtual addresses to the same physical location, but usually it's used the other way around.
I don't understand the question.

Resources