Can Intel processors delay TLB invalidations?

Can Intel processors delay TLB invalidations? - tlb

This in reference to InteI's Software Developer’s Manual (Order Number: 325384-039US May 2011), the section 4.10.4.4 "Delayed Invalidation" describes a potential delay in invalidation of TLB entries which can cause unpredictable results while accessing memory whose paging-structure entry has been changed.
The manual says ...
"Required invalidations may be delayed under some circumstances. Software devel-
opers should understand that, between the modification of a paging-structure entry
and execution of the invalidation instruction recommended in Section 4.10.4.2, the
processor may use translations based on either the old value or the new value of the
paging-structure entry. The following items describe some of the potential conse-
quences of delayed invalidation:
If a paging-structure entry is modified to change the R/W flag from 0 to 1, write
accesses to linear addresses whose translation is controlled by this entry may or
may not cause a page-fault exception."
Let us suppose a simple case, where a page-strucure entry is modified (r/w flag is flipped from 0 to 1) for a linear address and after that the corresponding TBL invalidation instruction is called immediately. My question is--as a consiquence of delayed invalidation of TLB s it possible that even after calling invalidation of TLB a write access to the linear address in question doesn't fault (page fault)?
Or is the "delayed invalidation" can only cause unpredictable results when "invalidate" instruction for the linear address whose page-structure has changed has not been issued?

TLBs are transparently optimisitically not uncached by CR3 changes. TLBs entries are marked with a unique identifier for the address-space and are left in the TLB until they are either touched by the wrong process (in which case the TLB entry is trashed) or the address-space is restored (in which case the TLB was preserved over the address-space changing).
This all happens transparently to the CPU. Your program (or OS) shouldn't be able to tell the difference between this behaviour and the TLB being actually invalidated by a TLB invalidation except via:
A) Timing - i.e. TLB optimisticly not uncaching is faster (which is why they do it)
B) There are edge cases where this behaviour is somewhat undefined. If you modify the code page on which you're sitting or touch memory you've just changed, the old value of the TLB might still be there (even across a CR3 change).
The solution to this is to either:
1) force a TLB update via a invlpg instruction. This purges the TLB entry, triggering a TLB read-in on the next touch of the page.
2) disable and re-enable paging via the CR0 register.
3) mark all pages as un-cachable via the cache-disable bit in CR0 or on all of the pages of the TLB (TLB entries marked uncachable are auto-purged after use).
4) Change the mode of the code-segment.
Note that this is genuinely undefined behaviour. Transitioning to SMM can invalidate the TLB, or might not, leaving this open to a race-condition. Don't depend on this behaviour.

Related

A problem about virtual memory management in OS

Here is the context of this problem. I am confused that why I don't need to now how many entries in TLB?
For the first question:
when I access data in 0x2330, I find it in main memory since TLB is empty now, then I need 10 + 100 = 110(ns)
when I access data in 0x0565, I meet a page fault so I need 500(ns) , then I load it in TLB and main memory(now I should replace one page in main memory because the resident set contains 2 frames, but which page should I replace? The problem just say we use LRU replacement policy in TLB)
when I access data in 0x2345, what things may happen? I'm not sure ;w;

I am confused that why I don't need to now how many entries in TLB?
For an answer to be 100% correct, you do need to know how many TLB entries there are. There are 2 possibilities:
a) There are 2 or more TLB entries. In this case you have no reason to care what the replacement algorithm is and the question's "And the TLB is replaced with LRU replacement policy" is an unnecessary distraction. In practice for real hardware (but not necessary in theory for academia) this is extremely likely.
b) There is only 1 TLB entry. In this case you have no reason to care what the replacement algorithm is for a different reason - any TLB miss must cause the previous contents to be evicted (and all of those accesses will be "TLB miss" because no page is accessed 2 or more times in a row).
To work around this problem I'd make the most likely assumption (that there are 2 or more TLB entries) and then clearly state that assumption in my answer (e.g. "Assuming there are 2 or more TLB entries (not stated in question), the answers are:").
when I access data in 0x2330, I find it in main memory since TLB is empty now, then I need 10 + 100 = 110(ns)
That's not quite right. The CPU would have to access the TLB to determine that it's a "TLB miss", then fetch the translation from memory into the TLB, then either:
(if there are no cache/s) fetch the data from memory (at physical address 0x27330) to do the final access; or
(if there are cache/s) check if the data is already cached and either:
(if "cache hit") fetch the data from cache, or
(if "cache miss") fetch the data from memory
Sadly; the question doesn't mention anything about cache/s. To work around that problem I'd make the most likely assumption (that there are no caches - see notes) and then clearly state that assumption in my answer (e.g. "Assuming there are 2 or more TLB entries (not stated in the question) and that there are also no cache/s (also not stated in the question), the answers are:").
Note: "No cache/s" is the most likely assumption for the context (academia focusing on teaching virtual memory in isolation); but is also the least likely assumption in the real world.
when I access data in 0x0565, I meet a page fault so I need 500(ns) , then I load it in TLB and main memory(now I should replace one page in main memory because the resident set contains 2 frames, but which page should I replace?
You're right again - the question only says "The OS adopts fixed allocation and local replacement policy" (and doesn't say if it uses LRU or something else).
To work around this problem I'd make a sane assumption (that the OS uses LRU replacment policy) and then clearly state that assumption in my answer (e.g. "Assuming there are 2 or more TLB entries (not stated in the question), and that there are also no cache/s (also not stated in the question), and that the OS is using an LRU page replacement policy (also not stated in the question); the answers are:").

How is the LRU eviction policy less efficient than the random policy in this corner case?

I am reading OS Design: Three Easy Pieces, and I stumbled upon this quote in chapter 19: Translation Lookaside Buffers. It talks about eviction policies and comparing the hit efficiency of LRU vs. a random selection eviction.
Such a policy (Random policy) is useful due
to its simplicity and ability to avoid corner-case behaviors; for example,
a “reasonable” policy such as LRU behaves quite unreasonably when a
program loops over n + 1 pages with a TLB of size n; in this case, LRU
misses upon every access, whereas random does much better
I think I understand why LRU misses upon every access. For example, take an empty TLB of size 3 pages and a program loops through 4. The first page access is a miss, then second is a miss, then the third, then the fourth one is not in the TLB, so we must evict the least recent one (first page). After that eviction, we repeat the TLB lookup to finally get the fourth page in it, giving a total of four misses, but now how does the Random eviction work? Let's have the same empty TLB and program. We miss one, two, three pages. Now the fourth ought to be in there, so we have to replace a random page this time. So we repeat a TLB lookup and now the fourth page could be in the first, second or third TLB entry. Doesn't this just lead to four misses as well?
Where in my assumptions and logic about how TLB and evictions work am I missing here that random is better than LRU with this corner case. Doesn't random also just miss the same amount?
A step by step explanation of each policy would be preferable. Thanks.

For example, take an empty TLB of size 3 pages and a program loops through 4. The first page access is a miss, then second is a miss, then the third, then the fourth one is not in the TLB, so we must evict the least recent one (first page). After that eviction, we repeat the TLB lookup to finally get the fourth page in it, giving a total of four misses, but now how does the Random eviction work?
In this case, after the first 3 misses the TLB will contain translations for pages 1, 2 and 3; then it will evict an entry to make room for the translation for page 4. There are 3 possibilities (chosen randomly):
If it evicts the translation for page 1 then you get a TLB miss for page 1.
If it evicts the translation for page 2 then you avoid a TLB miss for page 1.
If it evicts the translation for page 3 then you avoid a TLB miss for page 1.
In other words, there's a 66.66% chance that it will avoid a TLB miss when you need the translation for page 1 next. There's also a 33.333% chance that there will be a TLB miss when you need the translation for page 1 and that something else will be evicted to make room for it.
This probability will degrade. E.g. when you need the translation for page 2 again; there's a 33.33% chance it was evicted to make room for the translation of page 4, but there's also a chance that it was evicted to make room for the translation of page 1; so (guessing) there's might be a 50% chance that it wasn't evicted.
Doesn't this just lead to four misses as well?
I didn't do the maths (its messy and I haven't finished my morning coffee), but the chance of getting 4 misses is probably around 12%.

Purpose of address-spaced identifiers(ASIDs)

I am currently studying Operating Systems by A Silberschatz, P Galvin, G Gagne.
I am studying memory management strategies, and on section where they introduce Translation Look-aside Buffer (TLB).
Some TLBs store address-space identifiers (ASIDs) in each TLB entry. An ASID uniquely identifies each process and is used to provide address-space protection for that process. When the TLB attempts to resolve virtual page numbers, it ensures that the ASID for the currently running process matches the ASID associated with the virtual page. If the ASIDs do not match, the attempt is treated as a TLB miss.
Above is a quote from the textbook explaining ASID.
I am a bit confused as TLB miss means the logical address weren't able to be matched in TLB, so it has to be checked with Page table to head towards the physical memory.
That being said, ASID is an extra bits for each entry in TLB to check if the process that is accessing that entry belongs to the process.
What I am wondering is, when ASID is used to refuse the process, shouldn't it trap, instead of TLB miss? TLB miss will forward the process to page table, where the logical address for the process will be able to be mapped to certain address in main memory.
Please help me where I am understanding incorrectly.
Thanks!

Lets say you have two processes running on a system. Process A has its 2d page mapped to the 100th page frame and Process B has its 2d page mapped to the 200th page frame.
So now the MMU needs to find page #2, but does not want to read the page tables again. Does it go to page frame 100 or page frame 200?
There are at least two ways of handling that problem. One is to flush the cache whenever there is a process switch.
The other is to assign some unique identifier for each process and include that in the TLB cache entries.
I am a bit confused as TLB miss means the logical address weren't able to be matched in TLB, so it has to be checked with Page table to head towards the physical memory.
To translate logical page #X to a physical page frame:
Look in the TLB for #X. If not there, go to the page table.
[#X exists] Is there an entry for #X with an ASID that matches the current process? If not there, go to the page table.
Use the page mapping in the TLB
What I am wondering is, when ASID is used to refuse the process, shouldn't it trap, instead of TLB miss?
Then you'd get traps the first time the process accessed a page and the program would crash.

Though one year has passed, I happen to have the same problem as yours.
And I found a detailed explanation of TLB miss:
For software-managed TLB, when the machine encounters a TLB miss, the hardware would raise an exception (trap) to the OS (switched to kernel mode), and the trap handler for TLB miss would then look up the page table and update the TLB.
After that, the handler would return to the interrupted instruction (try the instruction that caused the exception again), which would yield TLB hit this time.
operating system three easy pieces, the explanation is at section 19.3

I would think that a TLB miss should trap to the OS (or virtual memory manager) when that TLB was the final TLB for physical/real memory but there are also TLBs for L1 cache, L2 cache, and L3 cache. When a cache TLB has a miss, there may be a hardware page table walker that can resolve the TLB miss much faster than a context switch to the OS (which would also pollute the caches and TLBs).
Multiple processes share a L3 TLB on a multicore processor and two processes share a L1 TLB and L2 TLB when hyperthreading is available and enabled. Each process has its own independent virtual address space, which the TLBs must distinguish.

OpenCL memory consistency

I have a question concerning the OpenCL memory consistency model. Consider the following kernel:
__kernel foo() {
__local lmem[1];
lmem[0] = 1;
lmem[0] += 2;
}
In this case, is any synchronization or memory fence necessary to ensure that lmem[0] == 3?
According to section 3.3.1 of the OpenCL specification,
within a work-item memory has load / store consistency.
To me, this says that the assignment will always be executed before the increment.
However, section 6.12.9 defines the mem_fence function as follows:
Orders loads and stores of a work-item executing a kernel. This means that loads and stores preceding the mem_fence will be committed to memory before any loads and stores following the mem_fence.
Doesn't this contradict section 3.3.1? Or maybe my understanding of load / store consistency is wrong? I would appreciate your help.

As long as only one work-item performs read/write access to a local memory cell, that work-item has a consistent view of it. Committing to memory using a barrier is only necessary to propagate writes to other work-items in the work-group. For example, an OpenCL implementation would be permitted to keep any changes to local memory in private registers until a barrier is encountered. Within the work-item, everything would appear fine, but other work-items would never see these changes. This is how the phrase "committed to memory" should be interpreted in 6.12.9.
Essentially, the interaction between local memory and barriers boils down to this:
Between barriers:
Only one work-item is allowed read/write access to a local memory cell.OR
Any number of work-items in a work-group is allowed read-only access to a local memory cell.
In other words, no work-item may read or write to a local memory cell which is written to by another work-item after the last barrier.

ARM single-copy atomicity

I am currently wading through the ARM architecture manual for the ARMv7 core. In chapter A3.5.3 about atomicity of memory accesses, it states:
If a single-copy atomic load overlaps a single-copy atomic store and
for any of the overlapping bytes the load returns the data written by
the write inserted into the Coherence order of that byte by the
single-copy atomic store then the load must return data from a point
in the Coherence order no earlier than the writes inserted into the
Coherence order by the single-copy atomic store of all of the
overlapping bytes.
As non-native english speaker I admit that I am slightly challenged in understanding this sentence.
Is there a scenario where writes to a memory byte are not inserted in the Coherence Order and thus the above does not apply? If not, am I correct to say that shortening and rephrasing the sentence to the following:
If the load happens to return at least one byte of the
the write, then the load must return all overlapping bytes from a point
no earlier than where the write inserted them into the
Coherence order of all of the overlapping bytes.
still transports the same meaning?

I see that wording in the ARMv8 ARM, which really tries to remove any possible ambiguity in a lot of places (even if it does make the memory ordering section virtually unreadable).
In terms of general understanding (as opposed to to actually implementing the specification), a little bit of ambiguity doesn't always hurt, so whilst it fails to make it absolutely clear what a "memory location" means, I think the old v7 manual (DDI0406C.b) is a nicer read in this case:
A read or write operation is single-copy atomic if the following conditions are both true:
After any number of write operations to a memory location, the value of the memory location is the value written by one of the write operations. It is impossible for part of the value of the memory location to come from one write operation and another part of the value to come from a different write operation
When a read operation and a write operation are made to the same memory location, the value obtained by the read operation is one of:
the value of the memory location before the write operation
the value of the memory location after the write operation.
It is never the case that the value of the read operation is partly the value of the memory location before the write operation and partly the value of the memory location after the write operation.
So your understanding is right - the defining point of a single-copy atomic operation is that at any given time you can only ever see either all of it, or none of it.
There is a case in v7 whereby (if I'm interpreting it right) two normally single-copy atomic stores that occur to the same location at the same time but with different sizes break any guarantee of atomicity, so in theory you could observe some unexpected mix of bytes there - this looks to have been removed in v8.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart