Prefetch from MMIO?

Prefetch from MMIO? - memory

Is it possible to issue a prefetch for an address backed by an MMIO region in a PCIe BAR (and mapped via either UC or WC page table entries)? I am currently issuing a load for this address which causes the hyperthread to stall for quite some time. There is a non-temporal access hint via PREFETCHNTA, so it seems like this may be possible.
If it is possible, do you know where the prefetched value is stored and what would possibly cause it to become invalidated before I am able to issue a load for it? For example, if I issue a synchronizing instruction such as sfence for something unrelated, would this cause the prefetched value to become invalidated?
From the Intel Software Development Manual:
"Prefetches from uncacheable or WC memory are ignored. ... It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types)."
The PCIe BAR that the MMIO region is in is marked as prefetchable, so I am not sure if that means prefetches will work with it given the language from the manual above.

I'd like to thank Peter Cordes, John D McCalpin, Neel Natu, Christian Ludloff, and David Mazières for their help with figuring this out!
In order to prefetch, you need to be able to store MMIO reads in the CPU cache hierarchy. When you use UC or WC page table entries, you cannot do this. However, you can use the cache hierarchy if you use WT page table entries.
The only caveat is that when you use WT page table entries, previous MMIO reads with stale data can linger in the cache. You must implement a coherence protocol in software to flush the stale cache lines from the cache and read the latest data via an MMIO read. This is alright in my case because I control what happens on the PCIe device, so I know when to flush. You may not know when to flush in all scenarios though, which could make this approach unhelpful to you.
Here is how I set up my system:
Mark the page table entries that map to the PCIe BAR as WT. You can use ioremap_wt() for this (or ioremap_change_attr() if the BAR has already been mapped into the kernel).
According to https://sandpile.org/x86/coherent.htm, there are conflicts between the PAT type and the MTRR type. The MTRR type for the PCIe BAR must also be set to WT, otherwise the PAT WT type is ignored. You can do this with the command below. Be sure to update the command with the PCIe BAR address (which you can see with lspci -vv) and the PCIe BAR size. The size is a hexadecimal value in units of bytes.
echo "base=$ADDRESS size=$SIZE type=write-through" >| /proc/mtrr
As a quick check at this point, you may want to issue a large number of MMIO reads in a loop to the same cache line in the BAR. You should see the cost per MMIO read go down substantially after the first MMIO read. The first MMIO read will still be expensive because you need to fetch the value from the PCIe device, but the subsequent reads should be much cheaper because they all read from the cache hierarchy.
You can now issue a prefetch to an address in the PCIe BAR and have the prefetched cache line stored in the cache hierarchy. Linux has the prefetch() function to help with issuing a prefetch.
You must implement a simple coherence protocol in software to ensure that stale cache lines backed by the PCIe BAR are flushed from the cache. You can use clflush to flush a stale cache line. Linux has the clflush() function to help with this.
A note about clflush in this scenario: Since the memory type is WT, each store goes to both the cache line in the cache and the MMIO. Thus, from the CPU's perspective, the contents of the cache line in the cache always match the contents of the MMIO. Therefore, clflush will just invalidate the cache line in the cache -- it will not also write the stale cache line to the MMIO.
Note that in my system, I immediately issue a prefetch after the clflush. However, the code below is incorrect:
clflush(address);
prefetch(address);
This code is incorrect, because according to https://c9x.me/x86/html/file_module_x86_id_252.html, the prefetch could be reordered before the clflush. Thus, the prefetch could be issued before the clflush, and the prefetch would presumably be invalidated when the clflush occurs.
To fix this, according to the link, you should issue cpuid in between the clflush and the prefetch:
int eax, ebx, ecx, edx;
clflush(address);
cpuid(0, &eax, &ebx, &ecx, &edx);
prefetch(address);
Peter Cordes said it is sufficient to issue an lfence instead of cpuid above.

Related

A problem about virtual memory management in OS

Here is the context of this problem. I am confused that why I don't need to now how many entries in TLB?
For the first question:
when I access data in 0x2330, I find it in main memory since TLB is empty now, then I need 10 + 100 = 110(ns)
when I access data in 0x0565, I meet a page fault so I need 500(ns) , then I load it in TLB and main memory(now I should replace one page in main memory because the resident set contains 2 frames, but which page should I replace? The problem just say we use LRU replacement policy in TLB)
when I access data in 0x2345, what things may happen? I'm not sure ;w;

I am confused that why I don't need to now how many entries in TLB?
For an answer to be 100% correct, you do need to know how many TLB entries there are. There are 2 possibilities:
a) There are 2 or more TLB entries. In this case you have no reason to care what the replacement algorithm is and the question's "And the TLB is replaced with LRU replacement policy" is an unnecessary distraction. In practice for real hardware (but not necessary in theory for academia) this is extremely likely.
b) There is only 1 TLB entry. In this case you have no reason to care what the replacement algorithm is for a different reason - any TLB miss must cause the previous contents to be evicted (and all of those accesses will be "TLB miss" because no page is accessed 2 or more times in a row).
To work around this problem I'd make the most likely assumption (that there are 2 or more TLB entries) and then clearly state that assumption in my answer (e.g. "Assuming there are 2 or more TLB entries (not stated in question), the answers are:").
when I access data in 0x2330, I find it in main memory since TLB is empty now, then I need 10 + 100 = 110(ns)
That's not quite right. The CPU would have to access the TLB to determine that it's a "TLB miss", then fetch the translation from memory into the TLB, then either:
(if there are no cache/s) fetch the data from memory (at physical address 0x27330) to do the final access; or
(if there are cache/s) check if the data is already cached and either:
(if "cache hit") fetch the data from cache, or
(if "cache miss") fetch the data from memory
Sadly; the question doesn't mention anything about cache/s. To work around that problem I'd make the most likely assumption (that there are no caches - see notes) and then clearly state that assumption in my answer (e.g. "Assuming there are 2 or more TLB entries (not stated in the question) and that there are also no cache/s (also not stated in the question), the answers are:").
Note: "No cache/s" is the most likely assumption for the context (academia focusing on teaching virtual memory in isolation); but is also the least likely assumption in the real world.
when I access data in 0x0565, I meet a page fault so I need 500(ns) , then I load it in TLB and main memory(now I should replace one page in main memory because the resident set contains 2 frames, but which page should I replace?
You're right again - the question only says "The OS adopts fixed allocation and local replacement policy" (and doesn't say if it uses LRU or something else).
To work around this problem I'd make a sane assumption (that the OS uses LRU replacment policy) and then clearly state that assumption in my answer (e.g. "Assuming there are 2 or more TLB entries (not stated in the question), and that there are also no cache/s (also not stated in the question), and that the OS is using an LRU page replacement policy (also not stated in the question); the answers are:").

Time required to access the memory locations in the same cache line

Consider the big box in the following figure as a cache and the block as a single cache line inside the cache.
The CPU fetched the data (first 4 elements of the array A) from RAM into the cache block.
Now, my question is, does it takes exactly same time to perform read/write operations on all the 4 memory locations (A[0], A[1], A[2] and A[3]) in the cache block or is it approximately same?
PS: I am expecting an answer for ideal case where runtime to perform any read/write operation on any memory location is not affected by the operating system jitter on user processes or applications.

With the line already hot in cache, time is constant for access to any aligned word in the cache. The hardware that handles the offset-within-line part of an address doesn't have to iterate through to the right position or anything, it just MUXes those bytes to the output.
If the line was not already hot in cache, then it depends on the design of the cache. If the CPU doesn't transfer around whole lines at once over a wide bus, then one / some words of the line will arrive before others. A cache that supports early-restart can let the load complete as soon as the needed word arrives.
A critical-word-first bus and memory allow that word to be the first one transferred for a demand-miss. Otherwise they arrive in some fixed order, and a cache miss on the last word of the line could take an extra few cycles.
Related:
Does cacheline size affect memory access latency?
if cache miss happens, the data will be moved to register directly or first moved to cache then to register?
which is optimal a bigger block cache size or a smaller one?

OpenCL memory consistency

I have a question concerning the OpenCL memory consistency model. Consider the following kernel:
__kernel foo() {
__local lmem[1];
lmem[0] = 1;
lmem[0] += 2;
}
In this case, is any synchronization or memory fence necessary to ensure that lmem[0] == 3?
According to section 3.3.1 of the OpenCL specification,
within a work-item memory has load / store consistency.
To me, this says that the assignment will always be executed before the increment.
However, section 6.12.9 defines the mem_fence function as follows:
Orders loads and stores of a work-item executing a kernel. This means that loads and stores preceding the mem_fence will be committed to memory before any loads and stores following the mem_fence.
Doesn't this contradict section 3.3.1? Or maybe my understanding of load / store consistency is wrong? I would appreciate your help.

As long as only one work-item performs read/write access to a local memory cell, that work-item has a consistent view of it. Committing to memory using a barrier is only necessary to propagate writes to other work-items in the work-group. For example, an OpenCL implementation would be permitted to keep any changes to local memory in private registers until a barrier is encountered. Within the work-item, everything would appear fine, but other work-items would never see these changes. This is how the phrase "committed to memory" should be interpreted in 6.12.9.
Essentially, the interaction between local memory and barriers boils down to this:
Between barriers:
Only one work-item is allowed read/write access to a local memory cell.OR
Any number of work-items in a work-group is allowed read-only access to a local memory cell.
In other words, no work-item may read or write to a local memory cell which is written to by another work-item after the last barrier.

How do non temporal instructions work?

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:
#include <emmintrin.h>
void setbytes(char *p, int c)
{
__m128i i = _mm_set_epi8(c, c, c, c,
c, c, c, c,
c, c, c, c,
c, c, c, c);
_mm_stream_si128((__m128i *)&p[0], i);
_mm_stream_si128((__m128i *)&p[16], i);
_mm_stream_si128((__m128i *)&p[32], i);
_mm_stream_si128((__m128i *)&p[48], i);
}
With such a comment right below it:
Assuming the pointer p is appropriately aligned, a call to this
function will set all bytes of the addressed cache line to c. The
write-combining logic will see the four generated movntdq instructions
and only issue the write command for the memory once the last
instruction has been executed. To summarize, this code sequence not
only avoids reading the cache line before it is written, it also
avoids polluting the cache with data which might not be needed soon.
What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?

When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.
When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.
The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.
The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware):
- Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.
The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.
Note that if the cache line written is already in the main caches, the above method will also update them.
If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.
If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).
Typically the non-temporal cache size is on the order of 4-8 cache lines.
To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.

I think this is partly a terminology question: The passage you quote from Ulrich Drepper's article isn't talking about cached data. It's just using the term "cache line" for an aligned 64B block.
This is normal, and especially useful when talking about a range of hardware with different cache-line sizes. (Earlier x86 CPUs, as recently as PIII, had 32B cache lines, so using this terminology avoids hard-coding that microarch design decision into the discussion.)
A cache-line of data is still a cache-line even if it's not currently hot in any caches.

I don't have references under my fingers to prove what I am saying, but my understanding is this: the only unit of transfer over the memory bus is cache lines, whether they go into the cache or to some special registers. So indeed, the code you pasted fills a cache line, but it is a special cache line that does not reside in cache. Once all bytes of this cache line have been modified, the cache line is send directly to memory, without passing through the cache.

4 questions about processor architecture. (Computer engineering)

Our teachers has asked us around 50 true of false questions in preparation for our final exam. I could find an answer for most of them online or by asking relative. How ever, those 4 questions adrive driving me crazy. Most of those question aren't that hard, I just cant get any satisfying answer anywhere. Sorry, the original question are not written in english, i had to translate them myself. If you don't understand something, please tell me.
Thanks!
True or false
The size of the manipulated address by the processor determines the size of the virtual memory. How ever, the size of the memory cache is independent.
For long, DRAM technology stayed imcompatible with CMOS technology used to do the standard logic in processor. This is the reason DRAM memory is (most of the time) used outside of the processor (on a different chip).
Pagination let correspond multiple virtual addressing space to a same space of physical addressing.
An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block.

"Manipulated address" is not a term of the art. You have an m-bit virtual address mapping to an n-bit physical address. Yes, a cache may be of any size up to the physical address size, but typically is much smaller. Note that cache lines are tagged with virtual or more typically physical address bits corresponding to the maximum virtual or physical address range of the machine.
Yes, DRAM processes and logic processes are each tuned for different objectives, and involve different process steps (different materials and thicknesses to lay down DRAM capacitor stacks/trenches, for example) and historically you haven't built processors in DRAM processes (except the Mitsubishi M32RD) nor DRAM in logic processes. Exception is so-called eDRAM that IBM likes to use for their SOI processes, and which is used as last level cache in IBM microprocessors such as the Power 7.
"Pagination" is what we call issuing a form feed so that text output begins at the top of the next page. "Paging" on the other hand is sometimes a synonym for virtual memory management, by which a virtual address is mapped (on a page by page basis) to a physical address. If you set up your page tables just so it allows multiple virtual addresses (indeed, virtual addresses from different processes' virtual address spaces) to map to the same physical address and hence the same location in real RAM.
"An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block."
Hmm, that's a strange question. Let's break it down. 1) You can have a direct mapped cache, in which an address maps to only one cache line. 2) You can have a fully associative cache, in which an address can map to any cache line; there is something like a CAM (content addressible memory) tag structure to find which if any line matches the address. Or 3) you can have an n-way set associative cache, in which you have, essentially, n sets of direct mapped caches, and a given address can map to one of n lines. There are other more esoteric cache organizations, but I doubt you're being taught them.
So let's parse the statement. "An associative cache memory". Well that rules out direct mapped caches. So we're left with "fully associative" and "n-way set associative". It has sets of 1 line. OK, so if it is set associative, then instead of something traditional like 4-ways x 64 lines/way, it is n-ways x 1 lines/way. In other words, it is fully associative. I would say this is a true statement, except the term of the art is "fully associative" not "entirely associative."
Makes sense?
Happy hacking!

True, more or less (it depends on the accuracy of your translation I guess :) ) The number of bits in addresses sets an upper limit on the virtual memory space; you could, of course, choose not to use all the bits. The size of the memory cache depends on how much actual memory is installed, which is independent; but of course if you had more memory than you can address, then it still can't be used.
Almost certainly false. We have RAM on separate chips so that we can install more without building a whole new computer or replacing the CPU.

There is no a-priori upper or lower limit to the cache size, though in a real application certain sizes make more sense than others, of course.
I don't know of any incompatibility. The reason why we use SRAM as on-die cache is because it's faster.
Maybe you can force an MMUs to map different virtual addresses to the same physical location, but usually it's used the other way around.
I don't understand the question.

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart