Time required to access the memory locations in the same cache line - memory

Consider the big box in the following figure as a cache and the block as a single cache line inside the cache.
The CPU fetched the data (first 4 elements of the array A) from RAM into the cache block.
Now, my question is, does it takes exactly same time to perform read/write operations on all the 4 memory locations (A[0], A[1], A[2] and A[3]) in the cache block or is it approximately same?
PS: I am expecting an answer for ideal case where runtime to perform any read/write operation on any memory location is not affected by the operating system jitter on user processes or applications.

With the line already hot in cache, time is constant for access to any aligned word in the cache. The hardware that handles the offset-within-line part of an address doesn't have to iterate through to the right position or anything, it just MUXes those bytes to the output.
If the line was not already hot in cache, then it depends on the design of the cache. If the CPU doesn't transfer around whole lines at once over a wide bus, then one / some words of the line will arrive before others. A cache that supports early-restart can let the load complete as soon as the needed word arrives.
A critical-word-first bus and memory allow that word to be the first one transferred for a demand-miss. Otherwise they arrive in some fixed order, and a cache miss on the last word of the line could take an extra few cycles.
Related:
Does cacheline size affect memory access latency?
if cache miss happens, the data will be moved to register directly or first moved to cache then to register?
which is optimal a bigger block cache size or a smaller one?

Related

Prefetch from MMIO?

Is it possible to issue a prefetch for an address backed by an MMIO region in a PCIe BAR (and mapped via either UC or WC page table entries)? I am currently issuing a load for this address which causes the hyperthread to stall for quite some time. There is a non-temporal access hint via PREFETCHNTA, so it seems like this may be possible.
If it is possible, do you know where the prefetched value is stored and what would possibly cause it to become invalidated before I am able to issue a load for it? For example, if I issue a synchronizing instruction such as sfence for something unrelated, would this cause the prefetched value to become invalidated?
From the Intel Software Development Manual:
"Prefetches from uncacheable or WC memory are ignored. ... It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types)."
The PCIe BAR that the MMIO region is in is marked as prefetchable, so I am not sure if that means prefetches will work with it given the language from the manual above.
I'd like to thank Peter Cordes, John D McCalpin, Neel Natu, Christian Ludloff, and David Mazières for their help with figuring this out!
In order to prefetch, you need to be able to store MMIO reads in the CPU cache hierarchy. When you use UC or WC page table entries, you cannot do this. However, you can use the cache hierarchy if you use WT page table entries.
The only caveat is that when you use WT page table entries, previous MMIO reads with stale data can linger in the cache. You must implement a coherence protocol in software to flush the stale cache lines from the cache and read the latest data via an MMIO read. This is alright in my case because I control what happens on the PCIe device, so I know when to flush. You may not know when to flush in all scenarios though, which could make this approach unhelpful to you.
Here is how I set up my system:
Mark the page table entries that map to the PCIe BAR as WT. You can use ioremap_wt() for this (or ioremap_change_attr() if the BAR has already been mapped into the kernel).
According to https://sandpile.org/x86/coherent.htm, there are conflicts between the PAT type and the MTRR type. The MTRR type for the PCIe BAR must also be set to WT, otherwise the PAT WT type is ignored. You can do this with the command below. Be sure to update the command with the PCIe BAR address (which you can see with lspci -vv) and the PCIe BAR size. The size is a hexadecimal value in units of bytes.
echo "base=$ADDRESS size=$SIZE type=write-through" >| /proc/mtrr
As a quick check at this point, you may want to issue a large number of MMIO reads in a loop to the same cache line in the BAR. You should see the cost per MMIO read go down substantially after the first MMIO read. The first MMIO read will still be expensive because you need to fetch the value from the PCIe device, but the subsequent reads should be much cheaper because they all read from the cache hierarchy.
You can now issue a prefetch to an address in the PCIe BAR and have the prefetched cache line stored in the cache hierarchy. Linux has the prefetch() function to help with issuing a prefetch.
You must implement a simple coherence protocol in software to ensure that stale cache lines backed by the PCIe BAR are flushed from the cache. You can use clflush to flush a stale cache line. Linux has the clflush() function to help with this.
A note about clflush in this scenario: Since the memory type is WT, each store goes to both the cache line in the cache and the MMIO. Thus, from the CPU's perspective, the contents of the cache line in the cache always match the contents of the MMIO. Therefore, clflush will just invalidate the cache line in the cache -- it will not also write the stale cache line to the MMIO.
Note that in my system, I immediately issue a prefetch after the clflush. However, the code below is incorrect:
clflush(address);
prefetch(address);
This code is incorrect, because according to https://c9x.me/x86/html/file_module_x86_id_252.html, the prefetch could be reordered before the clflush. Thus, the prefetch could be issued before the clflush, and the prefetch would presumably be invalidated when the clflush occurs.
To fix this, according to the link, you should issue cpuid in between the clflush and the prefetch:
int eax, ebx, ecx, edx;
clflush(address);
cpuid(0, &eax, &ebx, &ecx, &edx);
prefetch(address);
Peter Cordes said it is sufficient to issue an lfence instead of cpuid above.

All blocks read same global memory location section. Fastest method is?

I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.

Does cacheline size affect memory access latency?

Intel architecture has had 64 byte caches for a long time. I am curious, if instead of 64-byte cache lines a processor had 32-byte or 16-byte cachelines, would this improve the RAM-to-register data transfer latency? if so, how much? if not, why?
Thank you.
Transferring a larger amount of data of course increases the communication time. But the increase is very small due the way memory are organized and it does it does not impact memory to register latency.
Memory access operations are done in three steps:
bitline precharge: row address is sent and the internal busses of memory are precharged (duration tRP)
row access: an internal row of a memory is read and written to internal latches. During that time, column address is sent (duration tRCD)
column access: the selected columns are read in the row latches and start to be sent to the processor (duration tCL)
Row access is a long operation.
A memory is a matrix of cell elements. To increase the capacity of memory, cells must be rendered as small as possible. And when reading a row of cells, one has to drive a very capacitive and large bus that goes along a memory column. The voltage swing is very low and there are sense amplifiers amplifiers to detect small voltage variations.
Once this operation is done, a complete row is memorized in latches and reading them can be fast and are generally sent in burst mode.
Considering a typical DDR4 memory, with a 1GHz IO cycle time, we generally have tRP/tRCD/tCL=12-15cy/12-15cy/10-12cy and the complete time is around 40 memory cycles (if processor frequency is 4GHz, this is ~160 processor cycles). Then data is sent in burst mode twice per cycle, and 2x64 bits are sent every cycle. So, data transfer adds 4 cycles for 64 bytes and it would add only 2 cycles for 32 bytes.
So reducing cache line from 64B to 32B would reduce the transfer time by ~2/40=5%
If row address do not change, precharging and reading memory row is not required and the access time is ~15 memory cycles. In that case, the relative increase of time for transferring 64B vs 32B is larger but still limited: ~2/15~15%.
Both evaluations do not take into account the extra time required to process a miss in the memory hierachy and the actual percentage will be even smaller.
Data can be sent "critical word first" by the memory. If processor requires a given word, the address of this word is sent to memory. Once the row is read, memory sends first this word, then the other words in the cache line. So, caches can serve processor request as soon as the first word is received, whatever cache line is, and decreasing line width would have no impact on cache latency. So if using this feature, memory-to-register time would not change.
In recent processors, exchanges between different caches levels are based on the cache line width and sending the critical word first does not bring any gain.
Besides that, large line sizes reduce mandatory misses thanks to spatial locality and reducing line size would have a negative impact on cache miss rate.
Last, using larger cache lines increases data transfer rate between cache and memory.
The only negative aspect of large cache lines (besides the small transfer time increase) are that the number of lines in the cache is reduced and conflict misses may increase. But with the large associativity of modern caches, this effect is limited.

Flash Memory Management

I'm collecting data on an ARM Cortex M4 based evaluation kit in a remote location and would like to log the data to persistent memory for access later.
I would be logging roughly 300 bytes once every hour, and would want to come collect all the data with a PC after roughly 1 week of running.
I understand that I should attempt to minimize the number of writes to flash, but I don't have a great understanding of the best way to do this. I'm looking for a resource that would explain memory management techniques for this kind of situation.
I'm using the ADUCM350 which looks like it has 3 separate flash sections (128kB, 256kB, and a 16kB eeprom).
For logging applications the simplest and most effective wear leveling tactic is to treat the entire flash array as a giant ring buffer.
define an entry size to be some integer fraction of the smallest erasable flash unit. Say a sector is 4K(4096 bytes); let the entry size be 256.
This is to make all log entries be sector aligned and will allow you to erase any sector without cuting a log entry in half.
At boot, walk the memory and find the first empty entry. this is the 'write_pointer'
when a log entry is written, simply write it to write_pointer and increment write_pointer.
If write_pointer is on a sector boundary erase the sector at write_pointer to make room for the next write. essentially this guarantees that there is at least one empty log entry for you to find at boot and allows you to restore the write_pointer.
if you dedicate 128KBytes to the log entries and have an endurance of 20000 write/erase cycles. this should give you a total of 10240000 entries written before failure. or 1168 years of continuous logging...

How do non temporal instructions work?

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:
#include <emmintrin.h>
void setbytes(char *p, int c)
{
__m128i i = _mm_set_epi8(c, c, c, c,
c, c, c, c,
c, c, c, c,
c, c, c, c);
_mm_stream_si128((__m128i *)&p[0], i);
_mm_stream_si128((__m128i *)&p[16], i);
_mm_stream_si128((__m128i *)&p[32], i);
_mm_stream_si128((__m128i *)&p[48], i);
}
With such a comment right below it:
Assuming the pointer p is appropriately aligned, a call to this
function will set all bytes of the addressed cache line to c. The
write-combining logic will see the four generated movntdq instructions
and only issue the write command for the memory once the last
instruction has been executed. To summarize, this code sequence not
only avoids reading the cache line before it is written, it also
avoids polluting the cache with data which might not be needed soon.
What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?
When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.
When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.
The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.
The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware):
- Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.
The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.
Note that if the cache line written is already in the main caches, the above method will also update them.
If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.
If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).
Typically the non-temporal cache size is on the order of 4-8 cache lines.
To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.
I think this is partly a terminology question: The passage you quote from Ulrich Drepper's article isn't talking about cached data. It's just using the term "cache line" for an aligned 64B block.
This is normal, and especially useful when talking about a range of hardware with different cache-line sizes. (Earlier x86 CPUs, as recently as PIII, had 32B cache lines, so using this terminology avoids hard-coding that microarch design decision into the discussion.)
A cache-line of data is still a cache-line even if it's not currently hot in any caches.
I don't have references under my fingers to prove what I am saying, but my understanding is this: the only unit of transfer over the memory bus is cache lines, whether they go into the cache or to some special registers. So indeed, the code you pasted fills a cache line, but it is a special cache line that does not reside in cache. Once all bytes of this cache line have been modified, the cache line is send directly to memory, without passing through the cache.

Resources