Here is an image taken from the CUDA C Programming Guide:
The guide says that this is an example of a Conflict-free access since threads 3, 4, 6, 7 and 9 access the same word within bank 5.
I don't quite understand why is this conflict-free, since not only threads 3, 4, 6, 7 and 9 access the same work within same bank (shouldn't that be an example of memory conflict?) but also thread 5 has to access bank 4.
Could you please explain to me this case?
Note that a bank is not the same thing as a word or location in shared memory. A bank refers collectively to all words in shared memory that satisfy a certain address pattern condition.
In general, shared memory bank conflicts can be avoided if all accesses from a warp (or half-warp in cc 1.x) go to separate banks. These accesses need not be in warp order, i.e. they can be scrambled, as long as the request from each thread targets a separate bank.
The above description covers every arrow in your diagram except those arrows pointing to bank 5.
If we had no other information, then multiple arrows targetting a single bank would indicate a potential bank conflict.
However, there is an exception, when not only are the accesses targetting the same bank, but they are targetting the same word in memory. When multiple shared memory requests target the same word in memory, then the shared memory system has a broadcast mechanism to take the data contained in that word, and service it to all the requesting threads, in a single cycle.
From the documentation(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-1-x):
Shared memory features a broadcast mechanism whereby a 32-bit word can be read and broadcast to several threads simultaneously when servicing one memory read request. This reduces the number of bank conflicts when several threads read from an address within the same 32-bit word.
Related
In intel's manual:
section of : "8.2.2 Memory Ordering in P6 and More Recent Processor Families"
Any two stores are seen in a consistent order by processors other than those performing the stores
what's meaning of this statement ?
It means no IRIW reordering (Independent Readers, Independent Writers; at least 4 separate cores, at least 2 each writers and readers). 2 readers will always agree on the order of any 2 stores performed other cores.
Weaker memory models don't guarantee this, for example ISO C++11 only guarantees it for seq_cst operations, not for acq_rel or any weaker orders.
A few hardware memory models allow it on paper, including ARM before ARMv8. But in practice it's very rare POWER hardware can actually violate this in practice: See my answer Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for an explanation of a hardware mechanism that can make it happen (store-forwarding between SMT "hyperthreads" on the same physical core making a store visible to some cores before it's globally visible).
x86 forbids this so communication between hyperthreads has to wait for commit to L1d cache, i.e. waiting for the store to be globally visible (thanks to MESI) before any other core can see it. What will be used for data exchange between threads are executing on one Core with HT?
According to "Windows Internals, Part 1" (7th Edition, Kindle version):
Pages in a process virtual address space are either free, reserved, committed, or shareable.
Focusing only on the reserved and committed pages, the first type is described in the same book:
Reserving memory means setting aside a range of contiguous virtual addresses for possible future use (such as an array) while consuming negligible system resources, and then committing portions of the reserved space as needed as the application runs. Or, if the size requirements are known in advance, a process can reserve and commit in the same function call.
Both reserving or committing will initially get you entries in the VADs (virtual address descriptors), but neither operation will touch the PTE (page table entries) structures. It used to cost PTEs for reserving before Windows 8.1, but not anymore.
As described above, reserved means blocking a range of virtual addresses, NOT blocking physical memory or paging file space at the OS level. The OS doesn't include this in the commit limit, therefore when the time comes to allocate this memory, you might get a surprise. It's important to note that reserving happens from the perspective of the process address space. It's not that there's any physical resource reserved - there's no stamping of "no vacancy" against RAM space or page file(s).
The analogy with plots of land might be missing something: take reserved as the area of land surrounded by wooden poles, thus letting others now that the land is taken. But how about committed ? It can't be land on which structures (eg houses) have already been build, since those would require PTEs and there's none there yet, since we haven't accessed anything. It's only when touching committed data that the PTEs will get built, which will make the pages available to the process.
The main problem is that committed memory - at least in its initial state - is functionally very much alike reserved memory. It's just an area blocked within VADs. Try to touch one of the addresses, and you'll get an access violation exception for a reserved address:
Attempting to access free or reserved memory results in an access violation exception because the page isn’t mapped to any storage that can resolve the reference
...and an initial page fault for a committed one (immediately followed by the required PTE entries being created).
Back to the land analogy, once houses are build, that patch of land is still committed. Yet this is a bit peculiar, since it was still committed when the original grass was there, before the very first shovel was excavated to start construction. It resembled the same state as that of a reserved patch. Maybe it would be better to think of it like terrain eligible for construction. Eg you have a permit to build (albeit you might never build as much as a wall on that patch of land).
What would be the reasons for using one type of memory versus the other ? There's at least one: the OS guarantees that there will be room to allocate committed memory, should that ever occur in the future, but doesn't guarantee anything for reserved memory aside from blocking that process' address space range. The only downside for committed memory is that one or more paging files might need to be extended in size as to be able to make the commit limit take into account the recently allocated block, so should the requester demand the use of part of all the data in the future, the OS can provide access to it.
I can't really think how the land analogy can capture this detail of "guarantee". After all, the reserved patch also physically existed, covered by the same grass as a committed one in its pristine state.
The stack is another scenario where reserved and committed memory are used together:
When a thread is created, the memory manager automatically reserves a predetermined amount of virtual memory, which by default is 1 MB.[...] Although 1 MB is reserved, only the first page of the stack will be committed [...]
along with a guard page. When a thread’s stack grows large enough to touch the guard page, an exception occurs, causing an attempt to allocate another guard. Through this mechanism, a user stack doesn’t immediately consume all 1 MB of committed memory but instead grows with demand."
There is an answer here that deals with why one would want to use reserved memory as opposed to committed . It involves storing continuously expanding data - which is actually the stack model described above - and having specific absolute address ranges available when needed (although I'm not sure why one would want to do that within a process).
Ok, what am I actually asking ?
What would be a good analogy for the reserved/committed concept ?
Any other reason aside those depicted above that would mandate the
use of reserved memory ? Are there any interesting use cases when
resorting to reserved memory is a smart move ?
Your question hits upon the difference between logical memory translation and virtual memory translation. While CPU documentation likes to conflate these two concepts, they are different in practice.
If you look at logical memory translation, there are are only two states for a page. Using your terminology, they are FREE and COMMITTED. A free page is one that has no mapping to a physical page frame and a COMMITTED page has such a mapping.
In a virtual memory system, the operating system has to maintain a copy of the address space in secondary storage. How this is done depends upon the operating system. Typically, a process will have its mapping to several different files for secondary storage. The operating system divides the address space into what is usually called a SECTION.
For example, the code and read only data could be stored virtually as one or more SECTIONS in the executable file. Code and static data in shared libraries could each be in a different section that are paged to the shared libraries. You might have a map to a shared filed to the process that uses memory that can be accessed by multiple processes that forms another section. Most of the read/write data is likely to be in a page file in one or more sections. How the operating system tracks where it virtually stores each section of data is system dependent.
For windows, that gives the definition of one of your terms: Sharable. A sharable section is one where a range of addresses can be mapped to different processes, at different (or possibly the same) logical addresses.
Your last term is then RESERVED. If you look at the Windows' VirtualAlloc function documentation, you can see that (among your options) you can RESERVE or COMMIT. If you reserve you are creating a section of VIRTUAL MEMORY that has no mapping to physical memory.
This RESERVE/COMMIT model is Windows-specific (although other operating systems may do the same). The likely reason was to save disk space. When Windows NT was developed, 600MB drives the size of washing machine were still in use.
In these days of 64-bit address spaces, this system works well for (as you say) expanding data. In theory, an exception handler for a stack overrun can simply expand the stack. Reserving 4GB of memory takes no more resources than reserving a single page (which would not be practicable in a 32-bit system—see above). If you have 20 threads, this makes reserving stack space efficient.
What would be a good analogy for the reserved/committed concept ?
One could say RESERVE is like buying options to buy and COMMIT is exercising the option.
Any other reason aside those depicted above that would mandate the use of reserved memory ? Are there any interesting use cases when resorting to reserved memory is a smart move ?
IMHO, the most likely places to RESERVE without COMMITTING are for creating stacks and heaps with the former being the most important.
I was wondering about this because it's a potential security hole if process A can malloc 50 megs of data that is not zero'd out and that chunk of memory turns out to include what had been physical pages from process B and still contain process B's data.
Is malloc'd data zeroed in objective c?
Mostly Yes. There's a zero-page writer that is part of the memory manager which provides a process with zero'd pages. The memory manager will call memory_object_data_unavailable to tell the kernel to supply zero-filled memory for the region.
If the process calls free and then mallocs again, the page is not re-zero'd. Zeroization only occurs when a new page is demanded. In fact, the page is probably not returned to the system upon free. The process retains the page for its own use due to the runtime. Related, see Will malloc implementations return free-ed memory back to the system?
If a page is returned to the system under a low-memory condition, the the page will be re-zero'd even if the process formerly held the page. The memory manager does not account for last owner of a page. It just assumes a new page needs to be zero'd to avoid an information leak across processes.
Note Microsoft calls it the zero-page writer. Darwin has the same component, but I don't recall seeing it named. Also see Mac OS X Internals: A Systems Approach by Singh. Its a bit dated, but it provides a lot of system information. Chapter 8, Memory, is the chapter of interest.
Singh's book goes into other details, like cases where a page is demanded but does not need to be zeroized. In this case, there was some shared data among processes, and a new page was allocated to the process under a Copy-on-Write (COW) scheme. Effectively, the new page was populated from existing data rather than zero's. The function of interest is memory_object_data_request.
Linux has an interesting discussion of the zero page at Some ado about zero. Its interesting reading about a topic that seems mundane on the surface.
I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.
I could find that for "global" memory access, the coalescing (neighboring) the memory addresses which required by threads is the key for optimum transaction while in "shared" memory the non-conflicting the addresses issued by threads is the key. Did I understand well?
From NVIDIA CUDA Programming guide:
To maximize global memory throughput, it is therefore important to maximize
coalescing by:
Following the most optimal access patterns based on Sections G.3.2 and G.4.2,
Using data types that meet the size and alignment requirement detailed in
Section 5.3.2.1.1,
Padding data in some cases, for example, when accessing a two-dimensional
array as described in Section 5.3.2.1.2.
This is related to the memory accesses of the threads in a warp which is coalesced 'packed' into one or more transactions. This issue has been relaxed for devices of compute capability 2.x.
On the other hand, for shared memory accesses you need to understand how this memory is implemented.
To achieve high bandwidth, shared memory is divided into equally-sized memory
modules, called banks, which can be accessed simultaneously.
If two or more threads access the same bank the transfer is serialized, a.k.a. a bank conflict.
Appendix G. Compute Capabilities has more info about the architecture.
Regards!